I want to mirror the following web site completely: http://tinaztitiz.com
I use the following wget command:
wget -m http://tinaztitiz.com
The web site is a custom CMS and contains lots of pages having the following form of urls:
http://tinaztitiz.com/yazi.php?id=943
http://tinaztitiz.com/yazi.php?id=762
Oddly, wget gets a few of these pages but not all of them. I wonder what might be the reason for this?
Note: There is no constraining due to robots.txt.
Update:
Looking at the source code of the web site, I noticed that the pages that are not detected and crawled by wget have a common property. Their anchor urls are written by the following javascript function:
function yazilar()
{
var ab = '
';
var aa = 'var ac = '';
var arr = new Array();
arr[0] = '12\">'+ac+' Belâgat';
arr[1] = '15\">'+ac+' Bilim ve Teknoloji';
//...
maxi = 14;
for(i=0;i a = aa + arr[i] + ab;
document.writeln(a);
}
}
So, it looks like wget cannot detect anchor tags that are generated dynamically.
Answer
Javascript is rendered by the browser. wget
does exactly what it's supposed to do, fetching the content. Browsers do the same thing initially. They get the content exactly how you posted above. But then it renders the Javascript
part and builds the links. wget
can't do that. So, no, you can't get links that are generated dynamically, using just wget. You can try something like PhantomJS though.
No comments:
Post a Comment