Friday, December 26, 2014

Why does wget not get all the pages when mirroring this site



I want to mirror the following web site completely: http://tinaztitiz.com



I use the following wget command:



wget -m http://tinaztitiz.com


The web site is a custom CMS and contains lots of pages having the following form of urls:




http://tinaztitiz.com/yazi.php?id=943
http://tinaztitiz.com/yazi.php?id=762


Oddly, wget gets a few of these pages but not all of them. I wonder what might be the reason for this?



Note: There is no constraining due to robots.txt.



Update:




Looking at the source code of the web site, I noticed that the pages that are not detected and crawled by wget have a common property. Their anchor urls are written by the following javascript function:



function yazilar()
{
var ab = '
';
var aa = 'var ac = '';

var arr = new Array();

arr[0] = '12\">'+ac+' Belâgat';
arr[1] = '15\">'+ac+' Bilim ve Teknoloji';
//...
maxi = 14;
for(i=0;i a = aa + arr[i] + ab;
document.writeln(a);
}
}



So, it looks like wget cannot detect anchor tags that are generated dynamically.


Answer



Javascript is rendered by the browser. wget does exactly what it's supposed to do, fetching the content. Browsers do the same thing initially. They get the content exactly how you posted above. But then it renders the Javascript part and builds the links. wget can't do that. So, no, you can't get links that are generated dynamically, using just wget. You can try something like PhantomJS though.


No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...