Why does wget not get all the pages when mirroring this site

I want to mirror the following web site completely: http://tinaztitiz.com

I use the following wget command:

wget -m http://tinaztitiz.com

The web site is a custom CMS and contains lots of pages having the following form of urls:

http://tinaztitiz.com/yazi.php?id=943
http://tinaztitiz.com/yazi.php?id=762

Oddly, wget gets a few of these pages but not all of them. I wonder what might be the reason for this?

Note: There is no constraining due to robots.txt.

Update:

Looking at the source code of the web site, I noticed that the pages that are not detected and crawled by wget have a common property. Their anchor urls are written by the following javascript function:

function yazilar()
{
var ab = '
';
var aa = 'var ac = '';

var arr = new Array();

arr[0] = '12\">'+ac+' Belâgat';
arr[1] = '15\">'+ac+' Bilim ve Teknoloji';
//...
maxi = 14;
for(i=0;i    a = aa + arr[i] + ab;
    document.writeln(a);
    }
}

So, it looks like wget cannot detect anchor tags that are generated dynamically.

Answer

Javascript is rendered by the browser. wget does exactly what it's supposed to do, fetching the content. Browsers do the same thing initially. They get the content exactly how you posted above. But then it renders the Javascript part and builds the links. wget can't do that. So, no, you can't get links that are generated dynamically, using just wget. You can try something like PhantomJS though.

Blog

Friday, December 26, 2014

Why does wget not get all the pages when mirroring this site

No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server