Can HTTrack download a website without the index page or a website that has some "isolated" pages?

Friday, July 24, 2015

Can HTTrack download a website without the index page or a website that has some "isolated" pages?

Is it like HTTrack can only download websites that have an index page? And the index page should have all the links to all the other pages on the site, right? Or, at least, all the pages on the site must interconnected by links somehow, right? So, if there is at least one page that is not containing any link and is not linked to from any other page, then this kind of a page will not be downloaded by HTTrack, right?

I am trying to download a website on a free host (in fact it's not a website, but, rather, a collection of pictures and some HTML documents that are not necessarily connected to each other). This web site is going to be closed in about two weeks, so I need to hurry up in order to download all my pics from that site. So I tried HTTrack attempting to download the whole site, but I got a message in the process that was saying this:

WinHTTrack Website Copier

MIRROR ERROR! * * HTTrack has detected that the current mirror is
empty. If it was an update, the
previous mirror has been restored.
Reason: the first page(s) either could
not be found, or a connection problem
occured.
=> Ensure that the website still exists, and/or check your proxy
settings! <=

I am using Windows XP.

Answer

You're right that such tools will only work based on links between the pages. If a page has no other pages pointing to it, it's "invisible" for HTTrack (and other "spider" tools). If you know the URLs to this "unlinked" pages, you add them manually.

However, if the webserver has "Directory Browsing" enabled, by pointing to a URL containing a directory and no page name, it will display a list of all files in the directory. But it's seldom activated for security reasons. Most of the times, if no page name is specified, the webserver will serve a default page (index.html, index.php, default.html, ...) instead of the directory content.

Blog

Friday, July 24, 2015

Can HTTrack download a website without the index page or a website that has some "isolated" pages?

WinHTTrack Website Copier

No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server