Tuesday, January 20, 2015

linux - Trouble using wget or httrack to mirror archived website

I am trying to use wget to create a local mirror of a website. But I am finding that I am not getting all the linking pages.



Here is the website



http://web.archive.org/web/20110722080716/http://cst-www.nrl.navy.mil/lattice/


I don't want all pages that begin with web.archive.org, but I do want all pages that begin with http://web.archive.org/web/20110722080716/http://cst-www.nrl.navy.mil/lattice/.




When I use wget -r, in my file structure I find



web.archive.org/web/20110722080716/http://cst-www.nrl.navy.mil/lattice/index.html,


but I don't have all files that are part of this database, e.g.



web.archive.org/web/20110808041151/http://cst-www.nrl.navy.mil/lattice/struk/d0c.html.



Perhaps httrack would do better, but right now that's grabbing too much.



So, by which means is it possible to grab a local copy of an archived website from the Internet Archive Wayback Machine?

No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...