Monday, October 6, 2014

wget - Batch download pages from a wiki without special pages


From time to time I find some documentation on the web that I need for offline use on my notebook. Usually I fire up wget and get the whole site.


Many projects however are now switching to wikis, and that means I download every single version and every "edit me" link, too.


Is there any tool or any configuration in wget, so that I, for example, download only files without a query string or matching a certain regexp?


Cheers,


By the way: wget has the very useful -k switch, that converts any in-site links to their local counterparts. That would be another requirement. Example: Fetching http://example.com pages. Then all links to "/..." or "http://example.com/..." have to be converted to match the downloaded counterpart.


Answer



From the wget man page:



-R rejlist --reject rejlist


Specify comma-separated lists of file name suffixes or patterns to
accept or reject. Note that if any of
the wildcard characters, *, ?, [ or
], appear in an element of acclist or rejlist, it will be treated
as a pattern, rather than a suffix.



This seems like exactly what you need.


Note: to reduce the load on the wiki server, you might want to look at the -w and --random-wait flags.


No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...