wget - Batch download pages from a wiki without special pages

Monday, October 6, 2014

wget - Batch download pages from a wiki without special pages

From time to time I find some documentation on the web that I need for offline use on my notebook. Usually I fire up wget and get the whole site.

Many projects however are now switching to wikis, and that means I download every single version and every "edit me" link, too.

Is there any tool or any configuration in wget, so that I, for example, download only files without a query string or matching a certain regexp?

Cheers,

By the way: wget has the very useful -k switch, that converts any in-site links to their local counterparts. That would be another requirement. Example: Fetching http://example.com pages. Then all links to "/..." or "http://example.com/..." have to be converted to match the downloaded counterpart.

Answer

From the wget man page:

-R rejlist --reject rejlist

Specify comma-separated lists of file name suffixes or patterns to
accept or reject. Note that if any of
the wildcard characters, *, ?, [ or
], appear in an element of acclist or rejlist, it will be treated
as a pattern, rather than a suffix.

This seems like exactly what you need.

Note: to reduce the load on the wiki server, you might want to look at the -w and --random-wait flags.

Blog

Monday, October 6, 2014

wget - Batch download pages from a wiki without special pages

No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server