From time to time I find some documentation on the web that I need for offline use on my notebook. Usually I fire up wget and get the whole site.
Many projects however are now switching to wikis, and that means I download every single version and every "edit me" link, too.
Is there any tool or any configuration in wget, so that I, for example, download only files without a query string or matching a certain regexp?
Cheers,
By the way: wget has the very useful -k switch, that converts any in-site links to their local counterparts. That would be another requirement. Example: Fetching http://example.com pages. Then all links to "/..." or "http://example.com/..." have to be converted to match the downloaded counterpart.
Answer
From the wget man page:
-R rejlist --reject rejlist
Specify comma-separated lists of file name suffixes or patterns to
accept or reject. Note that if any of
the wildcard characters, *, ?, [ or
], appear in an element of acclist or rejlist, it will be treated
as a pattern, rather than a suffix.
This seems like exactly what you need.
Note: to reduce the load on the wiki server, you might want to look at the -w and --random-wait flags.
No comments:
Post a Comment