I need a solution to export all hyperlinks on a webpage (on a webpage, not from entire website) and a way to specify the links I want to export, for example only hyperlinks starting with https://superuser.com/questions/ excluding everything else.
Exporting as text file preferred and the results should be displayed one below another, one URL per line:
https://superuser.com/questions/1
https://superuser.com/questions/2
https://superuser.com/questions/3
[...]
Answer
If you are running on a Linux or a Unix system (like FreeBSD or macOS), you can open a terminal session and run this command:
wget -O - http://example.com/webpage.htm | \
sed 's/href=/\nhref=/g' | \
grep href=\"http://specify.com | \
sed 's/.*href="//g;s/".*//g' > out.txt
In usual cases there may be multiple tags in one line, so you have to cut them first (the first
sed
adds newlines before every keyword href
to make sure there's no more than one of it in a single line).
To extract links from multiple similar pages, for example all questions on the first 10 pages on this site, use a for
loop.
for i in $(seq 1 10); do
wget -O - http://superuser.com/questions?page=$i | \
sed 's/href=/\nhref=/g' | \
grep -E 'href="http://superuser.com/questions/[0-9]+' | \
sed 's/.*href="//g;s/".*//g' >> out.txt
done
Remember to replace http://example.com/webpage.htm
with your actual page URL and http://specify.com
with the preceding string you want to specify.
You can specify not only a preceding string for the URL to export, but also a Regular Expression pattern if you use egrep
or grep -E
in the command given above.
If you're running a Windows, consider taking advantage of Cygwin. Don't forget to select packages Wget
, grep
, and sed
.
No comments:
Post a Comment