Tuesday, April 7, 2015

How to export all hyperlinks on a webpage?


I need a solution to export all hyperlinks on a webpage (on a webpage, not from entire website) and a way to specify the links I want to export, for example only hyperlinks starting with https://superuser.com/questions/ excluding everything else.
Exporting as text file preferred and the results should be displayed one below another, one URL per line:


https://superuser.com/questions/1
https://superuser.com/questions/2
https://superuser.com/questions/3
[...]

Answer



If you are running on a Linux or a Unix system (like FreeBSD or macOS), you can open a terminal session and run this command:


wget -O - http://example.com/webpage.htm | \
sed 's/href=/\nhref=/g' | \
grep href=\"http://specify.com | \
sed 's/.*href="//g;s/".*//g' > out.txt

In usual cases there may be multiple tags in one line, so you have to cut them first (the first sed adds newlines before every keyword href to make sure there's no more than one of it in a single line).
To extract links from multiple similar pages, for example all questions on the first 10 pages on this site, use a for loop.


for i in $(seq 1 10); do
wget -O - http://superuser.com/questions?page=$i | \
sed 's/href=/\nhref=/g' | \
grep -E 'href="http://superuser.com/questions/[0-9]+' | \
sed 's/.*href="//g;s/".*//g' >> out.txt
done

Remember to replace http://example.com/webpage.htm with your actual page URL and http://specify.com with the preceding string you want to specify.
You can specify not only a preceding string for the URL to export, but also a Regular Expression pattern if you use egrep or grep -E in the command given above.
If you're running a Windows, consider taking advantage of
Cygwin. Don't forget to select packages Wget, grep, and sed.


No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...