Wednesday, April 22, 2015

html - Extracting links from a numeric range of web pages



I would like to extract links from a numerical sequence of pages like this:



http://example.com/page001.html
http://example.com/page002.html
http://example.com/page003.html
...
http://example.com/page329.html



What I want at the output is a text file with URLs gathered from the links on these pages:




http://www.test.com/index.html
http://www.google.com
http://www.superuser.com/questions



To be clear, I don't want to download the pages, I just want a list of links.



Windows software would be idea, but Linux would be okay too. All I can think of is writing a long batch script with Xidel, but it wouldn't be very robust when encountering errors. Curl can download the range of pages, but then I need to parse them somehow.






Thanks to Enigman for putting me on the right track. I created a Perl script that reads URLs from a file and spits out links matching a string stored in $site:




use warnings;
use LWP;
$site = "twitter.com";

my $browser = LWP::UserAgent->new;
my @ns_headers = (
'User-Agent' => 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36',
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language' => 'en-GB,en;q=0.8',

);

open (URLLIST, 'urls.txt');
while () {
chomp;
print "# $_\n";
my $response = $browser->get($_, @ns_headers);
die "Can't get $_ -- ", $response->status_line
unless $response->is_success;


my @urls = $response->content =~ /\shref="?([^\s>"]+)/gi ;
foreach $url(@urls) {
if ($url =~ /$site/) {
print("$url\n");
}
}
}
close(URLLIST);



To generate the URL list I made a little batch script:



@echo off
for /l %%i in (0, 15, 75) do @echo http://www.example.com/page_%%i.html


The Perl script just stops on an error, which I prefer. It would be trivial to modify it to just carry on. The User agent and accept data is ripped from Chrome, because some sites don't like anything which looks like a bot. If you are intending to scan sites you do not own please respect the robots.txt and set up a custom user agent.


Answer



If you wanted to use code to do this you can do it in Perl using LWP::Simple or Mechanize modules.




The following might have what you are after Find All Links from a web page using LWP::Simple module



This is assuming you are comfortable with using a command line solution using Perl. This works the same on both Windows and Linux platforms. It wouldn't take much to modify to take URL's as parameters from the command line to parse.


No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...