html - Extracting links from a numeric range of web pages

Wednesday, April 22, 2015

html - Extracting links from a numeric range of web pages

I would like to extract links from a numerical sequence of pages like this:

http://example.com/page001.html
http://example.com/page002.html
http://example.com/page003.html
...
http://example.com/page329.html

What I want at the output is a text file with URLs gathered from the links on these pages:

http://www.test.com/index.html
http://www.google.com
http://www.superuser.com/questions

To be clear, I don't want to download the pages, I just want a list of links.

Windows software would be idea, but Linux would be okay too. All I can think of is writing a long batch script with Xidel, but it wouldn't be very robust when encountering errors. Curl can download the range of pages, but then I need to parse them somehow.

Thanks to Enigman for putting me on the right track. I created a Perl script that reads URLs from a file and spits out links matching a string stored in $site:

use warnings;
use LWP;
$site = "twitter.com";

my $browser = LWP::UserAgent->new;
my @ns_headers = (
    'User-Agent' => 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36',
    'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language' => 'en-GB,en;q=0.8',

);

open (URLLIST, 'urls.txt');
while () {
    chomp;
    print "# $_\n";
    my $response = $browser->get($_, @ns_headers);
    die "Can't get $_ -- ", $response->status_line
        unless $response->is_success;


    my @urls = $response->content =~ /\shref="?([^\s>"]+)/gi ;
    foreach $url(@urls) {
        if ($url =~ /$site/) {
            print("$url\n");
        }
    }
}
close(URLLIST);

To generate the URL list I made a little batch script:

@echo off
for /l %%i in (0, 15, 75) do @echo http://www.example.com/page_%%i.html

The Perl script just stops on an error, which I prefer. It would be trivial to modify it to just carry on. The User agent and accept data is ripped from Chrome, because some sites don't like anything which looks like a bot. If you are intending to scan sites you do not own please respect the robots.txt and set up a custom user agent.

Answer

If you wanted to use code to do this you can do it in Perl using LWP::Simple or Mechanize modules.

The following might have what you are after Find All Links from a web page using LWP::Simple module

This is assuming you are comfortable with using a command line solution using Perl. This works the same on both Windows and Linux platforms. It wouldn't take much to modify to take URL's as parameters from the command line to parse.

Blog

Wednesday, April 22, 2015

html - Extracting links from a numeric range of web pages

No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server