Programming a Webpage Source Extractor in Python

Posted: June 1, 2012 in Programming, Python
Tags: , ,

First you will need Python 2.7.3 . You can find instructions for installing and using Python in my previous blog entry here.

Source can be extracted using the urllib2 module in python.  This will request the user to enter  URL of a webpage (Note: the URL must be in complete form i.e the url must start with “http://” , For example: the input must be http://www.example.com  and not www.google.com or else it might produce errors).

I have defined two functions get_source(page) which will retrieve and print  the source , and write_source(location)  which will save the source to a user specified location in your hard disk.

The program will initially ask the user to enter the URL of the webpage (for example, you could enter http://www.google.com ). And then it will ask the user to enter the location where to save the source in your hard disk, you can enter any location you prefer. (For example, you could enter C:/source.txt to save the source as a text file or C:/source.html to save it as a html page itself).

# Source Extractor
# extr3metech.wordpress.com

import urllib2

def get_source(page):
    url=urllib2.urlopen(page)
    print url
    source=url.read()
    return source

def write_source(location):
    fob=open(location,"w")
    for line in get_source(webpage):
        fob.write(line)
    print "Source saved in ", location
    fob.close()

webpage=raw_input("Enter URL to get source : ") # Example: http://www.google.com
path=raw_input("Enter location to save source : ")  # Example: C:/source.txt
print get_source(webpage)

write_source(path)

raw_input("Press any key to exit..")

ΞXΤЯ3МΞ

Comments
  1. Eswar says:

    Hey it works amazing thanks to share……………………

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s