April 05, 2017

Python to Access Web Data



What is web scraping?

Web sites are written using HTML, which means that each web page is a structured document. Web sites don’t always provide their data in comfortable formats such as csv or json.
This is where web scraping comes in. Web scraping is the practice of using a computer program to sift through a web page and gather the data that you need in a format most useful to you while at the same time preserving the structure of the data. Understanding HTML Basics Scraping is all about html tags. I will be using two Python modules for scraping data.
  • Urllib
  • Beautifulsoup

Parsing HTML using Urllib
Using urllib, you can treat a web page much like a file. You simply indicate
which web page you would like to retrieve and urllib handles all of the HTTP
protocol and header details. We can construct a well-formed regular expression to match and extract the link values from the above text as follows:

href="http://.+?"
The question mark added to the “.+?” indicates to find the smallest possible matching string and tries to find the largest possible matching string.



import urllib
import re
url = raw_input('Enter - ')
html = urllib.urlopen(url).read()
links = re.findall('href="(http://.*?)"', html)
for link in links:
print link  # tab 


Parsing HTML using BeautifulSoup

BeautifulSoup library is used parse some HTML input and lets you easily extract the data you need. 
You can download and “install” BeautifulSoup or you can simply place the
BeautifulSoup.py file in the same folder as your application.
We will use urllib to read the page and then use BeautifulSoup to extract the
href attributes from the anchor (a) tags.

import urllib
from bs4 import BeautifulSoup
url = raw_input('Enter - ')
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
print tag.get('href', None)    # tab

print 'TAG:',tag
print 'URL:',tag.get('href', None)
print 'Content:',tag.contents[0]

print 'Attrs:',tag.attrs

No comments: