What is web scraping?
Web sites are written using HTML, which means that each web page is a structured document. Web sites don’t always provide their data in comfortable formats such as csv or json.
This is where web scraping comes in. Web scraping is the practice of using a computer program to sift through a web page and gather the data that you need in a format most useful to you while at the same time preserving the structure of the data. Understanding HTML Basics Scraping is all about html tags. I will be using two Python modules for scraping data.
- Urllib
- Beautifulsoup
Parsing HTML using Urllib
Using urllib, you can treat a web page much like a file. You simply indicate
which web page you would like to retrieve and urllib handles all of the HTTP
protocol and header details. We can construct a well-formed regular expression to match and extract the link values from the above text as follows:
href="http://.+?"
The question mark added to the “.+?” indicates to find the smallest possible matching string and tries to find the largest possible matching string.
import urllib
import re
url = raw_input('Enter - ')
html = urllib.urlopen(url).read()
links = re.findall('href="(http://.*?)"', html)
for link in links:
print link # tab
Enter input : http://www.wsbtv.com/weather/
Parsing HTML using BeautifulSoup
BeautifulSoup library is used parse some HTML input and lets you easily extract the data you need.
You can download and “install” BeautifulSoup or you can simply place the
BeautifulSoup.py file in the same folder as your application.
We will use urllib to read the page and then use BeautifulSoup to extract the
href attributes from the anchor (a) tags.
import urllib
from bs4 import BeautifulSoup
url = raw_input('Enter - ')
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
print tag.get('href', None) # tab
print 'TAG:',tag
print 'URL:',tag.get('href', None)
print 'Content:',tag.contents[0]
print 'Attrs:',tag.attrs
No comments:
Post a Comment