Data Engineering with Avishkar: Python to Access Web Data

April 05, 2017

Python to Access Web Data

What is web scraping?

Web sites are written using HTML, which means that each web page is a structured document. Web sites don’t always provide their data in comfortable formats such as csv or json.

This is where web scraping comes in. Web scraping is the practice of using a computer program to sift through a web page and gather the data that you need in a format most useful to you while at the same time preserving the structure of the data. Understanding HTML Basics Scraping is all about html tags. I will be using two Python modules for scraping data.

Urllib
Beautifulsoup

Parsing HTML using Urllib

Using urllib, you can treat a web page much like a file. You simply indicate

which web page you would like to retrieve and urllib handles all of the HTTP

protocol and header details. We can construct a well-formed regular expression to match and extract the link values from the above text as follows:

href="http://.+?"

The question mark added to the “.+?” indicates to find the smallest possible matching string and tries to find the largest possible matching string.

import urllib

import re

url = raw_input('Enter - ')

html = urllib.urlopen(url).read()

links = re.findall('href="(http://.*?)"', html)

for link in links:

print link # tab

Enter input : http://www.wsbtv.com/weather/

Parsing HTML using BeautifulSoup

BeautifulSoup library is used parse some HTML input and lets you easily extract the data you need.

You can download and “install” BeautifulSoup or you can simply place the

BeautifulSoup.py file in the same folder as your application.

We will use urllib to read the page and then use BeautifulSoup to extract the

href attributes from the anchor (a) tags.

import urllib

from bs4 import BeautifulSoup

url = raw_input('Enter - ')

html = urllib.urlopen(url).read()

soup = BeautifulSoup(html)

# Retrieve all of the anchor tags

tags = soup('a')

for tag in tags:

print tag.get('href', None) # tab

print 'TAG:',tag

print 'URL:',tag.get('href', None)

print 'Content:',tag.contents[0]

print 'Attrs:',tag.attrs

Data Engineering with Avishkar

April 05, 2017

Python to Access Web Data

No comments:

Fashion Catalog Similarity Search using Datastax AstraDB Vector Database

Search This Blog