CS109 Lecture 7
Data Scraping
Sources
- From a Web Sites
- With An API
Copyrights and permission
- Be careful and polite
- Give credit
- Care about media law
- Don’t be evil
<h1></h1>
<p></p>
<br>
<a href = 'url'>Link</a>
Useful Libraries for Scraping
- urllib
- beautifulsoup
- pattern
- LXML
Get Data From Website
url = 'url'
scource = urllib2.urlopen(url).read()
soup = bs4.BeautifulSoup(source)
soup.findAll('a')
tag = soup.find('a')
tag.get('href')
C = soup.findAll('p',{'class':'Event'})
t=C[0]
t.findNextSiblings
Get Data With An API
import json
import requests
api_key = 'mykey'
url = 'url' + api_key
scource = urllib2.urlopen(url).read()
a = {'a':1,'b':2}
s = json.dump(a)
a2 = json.loads(s)
dataDict = json.loads(data)
dtatDict.keys()