一、python模拟浏览器简単爬虫html
def readHeiKe(url):
req_header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36'
}
req_timeout = 5
req = urllib2.Request(url,None,req_header)
resp = urllib2.urlopen(req,None,req_timeout)
return resp.read();
二、HTML数据提取
Beautiful Soup是一个能够从HTML或XML文件中提取数据的Pythonpython
使用教程参见:http://beautifulsoup.readthedocs.org/zh_CN/latest/#数据库
代码示例:浏览器
url="#"
html=readHeiKe(url)
soup=BeautifulSoup(html, "lxml")
arr=soup.find_all("div", class_="newitem clearfix")
for term in arr:
url= term.a['href']+""
title= term.a.string+""
time=term.find("div",class_="col-sm-3 text-right").string+""循环内可作数据库处理