以前用OC 尝试性的抓取网页然后再用正则分析, 相当的复杂
最近接触python, 做了相同的功能, 仅仅不到30行, 瞬间感觉到py的高大尚
首先需要 引入3个库 urllib
和 BeautifulSoup
lxml
, urllib是网络请求的, BeautifulSoup是读取数据的, lxml解析数据的(当然这个也可以用系统提供的html.parser, 不过建议用lxml, 解析速度快, 文档容错能力强)
from bs4 import BeautifulSoup
import urllib.request
这次直接抓取 http://blog.csdn.net/zhz459880251/article/details/50212257
直接上代码
url = "http://blog.csdn.net/zhz459880251/article/details/50212257"
headers = {
'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Mobile',
'Cookie':'uuid_tt_dd=-6992902498996105809_20170209; __message_district_code=310000; _message_m=fx40yhjccq4t4ivqrlq3yf2t; UN=zhz459880251; UE=""; BT=1486950243541; _ga=GA1.2.1648016979.1486950242; __utmt=1; __utma=17226283.1648016979.1486950242.1486964688.1486964688.1; __utmb=17226283.1.10.1486964688; __utmc=17226283; __utmz=17226283.1486964688.1.1.utmcsr=yiibai.com|utmccn=(referral)|utmcmd=referral|utmcct=/python/python3-webbug-series1.html; Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac=1486950252,1486955003,1486963532,1486964112; Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac=1486964737; dc_tos=olatc0; dc_session_id=1486964688035; __message_sys_msg_id=0; __message_gu_msg_id=0; __message_cnel_msg_id=0; __message_in_school=0'
}
data = urllib.request.Request(url, headers=headers)
data = urllib.request.urlopen(data)
html = data.read().decode('UTF-8')
soup = BeautifulSoup(html, 'lxml')
titles = soup.select('div > table > tbody > tr > td > a')
title_urls = soup.select('div > table > tbody > tr > td > a')
title_details = soup.select('div > table > tbody > tr > td > big')
for titles, title_url, title_detail in zip(titles, title_urls, title_details):
data = {
'title':titles.get_text(),
'url':title_url['href'],
'detail':title_detail.get_text(),
}
print(data)
当然, 有些网页是不需要header的, 可以不用添加这个参数
OK, 到这就完成了, 就是这么简单!
解析结果放到字典里面了,