课前说明:本章节请求的 url 部分用 ** 代替
爬虫分三个模块:
1、请求模块:用于构造请求体,并将请求到的网页(数据)返回给解析模块;
2、解析模块:用于提取数据(本章节用xpath提取网页中的数据),并返回数据给存储模块;
3、存储模块:将数据存储在 json 文件中。
案例简介:
用于抓取网页 https://www.qiushibai**.com/text/page/%d/ 发布糗事的作者,年纪,糗事内容等。
简单解析一下爬虫的思路:
1、访问链接: https://www.qiushibai**.com/text/page/%d/ 检查该网站是否动态加载网站;
2、用xpath 解析网页内容;
3、将爬取到的数据存储到 json 文件中;
4、撰写爬虫代码,具体代码如下:
import json
from time import sleep
from urllib import request
from lxml import etree
# 请求模块
def request_handle(url, page):
new_url = url % page
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
}
return request.Request(url=new_url, headers=headers)
def request_html(url, start, end):
for page in range(start, end + 1):
req = request_handle(url, page)
res = request.urlopen(req)
sleep(1)
yield res.read().decode('utf-8')
# 解析模块
def analysis_html(html_list):
for html in html_list:
html_tree = etree.HTML(html)
div_list = html_tree.xpath("//div[@id='content-left']/div")
for div in div_list:
item = {}
item['author'] = div.xpath(".//a/h2/text()")[0]
item['avatar'] = 'https:' + div.xpath(".//a/img/@src")[0]
age = div.xpath(".//div[@class='articleGender manIcon']/text()")
item['age'] = age[0] if len(age) > 0 else None
item['content'] = div.xpath(".//a/div[@class='content']/span/text()")[0]
comment = div.xpath(".//a/i/text()")
item['comment'] = comment[0] if len(comment) > 0 else None
yield item
# 存储模块
def save_json(data):
with open('./qiushi.json', 'a+') as fp:
for item in data:
fp.write(json.dumps(item, ensure_ascii=False))
def main():
url = 'https://www.qiushibai**.com/text/page/%d/'
start = int(input('请输入起始页码:'))
end = int(input('请输入终止页码:'))
# 请求
html_list = request_html(url, start, end)
# 解析
data = analysis_html(html_list)
# 存储
save_json(data)
if __name__ == '__main__':
main()
温馨提示:不记得 xpath 的同学可以复习一下 1.7 认识网页解析工具哦n(*≧▽≦*)n