1、确认数据接口
2、分析页面结构,确定为静态页面
3、标签定位,拿详情页的url
3.1拼接详情页面url
html = etree.HTML(html_one)
# 获取详情页面url
try:
base_url_lists = html.xpath('.//div[@id="content"]/div[@class="cover"]/div')
for base_url_list in base_url_lists:
base_url = base_url_list.xpath('./div[2]/h2/a/@href')[0]
detail_url = "https://m.22shuquge.com{}".format(base_url)
get_detail_data(detail_url)
except Exception as e:
pass
4、进入详情页面,确认想要的数据位置,标签定位
5、数据采集
html = etree.HTML(html_two)
try:
# 获取数据
datas = html.xpath('//div[@id="content"]/div[@class ="cover"]/div/div[2]')
for data in datas:
book_name = data.xpath('.//h2/a/text()')[0]
author = data.xpath('./p[3]/text()')[0]
leibie = data.xpath('./p[4]/text()')[0].split(":")[1]
zhuangtai = data.xpath('./p[5]/text()')[0].split(":")[1]
gengxin = data.xpath('./p[6]/text()')[0].split(":")[1]
book_dict = {
'name': book_name
, 'author': author
, 'leibie': leibie
, 'zhuangtan': zhuangtai
, 'gengxin': gengxin
}
print(f"{book_dict}采集成功")
# 保存数据
# save(book_dict)
except Exception as e:
pass
6、数据存储
总结:本案例在于熟练掌握静态页面的标签定位和xpath的使用