Python爬虫实战项目
爬虫接触得也蛮久了,长期面向GitHub编程^^,最近准备撕破这个标签,于是找了一些爬虫的实战项目,自己写个爬虫挑战,就先从最简单的项目开始吧。
准备从豆瓣图书下手,爬取我感兴趣的分类的图书,获取书名,作者,评分等字段,然后按评分从高到低排序输出写入csv保存。
豆瓣算是爬虫入门的网站了,只需要设置user-agent,也不用scrapy框架就可以爬下来,采用requests,beautifulsoup+re的技术路线就能实现,话不多说直接上代码:
import requests
import re
from bs4 import BeautifulSoup
import csv
def getHTMl(url):#获取网页
header = {
'cookie':'bid=ZUbLXkQB7M4; douban-fav-remind=1; __gads=ID=f5c402ca9746fb6a:T=1589220899:S=ALNI_MZ1QGprQXmEaFsUlcdBE8TQmjjGVA; __utmc=30149280; __utmz=30149280.1591805816.3.3.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; __utma=30149280.1395689765.1589220900.1591805816.1591838258.4; __utmc=81379588; __utmz=81379588.1591838258.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utma=81379588.1826912001.1591838258.1591838258.1591838258.1; _pk_id.100001.3ac3=463e96a4378e7ffb.1591838258.1.1591838270.1591838258.',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
}
try:
html = requests.get(url, headers=header,timeout=30).text
return html
except:
return '发生异常'
def Page(ulist,html):#解析页面内容,获取需要的标签
soup = BeautifulSoup(html,'lxml')
tags = soup.find_all('div', class_='info')
for tag in tags:
title = tag.h2.a['title']
author = tag.find_all('div', class_='pub')[0].string
author = re.findall(r'.*?/',author)[0].replace('/','')
number = tag.find_all('span',class_='rating_nums')[0].string
ulist.append([title,author,number])
ulist = sorted(ulist, key=lambda s: float(s[2]),reverse=True)
return list(ulist)
def info(ulist):#输出图书列表
tplt = '{:6}\t{:10}\t{:3}'
print(tplt.format('作品名称', '作者', '得分'))
for value in ulist:
print(tplt.format(value[0],value[1],value[2]))
def intoFile(ulist):#把爬取到的页面存入csv
f = open('douban.csv','w', encoding='utf-8-sig',newline='')
csv_writer = csv.writer(f)
csv_writer.writerow(['作品名称', '作者', '得分'])
for line in ulist:
csv_writer.writerow(line)
f.close()
print('爬取完成')
def main():
sturl = 'https://book.douban.com/' +'/tag/推理?start='
ulist = []
depth = 236#页面深度,就是翻页
for i in range(depth):
try:
url = sturl + str(i*20)
html = getHTMl(url)
ulist = Page(ulist,html)
except:
continue
#info(ulist)
intoFile(ulist)
main()
下面是爬取效果:
这也算是由我独立写完的第一个小项目,用来练手,当然还有很多不足之处需要改进,包括代码的可读性等等,而且现在还不涉及数据库和scrapy什么的,长路漫漫诶。