爬虫练习案例
爬取豆瓣小说信息(https://www.douban.com/tag/小说/book)
使用语言: python
使用库: [如果在你的 IDE 有红色波浪线表示还没下载这个库]
re
- 正则表达式urllib
- 处理请求bs4
>BeautifulSoup
- 网页解析,获取数据sqlite3
- 数据库存储
爬取思路:
获取数据
-
定义要爬取的链接 baseurl
-
获取此链接的 html 字符串(askURL)
request = urllib.request.Request(url, ...)
response = urllib.request.urlopen(request)
html = response.read().decode('utf-8')
-
亮汤解析字符串, 查找所有要的格式组成一列表(find_all)
-
遍历列表
-
把内容转回str
-
根据正则表达式找出所有数据并存储在列表(findall)
-
返回所有数据的列表
储存到数据库
- 定义要创建的数据库
- 连接数据库
conn = sqlite.connect(dbpath)
- 获取光标
cursor = conn.cursor()
- 执行sql
cursor.execute(sql)
[这里需要使用 sql 语法] - 提交
conn.commit()
- 关闭
cursor.close()
conn.close()
# coding=utf-8
import urllib
import urllib.request
import urllib.error
from bs4 import BeautifulSoup
import re
import sqlite3
findbook = re.compile(r'<a href="(.*?)" target="_blank">')
findimg = re.compile(r'<img src="(.*?)"/>', re.S)
findname = re.compile(r'<a class="title" href=".*" target="_blank">(.*?)</a>')
finddesc = re.compile(r'<div class="desc">(.*?)</div>', re.S)
findrating = re.compile(r'<span class="rating_nums">(.*?)</span>')
def main():
baseurl = 'https://www.douban.com/tag/%E5%B0%8F%E8%AF%B4/book'
savepath = 'doubanbook.db'
datalist = getData(baseurl)
saveData(datalist, savepath)
def getData(baseurl):
datalist = []
for page in range(0, 20):
url = baseurl + '?start={}'.format(page*15)
html = askURL(url)
soup = BeautifulSoup(html, "html.parser")
for item in soup.find_all('dl'):
data = []
item = str(item)
linkBook = re.findall(findbook, item)[0]
data.append(linkBook)
linkImg = re.findall(findimg, item)[0]
data.append(linkImg)
bookName = re.findall(findname, item)[0]
data.append(bookName)
bookDesc = re.findall(finddesc, item)[0]
data.append(bookDesc.strip())
rating = re.findall(findrating, item)
if len(rating) == 0:
rating = '0.0'
else:
rating = rating[0]
data.append(rating)
datalist.append(data)
print("进度:{}/{}".format(page+1, 20))
return datalist
def askURL(url):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0"
}
request = urllib.request.Request(url, headers=headers)
try:
response = urllib.request.urlopen(request)
# print(response)
html = response.read().decode('utf-8')
except urllib.error.URLError as e:
if hasattr(e, "code"):
print(e.code)
if hasattr(e, "reason"):
print(e.reason)
return html
def saveData(datalist, savepath):
init_db(savepath)
conn = sqlite3.connect(savepath)
cursor = conn.cursor()
for item in datalist:
for i in range(len(item)):
if i == len(item) - 1:
continue
item[i] = '"' + item[i] + '"'
sql = '''
insert into bookinfo(
booklink, imglink, bookname, bookinfo, bookrating
)values(%s);
''' % ",".join(item)
cursor.execute(sql)
conn.commit()
cursor.close()
conn.close()
def init_db(dbpath):
sql = '''
create table bookinfo
(
id integer primary key autoincrement,
booklink text,
imglink text,
bookname varchar,
bookinfo varchar,
bookrating numeric
);
'''
conn = sqlite3.connect(dbpath)
cursor = conn.cursor()
cursor.execute(sql)
conn.commit()
cursor.close()
conn.close()
if __name__ == '__main__':
main()
print("爬取完成")
执行结果
会在你当前目录下生成一个
movie.db
的文件, 里面就是所有数据信息
拓展知识
另外学习了存储在 EXCEL 的方法
使用库:xlwt
- 创建自身对象
book = xlwt.Workbook(encoding='utf-8')
- 创建一个sheet表
sheet = book.add_sheet('sheet1')
- 往单元格内写内容
sheet.write(第几行, 第几列, 信息)
[从(0,0)开始] - 保存表格
book.save(savepath)