分享两个Python爬虫小案例（附源码）_import csv import requests import re import time d(1)

冉静学习开发

于 2024-05-01 11:06:42 发布

阅读量752

点赞数 19

分类专栏：程序员文章标签： python 爬虫开发语言

本文链接：https://blog.csdn.net/m0_61331407/article/details/138368881

版权

程序员专栏收录该内容

257 篇文章 0 订阅

订阅专栏

import requests
import re
import time

def main(page):
url = f’https://tieba.baidu.com/p/7882177660?pn={page}’
headers = {
‘User-Agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36’
}
resp = requests.get(url,headers=headers)
html = resp.text
# 评论内容
comments = re.findall(‘style=“display:;”> (.*?)’,html)
# 评论用户
users = re.findall(‘class=“p_author_name j_user_card” href=“.*?” target=“_blank”>(.*?)’,html)
# 评论时间
comment_times = re.findall(‘楼(.*?)<div’,html)
for u,c,t in zip(users,comments,comment_times):
# 筛选数据,过滤掉异常数据
if ‘img’ in c or ‘div’ in c or len(u)>50:
continue
csvwriter.writerow((u,t,c))
print(u,t,c)
print(f’第{page}页爬取完毕’)

if name == ‘__main__’:
with open(‘01.csv’,‘a’,encoding=‘utf-8’)as f:
csvwriter = csv.writer(f)
csvwriter.writerow((‘评论用户’,‘评论时间’,‘评论内容’))
for page in range(1,8): # 爬取前7页的内容
main(page)
time.sleep(2)


关键结果截图：


![在这里插入图片描述](https://img-blog.csdnimg.cn/c1ff6294eeb34d0ebe39f4975b59c134.png)


**完整的源代码已经打包至CSDN官方了，需要的朋友可以扫描下方二维码免费领取**



![](https://img-blog.csdnimg.cn/img_convert/76038a1ac66d213db02863908b11a2b5.png)



### 2、实现多线程爬虫爬取某小说部分章节内容并以数据库存储（不少于10个章节）


![在这里插入图片描述](https://img-blog.csdnimg.cn/60232f41707b4b3386d4d9ec0c209721.png)


本次选取的小说网址是某小说网，这里我们选取第一篇小说进行爬取


![在这里插入图片描述](https://img-blog.csdnimg.cn/227e4c8e7ae3402da5a847fa2b6df19f.png)


然后通过分析网页源代码分析每章小说的链接


找到链接的位置后，我们使用Xpath来进行链接和每一章标题的提取


![在这里插入图片描述](https://img-blog.csdnimg.cn/c1a11f371d3c4550b8f5337e80f3d3a1.png)


在这里，因为涉及到多次使用requests发送请求，所以这里我们把它封装成一个函数，便于后面的使用


![在这里插入图片描述](https://img-blog.csdnimg.cn/dcc37de90afa4b649deaf4e108392e4d.png)  
 每一章的链接获取后，我们开始进入小说章节内容页面进行分析


![在这里插入图片描述](https://img-blog.csdnimg.cn/522c93cf8ed6457c9eae595006c86904.png)


通过网页分析，小说内容都在网页源代码中，属于静态数据


这里我们选用re正则表达式进行数据提取，并对最后的结果进行清洗


![在这里插入图片描述](https://img-blog.csdnimg.cn/b6ef3f53033f421194ff02c53d56db78.png)


然后我们需要将数据保存到数据库中，这里我将爬取的数据存储到mysql数据库中，先封住一下数据库的操作


![在这里插入图片描述](https://img-blog.csdnimg.cn/687c3043c26e4d7ebf31b375e296ccc4.png)


接着将爬取到是数据进行保存


![在这里插入图片描述](https://img-blog.csdnimg.cn/9159d1c055e44b199f3124684a9986b9.png)


最后一步就是使用多线程来提高爬虫效率，这里我们创建了5个线程的线程池


![在这里插入图片描述](https://img-blog.csdnimg.cn/670b560289b34323842f14b18890ce5e.png)  
 **程序源代码：**

import requests
from lxml import etree
import re
import pymysql
from time import sleep
from concurrent.futures import ThreadPoolExecutor

def get_conn():
# 创建连接
conn = pymysql.connect(host=“127.0.0.1”,
user=“root”,
password=“root”,
db=“novels”,
charset=“utf8”)
# 创建游标
cursor = conn.cursor()
return conn, cursor

def close_conn(conn, cursor):
cursor.close()
conn.close()

def get_xpath_resp(url):
headers = {‘User-Agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36’}
resp = requests.get(url, headers=headers)
tree = etree.HTML(resp.text) # 用etree解析html
return tree,resp

def get_chapters(url):
tree,_ = get_xpath_resp(url)
# 获取小说名字
novel_name = tree.xpath(‘//*[@id=“info”]/h1/text()’)[0]
# 获取小说数据节点
dds = tree.xpath(‘/html/body/div[4]/dl/dd’)
title_list = []
link_list = []
for d in dds[:15]:
title = d.xpath(‘./a/text()’)[0] # 章节标题
title_list.append(title)
link = d.xpath(‘./a/@href’)[0] # 章节链接
chapter_url = url +link # 构造完整链接
link_list.append(chapter_url)
return title_list,link_list,novel_name

def get_content(novel_name,title,url):
try:
cursor = None
conn = None
conn, cursor = get_conn()
# 插入数据的sql
sql = ‘INSERT INTO novel(novel_name,chapter_name,content) VALUES(%s,%s,%s)’
tree,resp = get_xpath_resp(url)
# 获取内容
content = re.findall(‘

(.*?)

’,resp.text)[0]
# 对内容进行清洗
content = content.replace(‘
’,‘\n’).replace(’ ‘,’ ‘).replace(‘全本小说网 www.qb5.tw，最快更新宇宙职业选手最新章节！

’,’')
print(title,content)
cursor.execute(sql,[novel_name,title,content]) # 插入数据
conn.commit() # 提交事务保存数据
except:
pass
finally:
sleep(2)
close_conn(conn, cursor) # 关闭数据库

if name == ‘__main__’:
# 获取小说名字，标题链接，章节名称
title_list, link_list, novel_name = get_chapters(‘https://www.qb5.tw/book_116659/’)
with ThreadPoolExecutor(5) as t: # 创建5个线程
for title,link in zip(title_list,link_list):
t.submit(get_content, novel_name,title,link) # 启动线程


**结果截图：**


![在这里插入图片描述](https://img-blog.csdnimg.cn/e26b6e16d9834f05a3a69c06c4d17693.png)  
 ![在这里插入图片描述](https://img-blog.csdnimg.cn/8a8134d919624e44b6f130ba868a6306.png)


### 零基础小白的Python学习资源总结


如果你也喜欢编程，想通过学习\*\*\*\*，我也为大家整理了一份 **【最新全套Python学习资料】** 一定对你有用！


**对于0基础小白入门：**



### 一、Python所有方向的学习路线

Python所有方向路线就是把Python常用的技术点做整理，形成各个领域的知识点汇总，它的用处就在于，你可以按照上面的知识点去找对应的学习资源，保证自己学得较为全面。



![](https://img-blog.csdnimg.cn/img_convert/9f49b566129f47b8a67243c1008edf79.png)



### 二、学习软件

工欲善其事必先利其器。学习Python常用的开发软件都在这里了，给大家节省了很多时间。



![](https://img-blog.csdnimg.cn/img_convert/8c4513c1a906b72cbf93031e6781512b.png)



### 三、入门学习视频



我们在看视频学习的时候，不能光动眼动脑不动手，比较科学的学习方法是在理解之后运用它们，这时候练手项目就很适合了。



![](https://img-blog.csdnimg.cn/afc935d834c5452090670f48eda180e0.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA56iL5bqP5aqb56eD56eD,size_20,color_FFFFFF,t_70,g_se,x_16#pic_center)




**网上学习资料一大堆，但如果学到的知识不成体系，遇到问题时只是浅尝辄止，不再深入研究，那么很难做到真正的技术提升。**

**[需要这份系统化学习资料的朋友，可以戳这里无偿获取](https://bbs.csdn.net/topics/618317507)**

**一个人可以走的很快，但一群人才能走的更远！不论你是正从事IT行业的老鸟或是对IT行业感兴趣的新人，都欢迎加入我们的的圈子（技术交流、学习资源、职场吐槽、大厂内推、面试辅导），让我们一起学习成长！**

冉静学习开发

关注

19
点赞
踩
17

收藏

觉得还不错? 一键收藏
0
评论
分享两个Python爬虫小案例（附源码）_import csv import requests import re import time d(1)

,‘\n’).replace(’ ‘,’ ‘).replace(‘全本小说网 www.qb5.tw，最快更新。cursor.execute(sql,[novel_name,title,content]) # 插入数据。t.submit(get_content, novel_name,title,link) # 启动线程。csvwriter.writerow((‘评论用户’,‘评论时间’,‘评论内容’))title = d.xpath(‘./a/text()’)[0] # 章节标题。
复制链接

扫一扫