初次学习python爬虫实战糗事百科-CSDN博客

本文链接：https://blog.csdn.net/weixin_50342056/article/details/108228012

初次学习python爬虫实战糗事百科
高二在家实在无聊（可能作业太少了）就学习了一下python爬虫，之前有过一点python基础，这两天掌握了一下requests库 urllib库还有bs4 和xpath的用法算是用这个给自己做一个记录吧记录一下自己小白的学习成果
今天爬一下糗事百科的视频吧没事的时候还可以看看也是蛮不错的
今天用到的库： re requests 还有bs4 （自己的xpath用的不太习惯所以选择Bs4）

from bs4 import BeautifulSoup
import  re
import requests

糗事百科网站：https://www.qiushibaike.com/video/
打开以后 F12发现：在这里插入图片描述
这个就是我们想要的链接了通过BeautifulSoup和re就可以提取里面的链接了
先解析一下这个网页

res = requests.get(url=url,headers=headers).text       #对目标站点发送请求
html = BeautifulSoup(res,'lxml')                       #解析

然后分析一下这个网页，这个网页很好分析因为这个链接在video 这个标签下所以定位到这里
我们可以写正则表达式了然后正则匹配就可以了

vid = re.compile(r'<source src="//(.*?)" type="video/mp4"/>')

for item in html.find_all('video',controls="controls"):
	item = str(item)
	video_list = re.findall(vid,item)
	num = num+1

在这里插入图片描述

在for循环外设置一个累加器 等下视频用它来当作名字
接下来就要用for循环遍历一下这个列表也就是video_list 也就是两个循环的嵌套。获得地址以后 给他的前面加上https就可以获得完整地址 再发送请求就可以了 headers 自己填自己的就好了


for video_url in video_list :
	video_urls = 'Https://'+str(video_url)
	response = 	requests.gett(url=video_urls,headers=headers).content    #转化成二进制
	with open('video/'+str(num)+'.mp4','wb')  as file:
		file.write(response)

这样就写完了还是挺简单的在自己当前目录创建一个文件夹就可以了

完整代码

from bs4 import BeautifulSoup
import  re
import requests
url = 'https://www.qiushibaike.com/video/'
res = requests.get(url=url,headers=headers).text       #对目标站点发送请求
html = BeautifulSoup(res,'lxml')  
vid = re.compile(r'<source src="//(.*?)" type="video/mp4"/>')   
num = 0
for item in html.find_all('video',controls="controls"):
	item = str(item)
	video_list = re.findall(vid,item)
	num = num+1 
	for video_url in video_list :
		video_urls = 'Https://'+str(video_url)
		response = 	requests.gett(url=video_urls,headers=headers).content    #转化成二进制
		with open('video/'+str(num)+'.mp4','wb')  as file:
			file.write(response)