Python 爬虫DY天堂

最新推荐文章于 2024-07-08 00:01:13 发布

麦合学长~

最新推荐文章于 2024-07-08 00:01:13 发布

阅读量886

点赞数 8

文章标签： python 爬虫开发语言

本文链接：https://blog.csdn.net/maihexuezhang/article/details/135140238

版权

本文介绍了使用Python的requests库和正则表达式(re)爬取电影天堂网站2021年热门影片列表，提取a标签的href值，并获取电影名和下载链接的过程。

摘要由CSDN通过智能技术生成

import requests
import re
f = open("电影天堂.csv", mode="w", encoding='utf-8')

url = "https://www.dy2018.com/"
resp = requests.get(url)
resp.encoding = "gbk"   # 源代码 charset 是GBK (国内网站，不是GBK，就是UTF-8)
# print(resp.text)   判断一下有没有正常输出


# 1.提取2021必看热片部分的HTML代码
obj1 = re.compile(r"2023必看热片.*?<ul>(?P<html>.*?)</ul>", re.S)
result1 = obj1.search(resp.text)
html = result1.group("html")

# 2.提取a标签中的href的值
obj2 = re.compile(r"<li><a href='(?P<href>.*?)' title")

result2 = obj2.finditer(html)   # 细节：从html里面进行进一步判断
										# 细节：截取信息，可以从 id class 等入手
obj3 = re.compile(r'<div id="Zoom">.*?◎片　　名(?P<movie>.*?)<br />.*?<td style="WORD-WRAP: break-word"'
				  r' bgcolor="#fdfddf"><a href="(?P<download>.*?)">', re.S)   # 正则工具准备好
for item in result2:

	child_url = url.strip("/") + item.group("href")   # 细节：别忘了删掉后面的"/" 。因为多余了
	child_resp = requests.get(child_url)
	child_resp.encoding = 'gbk'

	result3 = obj3.search(child_resp.text)   # 用obj3工具，对数据text进行修改   一个html只有一个数据，所以可以用 search
	movie = result3.group("movie")
	download = result3.group("download")
	print(movie, download)
	f.write(f"{movie},{download}\n")
f.close()  # 退出 while 循环，才关闭！！
resp.close()    # 退出 while 循环，才关闭！！
print("电影天堂数据爬取完毕.")