分析页面之间的关系
- 每个页面显示25条信息
- url实际地址https://movie.douban.com/top250?start=(x),x为页面第一个电影的序号-1
- 一共有十个页面
解析所需要的内容
通过re正则表达式解析
<li>
<div class="item">
<div class="pic">
<em class="">1</em>
<a href="https://movie.douban.com/subject/1292052/">
<img width="100" alt="肖申克的救赎" src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p480747492.webp" class="">
</a>
</div>
<div class="info">
<div class="hd">
<a href="https://movie.douban.com/subject/1292052/" class="">
<span class="title">肖申克的救赎</span>
<span class="title"> / The Shawshank Redemption</span>
<span class="other"> / 月黑高飞(港) / 刺激1995(台)</span>
</a>
<span class="playable">[可播放]</span>
</div>
<div class="bd">
<p class="">
导演: 弗兰克·德拉邦特 Frank Darabont 主演: 蒂姆·罗宾斯 Tim Robbins /...<br>
1994 / 美国 / 犯罪 剧情
</p>
<div class="star">
<span class="rating5-t"></span>
<span class="rating_num" property="v:average">9.6</span>
<span property="v:best" content="10.0"></span>
<span>1457997人评价</span>
</div>
<p class="quote">
<span class="inq">希望让人自由。</span>
</p>
</div>
</div>
</div>
</li>
解析为:
<div class="item">.*?href="(.*?)">.*?alt="(.*?)" src="(.*?)".*?<span class="rating_num" property="v:average">(.*?)</span>.*?<span>(.*?)</span>
实验程序
import re
import requests
'''
URL : https://movie.douban.com/top250?start=
电影详情页
电影名称
电影图片
电影评分
电影评价人数
<div class="item">.*?href="(.*?)">.*?alt="(.*?)" src="(.*?)".*?<span class="rating_num" property="v:average">(.*?)</span>.*?<span>(.*?)</span>
'''
'''
Request URL: https://movie.douban.com/
Request Method: GET
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36
Cookie: bid=DsWESUOKNDQ; ap_v=0,6.0; _pk_ses.100001.4cf6=*; __yadk_uid=3wbCqfLdzEsPp0OKPP21bNLQS6ez7odQ; _pk_id.100001.4cf6=a9e1240037f50320.1561111925.1.1561112670.1561111925.
'''
url = 'https://movie.douban.com/top250?start={}'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
Cookies = {
'Cookie': 'bid=DsWESUOKNDQ; ap_v=0,6.0; _pk_ses.100001.4cf6=*; __yadk_uid=3wbCqfLdzEsPp0OKPP21bNLQS6ez7odQ; _pk_id.100001.4cf6=a9e1240037f50320.1561111925.1.1561112670.1561111925.'
}
movie_list = []
num = 0
for i in range(10):
respond = requests.get(url.format(num), headers=headers, cookies=Cookies)
movie_list.extend(re.findall(
'<div class="item">.*?href="(.*?)">.*?alt="(.*?)" src="(.*?)" class="">.*?<span class="rating_num" property="v:average">(.*?)</span>.*?<span>(.*?)</span>',
respond.text, re.S))
num += 25
with open('movie_list','w',encoding='utf-8') as f:
save_format = '电影名:{1} 电影详情页{0} 电影海报:{2} 电影评分:{3} 评价人数:{4}\n'
for line in movie_list:
f.write(save_format.format(*line))