一.需求分析
应用requests库和正则表达式抓取猫眼电影TOP100的电影名称,时间,评分,图片等信息。
项目分析:
1.明确采集网址
猫眼电影榜单TOP100
2.爬取,requests数据采集库,正则表达式数据解析库
3.存,json格式存储到文件
二.分步骤爬取
1.导入需要的包
import codecs
import json
import re
import time
import requests
from colorama import Fore
from fake_useragent import UserAgent
from requests import HTTPError
2.获取页面
def download_page(url, parmas=None):
"""
根据url地址下载html页面
:param url:
:param parmas:
:return: str
"""
try:
ua = UserAgent()
headers = {
'User-Agent': ua.random,
'Host': 'maoyan.com',
'Cookie': '__mta=216365737.1586537270106.1586587968494.1586587974745.28; uuid_n_v=v1; uuid=FDFCF0307B4A11EA81A9FDE94F2697E46AE0BB693A874A189C91F0B6F3298114; Hm_lvt_703e94591e87be68cc8da0da7cbd0be2=1586537263,1586587448; mojo-uuid=290ecc97cc9252d9e67fddc2fdfba5ff; _lxsdk_cuid=17164fd7ca334-0cfe278836c8a7-4c302f7e-e1000-17164fd7cabc8; _lxsdk=FDFCF0307B4A11EA81A9FDE94F2697E46AE0BB693A874A189C91F0B6F3298114; __mta=216365737.1586537270106.1586575636717.1586587450776.15; _csrf=475c87b2782054e49dc109b59dc40641530b6b61ce9e007ba437dd7fc375b344; _lxsdk_s=17167f7d0b2-36e-60f-f57%7C%7C53; mojo-trace-id=46; mojo-session-id={"id":"fc93bad8aa704378a2e0ae28dfb31398","time":1586587226413}; Hm_lpvt_703e94591e87be68cc8da0da7cbd0be2=1586587975; _lx_utm=utm_source%3DBaidu%26utm_medium%3Dorganic'
}
# 请求https协议的时候, 回遇到报错: SSLError
# verify=Flase不验证证书
response = requests.get(url, params=parmas, headers=headers)
except HTTPError as e:
print(Fore.RED + '[-] 爬取网站%s失败: %s' % (url, str(e)))
return None
else:
# content返回的是bytes类型, text返回字符串类型
return response.text
if __name__ == '__main__':
url = 'https://maoyan.com/board/4'
html = download_page(url)
print(html)
在headers中我们设置’Host’与’Cookie’信息,最终response.text返回字符串类型,返回运行结果
可以看到一部分的返回结果,我们需要的信息都在返回的html字符串中。
3.用正则表达式解析html
(1)设置正则表达式
#以<dd>开头,()代表分组,.*?表示任意字符
pattern = re.compile(
'<dd>'
+ '.*?<i class="board-index.*?">(\d+)</i>' # 获取电影的排名<i class="board-index board-index-1">1</i>
+ '.*?<img data-src="(.*?)" alt="(.*?)"' # 获取图片网址和图片名称<img data-src="xxxxx" alt="我和我的祖国"
+ '.*?<p class="star">(.*?)</p>' # 获取电影的主演: <p class="star">主演:黄渤,张译,韩昊霖</p>
+ '.*?<p class="releasetime">(.*?)</p>' # 获取电影的上映时间: <p class="releasetime">上映时间:2019-09-30</p>
'.*?</dd>',
re.S
)
(2)解析的具体代码
def download_page(url, parmas=None):
pass
#此处自行添加获取页面代码
def