今天给大家带来如何爬取某猫即将上映电影的详细数据
这是我们今天爬取的页面
今天需要用到3个模块
import parsel
import requests as r
import xlwt
parsel是基于scrapy分离出来的工具,有xpath,re正则,css选择器 这几种用法
xlwt是用于操作excel写入数据的库
现在开始上代码
import parsel
import requests as r
import xlwt
headers={
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0"}
url='https://maoyan.com/films?showType=2&offset='
page=[0,30,60,90]# 4个页面页码
rows = 2
def movie_url():
film=[]
for i in page: #依次获取页面电影序号
response=r.get(f'{
url}{
i}',headers=headers)
data_order=parsel.Selector(response.text)
film_order=data_order.xpath('//div[@class="channel-detail movie-item-title"]/a/@href').extract()
film_url = ['https://maoyan.com' + a for a in film_order]#补全电影主页url
for b in film_url:
film.append(b)
return film