爬取猫眼电影并保存数据到excel的源码如下:
注:由于这个源码没有使用selenium,所有出现滑块验证是要先登录
https://maoyan.com/board/4? 进行手动验证
import requests
from lxml import etree
import pandas as pd
df = []
# 注:猫眼电影有时要滑块验证,所以print打印出来为猫眼验证中心,要先登录网址通过滑块验证
base_url = 'https://maoyan.com/board/4?offset={}'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36'
}
columns = ['排名', '片名', '主演', '时间']
for i in range(50):
url = base_url.format(str(i))
response = requests.get(url, headers=headers)
print(response.text)
html = response.text
xp = etree.HTML(html)
# print(xp)
lis = xp.xpath('//*[@id="app"]/div/div/div[1]')
# print(lis)
for li in lis:
# 以下结果全是element x at ......
paiming = li.xpath('//*[@id="app"]/div/div/div[1]/dl/dd[1]/i/text()')
pianming = li.xpath('//*[@id="app"]/div/div/div[1]/dl/dd[1]/div/div/div[1]/p[1]/a/text()')
zhuyan = li.xpath('//*[@id="app"]/div/div/div[1]/dl/dd[1]/div/div/div[1]/p[2]/text()')[0].strip().replace("\xa0\xa0\xa0", "\t").split("\t")
shijian = li.xpath('//*[@id="app"]/div/div/div[1]/dl/dd[1]/div/div/div[1]/p[3]/text()')
# print(paiming)
# print(pianming)
# print(zhuyan)
b = df.append([paiming, pianming, zhuyan, shijian])
# dataframe是二维数组,columns将上面的标题行插入到二维数组中
d = pd.DataFrame(df, columns=columns)
# index=False表示输出不显示索引值
d.to_excel("猫眼电影.xlsx", index=False)
如上图所示,我之前复制的xpath语句是
paiming = li.xpath('//*[@id="app"]/div/div/div[1]/dl/dd[1]/i')
pianming = li.xpath('//*[@id="app"]/div/div/div[1]/dl/dd[1]/div/div/div[1]/p[1]/a')
这会导致print(paiming)的结果为 [<Element i at 0x18a666abf08>]
但是只要在xpath的最后加上/text(),就能使打印结果为文字
paiming = li.xpath('//*[@id="app"]/div/div/div[1]/dl/dd[1]/i/text()')
pianming = li.xpath('//*[@id="app"]/div/div/div[1]/dl/dd[1]/div/div/div[1]/p[1]/a/text()')
结果如下:
['2']
['我不是药神']
['主演:徐峥,周一围,王传君']