python xpath爬取电影top100_python爬取豆瓣电影top250

最新推荐文章于 2024-05-02 07:00:29 发布

weixin_39552179

最新推荐文章于 2024-05-02 07:00:29 发布

阅读量486

点赞数

文章标签： python xpath爬取电影top100

爬取豆瓣电影top250比猫眼top100复杂了一点点，这里主要使用到的工具是BeautifulSoup网页解析库和正则表达式，个人认为，对于静态网页的爬取，Xpath查询语句和正则表达式是最有力的武器。

另外，对于python中文乱码现象，必要的时候需要考虑encode("UTF-8")编码和decode("GBK")解码

import requests

from bs4 import BeautifulSoup

import re

import pandas as pd

import time

film_url="https://movie.douban.com/top250"

url_set=["https://movie.douban.com/top250"]#第一页网站

url_setx=["https://movie.douban.com/top250"]#用于测试

for i in range(25,250,25):

url_set.append(film_url+"?start="+str(i)+"&filter=")

print(url_set)

name=[]#film name

director=[]

star=[]#film star

date=[]# film date

score=[]#film score

for url in url_set:

html=requests.get(url).content

x=BeautifulSoup(html)

y=x.find_all(name="img",attrs={"class":"","src":re.compile(".*jpg$")})

#print(y.string)

for i in y:

name.append(i.attrs["alt"])

y1=x.find_all(name="p",attrs={"class":""})

for i in y1:

n=re.search(pattern="导演: (.*)主(.*)",string=i.text.encode("UTF-8"))

if n is not None and n.group(1) is not None:

director.append(n.group(1))

else:

director.append(None)

if n is not None and n.group(2) is not None:

#star.append(n.group(2))

tmp=re.sub(string=n.group(2).encode("UTF-8"),pattern="演: ",repl="")

star.append(tmp)

else:

star.append(None)

m=re.search(pattern="[0-9]{4}",string=i.text.encode("UTF-8"))

if m is not None:

date.append(m.group(0))

else:

date.append(None)

y2=x.find_all(name="span",attrs={"class":"rating_num","property":"v:average"})

for i in y2:

if i is not None:

score.append(float(i.string))

else:

score.append(None)

time.sleep(2)

#cbind into a DataFrame

data={"name":name,"director":director,"star":star,"date":date,"score":score}

x=pd.DataFrame(data)

print(x)

weixin_39552179

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。