Requests爬虫实践
项目名称:豆瓣电影top250的所有电影的名称
项目url:http://movie.douban.com/top250
1.构建请求头
hearder={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36",
"host":"movie.douban.com"}
可以看出只有第一页内容。如果需要获取所有的250页内容,就需要获取总共10页的内容
通过单击第二页就发现网页地址变成了:http://movie.douban.com/top250?start=25
第三页的地址为:http://movie.douban.com/top250?start=50
通过以上分析,可以通过for循环,翻页获取内容
import requests
def get_movies():
hearders={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36",
"host":"movie.douban.com"}
for i in range(0,10):
link = "http://movie.douban.com/top250?start=" +str(0*i)
r = requests.get(link,headers=hearders,timeout=3)
print(i,"页面状态响应码:",r.status_code)
print(r.text)
get_movies()
得到的结果只是网页的html代码,我们需要从中提取需要的电影名称
import requests
from bs4 import BeautifulSoup
movie_list = []
def get_movies():
hearders={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36",
"host":"movie.douban.com"}
for i in range(0,10):
link = "http://movie.douban.com/top250?start=" +str(0*i)
r = requests.get(link,headers=hearders,timeout=3)
soup = BeautifulSoup(r.text,"lxml") #转换对象,将“r.rext”,转换为BeautifulSoup对象
div_list = soup.find_all("div",class_ ="hd") #组合查询、
for each in div_list:
movie = each.a.span.text.strip()
movie_list.append(movie)
return movie_list
movies=get_movies()
print(movies)