下面分享一个爬取豆瓣250网站的一个爬虫项目
1.开发环境的搭建:
(1).pycharm
(2).需要用到几个包:requests bs4
(3).在windows系统环境下cmd打开命令行。输入pip install requests ;pip install bs4就可以了。
2.介绍下这个项目的原理:
基于Python的爬虫项目,主要对豆瓣top250这个网站的电影详情进行爬取。整个项目主要分成了三个函数来完成整个功能。
3.项目代码:
import requests
from bs4 import BeautifulSoup
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0'
}
def get_detail_urls(url):
detail_urls = []
resp = requests.get(url, headers=headers)
html = resp.text
soup = BeautifulSoup(html, 'lxml')
list = soup.find('ol', class_='grid_view').find_all('li')
for li in list:
detail_url = li.find('a')['href']
detail_urls.append(detail_url)
return detail_urls
def parse_detail_url(url,f):
resp = requests.get(url, headers=headers)
html = resp.text
soup = BeautifulSoup(html, 'lxml')
name=list(soup.find('div',id='content').find('h1').stripped_strings)
name=''.join(name)
director=list(soup.find('div',id='info').find('span').find('span',class_='attrs').stripped_strings)
ScreenWriter=list(soup.find('div',id='info').find_all('span')[3].find('span',class_='attrs').stripped_strings)
actor=list(soup.find('span',class_='actor').find('span',class_='attrs').stripped_strings)
score=soup.find('strong',class_='ll rating_num').string
print(score)
f.write('{},{},{},{},{}\n'.format(name,''.join(director),''.join(ScreenWriter),''.join(actor),score))
def paqu():
base_url = 'https://movie.douban.com/top250?start={}&filter='
with open('TOP250.csv','a',encoding='utf-8')as f:
for x in range(0,250,25):
url=base_url.format(x)
detail_urls=get_detail_urls(url)
for detail_url in detail_urls:
parse_detail_url(detail_url,f)
if __name__ =='__main__':
paqu()
4.运行结果截图:
![在这里插入图片描述](https://img-blog.csdnimg.cn/20200503170924253.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MzMwNDY1Ng==,size_16,color_FFFFFF,t_70)