2021-01-05

一、主要功能
1、利用lxml爬取豆瓣电影TOP250的片名、导演、演员、评分等全部信息
2、用xpath确定每一项数据的位置
3、获取数据,并同时间数据写入csv文件中

源代码
import requests
from lxml import etree
import pandas as pd
import os
def get_html(url):
headers = {‘User-Agent’:‘Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Mobile Safari/537.36’}
try:
html = requests.get(url,headers = headers)
html.encoding = html.apparent_encoding
if html.status_code == 200:
print(‘成功获取源代码’)
except Exception as e :
print('获取代码失败:s% ’ % e)
return html.text
def parse_html(html):
movies = []
imgurls = []
html = etree.HTML(html)
lis = html.xpath("//ol[@class=‘grid_view’]/li")
for li in lis:
name = li.xpath(".//a/span[@class=‘title’]/text()")[0] #提取电影名称数组
director_actor = li.xpath(".//div[@class=‘bd’]/p/text()")[0].strip() #提取导演演员等信息 [0]提取数组、。strip()清除空格
info = li.xpath(".//div[@class=‘bd’]/p/text()")[1].strip() #xpath谓语用法
rating_score = li.xpath(".//div[@class=‘star’]/span[2]/text()")[0]
rating_num = li.xpath(".//div[@class=‘star’]/span[4]/text()")[0]
try:
introduce = li.xpath(".//p[@class=‘quote’]/span/text()")[0]
except:
introduce = “”
imgurl = li.xpath(".//img/@src")[0]

    movie = {'name':name,'director_actor':director_actor,'info':info,'rating_score':rating_score,'rating_num':rating_num,'introduce':introduce}   #以列表形式输出
    movies.append(movie)
    imgurls.append(imgurl)
return movies,imgurls

def downloading(url,movie):
if ‘movieposter’ in os.listdir(r’/new project/web_scrap/chp10’):
pass
else:
os.mkdir(’…/chp10/movieposter’)
os.chdir(r’/new project/web_scrap/chp10/movieposter’)
img = requests.get(url).content
with open(movie[‘name’]+’.jpg’,‘wb’) as f:
print(“正在下载:%s” % url)
f.write(img)
if name == ‘main’:
movies1=[]
imgurls1=[]
for i in range(10):
url = ‘https://movie.douban.com/top250?start=’+str(i*25)+’&filter=’
html = get_html(url)
movies2,imgurls2 = parse_html(html)
print(len(movies2),len(imgurls2))
movies1.extend(movies2)
imgurls1.extend(imgurls2)

    print(len(imgurls1), len(movies1))
for i in range(250):
    downloading(imgurls1[i],movies1[i])

os.chdir(r'/new project/web_scrap/chp10')
moviedata = pd.DataFrame(movies1)
moviedata.to_csv('movie.csv')

爬取结果
在这里插入图片描述

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值