豆瓣电影信息采集

豆瓣电影信息采集

1.思路
先能爬一篇电影的信息,之后能爬排行榜url列表,之后在列表里循环一篇信息的爬取程序就大功告成
2.爬一篇信息的过程
技术python3.7.2+requests+BeautifulSoup4

def OneMovieInfo(url):
    headers = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3'
        , 'Accept-Encoding': 'gzip, deflate, br'
        , 'Accept-Language': 'zh,zh-TW;q=0.9,en-US;q=0.8,en;q=0.7,zh-CN;q=0.6'
        , 'Cache-Control': 'max-age=0'
        , 'Connection': 'keep-alive'
        ,
        'Cookie': 'll="118099"; bid=oLMMQ8yWn4I; ap_v=0,6.0; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1553687581%2C%22https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3DZo7zPOnH0SlfVKADj2cjWHUbGmPZadbaxVA16Uo7gohz5M6uUf2px5J6LIBc_CWc%26wd%3D%26eqid%3Df9f78bc3000b1696000000035c9b641c%22%5D; _pk_ses.100001.4cf6=*; __utma=30149280.1773706481.1553687581.1553687581.1553687581.1; __utmb=30149280.0.10.1553687581; __utmc=30149280; __utmz=30149280.1553687581.1.1.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; __utma=223695111.603638443.1553687581.1553687581.1553687581.1; __utmb=223695111.0.10.1553687581; __utmc=223695111; __utmz=223695111.1553687581.1.1.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; __yadk_uid=rVLAl8dxM3Zxd3wGfI2BGiQvOthM6xuP; _vwo_uuid_v2=DF4DCE411E806F129A7A313A65BC63B65|071d6cd2c6aa78a4896e8f2d9377ee57; _pk_id.100001.4cf6=ab773b0cb7ee49c3.1553687581.1.1553687590.1553687581.'
        , 'Host': 'movie.douban.com'
        , 'Referer': 'https://movie.douban.com/'
        , 'Upgrade-Insecure-Requests': '1'
        ,
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
    }
    html = requests.get(url, headers=headers)
    soup = BeautifulSoup(html.text, 'lxml')
    title = soup.select('#content h1')[0].text.replace('\n', '')
    info = soup.select('#info')[0].text.lstrip('\n').rstrip('\n')
    Temp_pinglun = soup.select('.short')
    pinglun = [i.text for i in Temp_pinglun]
    score = soup.select('.rating_num')[0].text
    print(title)
    print(score)
    print(info)
    print(pinglun)

3.爬取排行榜
这时候要注意排行榜数据不再是单页面简单的HTML了,而是通过XHR异步手段传进来的,通过谷歌开发者工具分析确定了真正的json路径,重新构造代码

def All_Url_list(url):
    headers={
    'Accept': '*/*'
    ,'Accept-Encoding': 'gzip, deflate, br'
    ,'Accept-Language': 'zh,zh-TW;q=0.9,en-US;q=0.8,en;q=0.7,zh-CN;q=0.6'
    ,'Connection': 'keep-alive'
    ,'Cookie': 'll="118099"; bid=oLMMQ8yWn4I; ap_v=0,6.0; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1553687581%2C%22https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3DZo7zPOnH0SlfVKADj2cjWHUbGmPZadbaxVA16Uo7gohz5M6uUf2px5J6LIBc_CWc%26wd%3D%26eqid%3Df9f78bc3000b1696000000035c9b641c%22%5D; _pk_ses.100001.4cf6=*; __utma=30149280.1773706481.1553687581.1553687581.1553687581.1; __utmb=30149280.0.10.1553687581; __utmc=30149280; __utmz=30149280.1553687581.1.1.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; __utma=223695111.603638443.1553687581.1553687581.1553687581.1; __utmb=223695111.0.10.1553687581; __utmc=223695111; __utmz=223695111.1553687581.1.1.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; __yadk_uid=rVLAl8dxM3Zxd3wGfI2BGiQvOthM6xuP; _vwo_uuid_v2=DF4DCE411E806F129A7A313A65BC63B65|071d6cd2c6aa78a4896e8f2d9377ee57; _pk_id.100001.4cf6=ab773b0cb7ee49c3.1553687581.1.1553689627.1553687581.'
    ,'Host': 'movie.douban.com'
    ,'Referer': 'https://movie.douban.com/typerank?type_name=%E5%96%9C%E5%89%A7&type=24&interval_id=100:90&action='
    ,'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
    ,'X-Requested-With': 'XMLHttpRequest'
    }
    data=requests.get(url,headers=headers)
    result=data.json()
    url_list=[i['url'] for i in result]
    return url_list

4最后在main方法里套个循环就完事了

if __name__ == '__main__':
    url='https://movie.douban.com/j/chart/top_list?type=24&interval_id=100%3A90&action=&start=0&limit=20'
    url_list=All_Url_list(url)
    for i in url_list:
        OneMovieInfo(i)

5.程序效果
在这里插入图片描述

  • 2
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值