爬取豆瓣电影top250

最新推荐文章于 2024-05-02 07:00:29 发布

FZ小冰

最新推荐文章于 2024-05-02 07:00:29 发布

阅读量1.6k

点赞数

分类专栏：爬虫文章标签： python 爬虫 html

本文链接：https://blog.csdn.net/wangjin56789/article/details/122144329

版权

爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

1.爬取豆瓣电影top250

提示：仅供参考

文章目录

1.爬取豆瓣电影top250
前言
一、使用步骤
- 1. 爬虫代码（m_douban.py）
总结

前言

爬虫
都是最基础的，适合新手入门。

一、使用步骤

1. 爬虫代码（m_douban.py）

代码如下（示例）：

import requests
from lxml import etree
import time
import csv
def download(args):
    headers={
        "User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Mobile Safari/537.36"
    }
    for i in range(0,250,25):
        url = "https://movie.douban.com/top250?start={}&filter=".format(i)
        resp = requests.get(url,headers=headers)
        html = etree.HTML(resp.text)
        divs = html.xpath("/html/body/div[3]/div[1]/div/div[1]/ol/li")
        for div in divs:
            img_url= div.xpath("./div/div[1]/a/img/@src")[0] #封面图片所对应的url
            num=eval(div.xpath("./div/div[1]/em/text()")[0])#电影的序号 <class 'int'>
            img_name=div.xpath("./div/div[2]/div[1]/a/span[1]/text()")[0] #电影名称
            actor=div.xpath("./div/div[2]/div[2]/p/text()[1]")[0].strip()  #电影导演演员
            attribute=div.xpath("./div/div[2]/div[2]/p/text()[2]")[0].strip()   #电影属性
            score=eval(div.xpath("./div/div[2]/div[2]/div/span[2]/text()")[0]) #评分类型为<class 'float'>
            evaluate=eval(div.xpath("./div/div[2]/div[2]/div/span[4]/text()")[0].strip('人评价')) #评价人数为int类型
            shuzu=attribute.split('/')
            if '\xa0' not in shuzu[0]:
                year=1961
                nation='中国大陆'
                attribute_1='剧情 动画 奇幻 古装'
            else:
                year=eval(shuzu[0].strip('\xa0'))
                nation=shuzu[1].strip('\xa0')
                attribute_1=shuzu[2].strip('\xa0')
            if len(div.xpath("./div/div[2]/div[2]/p[2]/span")) == 0:
                Good_sentence="该电影没有好的句子"
            else:
                Good_sentence=div.xpath("./div/div[2]/div[2]/p[2]/span/text()")[0] #电影中的好句子
            #拼接电影名称字符串
            a=div.xpath("./div/div[2]/div[1]/a/span") 
            title=''
            for span in a:
                title_text=span.xpath("./text()")[0].strip()
                title=title+title_text
            img=requests.get(img_url)
            if args==0:
                #下载图片封面到上一级img_example目录下
                with open("img_example/"+img_name+".jpg",mode='wb')as p:
                    p.write(img.content)
            #将电影信息进行存储
            if args==1:
                f=open("douban.csv",mode="a",encoding="utf-8")
                csvwriter=csv.writer(f)
                csvwriter.writerow([num,title,img_url,actor,year,nation,attribute_1,score,evaluate,Good_sentence])
            print(num,title,img_url,actor,attribute,score,evaluate,Good_sentence)
            time.sleep(2)
    resp.close()
    print('over')
    download(1)#存储信息
    download(0)#存储图片

总结

提示：要在该文件的上一级目录新建一个文件夹命名为img_example，并新建一个文件douban.csv

FZ小冰

关注

0
点赞
踩
10

收藏

觉得还不错? 一键收藏
打赏
0
评论
爬取豆瓣电影top250

1.爬取豆瓣电影top250提示：仅供参考文章目录1.爬取豆瓣电影top250前言一、使用步骤1. 爬虫代码（m_douban.py）总结前言爬虫都是最基础的，适合新手入门。一、使用步骤1. 爬虫代码（m_douban.py）代码如下（示例）：import requestsfrom lxml import etreeimport timeimport csvdef download(args): headers={ "User-Agent":
复制链接

扫一扫