爬虫(一)：豆瓣电影的前250排名信息

肉圆好好吃

于 2021-03-29 22:21:38 发布

阅读量377

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/jiliguluguji/article/details/115312898

版权

爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

该博客详细介绍了如何利用Python的requests和BeautifulSoup库爬取豆瓣电影Top250的电影信息，包括电影排名、名称、短评和评分，并将数据存储到Excel文件中。通过遍历不同页面，动态修改URL中的start参数来抓取每一页的内容，然后解析HTML找到特定标签提取所需数据。

摘要由CSDN通过智能技术生成

一：框架

使用requests+beautifulsoup进行豆瓣电影前250名的信息爬取

二：流程

2.1 使用requests进行的网页信息获取

网页代码

如上图所示，所有关于电影信息都在都在标签lo的下面，我们的任务就是将获取的页码beautifusoup定位到指定的位置。
如下代码所示，使用requests获取到所需要的数据

def get_html(url, code="utf-8"):
    kv = {'User-Agent': 'Mozilla/5.0'}
    try:
        r = requests.get(url, headers=kv)
        r.encoding = code
        r.raise_for_status()
        return r.text
    except:
        print("请求失败")

通过网页的url（https://movie.douban.com/top250?start=25&filter=）特点知道，每次点击下一个页面都是通过start加上25，一共是个页面
在每次传递的时候，参数的时候进行下面操作

    for i in range(10):
        get_url = url.format(i*25)

获取到的html文本传递到beautifulsoup中进行解析，其中通过find方法寻找到指定标签的数据并且将结果存储在列表中

def parse_html(html, i):
    soup = BeautifulSoup(html, "html.parser")
    info_page = soup.find("ol", attrs={"class": "grid_view"})
    films = info_page.find_all('li')
    for film in films:
        rank = film.find("em").text  # 排名
        name = film.find("span", attrs={"class": "title"}).text  # 名字
        description = film.find('span', attrs={'class': 'inq'}).text  # 短评
        socre = film.find("span", attrs={"class": "rating_num"}).text  # 得分
        list_info.append([int(rank) + i*25, name, description, socre])

将所有结果存储起来，主要使用的是excel存储，整个过程的完整代码如下

# beautifulsoup+requests爬取豆瓣电影排行榜信息
from bs4 import BeautifulSoup
import requests
import xlwt

list_info = []
url = "https://movie.douban.com/top250?start={}"


def get_html(url, code="utf-8"):
    kv = {'User-Agent': 'Mozilla/5.0'}
    try:
        r = requests.get(url, headers=kv)
        r.encoding = code
        r.raise_for_status()
        return r.text
    except:
        print("请求失败")


def parse_html(html, i):
    soup = BeautifulSoup(html, "html.parser")
    info_page = soup.find("ol", attrs={"class": "grid_view"})
    films = info_page.find_all('li')
    for film in films:
        rank = film.find("em").text  # 排名
        name = film.find("span", attrs={"class": "title"}).text  # 名字
        description = film.find('span', attrs={'class': 'inq'}).text  # 短评
        socre = film.find("span", attrs={"class": "rating_num"}).text  # 得分
        list_info.append([int(rank) + i*25, name, description, socre])


def save_Films(list_infos, filePath):
    for i in list:
        with open(filePath, 'a', encoding="utf-8") as f:
            f.write(str(i) + '\n')


def save_excel(list_infos, filename, list_names):
    workbook = xlwt.Workbook()
    sheet_01 = workbook.add_sheet("sheet_01")
    row = 0
    cow = 0
    for list_name in list_names:
        sheet_01.write(row, cow, list_name)
        cow += 1
    row += 1
    for list_info in list_infos:
        cow = 0
        for l in list_info:
            sheet_01.write(row, cow, l)
            cow += 1
        row += 1
    workbook.save(filename)


if __name__ == "__main__":
    file_name = "./douban_text.xls"
    list_names = ["排名", "名字", "短评", "评分"]
    for i in range(10):
        get_url = url.format(i*25)
        html = get_html(url=url)
        parse_html(html, i)
    save_excel(list_info, file_name,list_names)

save主要是存储在txt文本中，如需使用，修改file_name的后缀名即可

肉圆好好吃

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
爬虫(一)：豆瓣电影的前250排名信息

一：框架使用requests+beautifulsoup进行豆瓣电影前250名的信息爬取二：流程2.1 使用requests进行的网页信息获取如上图所示，所有关于电影信息都在都在标签lo的下面，我们的任务就是将获取的页码beautifusoup定位到指定的位置。如下代码所示，使用requests获取到所需要的数据def get_html(url, code="utf-8"): kv = {'User-Agent': 'Mozilla/5.0'} try:
复制链接

扫一扫

专栏目录