python爬取豆瓣T250电影及保存excel（易上手）

最新推荐文章于 2022-09-18 11:59:11 发布

exemplify

最新推荐文章于 2022-09-18 11:59:11 发布

阅读量4.8k

点赞数 1

文章标签： python 爬虫数据挖掘

本文链接：https://blog.csdn.net/weixin_57431906/article/details/121303530

版权

本文介绍了如何使用Python的bs4和re正则表达式，以及xpath方法爬取豆瓣电影Top 250的电影信息，并将数据保存到Excel文件中。文中提供了请求网页、提取数据和写入Excel的源代码，同时提醒注意加time.sleep(1)降低爬取速度以避免被封禁，并确保请求结束后关闭网页。

摘要由CSDN通过智能技术生成

网址：豆瓣电影 Top 250

一.bs4和re正则爬取

二.xpath爬取

一.bs4和re正则爬取

源代码：

import urllib.request,urllib.error
import re
from bs4 import BeautifulSoup
import xlwt

baseurl = "https://movie.douban.com/top250?start="
head = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36"
}

title = re.compile('<span class="title">(.*?)</span>')  #标题
link = re.compile('<a class="" href="(.*?)"')  #电影链接
introduction = re.compile('<span class="inq">(.*?)</span>')  #电影简介
appraise = re.compile('<span>(.*?)</span>')  #评价人数
basedata = []  #最后写入到excel中，利用列表来写入
workbook = xlwt.Workbook(encoding='utf-8')  #新建excel文档
worksheet = workbook.add_sheet('daoban')  #excel文档中添加表格


for i in range(0,10):
    url = url = baseurl+str(i*25)
    #请求网页
    request = urllib.request.Request(url=url,headers=head)
    #得到网页回应并打开
    response = urllib.request.urlopen(request)
    #对打开的网页回应并解码到utf-8(有中文)
    html = response.read().decode("utf-8")
    # print(html)  #字符类型的html
    soup = BeautifulSoup(html,"html.parser")  #后面是html。parser 解析器

    for item in soup.find_all('div',class_="info"):
        item = str(item)
        data = []  #每一部电影有多个，每部电影放入一个列表中

        ftitle = re.findall(title,item)
        if len(ftitle) == 2:
            etitle = ftitle[0]
            data.append(etitle)
            rtitle = ftitle[1].replace("/","")#去点无关符号
            data.append(rtitle)
        else:
            data.append(ftitle[0])
            data.append(' ')#外国名留空  #有的会没有外国电影，防止串行

        flink = re.findall(link,item)[0]
        data.append(flink)
        # print(data)
        # fintroduction = re.findall(introduction,item)[0]#有的电影没有概述
        fintroduction = re.findall(introduction, item)
        if len(fintroduction) != 0:
            fintroduction = fintroduction[0].replace("。","")#取代最后面的句号
            data.append(fintroduction)
        else:
            data.append(" ")
            data.append(fintroduction)
        fappraise = re.findall(appraise,item)[0]
        data.append(fappraise)
        basedata.append(data)  #单个电影信息的列表加入到一个大列表中

for i in range(0,250):
    print("第%d条记录"%i)
    first = basedata[i]
    for j in range(0,4):
        second = first[j]
        worksheet.write(i,j,second)  #每个列表中保存数据

workbook.save('豆瓣.xls')  #保存excel文档

print(basedata)

1.请求网页并且获取网页源代码的代码实现：

import urllib.request,urllib.error

baseurl = "https://movie.douban.com/top250?start="
head

最低0.47元/天解锁文章

exemplify

关注

1
点赞
踩
17

收藏

觉得还不错? 一键收藏
0
评论
python爬取豆瓣T250电影及保存excel（易上手）

网址：https://movie.douban.com/top250?start=1.bs4和re正则爬取2.xpath爬取这个我自己使用了几次，好像相比上面那个，更容易被拉黑，谨慎使用吧。源代码：import requestsfrom lxml import etreeimport xlwtworkbook = xlwt.Workbook(encoding='utf-8') #新建一个excel文档worksheet = workbook.add_sheet('daob
复制链接

扫一扫