爬虫学习笔记-实战爬取豆瓣TOP250(慎重)

使用reques库和BeautifuSoup库

注意爬取数量,楼主把内容全爬导致IP被检测了。。。。。

二话不说,码来!

完整代码如下:

import requests
from bs4 import BeautifulSoup


headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36',
    'Cookie': 'bid=IFQLyt3P1VA; __utmc=30149280; __utmc=223695111; __gads=ID=d82deb8436795c5b-227e11179ace0067:T=1636264521:RT=1636264521:S=ALNI_Ma_D80fB_lyk-FQhVvy7K8YYlgzDw; ll="118202"; _vwo_uuid_v2=DB907CD6F73DA2C4D44E6E975F7FB59A7|d9e5bef21c0ca4c326417172b5286bb2; dbcl2="249709309:WlpE/C5Yo0E"; ck=N6JD; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1636271761%2C%22https%3A%2F%2Faccounts.douban.com%2F%22%5D; _pk_id.100001.4cf6=d26da555220694c9.1636264523.2.1636271761.1636266298.; _pk_ses.100001.4cf6=*; __utma=30149280.1730494055.1636264523.1636264523.1636271761.2; __utmb=30149280.0.10.1636271761; __utmz=30149280.1636271761.2.2.utmcsr=accounts.douban.com|utmccn=(referral)|utmcmd=referral|utmcct=/; __utma=223695111.115091681.1636264523.1636264523.1636271761.2; __utmb=223695111.0.10.1636271761; __utmz=223695111.1636271761.2.2.utmcsr=accounts.douban.com|utmccn=(referral)|utmcmd=referral|utmcct=/; push_noty_num=0; push_doumail_num=0'
}  # 根据自己选择UA和cookie


# 获取详情页面url
def get_detail_url(url):
    resp = requests.get(url, headers=headers)  # 获取网页源代码
    html = resp.text  # 存储到html中
    soup = BeautifulSoup(html, 'lxml')  # 用BeautifulSoup库解析html 解析器为lxml
    lis = soup.find('ol', class_='grid_view').find_all('li')  # 分析网页得出li标签存放位置
    detail_urls = []
    # 将获取到的url存放到列表里
    for li in lis:
        detail_url = li.find('a')['href']
        # print(detail_url)
        detail_urls.append(detail_url)
    return detail_urls


# 解析详情页
def parse_detail_url(url, f):
    resp = requests.get(url, headers=headers)
    html = resp.text
    soup = BeautifulSoup(html, 'lxml')
    # 电影名
    name = list(soup.find('div', id='content').find('h1').stripped_strings)
    name = ''.join(name)
    # 导演
    director = list(soup.find('div', id='info').find('span').find('span', class_='attrs').stripped_strings)
    director = '/'.join(director)
    # print(director)
    # 编剧
    screenwriter = list(soup.find('div', id='info').find_all('span')[3].find('span', class_='attrs').stripped_strings)
    screenwriter = '/'.join(screenwriter)
    # print(screenwriter)
    # 主演
    actor = list(soup.find('span', class_='actor').find('span', class_='attrs').stripped_strings)
    actor = '/'.join(actor)
    # print(actor)
    # 评分
    score = soup.find('strong', class_='ll rating_num').string
    print(score)
    f.write('{},{},{},{},{}\n'.format(name, director, screenwriter, actor, score))  # 写入csv文件


def main():
    base_url = 'https://movie.douban.com/top250?start=0&filter='
    with open('Top250.csv', 'a', encoding='utf-8') as f:
        for i in range(0, 26, 25):  # 设置翻页  注意爬取页数 太多容易被检测
            url = base_url.format(i)
            detail_urls = get_detail_url(url)

            for detail_url in detail_urls:
                parse_detail_url(detail_url, f)


if __name__ == '__main__':
    main()

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值