Python爬取豆瓣top250书籍(beautifulsoup法)

最新推荐文章于 2024-05-01 14:26:58 发布

吃花椒的喵醬

最新推荐文章于 2024-05-01 14:26:58 发布

阅读量956

点赞数 3

分类专栏：爬虫文章标签： python 正则表达式 excel 爬虫列表

本文链接：https://blog.csdn.net/m0_51908955/article/details/113888864

版权

库需求

requests（对网站发起请求）
beautifulsoup（提取html信息）
re（正则表达式）
fake_useragent（生成假的请求头）
xlwt（处理excel文档）

准备

打开豆瓣Top250书籍网站https://book.douban.com/top250，观察其html特点，找到储存书本信息的位置。
在这里插入图片描述

代码

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
import re
import xlwt
ua=UserAgent()#生成假的请求头

def getBooks(page):
    if page==0:
        url='https://book.douban.com/top250'
    else:
        url='https://book.douban.com/top250'+'?start='+str(page*25) #根据top250的网站url特点进行循环构造
    try:
        user_agent = ua.random
        res = requests.get(url,headers={
   'User-Agent': user_agent})#利用假的请求头进行请求，以提高爬虫成功率
        html=res.text
        res.raise_for_status()   # 如果返回的状态码不是200， 则抛出异常;
        res.encoding = res.apparent_encoding  # 判断网页的编码格式， 便于respons.text知道如何解码;
    except Exception as e:
        print("爬取错误")
    html=res.text
    return html

最低0.47元/天解锁文章

吃花椒的喵醬

关注

3
点赞
踩
14

收藏

觉得还不错? 一键收藏
0
评论
Python爬取豆瓣top250书籍(beautifulsoup法)

库需求requests（对网站发起请求）beautifulsoup（提取html信息）re（正则表达式）fake_useragent（生成假的请求头）xlwt（处理excel文档）准备打开豆瓣Top250书籍网站https://book.douban.com/top250，观察其html特点，找到储存书本信息的位置。代码import requestsfrom bs4 import BeautifulSoupfrom fake_useragent import UserAgen
复制链接

扫一扫