python爬虫豆瓣top250 并写入数据库

最新推荐文章于 2021-10-20 20:17:12 发布

youngerXZ

最新推荐文章于 2021-10-20 20:17:12 发布

阅读量410

点赞数

分类专栏： python爬虫文章标签： python 数据库

本文链接：https://blog.csdn.net/z_kitty/article/details/107387705

版权

本文介绍了如何使用Python模拟浏览器发送请求，避开豆瓣的反爬机制，获取豆瓣Top250书籍的HTML页面。内容包括设置requests库的SSL验证，处理分页逻辑，以及对抓取到的数据进行处理，最终将数据存入数据库。

摘要由CSDN通过智能技术生成

获取豆瓣top250页面html

html页面

因为豆瓣有反爬机制，所以试了各种方法，最终采用模拟浏览器发送请求
在requests做请求的时候，为了避免ssl认证，可以将verify=False

import requests
from bs4 import BeautifulSoup
import re
from faker import Faker
from database.dbc import Pymysql_dbc

def getHTMLText(url):
    faker = Faker()
    headers = {
   
        'User-Agent': faker.user_agent(),
        'Accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"
    }
    # 禁用报错信息
    requests.packages.urllib3.disable_warnings()
    response = requests.get(url, headers=headers, verify=False)
    if response.status_code == 200:
        html = BeautifulSoup(response

最低0.47元/天解锁文章

youngerXZ

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
python爬虫豆瓣top250 并写入数据库

获取豆瓣top250页面html因为豆瓣有反爬机制，所以试了各种方法，最终采用模拟浏览器发送请求在requests做请求的时候，为了避免ssl认证，可以将verify=Falseimport requestsfrom bs4 import BeautifulSoupimport refrom faker import Fakerfrom database.dbc import Pymysql_dbcdef getHTMLText(url): faker = Faker()
复制链接

扫一扫

专栏目录