python爬取豆瓣读书_爬取豆瓣读书.py

最新推荐文章于 2024-04-26 18:35:17 发布

weixin_39939530

最新推荐文章于 2024-04-26 18:35:17 发布

阅读量163

点赞数

文章标签： python爬取豆瓣读书

此篇博客展示了如何使用Python爬虫从豆瓣阅读网站抓取热门书籍信息，包括书名、作者、评分、评价数量及价格，数据直接存储在MongoDB数据库中，便于高效查询。通过fake_useragent库模拟浏览器头，解析HTML并提取所需字段。

摘要由CSDN通过智能技术生成

import requests

from fake_useragent import UserAgent

from pyquery import PyQuery as pq

import csv

import time

import pymongo

import random

'''

因为CSV模块我学的不是很好所有那个write2csv函数我就注释掉了

因为运行就会出错等学了 CSV在改进下于是我直接扔进数据库了

豆瓣是个静态网页所有的的我们想要的数据全在一个网页中呈现出来了

我们也不用考虑什么算法加密啥的高端玩法直接获取网页HTML然后提取数据就O的K了

如果学过CSV 想保存为CSV格式的可以改下write2csv函数也可以直接改下数据库IP 扔进数据库中

如果放进数据库这样查询起来要比CSV的方便很多比如查询评分为9.0的书籍

如果数据库端口自己改变过在参数后面加上端口如果有name和pass也请自信加上

'''

clien=pymongo.MongoClient(host='自己的数据库')

db=clien.Douban_reading

coll=db.text

ua=UserAgent()

def parsing(page):

URL = 'https://read.douban.com/kind/100?start={}&sort=hot&promotion_only=False&min_price=None&max_price=None&works_type=None'.format(page)

headers = {

'User-Agent': ua.random

}

sponse = requests.get(URL, headers=headers).text

doc=pq(sponse)

All=doc('.item.store-item').items()

for i in All:

#书名

Title=i.find('.title').text()

#作者

The_author=i.find('.author-item').text()

#译者(翻译过来的作者)

The_translator=i.find('.author-item').text()

#书的评分

Scores_of_the_book=i.find('.rating-average').text()

#多少人评价

How_many_evaluation=i.find('.ratings-link').text()

#print(How_many_evaluation)

#书的价格

The_price=i.find('.original-tag').text()

#print(The_price)

#书的简介

Introduction_to_the=i.find('.article-desc-brief').text()

#print(Introduction_to_the)

info={}

info['书名']=Title

info['作者']=The_author

info['译者']=The_translator

info['书的评分']=Scores_of_the_book

info['多少人评价']=How_many_evaluation

info['书的价格']=The_price

info['书的简介']=Introduction_to_the

coll.insert_one(info)

print(info)

'''

def write2csv(page):

print('正在写入CSV文件')

with open('豆瓣读书热门列表.csv','a',newline='',encoding='utf8')as f:

fieldnames=['书名','作者','译者','书的评分','多少人评价','书的价格','书的简介']

writer=csv.DictWriter(f,fieldnames=fieldnames)

writer.writeheader()

data=parsing(page)

writer.writerow(data)

print('写入成功')

'''

#一共744页

for i in range(0,744):

try:

i=i*20

parsing(i)

time.sleep(int(random.randint(0,9)))

except Exception as e:

print(e.args)

一键复制

编辑

Web IDE

原始数据

按行查看

历史

weixin_39939530

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬取豆瓣读书_爬取豆瓣读书.py

import requestsfrom fake_useragent import UserAgentfrom pyquery import PyQuery as pqimport csvimport timeimport pymongoimport random'''因为CSV模块我学的不是很好所有那个write2csv函数我就注释掉了因为运行就会出错等学了 CSV在改进下于是我直接扔进数...
复制链接

扫一扫