爬取豆瓣前十页小说内容

最新推荐文章于 2023-04-26 23:56:53 发布

小文学Python

最新推荐文章于 2023-04-26 23:56:53 发布

阅读量630

点赞数 1

分类专栏： python 文章标签： python

本文链接：https://blog.csdn.net/weixin_39168552/article/details/105603281

版权

python 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

import requests, time, re, csv
from bs4 import BeautifulSoup
import codecs
import re

with open(r'C:\Users\Administrator\Desktop\小说.csv', 'ab+')as fp:
    fp.write(codecs.BOM_UTF8)
f = open(r'C:\Users\Administrator\Desktop\小说.csv','a+',newline='', encoding='utf-8')
writer = csv.writer(f)
writer.writerow(('名称','作者','评分','人数','简介'))
r'C:\Users\Administrator\Desktop\小说.csv'
urls = ['https://book.douban.com/tag/小说?start={}&type=T/'.format(str(i)) for i in range(0,1001,20)]

headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'
}
for url in urls:
    res = requests.get(url, headers=headers)
    #print(res.text)  
# print(res.content.decode('utf-8')) 
    html = res.text
    soup = BeautifulSoup(html, 'lxml')
#   print(soup)
    infos = soup.select(".subject-item")
#     print(infos)
    for it in infos:
        name = it.h2.a['title']
        # print(name)
        author = it.select_one(".pub").text.strip()
        # print(author)
        score = it.select_one('.rating_nums').text
        # print(score)
        num1 = it.select_one('.pl').text.strip()
        num2 = re.findall("\d+",num1)[0]
        # print(num2)
        content = it.p.text
        writer.writerow((name,author,score,num,content))
        time.sleep(1)
f.close()

参考：https://blog.csdn.net/zxcjxx/article/details/105317054

小文学Python

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
爬取豆瓣前十页小说内容

import requests, time, re, csvfrom bs4 import BeautifulSoupimport codecsimport rewith open(r'C:\Users\Administrator\Desktop\小说.csv', 'ab+')as fp: fp.write(codecs.BOM_UTF8)f = open(r'C:\User...
复制链接

扫一扫

专栏目录