如何利用爬虫与数据分析指导选择首篇小说类型:写作新手的实用指南
第一章 小说数据获取与解析
文章目录
前言
如何利用爬虫与数据分析指导选择首篇小说类型、提高新人作者成功几率是本文主要作用
一、小说关键数据
二、小说数据获取
1.使用爬虫获取起点中文网强推3个月内小说网址
代码如下(示例):
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.5735.289 Mobile Safari/537.36',
'Cookie': 'newstatisticUUID=1687606367_135137479; fu=1132719504; supportwebp=true; supportWebp=true; _ga_D20NXNVDG2=GS1.1.1698680222.2.0.1698680231.0.0.0; _ga_VMQL7235X0=GS1.1.1698680222.2.0.1698680231.0.0.0; _csrfToken=8c77025c-97fa-44ff-8029-a31ab8aa56f9; traffic_utm_referer=https%3A//www.baidu.com/link; Hm_lvt_f00f67093ce2f38f215010b699629083=1710860919,1710935729,1711019219,1711187302; Hm_lpvt_f00f67093ce2f38f215010b699629083=1711187302; _yep_uuid=831915fd-ea5d-b598-f669-6482d91cd7e2; _gid=GA1.2.1754461931.1711187303; _gat_gtag_UA_199934072_2=1; _ga_FZMMH98S83=GS1.1.1711187302.10.0.1711187302.0.0.0; _ga=GA1.1.1178557524.1687606367; _ga_PFYW0QLV3P=GS1.1.1711187302.10.0.1711187302.0.0.0; e1=%7B%22l6%22%3A%22%22%2C%22l1%22%3A%22%22%2C%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22%22%7D; e2=%7B%22l6%22%3A%22%22%2C%22l1%22%3A10%2C%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22qd_A114%22%7D; w_tsfp=ltvgWVEE2utBvS0Q6aLhkkynFT07Z2R7xFw0D+M9Os09AaIpVZ2F1IN9udfldCyCt5Mxutrd9MVxYnGAV94ifhEdRsWTb5tH1VPHx8NlntdKRQJtA83YW1YXKrIh7TVFKT8LcBGy2D15IoFByeNmiA0EsSEg37ZlCa8hbMFbixsAqOPFm/97DxvSliPXAHGHM3wLc+6C6rgv8LlSgS3A9wqpcgQ2Xusewk+A1SgfDngj4RG7dOldNRytI86vWO0wrTPzwjn3apCs2RYx/UJk6EtuWZaxhCfAPX4VKFhsbVzg1Lkkfqf4PuFx6jcbVKQcGg8SoF4Yt+s66wk=',
} # 请求头 若不携带cookie 请求返回202
##########根据起点首页获得强推网址
url = r'https://www.qidian.com/' #起点中文网首页
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text,'html.parser') #获取请求返回网页资源
div_strongrecWrap = soup.find('div', 'book-list-wrap mr30 fl') #根据div的class属性获取往期强推div信息
a_tag = div_strongrecWrap.find('a')
strongrecWrap = r'https:' + a_tag['href'] #根据往期强推div信息获取网址
##########获取3个月内强推小说网址
response = requests.get(strongrecWrap,headers=headers) # 获取往期强推网页资源
soup = BeautifulSoup(response.text,'html.parser')
lists = soup.findAll('li', 'strongrec-list book-list-wrap') # 获取3个月内所有小说列表资源
timeList = []
novel_href_list = []
for list in lists:
time = list.find('span', 'date-from').text #获取每周的开始日期
novelLists = list.findAll('li')
for novel in novelLists:
a_tag = novel.find('a', 'name')
novelHref = r'https:' + a_tag['href'] #获取每本小说的网址
print(a_tag)
timeList.append(time) #获取3个月内小说强推日期
novel_href_list.append(novelHref) #获取3个月内小说列表
得到结果如下:
香江之狼在线阅读:https://www.qidian.com/book/1038879878/
华娱前夫哥在线阅读:https:///www.qidian.com/book/1038565435/
从电影抽取技能在线阅读:https:///www.qidian.com/book/1038934101/
这个反派过于年幼在线阅读:https:///www.qidian.com/book/1038873193/
我戾太子只想被废在线阅读:https:///www.qidian.com/book/1038971080/
圣杯战争?龙珠战争!在线阅读:https:///www.qidian.com/book/1038941759/
其中如何定位页面元素如下:
2.根据小说网址获取小说详细数据
代码如下:
############根据小说网址获取小说详细数据
novelInfoList = [] #3个月内小说信息总列表
for novelHref in novel_href_list:
response = requests.get(novelHref, headers=headers) # 获取小说网址网页资源
soup = BeautifulSoup(response.text, 'html.parser')
bookName = soup.find('h1',id='bookName').text #小说名
book_attribute = soup.find('p', 'book-attribute')
channel_type = book_attribute.findAll('a')
channel = channel_type[0].text #小说频道
type = channel_type[1].text #小说类型
count = soup.find('p', 'count').findAll('em')
total_num_word = count[0].text #小说字数
total_recommend = count[1].text #总推荐
week_recommend = count[2].text #周推荐
mouth_count = soup.find('p', id='monthCount').text #月票
writer_name = soup.find('a', 'writer-name').text #作者名
level = soup.find('div', 'outer-intro').find('p').text #作者等级
authorInfo = soup.findAll('em', 'color-font-card')
works_num = authorInfo[0].text #作者作品数
total_words = authorInfo[1].text #作者创作字数
write_days = authorInfo[2].text #作者创作天数
#####获取小说上架字数
catalog_volumes = soup.findAll('div', 'catalog-volume')
chapter_itemList = [] ###免费章节可能分为几大板块,获取所有板块的免费章节信息
for catalog_volume in catalog_volumes:
free = catalog_volume.find('span', 'free')
if free:
chapter_items = catalog_volume.findAll('li', 'chapter-item')
chapter_itemList.extend(chapter_items)
#####获取免费小说字数
novelWords = 0 #小说免费字数
for chapter_item in chapter_itemList:
chapter_href = r'https:' + chapter_item.find('a', 'chapter-name')['href']
chapter_response = requests.get(chapter_href, headers=headers) # 获取小说网址网页资源
chapter_soup = BeautifulSoup(chapter_response.text, 'html.parser')
chapter_main = chapter_soup.find('main', 'content mt-1.5em text-s-gray-900 leading-[1.8] relative z-0 r-font-black')
content_texts = chapter_main.findAll('p')
novelText = ''.join([_.text for _ in content_texts]) #获取小说内容
novelText = "".join(novelText.split()) #去除空格
novelWords += len(novelText)
print(novelWords)
############获取上架字数
novelDict = {
'bookName':bookName,
'channel': channel,
'type': type,
'total_num_word': total_num_word,
'total_recommend': total_recommend,
'week_recommend': week_recommend,
'mouth_count': mouth_count,
'writer_name': writer_name,
'level': level,
'works_num': works_num,
'total_words': total_words,
'write_days': write_days,
'novelWords':novelWords,
}
novelInfoList.append(novelDict) #3个月内小说信息总列表
得到数据样例如下(因小说免费章节字数获取时间较慢,故省略):{‘bookName’: ‘香江之狼’, ‘channel’: ‘都市’, ‘type’: ‘都市生活’, ‘total_num_word’: ‘45.04万’, ‘total_recommend’: ‘1.51万’, ‘week_recommend’: ‘267’, ‘mouth_count’: ‘1451’, ‘writer_name’: ‘任猪飞’, ‘level’: ‘阅文集团Lv.5作家’, ‘works_num’: ‘4’, ‘total_words’: ‘678万’, ‘write_days’: ‘816’}
{‘bookName’: ‘华娱前夫哥’, ‘channel’: ‘都市’, ‘type’: ‘娱乐明星’, ‘total_num_word’: ‘30.66万’, ‘total_recommend’: ‘1.44万’, ‘week_recommend’: ‘169’, ‘mouth_count’: ‘1418’, ‘writer_name’: ‘蚕食鲸’, ‘level’: ‘阅文集团Lv.1作家’, ‘works_num’: ‘1’, ‘total_words’: ‘31万’, ‘write_days’: ‘56’}
。
总结
通过精心设计的爬虫程序,我们已成功获取了起点中文网强推3个月内小说的详尽数据,包括小说类型、推荐情况、月票数量、作者等级以及上架字数等核心要素。
接下来,我们将对这些数据进行深入剖析,探寻其中的规律与趋势,旨在为小说创作者提供更具针对性的指导。同时,为避免爬虫在数据抓取过程中因网络请求量过大而遭遇IP封禁,我们将优化代理IP的使用策略,确保数据的稳定、持续获取。
此外,为提高代码的可读性和复用性,我们将对重复代码进行封装,形成可复用的函数模块。针对可能出现的请求失败情况,我们也制定了完善的处理策略,以保障数据的完整性和准确性。
最终,我们将基于这些小说数据的分析结果,尝试撰写第一本小说,期待能够为大家带来全新的阅读体验。敬请持续关注,共同见证这一创作之旅的精彩展开。