本次实验爬取免费小说网站,按照目录爬取小说的全部章节文本。接下来介绍实验中采取的爬虫工具以及代码流程详解,并附上实验代码和运行结果。
一、爬虫工具介绍
二、完整代码
三、运行结果
一、爬虫工具介绍
本次实验采用python语言,使用requests库抓取网站数据、使用BeautifulSoup库解析网页,采用的具体用法有:
requests.get(url,headers=headers,timeout=30):
作用: 查看网页是否允许访问 如果允许则获取网页全部内容(网页全部内容被称为源代码也常被称做html)
url:网址
headers :请求头
timeout:请求设定时间限制超过时间会抛出异常
soup = BeautifulSoup(page_text, 'lxml')
作用:将互联网上获取的页面源码加载到该对象中
其中page_text = requests.get(url).content
soup.select( )
作用:根据选择器选择指定的内容,可以通过标签选择器(a)、类型选择器(.)、id选择器(#)、层级选择器。
在本程序中用来选择小说和生成小说的目录列表
soup.find( )
作用:找到第一个符合要求的标签
在本程序中用来选择小说每一个章节的内容
二、完整代码
import random
import requests
import os
import time
from bs4 import BeautifulSoup
url = 'https://www.biquge3.cc/article/50645/'
user_agent_list = ["Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; WOW64) Gecko/20100101 Firefox/61.0",
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
"Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.5; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.6.1 Safari/605.1.15",
]
headers = {'User-Agent': random.choice(user_agent_list)}
page_text = requests.get(url=url, headers=headers, timeout=30).content
# page_text.encoding = page_text.apparent_encoding
soup = BeautifulSoup(page_text, 'lxml')
li_list = soup.select('.mulu_list > li')
# print(li_list)
fp = open('青帝.txt', 'w', encoding='utf-8')
for li in li_list:
# print(li.a)
if li.a == None:
continue
title = li.a.string
detail_url = url + li.a['href']
# print(detail_url)
detail_page_text = requests.get(url=detail_url, headers=headers).content
# detail_page_text.encoding = detail_page_text.apparent_encoding
soup = BeautifulSoup(detail_page_text, 'html.parser')
content = soup.find('div', id="htmlContent").text
fp.write(title + '\n' + content + '\n')
print(title, ':爬取成功')
time.sleep(5)
fp.close()
三、运行结果