最近在看崔庆才那本经典的爬虫开发书籍,之前虽然看过一点视频,但是与书籍相比还是书籍更加成体系,让我对知识有一个宏观的把控。目前已经看了前四章,了解了一些基础知识和如何解析数据的方法,但是对于数据的保存还不是很清楚。话不多说,今天主要是从笔趣阁上爬取书籍并简单保存。(菜鸟小白,不对的地方欢迎指正)
1. 首先我们需要导入一些库:
import requests
from requests.exceptions import RequestException
from lxml import etree
from bs4 import BeautifulSoup
(1)requests库帮助我们去请求服务器,并且返回
(2)lxml与bs4都是解析数据的库
2. 其次在主页面中我们需要拿到具体页面的url,方便我们解析:
这个就是主界面,所有具体页面(第一章、第二章.......)的url都在这个页面。所有我们需要在这个界面获取urls,再通过这些具体的url到具体页面提取到我们需要的内容
overall_url = 'https://www.biqooge.com/14_14838/'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gec
'Chrome/92.0.4515.107 Safari/537.36 Edg/92.0.902.55'
}
response = requests.get(overall_url,headers = headers)
text = BeautifulSoup(response.content.decode('utf-8','ignore').encode('gbk','ignore'),'lxml')
requests库的get方法需要一些基本的的参数,如此处的url和headers。url是请求的网站,headers是请求头,headers里的'User-Agent'是模拟浏览器发送请求,防止被发现为爬虫。通过这个方法,我们获得返回。这个response有两个属性可以获得返回内容,text和content。text返回Unicode格式的数据,以自己猜测的方式解码。content返回字节数据,可以指定解码格式,所以本实例中使用response.content.decode('utf-8','ignore').encode('gbk','ignore'),用‘utf-8’解码,因为是中文,可以用‘gbk’重新编码。
div = text.find('div', id = 'list')
dl = div.find_all('dl')[0]
dds = dl.find_all('dd')[9:]
urls = []
for dd in dds:
a= dd.find('a')
# print(dd.string,overall_url[0:-10] + a.get('href'))
urls.append(overall_url[0:-10] + a.get('href'))
接下来对返回的text进行解析,我们观察到它被放在div标签下的dl标签下的若干的dd标签之下,而每一个dd标签下的a标签的href属性就是具体的url。我用find_all或者是find方法获得标签,直接传入标签名,如果有什么属性,直接传入即可,格式为find_all('标签名','属性'),返回的是一个列表,需要对其进行索引。如果要获取标签属性,用'标签名'.get('属性名')。
完整代码:
def get_specific_url( ):
overall_url = 'https://www.biqooge.com/14_14838/'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
'Chrome/92.0.4515.107 Safari/537.36 Edg/92.0.902.55'
}
response = requests.get(overall_url,headers = headers)
text = BeautifulSoup(response.content.decode('utf-8','ignore').encode('gbk','ignore'),'lxml')
div = text.find('div', id = 'list')
dl = div.find_all('dl')[0]
dds = dl.find_all('dd')[9:]
urls = []
for dd in dds:
a= dd.find('a')
# print(dd.string,overall_url[0:-10] + a.get('href'))
urls.append(overall_url[0:-10] + a.get('href'))
return urls
3.对每一个具体页面解析得到文本:
完整代码:
def parse_one_page(url):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36 Edg/92.0.902.55"
}
response = requests.get(url,headers = headers)
html = response.content
bf = BeautifulSoup(html) #实例化对象
result1 = bf.find_all('div',id = 'content') #获得是div标签
result1 = result1[0].text.replace(' '*4,'\n\n') #第一个div标签中的文本,再对文本中的空格进行处理
return result1
4.最后定义一个函数写入文本:
def write_to_file(text):
# writer_flag = True
with open('不二之臣.txt','a',encoding='utf-8') as f:
f.writelines(text) #逐行写入文本
f.write('\n\n') #换行
5.最后:爬取书籍
if __name__=='__main__':
for url in get_specific_url():
endres = parse_one_page(url)
print(endres)
write_to_file(endres)
完整代码:
import requests,sys
from requests.exceptions import RequestException
from lxml import etree
from bs4 import BeautifulSoup
# import json
# import time
def get_specific_url( ):
overall_url = 'https://www.biqooge.com/14_14838/'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
'Chrome/92.0.4515.107 Safari/537.36 Edg/92.0.902.55'
}
response = requests.get(overall_url,headers = headers)
text = BeautifulSoup(response.content.decode('utf-8','ignore').encode('gbk','ignore'),'lxml')
div = text.find('div', id = 'list')
dl = div.find_all('dl')[0]
dds = dl.find_all('dd')[9:]
urls = []
for dd in dds:
a= dd.find('a')
# print(dd.string,overall_url[0:-10] + a.get('href'))
urls.append(overall_url[0:-10] + a.get('href'))
return urls
def parse_one_page(url):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36 Edg/92.0.902.55"
}
response = requests.get(url,headers = headers)
html = response.content
bf = BeautifulSoup(html)
result1 = bf.find_all('div',id = 'content') #获得是div标签
result1 = result1[0].text.replace(' '*4,'\n\n') #第一个div标签中的文本,再对文本进行替换
return result1
def write_to_file(text):
writer_flag = True
with open('不二之臣.txt','a',encoding='utf-8') as f:
f.writelines(text)
f.write('\n\n')
if __name__=='__main__':
for url in get_specific_url():
endres = parse_one_page(url)
print(endres)
write_to_file(endres)