实例3:爬取斗破苍穹小说全文

最新推荐文章于 2021-08-19 18:13:26 发布

Last_xuan1

最新推荐文章于 2021-08-19 18:13:26 发布

阅读量961

点赞数 2

分类专栏： # 普通爬虫

本文链接：https://blog.csdn.net/qq_43391383/article/details/86563474

版权

普通爬虫专栏收录该内容

18 篇文章 3 订阅

订阅专栏

呜呼~回到家的感觉真的不一样，下了广州站空气就清新了一番。。回到家就是熟悉而又陌生的感觉。

这次的实例是爬取斗破苍穹小说网的斗破苍穹小说全文。首先要找的是原网站，不是百度到的最前面的。因为正版的蜘蛛协议也写得很清楚了。。而且里面的文本内容被藏起来了。无能为力，只能爬别的网啦哈哈。

思路通常思路1. requests + Beautiful 2. requests + re

这里我们比较一下这两种方法，首先使用requests + re 路线
如图

在这里插入图片描述
要提取的信息都在p标签里面，
所以re.findall用正则表达式去匹配p标签吗?
下图

下面还有一个p标签呢。。所以findall的话会多出一个句子。
同实例2一样，我们使用select就可以了，经过全文的搜索发现<div class=articlecon 》这个标签是唯一存在的,唯一存在那就soup.select()完事了。

from bs4 import BeautifulSoup
import requests
import time

key_value = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
            '(KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'}

def get_html(url):
	try:
    		web_data = requests.get(url,headers = key_value)
    		web_data.raise_for_status()
    		web_data.encoding = web_data.apparent_encoding
    		return web_data.text
	except:
    		return None
file = open('D:/doupoxiaoshuo.txt','a+',encoding='UTF-8')

def write_file(html):
    soup = BeautifulSoup(html,'lxml')
    texts = soup.select('div.articlecon > p')
    for text in texts:
        real_text = text.get_text()
        file.write(real_text+'\n')
        
 if __name__ == '__main__':
    part_url = 'https://m.doupocangqiong1.com/1/t'
    for i in range(20,1677):
        real_url = part_url + str(i) + '.html'
        html = get_html(real_url)
        write_file(html)
        time.sleep(0.5)
        
file.close()