python 爬虫实例

最新推荐文章于 2024-11-04 14:27:30 发布

weixin_42789202

最新推荐文章于 2024-11-04 14:27:30 发布

阅读量136

点赞数

分类专栏： python 文章标签：爬小说爬虫 python 新手

本文链接：https://blog.csdn.net/weixin_42789202/article/details/88012096

版权

python 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

//新手爬虫在线爬小说，大佬略过~~~~~
# -*- coding:utf-8 -*-

import re
import sys
import os
from time import sleep

from bs4 import BeautifulSoup
import requests
reload(sys)
sys.setdefaultencoding('utf-8')

//上面的是引得包和解决一些bug的，什么证书问题什么的



def xs2(url,):
    path = r'E:/Desktop/img/cc.txt'
    localPath = unicode(path, 'utf-8')//转译，如果路径中有中文可能报错
    req = requests.get(url, headers=headers).text//headers写自己浏览器的header是
    soup = BeautifulSoup(req, 'html.parser')//这里用的BeautifulSoup，因为比较容易匹配
    list = soup.find_all('p')//因为纵横的小说html正文都是写在<p>所有匹配p标签
    title_txtbox = soup.find_all(class_='title_txtbox')//匹配书名
    fn = open(localPath, 'a+')//写入
    fn.write(title_txtbox[0].get_text())
    for i in range(0, len(list)):
        pp = list[i].get_text()
        fn.write(pp)
        print "正在写入" + pp
    fn.write("\n")//写完1章来个换行
    fn.close()
    nextchapter = soup.find_all(class_='nextchapter')//获取下一章的链接
    ree=re.findall(r'href="(.*?)"',str(nextchapter))匹配href的属性，(.*?)表示这是我要的
    sleep(2)//睡2秒，太快可能被反爬虫封杀了ip可以换个headers继续使用，平常的话建议用比人的headers 23333
    xs2(str(ree).strip("['").strip("']"))//因为匹配的下一章的链接中前后有[ ]所有删掉，，循环调入直到下载完，但是可能会被 封杀警告
if __name__ == '__main__':
    url = 'http://book.zongheng.com/chapter/769917/43006084.html'//纵横小说网址
    xs2(url)