爬虫代码

最新推荐文章于 2024-05-27 13:16:19 发布

bitwind

最新推荐文章于 2024-05-27 13:16:19 发布

阅读量1.8k

点赞数 1

分类专栏： python 文章标签： python爬虫

本文链接：https://blog.csdn.net/bitwind/article/details/83280640

版权

这段代码是一个使用Python进行网络爬虫的实现，主要利用BeautifulSoup库解析HTML页面，获取小说的章节名和内容，并将其保存到本地文件。爬取过程包括获取网页HTML，解析章节链接，按顺序下载章节内容，并在下载过程中显示进度。

摘要由CSDN通过智能技术生成

爬虫代码，备忘。
#coding=utf-8
#__author__ = chengzhipeng

import re
import os
import sys
from bs4 import BeautifulSoup
from urllib import request
import ssl
# url = 'http://www.biqiuge.com/book/4772/'
# url = 'https://www.qu.la/book/1/'
url = 'http://www.biquge.com.tw/14_14055/'

def getHtmlCode(url):
    page = request.urlopen(url)
    html = page.read()
    htmlTree = BeautifulSoup(html,'html.parser')
    return htmlTree
    #return htmlTree.prettify()
def getKeyContent(url):
    htmlTree = getHtmlCode(url)

def parserCaption(url):
    htmlTree = getHtmlCode(url)
    storyName = htmlTree.h1.get_text() + '.txt'

    print('小说名:',storyName)
    aList = htmlTree.find_all('a',href=re.compile('(\d)*.html'))  #aList是一个标签类型的列表，class = Tag 写入文件之前需要转化为str
    #print(int(aList[1]['href'][0:-5]))
    print(aList)
    aDealList = []
    for line in a

最低0.47元/天解锁文章

bitwind

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
爬虫代码

爬虫代码，备忘。#coding=utf-8#__author__ = chengzhipengimport reimport osimport sysfrom bs4 import BeautifulSoupfrom urllib import requestimport ssl# url = 'http://www.biqiuge.com/book/4772/'# ur...
复制链接

扫一扫