python爬虫之bs4库_Python3 爬虫 requests+BeautifulSoup4(BS4) 教程

最新推荐文章于 2023-12-05 11:38:09 发布

weixin_39997310

最新推荐文章于 2023-12-05 11:38:09 发布

阅读量167

点赞数

文章标签： python爬虫之bs4库

刚学Python爬虫不久，迫不及待的找了一个网站练手，新笔趣阁：一个小说网站。

前提准备

安装Python以及必要的模块（requests，bs4），不了解requests和bs4的同学可以去官网看个大概之后再回来看教程

爬虫思路

刚开始写爬虫的小白都有一个疑问，进行到什么时候爬虫还会结束呢？答案是：爬虫是在模拟真人在操作，所以当页面中的next链接不存在的时候，就是爬虫结束的时候。

1.用一个queue来存储需要爬虫的链接，每次都从queue中取出一个链接，如果queue为空，则程序结束

2.requests发出请求，bs4解析响应的页面，提取有用的信息，将next的链接存入queue

3.用os来写入txt文件

具体代码

需要把域名和爬取网站对应的ip 写入host文件中，这样可以跳过DNS解析，不这样的话，代码运行一段时间会卡住不动

'''

抓取新笔趣阁https://www.xbiquge6.com/单个小说

爬虫线路： requests - bs4 - txt

Python版本： 3.7

OS： windows 10

'''

import requests

import time

import sys

import os

import queue

from bs4 import BeautifulSoup

# 用一个队列保存url

q = queue.Queue()

# 首先我们写好抓取网页的函数

def get_content(url):

try:

headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36',

}

r = requests.get(url=url, headers=headers)

r.encoding = 'utf-8'

content = r.text

return content

except:

s = sys.exc_info()

print("Error '%s' happened on line %d" % (s[1], s[2].tb_lineno))

return " ERROR "

# 解析内容

def praseContent(content):

soup = BeautifulSoup(content,'html.parser')

chapter = soup.find(name='div',class_="bookname").h1.text

content = soup.find(name='div',id="content").text

save(chapter, content)

next1 = soup.find(name='div',class_="bottem1").find_all('a')[2].get('href')

# 如果存在下一个章节的链接，则将链接加入队列

if next1 != '/0_638/':

q.put(base_url+next1)

print(next1)

# 保存数据到txt

def save(chapter, content):

filename = "修罗武神.txt"

f =open(filename, "a+",encoding='utf-8')

f.write("".join(chapter)+'\n')

f.write("".join(content.split())+'\n')

f.close

# 主程序

def main():

start_time = time.time()

q.put(first_url)

# 如果队列为空，则继续

while not q.empty():

content = get_content(q.get())

praseContent(content)

end_time = time.time()

project_time = end_time - start_time

print('程序用时', project_time)

# 接口地址

base_url = 'https://www.xbiquge6.com'

first_url = 'https://www.xbiquge6.com/0_638/1124120.html'

if __name__ == '__main__':

main()

总结

结果蛮成功的吧，就是过程比较慢，程序用时1个半小时。。23333继续学习，有改进方案的欢迎提出来，一起交流。

QQ:1156381157

weixin_39997310

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬虫之bs4库_Python3 爬虫 requests+BeautifulSoup4(BS4) 教程

刚学Python爬虫不久，迫不及待的找了一个网站练手，新笔趣阁：一个小说网站。前提准备安装Python以及必要的模块（requests，bs4），不了解requests和bs4的同学可以去官网看个大概之后再回来看教程爬虫思路刚开始写爬虫的小白都有一个疑问，进行到什么时候爬虫还会结束呢？答案是：爬虫是在模拟真人在操作，所以当页面中的next链接不存在的时候，就是爬虫结束的时候。1.用一个queue来...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。