我的第一个python爬虫-CSDN博客

本文链接：https://blog.csdn.net/weixin_46102027/article/details/115893795

首先导入相应第三方的包：requests和BeautifulSoup4（简称bs4）导入方法：pip install requests,同理导入bs4
在这里准备爬取个小说做实验，小说链接:http://www.xbiquge.la/0/15/12961.html(武动乾坤）
查看网页编码格式是utf-8还是gbk:在网页内按F12键打开控制台输入document.charset即可查看编码格式（这里我选择的网页格式是utf-8）
有了网页地址和编码格式就可以发送请求首先导入第三方库：

from bs4 import BeautifulSoup
from lxml import html
import xml
import requests
import re

发送请求：

url = "http://www.xbiquge.la/0/15/12961.html"
response = requests.get(url)
response.encoding = 'utf-8'#如果页面编码是gbk格式的需要需要这句进行转义，因为python默认utf-8的编码格式，虽然然默认utf-8，但是也要注明编码格式，不注明也会乱码，不知道为什么
html = response.text#获取小说网页源码
print (html)

在这里插入图片描述

fb=open('%s.txt'% 1,'w',encoding='utf-8')#新建一个TXT文件用来存放从页面获取的内容
chapter_content=re.findall(r'<div id="content">(.*?)</div>',html)[0]#查找在<div id="content"></div>里面的小说正文这个学过网页都能看懂，可以在得到的网页源码中查找也可以在原网页中查找

从中可以看到正文是在<div id="content"></div>里面的
接着将网页中的<br/><br\>, 去掉，在这里要注意<br\>是HTML的换行符不能替换成空，那样得到的文章就没有换行，惨不忍睹。所以要将<br/><br\>替换成python的换行符\n

chapter_content=chapter_content.replace('&nbsp;','')
chapter_content=chapter_content.replace('<br/><br/>','\n')

最后将网页内容保存到新建的TXT文档里

fb.write(chapter_title)
fb.write('\n')
fb.write(chapter_content)
fb.write('\n')

最后成品在这里插入图片描述
这只是一个简单的爬取实验，用来验证思路的代码放在结尾，这样的还达不到我的目，所以后面还有根据目录页面爬取整本小说的代码，完善的代码会放在文章末尾

from bs4 import BeautifulSoup
from lxml import html
import xml
import requests
import re
url = "http://www.biquku.la/0/424/229657.html"
response = requests.get(url)
response.encoding = 'utf-8'
html = response.text
print (html)
fb=open('%s.txt'% 1,'w',encoding='utf-8')
chapter_content=re.findall(r'<div id="content">(.*?)</div>',html)[0]
chapter_content=chapter_content.replace('&nbsp;','')
chapter_content=chapter_content.replace('<br/><br/>','\n')
print (chapter_content)
fb.write('\n')
fb.write(chapter_content)
fb.write('\n')

参考https://www.cnblogs.com/mumu597/p/11355787.html](https://www.cnblogs.com/mumu597/p/11355787.html)
完整版:https://blog.csdn.net/weixin_46102027/article/details/115899024