python新手爬虫学习历程（一）

最新推荐文章于 2024-10-30 13:16:11 发布

老睿在此

最新推荐文章于 2024-10-30 13:16:11 发布

阅读量249

点赞数

文章标签： python 爬虫

本文链接：https://blog.csdn.net/weixin_47793683/article/details/117400917

版权

本文是作者作为Python新手的爬虫学习记录，详细介绍了如何使用urllib和BeautifulSoup爬取笔趣阁网站的书籍信息，并将其内容保存到TXT文件。通过解析HTML，找到目标元素并提取书名、章节链接，最终实现书籍内容的爬取与存储。

摘要由CSDN通过智能技术生成

python新手爬虫学习历程（一）

萌新驾到，刚接触python几个月，需要爬虫爬取一部分笔趣阁的书籍（当然这是不对的，我要自我检讨），完成数据库的大作业
先附上学习的网站：https://www.cnblogs.com/yizhenfeng168/p/6972946.html

本次爬虫采用的urlopen获取的http地址

（关于http这个玩意我们学习jsp的时候学过，不知道的同学可以上网查查，基本上理解没用难度）

附上完整代码：

from urllib.request import urlopen
# 导入BeautifulSoup
from bs4 import BeautifulSoup as bf
# 请求获取HTML
html = urlopen("https://www.biquge.com.cn/")
# 用BeautifulSoup解析html
obj = bf(html.read(),'html.parser')
# 从标签head、title里提取标题
# title = obj.head.title
#找到class=image的
con = obj.find(attrs={'class':'image'})
#读a中的href就是每本书的代号
book=con.a['href']
#html_book是笔趣阁地址加上书的代号
html_book=urlopen("https://www.biquge.com.cn/"+book)
#解析html_book
obj_book=bf(html_book.read(),'html.parser')
# book_name书名
book_name=obj_book.find(id="info").h1.text

#找到id=list再找到所有的dd
con_books=obj_book.find(id="list").find_all('dd')
data=open(r"E:\学习\python——爬虫\book_check"+"/"+book_name+".txt",'w+')
#循环是为了读取每一章的代号
for i in con_books:
    # print(i.a['href'])
    book_page=(i.a['href'])
    html_book_page = urlopen("https://www.biquge.com.cn/"+book_page)
    obj_book_page = bf(html_book_page.read(),'html.parser')
    # con_books_page=obj_book_page.find(id="content").find_all('br')
    # for j in con_books_page:
    #     print(j)
    con_books_page=obj_book_page.find(id="content")
    print(con_books_page.text.replace(u'\xa0', u' '),file=data)
    print(con_books_page.text.replace(u'\xa0', u' '))
data.close()

接下来
我将讲解一下我的这个爬虫的学习经历
大家先看这一块：

from urllib.request import urlopen
# 导入BeautifulSoup
from bs4 import BeautifulSoup as bf

头文件导入，没毛病，我用的pycharm所以这个urlopen和bs4两个包要提前装
在这里插入图片描述

接下来

html = urlopen("https://www.biquge.com.cn/")
#这一句中“ ”内的是你想进行爬取的网址


obj = bf(html.read(),'html.parser')
#这一句是用BeautifulSoup（这里提前简称为bf）解析html

然后咱们就要找要爬取的东西了（图片有点糊）
可以看见在选择class=image的时候我们找到了重生之都市仙尊
然后发现有个 <a href="/book/32883/">
这个就是我们想要的，点他就能进入这本书的页面
在这里插入图片描述

#找到class=image的
con = obj.find(attrs={'class':'image'})
#读a中的href就是每本书的代号
book=con.a['href']

这样我们就进入了重生之都市仙尊的页面
在这里插入图片描述
可以看见有很多章节
老样子，再来一遍上面找的过程

#html_book是笔趣阁地址加上书的代号
html_book=urlopen("https://www.biquge.com.cn/"+book)
#解析html_book
obj_book=bf(html_book.read(),'html.parser')
# book_name书名
book_name=obj_book.find(id="info").h1.text

#找到id=list再找到所有的dd
con_books=obj_book.find(id="list").find_all('dd')
data=open(r"E:\学习\python——爬虫\book_check"+"/"+book_name+".txt",'w+')
#循环是为了读取每一章的代号
for i in con_books:
    # print(i.a['href'])
    book_page=(i.a['href'])
    html_book_page = urlopen("https://www.biquge.com.cn/"+book_page)
    obj_book_page = bf(html_book_page.read(),'html.parser')
    # con_books_page=obj_book_page.find(id="content").find_all('br')
    # for j in con_books_page:
    #     print(j)
    con_books_page=obj_book_page.find(id="content")
    print(con_books_page.text.replace(u'\xa0', u' '),file=data)
    print(con_books_page.text.replace(u'\xa0', u' '))
data.close()

这里我把print的都写入txt文件中了

print(con_books_page.text.replace(u'\xa0', u' '),file=data)

是为了防止出现空格之类的解析的错误
具体还有一些问题附上几个网站都是在学习时用到的
解决解析错误的：
https://www.cnblogs.com/cwp-bg/p/7835434.html

open()函数打开文件路径报错问题：
https://blog.csdn.net/marsjhao/article/details/60333312

【python】读取和输出到txt：
https://blog.csdn.net/zxfhahaha/article/details/81288660

http问题的：
https://www.w3school.com.cn/tags/tag_br.asp

python爬虫用bs4获取标签中间的文本内容以及标签里的属性
https://blog.csdn.net/weixin_45774350/article/details/108930955

python获取页面所有a标签下href的值
https://blog.csdn.net/Homewm/article/details/83651735

Python 爬虫简单实现（爬取下载链接）
https://www.jianshu.com/p/8fb5bc33c78e

python正则查找http返回包中特定属性值
https://blog.csdn.net/sinat_36188088/article/details/103031182

小白如何入门 Python 爬虫？
https://zhuanlan.zhihu.com/p/77560712

笔趣阁（yyds）：https://www.biquge.com.cn/

老睿在此

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫