Python爬虫实例—爬取一个网站需要的所有网页

最新推荐文章于 2024-02-18 16:31:48 发布

八饱粥

最新推荐文章于 2024-02-18 16:31:48 发布

阅读量863

点赞数

分类专栏： Python学习笔记文章标签： python 爬虫开发语言

本文链接：https://blog.csdn.net/qq_64016761/article/details/128027202

版权

Python学习笔记专栏收录该内容

12 篇文章 1 订阅

订阅专栏

本文介绍了如何利用Python的requests库获取网页内容，结合UrlManager管理URL队列以及BeautifulSoup进行HTML解析，实现网页抓取。主要步骤包括设置超时、解析标题并写入文件、寻找并添加新的URL到队列中。

摘要由CSDN通过智能技术生成

import url_manager
from bs4 import BeautifulSoup
import requests
import re

root_url="http://www.crazyant.net"

urls=url_manager.UrlManager()#创建url管理器对象
urls.add_new_url(root_url)

fout=open("craw_all_pages.txt","w")
while urls.has_new_url():
    current_url=urls.get_url()
    r=requests.get(current_url,timeout=3)#过了3秒还没反应就继续往下执行，防止卡住
    if r.status_code!=200:
        print("erro,return status_code is not 200",current_url)
        continue
    soup=BeautifulSoup(r.text,"html.parser")#传入BeautifulSoup进行解析
    title=soup.title.string #获取数据
    
    fout.write("%s\t%s\n"%(current_url,title))
    fout.flush()#立马写入文件
    print("success:%s,%s,%d"%(current_url,title,len(urls.new_urls)))
    
    links=soup.find_all("a")
    for link in links: #将新的url添加到url管理器里面
        herf=link.get("href")
        if herf is None:#没有href属性就跳过
            continue
        pattern=r'^http://www.crazyant.net/\d+.html$'#用正则表达式判断获取的url是否符合需求
        if re.match(pattern, herf):
            urls.add_new_url(herf)
            
fout.close()#关闭文件，释放资源

url_manager在之前的文章里Python爬虫—requests、url管理器、HTML_八饱粥的博客-CSDN博客