什么叫网络爬虫？

最新推荐文章于 2023-12-11 21:08:50 发布

打嗝_小王子

最新推荐文章于 2023-12-11 21:08:50 发布

阅读量380

点赞数

本文链接：https://blog.csdn.net/qq_40728302/article/details/106031714

版权

之所以叫网络爬虫（Web crawler）是因为它们可以沿着网络爬行。它们的本质就是一种递归方式。为了找到 URL 链接，它们必须首先获取网页内容，检查这个页面的内容，再寻找另一个 URL，然后获取 URL 对应的网页内容，不断循环这一过程。

使用网络爬虫的时候，你必须非常谨慎地考虑需要消耗多少网络流量，还要尽力思考能不能让采集目标的服务器负载更低一些。

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
import datetime
import random
import re

# 不需要翻墙的镜像站
# http://en.bosimedia.com/wiki/Kevin_Bacon
# 
# <a target=_blank href="/item/%E9%9D%A2%E5%90%91%E5%AF%B9%E8%B1%A1">

def getLinks(articleUrl):
    try:    html = urlopen("https://baike.baidu.com"+articleUrl)
    except HTTPError as e:  print("open url error: ",e)
    try:
        bsObj = BeautifulSoup(html, features="html5lib")
        target = bsObj.find("div", {"class":"para"})
        # (?!:) 不匹配冒号
        res = target.findAll("a", href=re.compile("^(/item/)((?!:).)*$"))
    except AttributeError as e: print("None error", e)
    return res

if __name__ == "__main__":
    random.seed(datetime.datetime.now())
    links = getLinks("/item/java")
    while len(links) > 0:
        newArticle = links[random.randint(0, len(links)-1)].attrs["href"]
        print(newArticle)
        links = getLinks(newArticle)

打嗝_小王子

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
什么叫网络爬虫？

之所以叫网络爬虫（Web crawler）是因为它们可以沿着网络爬行。它们的本质就是一种递归方式。为了找到 URL 链接，它们必须首先获取网页内容，检查这个页面的内容，再寻找另一个 URL，然后获取 URL 对应的网页内容，不断循环这一过程。使用网络爬虫的时候，你必须非常谨慎地考虑需要消耗多少网络流量，还要尽力思考能不能让采集目标的服务器负载更低一些。from urllib.request import urlopenfrom urllib.error import HTTPError.
复制链接

扫一扫