“百度百科六度分隔理论”（简单版）

最新推荐文章于 2023-05-07 06:36:43 发布

在技术海洋里潜泳

最新推荐文章于 2023-05-07 06:36:43 发布

阅读量1.4k

点赞数 1

分类专栏：爬虫基础笔记

本文链接：https://blog.csdn.net/qq_46273905/article/details/105601844

版权

爬虫基础同时被 2 个专栏收录

9 篇文章 0 订阅

订阅专栏

笔记

8 篇文章 0 订阅

订阅专栏

“百度百科六度分隔理论”（简单版）

相信大家都听说过“维基百科六度分隔理论”，本文在此只研究该理论的前期过程，即构建一个从一个页面到另一个页面的爬虫。本文选用百度百科的金融词条进行测验。

前期准备

解决url乱码问题：百度百科的url显示出来会出现乱码，以下为解决办法。

#https://baike.baidu.com/item/%E9%87%91%E8%9E%8D/860
from urllib.parse import unquote
url='https://baike.baidu.com/item/%E9%87%91%E8%9E%8D/860'
def new_url(url):
    new_url=unquote(url,'utf8')
    return new_url

实践

先查找所有链接，发现链接在a标签中。

from urllib.request import urlopen
from bs4 import BeautifulSoup
from urllib.parse import unquote
url='https://baike.baidu.com/item/%E9%87%91%E8%9E%8D/860'
def new_url(url):
    new_url=unquote(url,'utf8')
    return new_url
html=urlopen('https://baike.baidu.com/item/%E9%87%91%E8%9E%8D/860')
bs=BeautifulSoup(html,'html.parser')
for link in bs.find_all('a'):
    if 'href' in link.attrs:
        print(link.attrs['href'])#发现符合要求的链接和不符合要求的链接都被选出，需要进行下一步筛选

进一步筛选合适的词条链接，发现词条链接的共同点：

词条链接都是类似于：/item/%E4%BC%9A%E8%AE%A1/88436这样的形式

利用正则表达式，筛选链接：

#^(/item/).*?/[0-9]*$
#https://baike.baidu.com/item/%E9%87%91%E8%9E%8D/860
from urllib.request import urlopen
from bs4 import BeautifulSoup
from urllib.parse import unquote
import re
url='https://baike.baidu.com/item/%E9%87%91%E8%9E%8D/860'
def new_url(url):
    new_url=unquote(url,'utf8')
    return new_url
html=urlopen('https://baike.baidu.com/item/%E9%87%91%E8%9E%8D/860')
bs=BeautifulSoup(html,'html.parser')
for link in bs.find_all('a',href=re.compile('^(/item/).*?/[0-9]*$')):
    if 'href' in link.attrs:
        print(link.attrs['href'])

创建函数，优化结构

def getLinks(articleUrl):
    html = urlopen('https://baike.baidu.com{}'.format(articleUrl))
    bs = BeautifulSoup(html, 'html.parser')
    return bs.find_all('a',href=re.compile('^(/item/).*?/[0-9]*$'))
links=getLinks('/item/%E9%87%91%E8%9E%8D/860')
while len(links)>0:
    newArticle=links[random.randint(0,len(links)-1)].attrs['href']
    print(newArticle)
    links=getLinks(newArticle)

5.总的代码：

#https://baike.baidu.com/item/%E9%87%91%E8%9E%8D/860
from urllib.request import urlopen
from bs4 import BeautifulSoup
from urllib.parse import unquote
import datetime
import random
import re
random.seed(datetime.datetime.now())
def new_url(url):
    new_url=unquote(url,'utf8')
    return new_url
def getLinks(articleUrl):
    html = urlopen('https://baike.baidu.com{}'.format(articleUrl))
    bs = BeautifulSoup(html, 'html.parser')
    return bs.find_all('a',href=re.compile('^(/item/).*?/[0-9]*$'))
links=getLinks('/item/%E9%87%91%E8%9E%8D/860')
while len(links)>0:
    newArticle=links[random.randint(0,len(links)-1)].attrs['href']
    print(newArticle)
    links=getLinks(newArticle)

在技术海洋里潜泳

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
“百度百科六度分隔理论”（简单版）

“百度百科六度分隔理论”（简单版）相信大家都听说过“维基百科六度分隔理论”，本文在此只研究该理论的前期过程，即构建一个从一个页面到另一个页面的爬虫。本文选用百度百科的金融词条进行测验。前期准备解决url乱码问题：百度百科的url显示出来会出现乱码，以下为解决办法。#https://baike.baidu.com/item/%E9%87%91%E8%9E%8D/860from urll...
复制链接

扫一扫