爬一下百度百科（莫烦python）

最新推荐文章于 2024-09-11 18:01:31 发布

m0_46149106

最新推荐文章于 2024-09-11 18:01:31 发布

阅读量285

点赞数

文章标签： python

本文链接：https://blog.csdn.net/m0_46149106/article/details/107687529

版权

1.首先打开百度百科詹姆斯
在这里插入图片描述
2.引入库加爬取的网址

from bs4 import BeautifulSoup#爬虫模块
from urllib.request import urlopen#内部模块打开网址用的
import re
import random#随机爬到另一个网站
base_url = "https://baike.baidu.com"#百度百科
his =["/item/%e5%8b%92%e5%b8%83%e6%9c%97%c2%b7%e8%a9%b9%e5%a7%86%e6%96%af/1989503"]
`
``

3.继续爬

url = base_url + his[-1]#最后一个网址
html = urlopen(url).read().decode('utf-8')#继续读
soup = BeautifulSoup(html, features='lxml')
print(soup.find('h1').get_text(), '    url: ', his[-1])#返回第一个h1也就是标题以及上一个url

在这里插入图片描述

4.詹姆斯百度百科里的链接的规律
在这里插入图片描述
以item开头，但是并不是所有item开头都对，因为有些掺杂着中文。
例如：

那么就要用正则表达式匹配掉这些

sub_urls = soup.find_all("a", { "href": re.compile("/item/(%.{2})+$"),"target": "_blank"})
if len(sub_urls) != 0:
    his.append(random.sample(sub_urls, 1)[0]['href'])#如果sub-urls里有东西的话就随机选取一个爬进去
else:
 	his.pop()#没有的话就返回上一层的his，pop的目的移除一个元素默认返回最后一个元素
print(his)