小白Python学习之旅三

最新推荐文章于 2019-08-13 13:42:32 发布

Mr_wuliboy

最新推荐文章于 2019-08-13 13:42:32 发布

阅读量540

点赞数

本文链接：https://blog.csdn.net/Mr_wuliboy/article/details/79858247

版权

1.使用beautifulsoup从网页中爬取信息：使用beautifulsoup之前先要导入，from bs4 import BeautifulSoup注意B和S要大写，然后需要引入urlopen这是为了导出网页的HTML源码，from urllib.request import urlopen

html=urlopen("网页地址")

print(html)这样网页的html源码就被导出了，然后需要

soup=BeautifulSoup(html,features='lxml'),把html源码赋值给soup其中features是指解析器一般推荐使用lxml，例如，导出某个页面<a href=""></a>中的网页地址，可用以下代码：all_href=soup.find_all('a')

print(all_href)这样输出的格式是<a href="https://mp.csdn.net/postedit"></a>这种带着标签的，如果想去掉标签的话，可以利用for循环：

all_href=soup.find_all('a')

all_href=[l['href'] for l in all_href]

print(all_href)

2.使用beautifulsoup输出css格式下的class信息：首先将网页html源码复制给soup，与上相同，a=soup.find_all('li',{"class","month"})表示将<li></li>中class中带有month的信息给a,for m in month

print(a.gettext())最后输出a中的文本信息

3.使用beautifulsoup爬取百度百科：首先介绍下random，在这个程序中random的作用是在当前页面中随机选取一个带有链接地址的关键词，整个程序的代码如下：

from bs4 import BeautifulSoup from urllib.request import urlopen import re import random base_url="https://baike.baidu.com" his=["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"] for i in range(10): url= base_url + his[-1] html=urlopen(url).read().decode('utf-8') soup=BeautifulSoup(html,features='lxml') print(soup.find('h1').get_text(),' url:',his[-1]) sub_urls=soup.find_all("a",{"target":"_blank","href":re.compile("/item/(%.{2})+$")}) if len(sub_urls)!=0: his.append(random.sample(sub_urls,1)[0]['href']) else: his.pop() print(his)

%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB表示的是网络爬虫四个汉字，网页地址中的汉字可以用转码器转换成utf-8的编码形式，%.{2})+$是根据自己所需要的信息，在html源码中总结规律，使用正则表达式将没用的信息筛选出去，%.{2})+$表示%后跟2位随机的字符或者数字，his.pop()表示如果当前页面中没有了带有链接地址的关键词，则返回上一级页面随机选取一个，可以理解为一个递归。

Mr_wuliboy

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
小白Python学习之旅三

1.使用beautifulsoup从网页中爬取信息：使用beautifulsoup之前先要导入，from bs4 import BeautifulSoup注意B和S要大写，然后需要引入urlopen这是为了导出网页的HTML源码，from urllib.request import urlopen ...
复制链接

扫一扫