python02

最新推荐文章于 2022-10-30 16:00:07 发布

ccrispy

最新推荐文章于 2022-10-30 16:00:07 发布

阅读量104

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/xxiizzeefather/article/details/108623441

版权

python 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

02

转自老师总结：

1）选着要爬的网址 (url)
2）使用 python 登录上这个网址 (urlopen等)
3）读取网页信息 (read() 出来)
4）将读取的信息放入 BeautifulSoup
5）使用 BeautifulSoup 选取 tag 信息等 (代替正则表达式)

beautifulSoup下载安装

# Python 2+
pip install beautifulsoup4

# Python 3+
pip3 install beautifulsoup4

开始学习使用bs4

由于使用的是python，红线alt+enter+Install就可以解决下载安装。
接下来是常规的爬取模式。

from bs4 import BeautifulSoup
from urllib.request import urlopen

# if has Chinese, apply decode()
html = urlopen("https://mofanpy.com/static/scraping/basic-structure.html").read().decode('utf-8')
print(html)

print出来的结果

<!DOCTYPE html>
<html lang="cn">
<head>
    <meta charset="UTF-8">
    <title>Scraping tutorial 1 | 莫烦Python</title>
    <link rel="icon" href="https://mofanpy.com/static/img/description/tab_icon.png">
</head>
<body>
    <h1>爬虫测试1</h1>
    <p>
        这是一个在 <a href="https://mofanpy.com/">莫烦Python</a>
        <a href="https://mofanpy.com/tutorials/scraping">爬虫教程</a> 中的简单测试.
    </p>

</body>
</html>

soup = BeautifulSoup(html,features='lxml')
print(soup.h1)

<h1>爬虫测试1</h1>

如果网页中有过个同样的 tag，比如链接<a>，可以使用find_all() 来找到所有的选项，用 key 来读取l["href"].

all_href = soup.find_all('a')
all_href = [l['href'] for l in all_href]
print('\n',all_href)

print出来的结果

 ['/', '/tutorials/data-manipulation/scraping/']

个人总结

1）import bs4,使用`bs4（beautifulSoup4）`以`lxml`进行加载，
2）直接选择相关标签进行打印
3）使用`l['']`配上for遍历进行以key为基础的遍历源码获取相关项进行打印

ccrispy

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录