HTML&CSS

最新推荐文章于 2024-04-17 23:37:40 发布

mmい

最新推荐文章于 2024-04-17 23:37:40 发布

阅读量1k

点赞数

分类专栏：数据挖掘—dataquest

本文链接：https://blog.csdn.net/zm714981790/article/details/51328569

版权

Introduction 互联网上有很多信息并不是存在数据库中也不是API格式，这些数据存储网页上。提取这些数据的一个技术就是网页爬虫（web scraping）。在Python中进行爬虫的过程大概就是：使用requests库加载这个网页，然后使用beautifulsoup 库从这个网页中提取出相关的信息。Webpage Structure 网页是由HyperText Markup

摘要由CSDN通过智能技术生成

Introduction

互联网上有很多信息并不是存在数据库中也不是API格式，这些数据存储网页上。提取这些数据的一个技术就是网页爬虫（web scraping）。
在Python中进行爬虫的过程大概就是：使用requests库加载这个网页，然后使用beautifulsoup 库从这个网页中提取出相关的信息。

Webpage Structure

网页是由HyperText Markup Language (HTML)编写的，HTML是一种标记语言（markup language），它有自己的语法规则，浏览器下载了这些网页根据这些规则将正确的内容呈现给用户。从这里可以看到HTML中所有的tag。

向http://dataquestio.github.io/web-scraping-pages/simple.html网页发出一个GET请求，使用response.content可以获取到网页的内容：

response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
content = response.content
'''
bytes (<class 'bytes'>)
b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'
'''

Retrieving Elements From A Page

获得了HTML的所有内容后，我们需要解析这个网页。BeautifulSoup这个库可以提取HTML中的tags，HTML中的tags是层层嵌套的，可以组织成一个树形结构。

提取网页中的title，由于标签的层层嵌套的，因此我们需要层层拨开来获取title标签。

from bs4 import BeautifulSoup

# Initialize the parser, and pass in the content we grabbed earlier.
parser = BeautifulSoup(content, 'html.parser')

# 观察content的内容，可以发现p标签在body标签里面
body = parser.body
p = body.p

# Text is a property that gets the inside text of a tag.
print(p.text)

# 而title在head标签里面
head = parser.head
title = head.title
title_text = title.text

Using Find All

像上面这样直接使用tag这个属性虽然很直观，但是也很不健壮。我们可以使用find_all函数来获取某个标签的所有出现。

使用find_all函数获取title，由于我们知道title是在第一个head中，因此我们先获取到head，然后获取head中第一个title的内容：

parser = BeautifulSoup(content, 'html.parser')

# Get a list of all occurences of the body tag in the element.
body = parser.find_all("body")

# Get the paragraph tag
p = body[0].find_all("p")

# body中有很多段落p
print(p[0].text)
head = parser.find_all("head")
title = head[0].find_all("title")
title_text = title[0].text

Element Ids

在HTML中，元素(tag)拥有独一无二的id，可以通过id来检索到元素(tag)。看个例子：


        <title>A simple example page</title>


        <div>
            <

最低0.47元/天解锁文章

mmい

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
HTML&CSS

Introduction 互联网上有很多信息并不是存在数据库中也不是API格式，这些数据存储网页上。提取这些数据的一个技术就是网页爬虫（web scraping）。在Python中进行爬虫的过程大概就是：使用requests库加载这个网页，然后使用beautifulsoup 库从这个网页中提取出相关的信息。Webpage Structure 网页是由HyperText Markup
复制链接

扫一扫