Python 爬虫

最新推荐文章于 2023-11-17 14:24:40 发布

cuicui_ruirui

最新推荐文章于 2023-11-17 14:24:40 发布

阅读量1.2k

点赞数 1

分类专栏： Python爬虫

本文链接：https://blog.csdn.net/cuicui_ruirui/article/details/105094000

版权

本文介绍了Python爬虫的基本流程，包括数据抽取、转化和储存（ETL过程），重点讲解了如何使用Requests库获取网页资源，以及利用BeautifulSoup解析HTML元素，提取所需信息。详细阐述了如何处理非结构化数据，提取网页链接、正文内容和时间信息，并讨论了批量爬取分页内容的方法。

摘要由CSDN通过智能技术生成

一、为什么要爬虫

90%的数据不在我们的数据库里，散落在网络世界，以网页资料形式呈现，即为非结构化数据，他们没有固定的数据格式，必须通过ETL（Extract，Transformation，Loading）工具将数据转化为结构化数据才能取用

二、ETL

E：Extract，数据抽取（原始资料，Raw Data）

T：Transformation：数据转化（利用ETL脚本）

L：Loding：数据储存（结构化数据，得到之后用来分析，挖掘价值）

三、如何处理非结构化数据

网络爬虫架构：

四、Requests

网络资源（URLs）截取套件

可以使用REST操作（POST，PUT,GET,DELETE）存取网络资源

简单爬虫：

Chrom浏览器右键检查，点击选择Network页签，刷新想要爬取的网页，点选Doc

在Name里选择第一个，点击右侧的Headers，就会出现这个页面的URL，读取方式等我们需要的信息。如下图所示

对于这个页面我们可以看到Request Method是GET，所以我们在用Request方法来获取这个页面的时候也要用get方法

import requests
htl=requests.get('http://mil.news.sina.com.cn/roll/index.d.html?cid=57918')
htl.encoding='utf-8'#为了防止字符等乱码
print(htl.text)#获取文章的内容
#如果是print(htl),输出结果是<Response [200]>

这里爬的是新浪网页，

五、BeautifulSoup剖析网页元素

DOM Tree

1、BeautifulSoup.text可以根据网页的元素把标签去掉，只写出标签内的内容

from bs4 import BeautifulSoup
html_sample='<html>' \
            '<body>' \
            '<h1 id="title">Hello World</h1>' \
            '<a href="#" class="link">This is link1</a>' \
            '<a href="#" class="link">This is link2</a>' \
            '</body>' \
            '</html>'
soup=BeautifulSoup(html_sample,'html.parser')#如果不加'html.parser'，这是个html的解析器，如果不加会警告
print(soup.text)

运行结果：Hello WorldThis is link1This is link2

2、BeautifulSoup.select可以找出含有特定标签的HTML元素

from bs4 import BeautifulSoup
html_sample='<html>' \
            '<body>' \
            '<h1 id="title">Hello World</h1>' \
            '<a href="#" class="link">This is link1</a>' \
            '<a href="#" class="link">This is link2</a>' \
            '</body>' \
            '</html>'
soup=BeautifulSoup(html_sample,'html.parser')
header=soup.select('h1')
print(header[0].text)
link=soup.select('a')
for link_1 in link:
    print(link_1.text)

输出结果是Hello World
This is link1
This is link2

3、BeautifulSoup.select可以找出含有CSS属性的元素