Python爬虫下

最新推荐文章于 2024-03-11 22:34:46 发布

及时行乐及生悲

最新推荐文章于 2024-03-11 22:34:46 发布

阅读量88

点赞数

本文链接：https://blog.csdn.net/Hanxuwei/article/details/111990885

版权

urllib的基本应用
1.读取并显示网页内容

import urllib.request

fp = urllib.request.urlopen(r’http://www.python.org’)

print(fp.read(100)) #读取100个字节

print(fp.read(100).decode()) #使用UTF8进行解码

fp.close()

2.提交网页参数

（1）如何使用GET方法读取并显示指定url的内容。

import urllib.request

import urllib.parse

params = urllib.parse.urlencode({‘spam’: 1, ‘eggs’: 2, ‘bacon’: 0})

url = “http://www.musi-cal.com/cgi-bin/query?%s” % params

with urllib.request.urlopen(url) as f:

print(f.read().decode('utf-8'))

（2）使用POST方法提交参数并读取指定页面内容。

import urllib.request

import urllib.parse

data = urllib.parse.urlencode({‘spam’: 1, ‘eggs’: 2, ‘bacon’: 0})

data = data.encode(‘ascii’)

with urllib.request.urlopen(“http://requestb.in/xrbl82xr”,

                            data) as f:

print(f.read().decode('utf-8'))

3.使用HTTP代理访问页面

import urllib.request

proxies = {‘http’: ‘http://proxy.example.com:8080/’}

opener = urllib.request.FancyURLopener(proxies)

with opener.open(“http://www.python.org”) as f:

f.read().decode('utf-8')

爬取网页的request库

requests库概述

简洁的处理HTTP请求的第三方库，建立在Python的urllib3库基础上，是对urllib3库的再封装。

requests库包括URL获取、HTTP长连接和连接缓存、自动内容解码、文件分块上传、连接超时处理、流数据下载等功能。

requests库解析

requests库的requests.get()方法功能是网络爬虫和信息提交

res=requests.get(url[,timeout=n])

该函数返回的网页内容会保存为一个response对象。参数url必须采用HTTP或HTTPS方式访问，可选参数timeout用于设定每次请求超时时间。

requests.get() 返回的response对象代表响应。response对象的主要属性如下。

● statuscode：返回HTTP请求的状态，200表示连接成功，404表示失败。

● text：HTTP响应内容的字符串形式，即url对应的页面内容。

● encoding：HTTP响应内容的编码方式。

● content：HTTP响应内容的二进制形式。

Response对象提供了两个方法。

json()：如果HTTP响应内容包含JSON格式数据，则该方法解析JSON数据。

raise_for_status()：如果status_code值不是200，则产生异常。

requests基本操作
（1）增加头部并设置访问代理

url = ‘https://api.github.com/some/endpoint’

headers = {‘user-agent’: ‘my-app/0.0.1’}

r = requests.get(url, headers=headers)

2）访问网页并提交数据

payload = {‘key1’: ‘value1’, ‘key2’: ‘value2’}

r = requests.post(“http://httpbin.org/post”, data=payload)

print(r.text) #查看网页信息，略去输出结果

url = ‘https://api.github.com/some/endpoint’

payload = {‘some’: ‘data’}

r = requests.post(url, json=payload)

print(r.text) # 查看网页信息，略去输出结果

print(r.headers) #查看头部信息，略去输出结果

print(r.headers[‘Content-Type’])

application/json; charset=utf-8

print(r.headers[‘Content-Encoding’])

gzip

（3）获取和设置cookies

使用get()方法获取网页信息时cookies属性的用法：

r = requests.get(“http://www.baidu.com/”)

r.cookies #查看cookies

使用get()方法获取网页信息时设置cookies参数的用法：

url = ‘http://httpbin.org/cookies’

cookies = dict(cookies_are=‘working’)

r = requests.get(url, cookies=cookies) #设置cookies

print(r.text)

{
“cookies”: {
“cookies_are”: “working”

}

beautifulsoup4库
1.beautifulsoup4库概述

beautifulsoup4库也称为bs4库或BeautifulSoup库

Python用于网页分析的第三方库，用来快速转换被抓取的网页。

beautifulsoup4将网页转换为一颗DOM树。

beautifulsoup4提供一些简单的方法以及类Python语法来查找、定位、修改一棵转换后的DOM树，还能自动将送进来的文档转换为Unicode编码。

beautifulsoup4库的对象

BeautifulSoup将HTML文档转换成一个树形结构，每个结点都是对象，可以归纳为4种类型：Tag、NavigableString、BeautifulSoup、Comment。

Tag对象，HTML中的一个标签。

NavigableString对象，用于操纵标签内部的文字，标签的string属性返回NavigableString对象

BeautifulSoup对象，表示的是一个文档的全部内容，大部分时候可以把它看作是一个特殊的Tag。

Comment对象，是一个特殊类型的NavigableSting对象，它的内容不包括注释符号。

beautifulsoup4库-操作解析文档树

遍历文档树

（1）获取直接子结点

contents属性和children属性可以获取Tag的直接子结点。

（2）获取所有有子结点

descendants属性可以对所有Tag的子结点进行递归循环，需要遍历获取其中的内容。

（3）获取结点内容

当标签中不再包含标签，string属性返回标签中的内容；

标签中内嵌唯一标签，那么string属性返回最里面标签的内容；

Tag包含了多个子标签结点，string的输出结果是None。

（4）获取多项内容

strings属性用于获取多个内容，需要遍历获取。

（5）父结点

父结点是当前结点的上级结点，parent属性用于获取父结点。

（6）兄弟结点

兄弟结点可以理解为和本结点处在同一层级的结点，next_sibling属性用于获取当前结点的下一个兄弟结点，previous_sibling则与之相反，如果结点不存在，则返回None

搜索文档树

（1）find_all()方法

搜索当前Tag的所有子结点，语法如下。

find_all(name,attrs,recursive,text,**kwargs)

name：名字为name的标签。

attrs：按照Tag标签属性值检索，采用字典形式。

recursive：如果只想搜索Tag的直接子结点，可以使用参数recursive=False。

text：通过text参数可以搜索文本字符中内容。

limit：限制返回结果的数量。

（2）find()方法

find()方法返回找到的第一个结果。

find(name,attrs,recursive,text)

参数含义与find_all()方法完全相同。

（3）用CSS选择器筛选元素

CSS的选择器用于选择网页元素，可以分为标签选择器、类选择器和id选择器三种。

在CSS中，标签名不加任何修饰，类名前面需要加点(.)标识，id名前加#号来标识。

在bs4库中，也可以利用类似的方法来筛选元素，用到的方法是soup.select()，返回类型是列表。

beautifulsoup4库-基本操作

from bs4 import BeautifulSoup

BeautifulSoup(‘hello world!’, ‘lxml’) #自动添加标签

hello world!

BeautifulSoup(‘hello world!’, ‘lxml’) #自动补全标签

hello world!

14.1.14 BeautifulSoup用法简介

soup.body #查看body标签内容

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

...

beautifulsoup4库-基本操作

soup.p #查看段落信息

The Dormouse's story

soup.p[‘class’] #查看标签属性

[‘title’]

soup.p.get(‘class’) #也可以这样查看标签属性

[‘title’]

soup.p.text #查看段落文本

“The Dormouse’s story”

soup.p.contents #查看段落内容

[The Dormouse’s story]

soup.a

Elsie

soup.a.attrs #查看标签所有属性

{‘class’: [‘sister’], ‘href’: ‘http://example.com/elsie’, ‘id’: ‘link1’}

soup.p #查看段落信息

The Dormouse's story

soup.p[‘class’] #查看标签属性

[‘title’]

soup.p.get(‘class’) #也可以这样查看标签属性

[‘title’]

soup.p.text #查看段落文本

“The Dormouse’s story”

soup.p.contents #查看段落内容

[The Dormouse’s story]

soup.a

Elsie

soup.a.attrs #查看标签所有属性

{‘class’: [‘sister’], ‘href’: ‘http://example.com/elsie’, ‘id’: ‘link1’}

import re

soup.find_all(href=re.compile(“elsie”))

                                  #查找href包含特定关键字的标签

[Elsie]

soup.find(id=‘link3’) #查找属性id='link3’的标签

Tillie

soup.find_all(‘a’, id=‘link3’) #查找属性’link3’的a标签

[Tillie]

for link in soup.find_all(‘a’):

print(link.text,':',link.get('href'))

Elsie : http://example.com/elsie

Lacie : http://example.com/lacie

Tillie : http://example.com/tillie

print(soup.get_text()) #返回所有文本

The Dormouse’s story

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

…

soup.a[‘id’] = ‘test_link1’ #修改标签属性的值

soup.a

Elsie

soup.a.string.replace_with(‘test_Elsie’) #修改标签文本

‘Elsie’

soup.a.string

‘test_Elsie’

for child in soup.body.children: #遍历直接子标签

print(child)

The Dormouse's story

Once upon a time there were three little sisters; and their names were

test_Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

...

test_doc = ‘

’

s = BeautifulSoup(test_doc, ‘lxml’)

for child in s.html.children: #遍历直接子标签

print(child)

for child in s.html.descendants: #遍历子孙标签

print(child)

及时行乐及生悲

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫