python代码网站_求python抓网页的代码

最新推荐文章于 2024-04-06 14:41:34 发布

weixin_39738115

最新推荐文章于 2024-04-06 14:41:34 发布

阅读量84

点赞数

文章标签： python代码网站

展开全部

python3.x中使用urllib.request模块来抓取网页代码，通过urllib.request.urlopen函数e69da5e6ba9062616964757a686964616f31333337626230取网页内容，获取的为数据流，通过read()函数把数字读取出来，再把读取的二进制数据通过decode函数解码(编号可以通过查看网页源代码中得知，如下例中为gbk编码。)，这样就得到了网页的源代码。

如下例所示，抓取本页代码：import urllib.request

html = urllib.request.urlopen('

).read().decode('gbk') #注意抓取后要按网页编码进行解码

print(html)

以下为urllib.request.urlopen函数说明：

urllib.request.urlopen(url,

data=None, [timeout, ]*, cafile=None, capath=None,

cadefault=False, context=None)

Open the URL url, which can be either a string or a Request object.

data must be a bytes object specifying additional data to be sent to

the server, or None

if no such data is needed. data may also be an iterable object and in

that case Content-Length value must be specified in the headers. Currently HTTP

requests are the only ones that use data; the HTTP request will be a

POST instead of a GET when the data parameter is provided.

data should be a buffer in the standard application/x-www-form-urlencoded format. The urllib.parse.urlencode() function takes a mapping or

sequence of 2-tuples and returns a string in this format. It should be encoded

to bytes before being used as the data parameter. The charset parameter

in Content-Type

header may be used to specify the encoding. If charset parameter is not sent

with the Content-Type header, the server following the HTTP 1.1 recommendation

may assume that the data is encoded in ISO-8859-1 encoding. It is advisable to

use charset parameter with encoding used in Content-Type header with the Request.

urllib.request module uses HTTP/1.1 and includes Connection:close header

in its HTTP requests.

The optional timeout parameter specifies a timeout in seconds for

blocking operations like the connection attempt (if not specified, the global

default timeout setting will be used). This actually only works for HTTP, HTTPS

and FTP connections.

If context is specified, it must be a ssl.SSLContext instance describing the various SSL

options. See HTTPSConnection for more details.

The optional cafile and capath parameters specify a set of

trusted CA certificates for HTTPS requests. cafile should point to a

single file containing a bundle of CA certificates, whereas capath

should point to a directory of hashed certificate files. More information can be

found in ssl.SSLContext.load_verify_locations().

The cadefault parameter is ignored.

For http and https urls, this function returns a http.client.HTTPResponse object which has the

following HTTPResponse

Objects methods.

For ftp, file, and data urls and requests explicitly handled by legacy URLopener and FancyURLopener classes, this function returns a

urllib.response.addinfourl object which can work as context manager and has methods such as

geturl() — return the URL of the resource retrieved,

commonly used to determine if a redirect was followed

info() — return the meta-information of the page, such

as headers, in the form of an email.message_from_string() instance (see Quick

Reference to HTTP Headers)

getcode() – return the HTTP status code of the response.

Raises URLError on errors.

Note that None

may be returned if no handler handles the request (though the default installed

global OpenerDirector uses UnknownHandler to ensure this never happens).

In addition, if proxy settings are detected (for example, when a *_proxy environment

variable like http_proxy is set), ProxyHandler is default installed and makes sure the

requests are handled through the proxy.

The legacy urllib.urlopen function from Python 2.6 and earlier has

been discontinued; urllib.request.urlopen() corresponds to the old

urllib2.urlopen.

Proxy handling, which was done by passing a dictionary parameter to urllib.urlopen, can be

obtained by using ProxyHandler objects.

Changed in version 3.2: cafile

and capath were added.

Changed in version 3.2: HTTPS virtual

hosts are now supported if possible (that is, if ssl.HAS_SNI is true).

New in version 3.2: data can be

an iterable object.

Changed in version 3.3: cadefault

was added.

Changed in version 3.4.3: context

was added.

weixin_39738115

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python代码网站_求python抓网页的代码

展开全部python3.x中使用urllib.request模块来抓取网页代码，通过urllib.request.urlopen函数e69da5e6ba9062616964757a686964616f31333337626230取网页内容，获取的为数据流，通过read()函数把数字读取出来，再把读取的二进制数据通过decode函数解码(编号可以通过查看网页源代码中得知，如下例中为gbk编码。)，这...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。