【第21期】观点:人工智能到底用 GPU?还是用 FPGA?

python urllib模块

转载 2015年11月18日 22:36:46

urllib提供了一系列用于操作URL的功能。

Get

urllib的request模块可以非常方便地抓取URL内容,也就是发送一个GET请求到指定的页面,然后返回HTTP的响应:方法是用urlopen函数,它的参数是url字符串或者是Request对象,他返回一个HTTPResponse对象
例如,对豆瓣的一个URLhttps://api.douban.com/v2/book/2129650进行抓取,并返回响应:

from urllib import request

url='https://api.douban.com/v2/book/2129650'
#urlopen的参数是url字符串或者是Request对象,返回值为HTTPResponse
with request.urlopen(url) as f:
    data=f.read()
    print('Statue: ',f.status,f.reason)
    for k,v in f.getheaders():
        print('%s: %s' % (k,v))
    print('Data: ',data.decode('utf-8'))

下面是HTTPResponse对象:
An HTTPResponse instance wraps the HTTP response from the server. It provides access to the request headers and the entity body. The response is an iterable object and can be used in a with statement.

HTTPResponse.read([amt])

Reads and returns the response body, or up to the next amt bytes.

HTTPResponse.readinto(b)

Reads up to the next len(b) bytes of the response body into the buffer b. Returns the number of bytes read.

New in version 3.3.

HTTPResponse.getheader(name, default=None)

Return the value of the header name, or default if there is no header matching name. If there is more than one header with the name name, return all of the values joined by ‘, ‘. If ‘default’ is any iterable other than a single string, its elements are similarly returned joined by commas.

HTTPResponse.getheaders()

Return a list of (header, value) tuples.

HTTPResponse.fileno()

Return the fileno of the underlying socket.

HTTPResponse.msg

A http.client.HTTPMessage instance containing the response headers. http.client.HTTPMessage is a subclass of email.message.Message.

HTTPResponse.version

HTTP protocol version used by server. 10 for HTTP/1.0, 11 for HTTP/1.1.

HTTPResponse.status

Status code returned by server.

HTTPResponse.reason

Reason phrase returned by server.

HTTPResponse.debuglevel

A debugging hook. If debuglevel is greater than zero, messages will be printed to stdout as the response is read and parsed.

HTTPResponse.closed

Is True if the stream is closed.

如果我们要想模拟浏览器发送GET请求,就需要使用Request对象,通过往Request对象添加HTTP头,我们就可以把请求伪装成浏览器。例如,模拟火狐去请求Python首页:

关于Request

其中User-agent是表示浏览器

Request对象都有什么属性和方法

from urllib import request

url='https://www.python.org/'
req=request.Request(url)
req.add_header('User_agent','Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11')
with request.urlopen(req) as f:
    print('Status:', f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    print('Data:', f.read().decode('utf-8'))

Get模拟微博登录:

from urllib import request,parse

print('Login to weibo.cn...')

url='https://passport.weibo.cn/sso/login?username=xxxxxx&password=xxxxxx'
print(url)

req=request.Request(url)
req.add_header('Origin', 'https://passport.weibo.cn')
req.add_header('User-Agent', 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25')
req.add_header('Referer', 'https://passport.weibo.cn/signin/login?entry=mweibo&res=wel&wm=3349&r=http%3A%2F%2Fm.weibo.cn%2F')

with request.urlopen(req) as f:
    print('Status:', f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    print('Data:', f.read().decode('utf-8'))

Post

如果要以POST发送一个请求,只需要把参数data以bytes形式传入。

我们模拟一个微博登录,先读取登录的邮箱和口令,然后按照weibo.cn的登录页的格式以username=xxx&password=xxx的编码传入:

from urllib import request,parse

print('Login to weibo.cn...')
url='https://passport.weibo.cn/sso/login'
email=input('Email: ')
password=input('Password: ')
login_data=parse.urlencode([
    ('username',email),
    ('password',password),
    ('entry','mweibo'),
    ('client_id',''),
    ('savestate','1'),
    ('ec',''),
    ('pagerefer', 'https://passport.weibo.cn/signin/welcome?entry=mweibo&r=http%3A%2F%2Fm.weibo.cn%2F')
])
req=request.Request(url)
req.add_header('Origin', 'https://passport.weibo.cn')
req.add_header('User-Agent', 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25')
req.add_header('Referer', 'https://passport.weibo.cn/signin/login?entry=mweibo&res=wel&wm=3349&r=http%3A%2F%2Fm.weibo.cn%2F')

with request.urlopen(req,data=login_data.encode('utf-8')) as f:
    print('Status:', f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    print('Data:', f.read().decode('utf-8'))

浅析HTTP协议
HTTP 请求方式: GET和POST的比较
http(百度百科)
HTTP协议详解

举报

相关文章推荐

Python中的urllib.request模块

因为在玩Python challenge的时候,有用过这个模块,而且学习这个模块之后也对系统学习网络爬虫有用。当时查了各种资料学习,没有碰官网文档(因为还是对英语有抗拒性),但是还是官方的文档最具权威...
  • Jurbo
  • Jurbo
  • 2016-08-25 18:08
  • 13892

python中urllib2模块 HTTPBasicAuthHandler认证 HTTPError bug

 系统winxp python 版本 2.6.6  用python 的urllib2模块做HTTP Basic Access Authentication 认证超过6次后抛出HTTPError   import urllib2 import os import re import time

学习Python的urllib模块

urllib 模块作为Python 3 处理 URL 的组件集合,如果你有 Python 2 的知识,那么你就会注意到 Python 2 中有 urllib 和 urllib2 两个版本的模块,这些现...

python的urllib2模块,用代理连接网络

今天要下载一些图片,这些图片全被墙了。就想着用python去下载,当然得用代理。 搜索一下发现urllib及urllib2模块都是支持代理的,但是要http代理。而我用Tunnelier建的是sock5代理,随即再搜索python使用sock5代理,发现也是有相关第三方模块的,叫SocksiPy。 下载后按说明使用,但一直不成功,汗,可能悟性太低了吧。 看来只有找个http代理了,要不用nginx搞个http代理?折腾了下,貌似也不成功。 又想到本博客另一作者写过一个sock5代理转http代理的帖子,立马找到其中提到
  • cjjwzs
  • cjjwzs
  • 2011-05-15 14:48
  • 1960

Python模块学习 --- urllib

Python模块学习 --- urllib 目录(?)[+]     urllib模块提供的上层接口,使我们可以像读取本地文件一样读取www和ftp上的数据。每当使用这个模块...
收藏助手
不良信息举报
您举报文章:深度学习:神经网络中的前向传播和反向传播算法推导
举报原因:
原因补充:

(最多只允许输入30个字)