爬虫学习,一些实战


前言: 感谢老污龟
[转] Python3中的urllib.request模块(中文).

1.urllib.request模块

urllib.request.urlopen

The urllib.request module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, cookies and more.

urllib.request模块定义了方法和类,帮助打开url(主要是HTTP)在一个复杂的世界——基本和摘要式身份验证,重定向,cookies等等。

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

一些参数

1.url

Open the URL url, which can be either a string or a Request object.
url类型可以是string或者Request类型

Request类

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)
This class is an abstraction of a URL request.
这个类是一个抽象的URL请求。

我们在使用urllib.request.urlopen()时,里面的url参数可以直接使用Request类型

比如:

url = 'http://placekitten.com/g/500/500'
req = urllib.request.Request(url)
response = urllib.request.urlopen(req)

另外,我们注意到第二个参数data,和urlopen()类似,我们可以直接将data表单写在这里,不用再传给urlopen()

headers

headers should be a dictionary, and will be treated as if add_header() was called with each key and value as arguments. This is often used to “spoof” the User-Agent header value, which is used by a browser to identify itself – some HTTP servers only allow requests coming from common browsers as opposed to scripts.

headers 应该是一个字典,如果 add_header()被称为与每个键和值作为参数。这通常是用来“恶搞” User-Agent头的值,因为使用一个浏览器识别本身——一些常见HTTP服务器只允许请求来自浏览器而不是脚本。

有时,我们想隐藏自己的身份,需要修改User-Agent,我们就需要修改headers的参数

e.g.1

head = {}

head['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'

date = {}

date['i'] = content
date['doctype'] = 'json'

date = urllib.parse.urlencode(date).encode('utf-8')

req = urllib.request.Request(url,date,head)

response = urllib.request.urlopen(req)

除了直接用,我们还可以使用add_header()函数

Request.add_header

Request.add_header(key, val)

Add another header to the request.
e.g.2

date = {}

date['i'] = content
date['doctype'] = 'json'

date = urllib.parse.urlencode(date).encode('utf-8')

req = urllib.request.Request(url,date)
req.add_header('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36')

response = urllib.request.urlopen(req)

2.date

data must be an object specifying additional data to send to the server, or None if no such data is needed.

Currently HTTP requests are the only ones that use data.The supported object types include bytes, file-like objects, and iterables. If no Content-Length nor Transfer-Encoding header field has been provided, HTTPHandler will set these headers according to the type of data. Content-Length will be used to send bytes objects, while Transfer-Encoding: chunked as specified in RFC 7230, Section 3.3.1 will be used to send files and other iterables.

For an HTTP POST request method, data should be a buffer in the standard application/x-www-form-urlencoded format. The urllib.parse.urlencode() function takes a mapping or sequence of 2-tuples and returns an ASCII string in this format. It should be encoded to bytes before being used as the data parameter.

数据必须是指定要发送到服务器的其他数据的对象,如果不需要此类数据,则为None。

目前只有HTTP请求使用数据。支持的对象类型包括字节、类文件对象和iterable。如果未提供内容长度或传输编码头字段,HTTPHandler将根据数据类型设置这些头。内容长度将用于发送字节对象,而传输编码:按照RFC 7230第3.3.1节的规定分块将用于发送文件和其他iterable。

对于HTTP POST请求方法,数据应该是标准application/x-www-form-urlencoded格式的缓冲区。urllib.parse.urlencode()函数接受2元组的映射或序列,并以这种格式返回一个ASCII字符串。在用作数据参数之前,应将其编码为字节。

我们查看有道翻译的数据表单,这里就是我们需要的data
Chrome
我们需要使用urllib.parse.urlencode()使data变为字节类型

一些函数geturl(),info(),getcode()

geturl() — return the URL of the resource retrieved, commonly used to determine if a redirect was followed

info() — return the meta-information of the page, such as headers, in the form of an email.message_from_string() instance (see Quick Reference to HTTP Headers)

getcode() – return the HTTP status code of the response.

直接举例子:

response = url.urlopen("http://www.fishc.com")
>>> type(response)
<class 'http.client.HTTPResponse'>

>>> response.geturl()     #获得url
'https://ilovefishc.com/'

>>> response.info()     #获得信息
<http.client.HTTPMessage object at 0x00000239F755BBC8>

>>> print(response.info())
Server: Tengine
Content-Type: text/html
Transfer-Encoding: chunked
Connection: close
Vary: Accept-Encoding
Date: Mon, 27 Apr 2020 09:55:23 GMT
Last-Modified: Thu, 13 Feb 2020 09:27:50 GMT
Vary: Accept-Encoding
ETag: "5e451696-e1a"
Via: cache24.l2et2[124,0], kunlun6.cn171[1320,0]
Timing-Allow-Origin: *
EagleId: ddb5c89d15879813233984199e

>>> response.getcode()    #表示正常打开
200

response.read()

读取网页信息,返回的是二进制字符串,需要decode()解码

response = urllib.request.urlopen(req)

html = response.read().decode('utf-8')

2.urllib.parse模块

urllib.parse.urlencode

urllib.parse.urlencode(query, doseq=False, safe='', encoding=None, errors=None, quote_via=quote_plus)

传入字典
返回字符串
编码为字节

date = {}

date['i'] = 'c'
date['doctype'] = 'json'

print(date)
print(type(date))   #字典

date = urllib.parse.urlencode(date)

print(date)
print(type(date))   #字符串

date = date.encode("utf-8")   #用utf-8编码

print(date)
print(type(date))   #二进制字节

运行结果:

{'i': 'c', 'doctype': 'json'}
<class 'dict'>

i=c&doctype=json
<class 'str'>

b'i=c&doctype=json'
<class 'bytes'>

3.json模块

python json模块 超级详解.

json.loads()

def loads(s, encoding=None, cls=None, object_hook=None, parse_float=None, parse_int=None, parse_constant=None, object_pairs_hook=None, **kw):
#将包含str类型的JSON文档反序列化为一个python对象(字典)

4.实战

1.爬取图片

import urllib.request as u

url = 'http://placekitten.com/g/500/500'

response = u.urlopen(url)

html = response.read()

with open('cat_g_500_500.jpg','wb') as file:
    file.write(html)

2.有道词典

import urllib.request
import json
import urllib.parse

content =input("请输入要翻译的内容:")

url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule"

date = {}

date['i'] = content
date['doctype'] = 'json'

date = urllib.parse.urlencode(date).encode('utf-8')


req = urllib.request.Request(url,date)
req.add_header('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36')
response = urllib.request.urlopen(req)

html = response.read().decode('utf-8')

target = json.loads(html)			#返回字典
print(target['translateResult'][0][0]['tgt'])	#将我们需要的东西输出	

5.写在最后

新人小白,欢迎指正。

  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值