URL的一般格式为(带方括号[]的为可选项):
protocol :// hostname[:port] / path / [;parameters][?query]#fragment
URL由三部分组成:
第一部分是协议:http,https,ftp,file,ed2k…
第二部分是存放资源的服务器的域名系统或IP地址(有时候要包含端口号,各种传输协议都有默认的端口号,如http的默认端口为80)。
第三部分是资源的具体地址,如目录或文件名等。
python3中的模块:urllib(包)
urllib is a package that collects several modules for working with URLs:
• urllib.request for opening and reading URLs
• urllib.error containing the exceptions raised by urllib.request
• urllib.parse for parsing URLs
• urllib.robotparser for parsing robots.txt files
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
Open the URL url, which can be either a string or a Request object.
举例:
import urllib.request
response=urllib.request.urlopen(url)
html=reponse.read()
print(html)
html=html.decode('utf-8')
import urllib.request
response = urllib.request.urlopen("http://placekitten.com/g/200/300")
cat_img = response.read() #获取到的是二进制内容
with open('cat_200_300.jpg', 'wb') as f:
f.write(cat_img)
以上代码也可以如下这么写:注意观察先建立一个request对象,用request对象打开。同时观察response的相关方法
reponse.getcode()
实例2:有道词典
利用浏览器的源码审查工具,查看网页信息传递内容,不同浏览器不一样,chrome和firefox
我用的是firefox
注意观察Headers下的Request Header:
特别是注意观察:User-Agent:一般通过它来判断是程序还是人在访问网站
注意:python爬取有道翻译出错 {‘errorcode’:50}
解决办法:
将在审查元素中获得的url中translate后面的_o去掉,错误就消失了,可以正常爬取。不知道为什么
接下来观察Form Data:表单数据
另外:data除了doctype键和i键不能去掉,其余的即使删除了也能正常运行翻译。
返回的数据是json格式
我的程序:
import urllib.request import urllib.parse import json content=input('请输入要翻译的内容:') url='http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule' data={} data['i']=content data['from']='AUTO' data['to']='AUTO' data['smartresult']='dict' data['client']='fanyideskweb' data['salt']='15890140685391' data['sign']='483f27800c357ce0e9a60057df27dda1' data['ts']='1589014068539' data['bv']='abf85f8020851128b561472c8a7b924d' data['doctype']='json' data['version']='2.1' data['keyfrom']='fanyi.web' data['action']='FY_BY_CLICKBUTTION' data=urllib.parse.urlencode(data).encode('utf-8') #这里要把数据封装成网页要求的格式,同时考虑中文的正常显示 response=urllib.request.urlopen(url,data) html=response.read().decode('utf-8') #这里要把获取到的数据解码 #print(html) target=json.loads(html) #print(target['translateResult'][0][0]['src']) print('翻译结果是:',target['translateResult'][0][0]['tgt'])
看看帮助文档:
urllib.request.
urlopen
(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
Open the URL url, which can be either a string or a Request
object.
data must be an object specifying additional data to be sent to the server, or None
if no such data is needed. See Request
for details.
class urllib.request.
Request
(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)
This class is an abstraction of a URL request.
url should be a string containing a valid URL.
data must be an object specifying additional data to send to the server, or None
if no such data is needed. Currently HTTP requests are the only ones that use data. The supported object types include bytes, file-like objects, and iterables of bytes-like objects. If no Content-Length
nor Transfer-Encoding
header field has been provided, HTTPHandler
will set these headers according to the type of data. Content-Length
will be used to send bytes objects, while Transfer-Encoding: chunked
as specified in RFC 7230, Section 3.3.1 will be used to send files and other iterables.
For an HTTP POST request method, data should be a buffer in the standard application/x-www-form-urlencoded format. The urllib.parse.urlencode()
function takes a mapping or sequence of 2-tuples and returns an ASCII string in this format. It should be encoded to bytes before being used as the data parameter.
headers should be a dictionary, and will be treated as if add_header()
was called with each key and value as arguments. This is often used to “spoof” the User-Agent
header value, which is used by a browser to identify itself – some HTTP servers only allow requests coming from common browsers as opposed to scripts. For example, Mozilla Firefox may identify itself as "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11"
, while urllib
’s default user agent string is "Python-urllib/2.6"
(on Python 2.6).
An appropriate Content-Type
header should be included if the data argument is present. If this header has not been provided and data is not None, Content-Type: application/x-www-form-urlencoded
will be added as a default.
The next two arguments are only of interest for correct handling of third-party HTTP cookies:
origin_req_host should be the request-host of the origin transaction, as defined by RFC 2965. It defaults to http.cookiejar.request_host(self)
. This is the host name or IP address of the original request that was initiated by the user. For example, if the request is for an image in an HTML document, this should be the request-host of the request for the page containing the image.
unverifiable should indicate whether the request is unverifiable, as defined by RFC 2965. It defaults to False
. An unverifiable request is one whose URL the user did not have the option to approve. For example, if the request is for an image in an HTML document, and the user had no option to approve the automatic fetching of the image, this should be true.
method should be a string that indicates the HTTP request method that will be used (e.g. 'HEAD'
). If provided, its value is stored in the method
attribute and is used by get_method()
. The default is 'GET'
if data is None
or 'POST'
otherwise. Subclasses may indicate a different default method by setting the method
attribute in the class itself.