python:爬虫学习与教学(1)

URL的一般格式为(带方括号[]的为可选项):

protocol :// hostname[:port] / path / [;parameters][?query]#fragment

URL由三部分组成:
第一部分是协议:http,https,ftp,file,ed2k…
第二部分是存放资源的服务器的域名系统或IP地址(有时候要包含端口号,各种传输协议都有默认的端口号,如http的默认端口为80)。
第三部分是资源的具体地址,如目录或文件名等。

python3中的模块:urllib(包)

urllib is a package that collects several modules for working with URLs:
    •    urllib.request for opening and reading URLs
    •    urllib.error containing the exceptions raised by urllib.request
    •    urllib.parse for parsing URLs
    •    urllib.robotparser for parsing robots.txt files

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
Open the URL url, which can be either a string or a Request object.

举例:

import urllib.request

response=urllib.request.urlopen(url)

html=reponse.read()

print(html)

html=html.decode('utf-8')


实验网站:http://placekitten.com/

import urllib.request

response = urllib.request.urlopen("http://placekitten.com/g/200/300")
cat_img = response.read()       #获取到的是二进制内容 

with open('cat_200_300.jpg', 'wb') as f:
    f.write(cat_img)

以上代码也可以如下这么写:注意观察先建立一个request对象,用request对象打开。同时观察response的相关方法

reponse.getcode()


实例2:有道词典

实验网址:http://fanyi.youdao.com/

利用浏览器的源码审查工具,查看网页信息传递内容,不同浏览器不一样,chrome和firefox

我用的是firefox

注意观察Headers下的Request Header:

特别是注意观察:User-Agent:一般通过它来判断是程序还是人在访问网站

注意:python爬取有道翻译出错 {‘errorcode’:50}

解决办法:

将在审查元素中获得的url中translate后面的_o去掉,错误就消失了,可以正常爬取。不知道为什么

 

接下来观察Form Data:表单数据

另外:data除了doctype键和i键不能去掉,其余的即使删除了也能正常运行翻译。

返回的数据是json格式

我的程序:

import urllib.request
import urllib.parse
import json

content=input('请输入要翻译的内容:')
url='http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule'

data={}
data['i']=content
data['from']='AUTO'
data['to']='AUTO'
data['smartresult']='dict'
data['client']='fanyideskweb'
data['salt']='15890140685391'
data['sign']='483f27800c357ce0e9a60057df27dda1'
data['ts']='1589014068539'
data['bv']='abf85f8020851128b561472c8a7b924d'
data['doctype']='json'
data['version']='2.1'
data['keyfrom']='fanyi.web'
data['action']='FY_BY_CLICKBUTTION'

data=urllib.parse.urlencode(data).encode('utf-8')  #这里要把数据封装成网页要求的格式,同时考虑中文的正常显示

response=urllib.request.urlopen(url,data)

html=response.read().decode('utf-8')                #这里要把获取到的数据解码

#print(html)
target=json.loads(html)

#print(target['translateResult'][0][0]['src'])
print('翻译结果是:',target['translateResult'][0][0]['tgt'])

 

 

看看帮助文档:

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

Open the URL url, which can be either a string or a Request object.

data must be an object specifying additional data to be sent to the server, or None if no such data is needed. See Request for details.

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

This class is an abstraction of a URL request.

url should be a string containing a valid URL.

data must be an object specifying additional data to send to the server, or None if no such data is needed. Currently HTTP requests are the only ones that use data. The supported object types include bytes, file-like objects, and iterables of bytes-like objects. If no Content-Length nor Transfer-Encoding header field has been provided, HTTPHandler will set these headers according to the type of data. Content-Length will be used to send bytes objects, while Transfer-Encoding: chunked as specified in RFC 7230, Section 3.3.1 will be used to send files and other iterables.

For an HTTP POST request method, data should be a buffer in the standard application/x-www-form-urlencoded format. The urllib.parse.urlencode() function takes a mapping or sequence of 2-tuples and returns an ASCII string in this format. It should be encoded to bytes before being used as the data parameter.

headers should be a dictionary, and will be treated as if add_header() was called with each key and value as arguments. This is often used to “spoof” the User-Agent header value, which is used by a browser to identify itself – some HTTP servers only allow requests coming from common browsers as opposed to scripts. For example, Mozilla Firefox may identify itself as "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11", while urllib’s default user agent string is "Python-urllib/2.6" (on Python 2.6).

An appropriate Content-Type header should be included if the data argument is present. If this header has not been provided and data is not None, Content-Type: application/x-www-form-urlencoded will be added as a default.

The next two arguments are only of interest for correct handling of third-party HTTP cookies:

origin_req_host should be the request-host of the origin transaction, as defined by RFC 2965. It defaults to http.cookiejar.request_host(self). This is the host name or IP address of the original request that was initiated by the user. For example, if the request is for an image in an HTML document, this should be the request-host of the request for the page containing the image.

unverifiable should indicate whether the request is unverifiable, as defined by RFC 2965. It defaults to False. An unverifiable request is one whose URL the user did not have the option to approve. For example, if the request is for an image in an HTML document, and the user had no option to approve the automatic fetching of the image, this should be true.

method should be a string that indicates the HTTP request method that will be used (e.g. 'HEAD'). If provided, its value is stored in the method attribute and is used by get_method(). The default is 'GET' if data is None or 'POST' otherwise. Subclasses may indicate a different default method by setting the method attribute in the class itself.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值