urllib使用

最新推荐文章于 2023-02-13 14:48:59 发布
斯内客
最新推荐文章于 2023-02-13 14:48:59 发布
阅读量121
点赞数
本文链接：https://blog.csdn.net/weixin_38942791/article/details/79730265
版权
#coding=utf-8

'''
Created on 2018年3月27日

@author: BH Wong
    概念：urllib 模块很强大，能获取数据并且可以发送数据到服务端，获取需要的数据。相较于spider01.增加了请求头
               和res = request.Request(url),res同时获取了多种方法 info(),getcode(),geturl()等。具体参见test()函数
      trans()是通过发送data数据到有道词典，获取结果。
    流程：
    (1) 分析网页，使用谷歌开发者工具查看翻译时向服务端发送了什么？
    (2) 组data
    (3) 使用res = urlopen(url,data)打开网页，并传输数据
    (4) 使用html = res.read()读取网页信息.
    (5) 同时利用code = chardet.detect(html)获取网页字符编码。
    (6) html = html.decode(code)获取正确的输出方式
    (7) json.loads(html) 这一步我之前不会，将字符串转换为json格式。
    相关模块知识：
          介绍了json.loads()将字符串转换为json,json.dumps()将json转换为字符串。
        这篇博客介绍的比较详细：https://www.cnblogs.com/xiaomingzaixian/p/7286793.html
    箴言：
'''
from urllib import request,parse
import chardet 
import json
import os
def trans():
    a,n = 5,0
    while n<5:
        txt = input('请输入需要翻译的内容：')
        if txt == '':
            print('输入为空，老铁')
            break
        else:
            url = 'http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&sessionFrom=https://www.baidu.com/link'
            data = {
                'from':'AUTO',
                'to':'AUTO',
                'smartresult':'dict',
                'client':'fanyideskweb',
                'salt':'1500092479607',
                'sign':'c98235a85b213d482b8e65f6b1065e26',
                'doctype':'json',
                'version':'2.1',
                'keyfrom':'fanyi.web',
                'action':'FY_BY_CL1CKBUTTON',
                'typoResult':'true'}
            data['i'] = txt
            data = parse.urlencode(data).encode('utf-8')    
            try:
                req = request.Request(url)
                response = request.urlopen(req,data)
                html = response.read()
                code = chardet.detect(html).get('encoding')
                html = html.decode(code)
                re = json.loads(html)
                result = re.get('translateResult')[0][0].get('tgt')
                print('翻译结果:',result)
            except Exception as e:
                print(e)
            finally:
                n = n + 1
def test():
    head = {} #增加请求头，模拟浏览器访问，不会被ban。
    head['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36'
    req = request.Request("http://www.neihanshequ.com/",headers = head)
    res = request.urlopen(req)
    html = res.read()
    #print(res.info()) 使用Request对象回去请求头信息
    #print(res.geturl()) 获取获取请求的url
    #print(res.getcode()) 获取请求的状态码
    html = html.decode('utf-8')
    print(html)

if __name__ == '__main__':
    #主要是
    #test()
    #使用urllib.request()发送数据，并解析返回的数据
    #下面是一个简单的例子，用于向有道词典发送数据，并返回翻译后的内容
    trans()
    
    
    
     #coding=utf-8


'''
Created on 2018年3月27日


@author: BH Wong
    概念：
     本节主要学习使用urllib对网页进行简单的抓取。
     (1)python3中对python2中的urllib和urllib2进行了重构，都整合在urllib大模块中。urllib.request/parse/error等
      *urllib.request获取网页信息
      *urllib.parse用于解析网页
      *urllib.error用于处理错误信息
    流程：见实例
    相关模块知识：chardet
    箴言：
'''
#打开糗事百科的网站，读取相关内容。网址为：https://www.qiushibaike.com/
#话不多少，我要开始装X了哈哈
from urllib import request
import chardet
if __name__ == '__main__':
    #打开网页
    getRequest = request.urlopen("https://mp.csdn.net/postlist")
    #读取网页内容
    content = getRequest.read()
    #引入chardet，在不用人工查看源码的情况下获取网页编码格式。
    code = chardet.detect(content).get('encoding')
    #print(chardet.detect(content)) 返回的结果里的confidence:可信度，表示百分之多少是这个编码
    #设置获取的网页内容的编码格式
    content = content.decode(code)
    print(content)