python爬虫中对含中文的url处理

在练习urllib操作中,遇到了url中含有中文字符的问题。比如http://dotamax.com/,看下源码的话,上方的搜索框的name=p,输入内容点击搜索以后,通过GET方法进行传递,比如我们搜索”意“,url变为http://dotamax.com/search/?q=意。但是url中是不允许出现中文字符的,这时候就改用urllib.parse.quote方法对中文字符进行转换。

url = "http://dotamax.com/"
search = "search/?q=" + urllib.parse.quote("意")
html = urllib.request.urlopen(url + search)

这样就可以正常获取页面了。

需要注意的是不能对整个url调用quote方法。

print(urllib.parse.quote("http://dotamax.com/search/?q=意"))

上面代码输出结果:

http%3A//dotamax.com/search/%3Fq%3D%E6%84%8F

可以看到,' : ', ' ? ', ' = '都被解码,因此需要将最后的中文字符部分调用quote方法后接在后面。

但是还有更方便的方法:

import urllib.parse

b = b'/:?='
print(urllib.parse.quote("http://dotamax.com/search/?q=意", b))
输出结果为:

http://dotamax.com/search/?q=%E6%84%8F
这就是我们想要的结果了。对quote方法是用help命令可以看到如下信息:

Help on function quote in module urllib.parse:

quote(string, safe='/', encoding=None, errors=None)
    quote('abc def') -> 'abc%20def'
    
    Each part of a URL, e.g. the path info, the query, etc., has a
    different set of reserved characters that must be quoted.
    
    RFC 2396 Uniform Resource Identifiers (URI): Generic Syntax lists
    the following reserved characters.
    
    reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
                  "$" | ","
    
    Each of these characters is reserved in some component of a URL,
    but not necessarily in all of them.
    
    By default, the quote function is intended for quoting the path
    section of a URL.  Thus, it will not encode '/'.  This character
    is reserved, but in typical usage the quote function is being
    called on a path where the existing slash characters are used as
    reserved characters.
    
    string and safe may be either str or bytes objects. encoding must
    not be specified if string is a str.
    
    The optional encoding and errors parameters specify how to deal with
    non-ASCII characters, as accepted by the str.encode method.
    By default, encoding='utf-8' (characters are encoded with UTF-8), and
    errors='strict' (unsupported characters raise a UnicodeEncodeError).

None

safe为可以忽略的字符,可以str类型或者bytes类型。

更详细的一些用法可以看这里:

http://www.nowamagic.net/academy/detail/1302863

  • 2
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值