python urllib/parse.py对中文链接处理问题临时处治

爬取一个网页出错。

抛出以下错误

netloc '微信小程序:某某某' contains invalid characters under NFKC normalization f')) File "/usr/local/lib/python3.6/site-packages/pyquery/pyquery.py", line 714, in each if callback(func, i, element) is False: File "/usr/local/lib/python3.6/site-packages/pyquery/pyquery.py", line 132, in callback return func(*args[:func_code(func).co_argcount]) File "/usr/local/lib/python3.6/site-packages/pyquery/pyquery.py", line 1690, in rep urljoin(base_url, attr_value.strip())) File "/usr/lib64/python3.6/urllib/parse.py", line 512, in urljoin urlparse(url, bscheme, allow_fragments) File "/usr/lib64/python3.6/urllib/parse.py", line 368, in urlparse splitresult = urlsplit(url, scheme, allow_fragments) File "/usr/lib64/python3.6/urllib/parse.py", line 441, in urlsplit _checknetloc(netloc) File "/usr/lib64/python3.6/urllib/parse.py", line 410, in _checknetloc "characters under NFKC normalization") ValueError: netloc '微信小程序:某某某' contains invalid characters under NFKC normalization

 

调试发现处理以下类型的链接时parse.py出错

<a href="http://微信小程序:某某某某" target="_blank">

查找资料,

unicodedata.normalize是对URL进行规范化

估计是中文URL出错

网上没查到解决办法,也没精力细研究,暂时注释掉这块处理

把parse.py的407到410行注释掉


vi /usr/lib64/python3.6/urllib/parse.py:

    394 def _checknetloc(netloc):
    395     if not netloc or not any(ord(c) > 127 for c in netloc):
    396         return
    397     # looking for characters like \u2100 that expand to 'a/c'
    398     # IDNA uses NFKC equivalence, so normalize for this check
    399     import unicodedata
    400     n = netloc.replace('@', '')   # ignore characters already included
    401     n = n.replace(':', '')        # but not the surrounding text
    402     n = n.replace('#', '')
    403     n = n.replace('?', '')
    404     netloc2 = unicodedata.normalize('NFKC', n)
    405     if n == netloc2:
    406         return
    407#    for c in '/?#@:':
    408#         if c in netloc2:
    409#             raise ValueError("netloc '" + netloc + "' contains invalid " +
    410#                              "characters under NFKC normalization")

 

期待有更好的方法解决

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值