爬取一个网页出错。
抛出以下错误
netloc '微信小程序:某某某' contains invalid characters under NFKC normalization f')) File "/usr/local/lib/python3.6/site-packages/pyquery/pyquery.py", line 714, in each if callback(func, i, element) is False: File "/usr/local/lib/python3.6/site-packages/pyquery/pyquery.py", line 132, in callback return func(*args[:func_code(func).co_argcount]) File "/usr/local/lib/python3.6/site-packages/pyquery/pyquery.py", line 1690, in rep urljoin(base_url, attr_value.strip())) File "/usr/lib64/python3.6/urllib/parse.py", line 512, in urljoin urlparse(url, bscheme, allow_fragments) File "/usr/lib64/python3.6/urllib/parse.py", line 368, in urlparse splitresult = urlsplit(url, scheme, allow_fragments) File "/usr/lib64/python3.6/urllib/parse.py", line 441, in urlsplit _checknetloc(netloc) File "/usr/lib64/python3.6/urllib/parse.py", line 410, in _checknetloc "characters under NFKC normalization") ValueError: netloc '微信小程序:某某某' contains invalid characters under NFKC normalization
调试发现处理以下类型的链接时parse.py出错
<a href="http://微信小程序:某某某某" target="_blank">
查找资料,
unicodedata.normalize是对URL进行规范化
估计是中文URL出错
网上没查到解决办法,也没精力细研究,暂时注释掉这块处理
把parse.py的407到410行注释掉
vi /usr/lib64/python3.6/urllib/parse.py:
394 def _checknetloc(netloc):
395 if not netloc or not any(ord(c) > 127 for c in netloc):
396 return
397 # looking for characters like \u2100 that expand to 'a/c'
398 # IDNA uses NFKC equivalence, so normalize for this check
399 import unicodedata
400 n = netloc.replace('@', '') # ignore characters already included
401 n = n.replace(':', '') # but not the surrounding text
402 n = n.replace('#', '')
403 n = n.replace('?', '')
404 netloc2 = unicodedata.normalize('NFKC', n)
405 if n == netloc2:
406 return
407# for c in '/?#@:':
408# if c in netloc2:
409# raise ValueError("netloc '" + netloc + "' contains invalid " +
410# "characters under NFKC normalization")
期待有更好的方法解决