被Python3搞得好崩溃,抓取网页的问题
赶时髦,装了个python3.3,发现网上很多资料都是2.7的,没关系,自己慢慢研究吧,可是搞了个抓取网页的程序,一运行就报错,找了几个网上类似的Python3的代码,跑了一下一样的错误,真的被这些脚本语言的环境和版本匹配搞得快崩溃了,哪位有类似经验的帮我看看吧:
代码:
import urllib.parse
import urllib.request
url='http://www.xxx.com'
user_agent='Mozilla/4.0 (compatible; MSIE5.5; Windows NT)'
values={'name':'Michael Foord',
'location':'Northampton',
'language':'Python'}
headers={ 'User-Agent' : user_agent}
data=urllib.parse.urlencode(values)
req=urllib.request.Request(url, data, headers)
response=urllib.request.urlopen(req)
the_page=response.read()
print (the_page)
一运行就得到
File "E:/work/url.py", line 13, in
response=urllib.request.urlopen(req)
File "C:\Python33\lib\urllib\request.py", line 160, in urlopen
return opener.open(url, data, timeout)
File "C:\Python33\lib\urllib\request.py", line 471, in open
req = meth(req)
File "C:\Python33\lib\urllib\request.py", line 1183, in do_request_
raise TypeError(msg)
TypeError: POST data should be bytes or an iterable of bytes. It cannot be of type str.
这样的错误,试了几个别人的代码都不行,有遇到类似问题的吗?是我环境中什么版本不对吗?
------解决方案--------------------
data should be a buffer in the standard application/x-www-form-urlencoded format. The urllib.parse.urlencode() function takes a mapping or sequence of 2-tuples and returns a string in this format. It should be encoded to bytes before being used as the data parameter. The charset parameter in Content-Type header may be used to specify the encoding. If charset parameter is not sent with the Content-Type header, the server following the HTTP 1.1 recommendation may assume that the data is encoded in ISO-8859-1 encoding. It is advisable to use charset parameter with encoding used in Content-Type header with the Request.
------解决方案--------------------
关于typeerror
2.7 是 ascii,3.3 是utf-8,都是string类型,但socket (urllib等都是基于socket) 使用bytes
timeout是网络问题,一般不是程序问题,最好加个捕捉来处理
但也有可能是其他错误(例如传输错误的内容)引起,所以先改好其他再测试一遍
------解决方案--------------------
req=urllib.request.Request(url, data, headers)
print(req.read())
输出结果是什么?还会报错吗