1.首先不是User-Agent的问题,说明一下headers的内容,尽量按照浏览器中的请求头把内容都加上。我抓取的网站中间变更过反爬策略,要求Referer也必须写,爬虫中Accept-Encoding不要写,不然获取的网页回来是乱码,其次网站增加了cookie验证和隐藏域字段
URL_ROOT='https://xxxx.com' #通过这个页面来获取cookie和隐藏域的值
cookie = http.cookiejar.CookieJar() # 声明一个CookieJar对象实例来保存cookie
handler = urllib.request.HTTPCookieProcessor(cookie) # 利用urllib2库的HTTPCookieProcessor对象来创建cookie处理器
opener = urllib.request.build_opener(handler) # 通过handler来构建opener
urllib.request.install_opener(opener)
request = urllib.request.Request(url=URL_ROOT,headers=headers,method='GET')
response = urllib.request.urlopen(request)
#打印cookie值
# for item in cookie:
# print('Name = ' + item.name)
# print('Value = ' + item.value)
the_tempPage = response.read()
# print(the_page)
tempResult = the_tempPage.decode('utf-8')
selector = Selector(text=tempResult)
hvalue = selector.xpath('//*[@id="hiddenvalue"]/input[1]/@value').extract()[0] #隐藏域参数
# print(hvalue )
2.之前post传参用的是下面这种形式,测试一直不成功,但在postman中是可以的
# data={"key1":value1,"key2":value2,"hvalue":hvalue}
# data = urllib.parse.urlencode(data)
# data = data.encode('utf-8')
于是直接将参数放到了url中
最后获取页面方法
urllib.request.install_opener(opener) #opener使用cookie
req = urllib.request.Request(url,headers=headers,method="POST")
response = urllib.request.urlopen(req)
the_page = response.read()
result = the_page.decode('UTF-8')
selector = Selector(text=result)
included_names = selector.xpath('/html/body/table/tr')