今天第一次遇到requests的Formdata是字典形式的,这个是在CSDN论坛有个网友求助,我就也练了一下,
其Formdata是这样的:
{"token":"","pn":10,"rn":10,"sdt":"","edt":"","wd":"","inc_wd":"","exc_wd":"","fields":"title","cnum":"001","sort":"{\"webdate\":0}","ssort":"title","cl":200,"terminal":"","condition":[{"fieldName":"categorynum","equal":"003001001001","notEqual":null,"equalList":null,"notEqualList":null,"isLike":true,"likeType":2},{"fieldName":"infoc","equal":"130121","notEqual":null,"equalList":null,"notEqualList":null,"isLike":true,"likeType":2}],"time":null,"highlights":"title","statistics":null,"unionCondition":null,"accuracy":"","noParticiple":"0","searchRange":null,"isBusiness":"1"}:
首先,这里与python的写法是不太一样的,需要把null换成None,把true换成True, 另外还有一两个括号的问题, 我就一一一更换掉,然后把值为0或者None的删掉(好像可以在替换前就删掉,傻了吧唧的~~)
然后尝试post请求,用data关键字代入,怎么都不对,查看了一下,还可以带json,结果真是带json就搞定了,精简了一下Formdata之后,开始爬取看看,果然很容易就下来了。。。
拿到url后,试试进入详情页,爬取具体内容,不过格式好乱,简单清洗了一下,格式还是不是很完美,不过也还算可以了。
import requests
import re
import json
import pprint
from lxml import etree
def clear_str(string):
return re.sub(r'\s|\n', '', string)
def clean_list(ls):
buff_list = list(map(clear_str, ls))
return_list = [i for i in buff_list if len(i) > 0]
return ''.join(return_list)
item = {"pn": 0,
"rn": 10,
"condition": [{"fieldName": "categorynum",
"equal": "003001001001",
"likeType": 2},
{"fieldName": "infoc",
"equal": "130201",
"likeType": 2}],
"isBusiness": "1"}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}
url = 'http://www.hebpr.gov.cn/inteligentsearch/rest/inteligentSearch/getFullTextData'
html = requests.post(url, json=item)
code = html.status_code
text = html.text
data = json.loads(text)
for count, single_info in enumerate(data['result']['records']):
try:
item = dict()
item['index'] = count + 1
item['title'] = single_info['title']
item['content'] = single_info['content']
item['data'] = single_info['webdate']
item['url'] = 'http://www.hebpr.gov.cn/hbggfwpt' + \
single_info['linkurl']
sub_text = requests.get(item['url'], headers=headers).text
text_list = etree.HTML(sub_text).xpath(
'//div[@class="ewb-copy"]//text()')
item['detail_content'] = clean_list(text_list)
pprint.pprint(item)
print()
except KeyError:
continue