爬取某服务网,ajax异步加载,post 携带json字典

33 篇文章 0 订阅
16 篇文章 0 订阅
这篇博客介绍了如何使用Python的requests库处理POST请求,特别是当请求数据需要以JSON字典形式传递的情况。在CSDN论坛中,作者通过替换特定字符并删除无效值来转换Formdata,最终发现使用data关键字无法成功,但通过携带json参数成功实现了AJAX异步请求。在获取URL后,作者还尝试抓取并清洗了详情页的内容,尽管格式不理想,但基本内容已能提取出来。
摘要由CSDN通过智能技术生成

今天第一次遇到requests的Formdata是字典形式的,这个是在CSDN论坛有个网友求助,我就也练了一下,
其Formdata是这样的:

{"token":"","pn":10,"rn":10,"sdt":"","edt":"","wd":"","inc_wd":"","exc_wd":"","fields":"title","cnum":"001","sort":"{\"webdate\":0}","ssort":"title","cl":200,"terminal":"","condition":[{"fieldName":"categorynum","equal":"003001001001","notEqual":null,"equalList":null,"notEqualList":null,"isLike":true,"likeType":2},{"fieldName":"infoc","equal":"130121","notEqual":null,"equalList":null,"notEqualList":null,"isLike":true,"likeType":2}],"time":null,"highlights":"title","statistics":null,"unionCondition":null,"accuracy":"","noParticiple":"0","searchRange":null,"isBusiness":"1"}:

首先,这里与python的写法是不太一样的,需要把null换成None,把true换成True, 另外还有一两个括号的问题, 我就一一一更换掉,然后把值为0或者None的删掉(好像可以在替换前就删掉,傻了吧唧的~~)
然后尝试post请求,用data关键字代入,怎么都不对,查看了一下,还可以带json,结果真是带json就搞定了,精简了一下Formdata之后,开始爬取看看,果然很容易就下来了。。。

拿到url后,试试进入详情页,爬取具体内容,不过格式好乱,简单清洗了一下,格式还是不是很完美,不过也还算可以了。

import requests
import re
import json
import pprint
from lxml import etree


def clear_str(string):
    return re.sub(r'\s|\n', '', string)


def clean_list(ls):
    buff_list = list(map(clear_str, ls))
    return_list = [i for i in buff_list if len(i) > 0]
    return ''.join(return_list)


item = {"pn": 0,
        "rn": 10,
        "condition": [{"fieldName": "categorynum",
                       "equal": "003001001001",
                       "likeType": 2},
                      {"fieldName": "infoc",
                       "equal": "130201",
                       "likeType": 2}],
        "isBusiness": "1"}
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}

url = 'http://www.hebpr.gov.cn/inteligentsearch/rest/inteligentSearch/getFullTextData'

html = requests.post(url, json=item)
code = html.status_code
text = html.text
data = json.loads(text)
for count, single_info in enumerate(data['result']['records']):
    try:
        item = dict()
        item['index'] = count + 1
        item['title'] = single_info['title']
        item['content'] = single_info['content']
        item['data'] = single_info['webdate']
        item['url'] = 'http://www.hebpr.gov.cn/hbggfwpt' + \
            single_info['linkurl']
        sub_text = requests.get(item['url'], headers=headers).text
        text_list = etree.HTML(sub_text).xpath(
            '//div[@class="ewb-copy"]//text()')
        item['detail_content'] = clean_list(text_list)
        pprint.pprint(item)
        print()
    except KeyError:
        continue
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值