网络爬虫--13.数据提取之JSON与JsonPATH

文章目录一. 前言二. JSON三. json.loads()四. json.dumps()五. json.dump()六. json.load()七. JsonPath八. JsonPath与XPath语法对比九. 案例分析一. 前言JSON(JavaScript Object Notation) 是一种轻量级的数据交换格式,它使得人们很容易的进行阅读和编写。同时也方便了机器进行解析和生...
摘要由CSDN通过智能技术生成

一. 前言

JSON(JavaScript Object Notation) 是一种轻量级的数据交换格式,它使得人们很容易的进行阅读和编写。同时也方便了机器进行解析和生成。适用于进行数据交互的场景,比如网站前台与后台之间的数据交互。

JSON和XML的比较可谓不相上下。

Python 2.7中自带了JSON模块,直接import json就可以使用了。

官方文档:http://docs.python.org/library/json.html

Json在线解析网站:http://www.json.cn/#

二. JSON

json简单说就是javascript中的对象和数组,所以这两种结构就是对象和数组两种结构,通过这两种结构可以表示各种复杂的结构

对象:对象在js中表示为{ }括起来的内容,数据结构为 { key:value, key:value, … }的键值对的结构,在面向对象的语言中,key为对象的属性,value为对应的属性值,所以很容易理解,取值方法为 对象.key 获取属性值,这个属性值的类型可以是数字、字符串、数组、对象这几种。

数组:数组在js中是中括号[ ]括起来的内容,数据结构为 [“Python”, “javascript”, “C++”, …],取值方式和所有语言中一样,使用索引获取,字段值的类型可以是 数字、字符串、数组、对象几种。

在这里插入图片描述
json模块提供了四个功能:dumps、dump、loads、load,用于字符串 和 python数据类型间进行转换。

三. json.loads()

把Json格式字符串解码转换成Python对象 从json到python的类型转化对照如下:、
在这里插入图片描述

import json

strList = '[1, 2, 3, 4]'

strDict = '{"city": "北京", "name": "大猫"}'

json.loads(strList)
# [1, 2, 3, 4]

json.loads(strDict) # json数据自动按Unicode存储
# {u'city': u'\u5317\u4eac', u'name': u'\u5927\u732b'}

四. json.dumps()

实现python类型转化为json字符串,返回一个str对象 把一个Python对象编码转换成Json字符串

从python原始类型向json类型的转化对照如下:

在这里插入图片描述

import json
import chardet

listStr = [1, 2, 3, 4]
tupleStr = (1, 2, 3, 4)
dictStr = {
   "city": "北京", "name": "大猫"}

json.dumps(listStr)
# '[1, 2, 3, 4]'
json.dumps(tupleStr)
# '[1, 2, 3, 4]'

# 注意:json.dumps() 序列化时默认使用的ascii编码
# 添加参数 ensure_ascii=False 禁用ascii编码,按utf-8编码
# chardet.detect()返回字典, 其中confidence是检测精确度

json.dumps(dictStr)
# '{"city": "\\u5317\\u4eac", "name": "\\u5927\\u5218"}'

chardet.detect(json.dumps(dictStr))
# {'confidence': 1.0, 'encoding': 'ascii'}

print json.dumps(dictStr, ensure_ascii=False)
# {"city": "北京", "name": "大刘"}

chardet.detect(json.dumps(dictStr, ensure_ascii=False))
# {'confidence': 0.99, 'encoding': 'utf-8'}

chardet是一个非常优秀的编码识别模块,可通过pip安装

五. json.dump()

将Python内置类型序列化为json对象后写入文件


import json

listStr = [{
   "city": "北京"}, {
   "name": "大刘"}]
json.dump(listStr, open("listStr.json","w"), ensure_ascii=False)

dictStr = {
   "city": "北京", "name": "大刘"}
json.dump(dictStr, open("dictStr.json","w"), ensure_ascii=False)

六. json.load()

读取文件中json形式的字符串元素 转化成python类型

import json

strList = json.load(open("listStr.json"))
print strList

# [{u'city': u'\u5317\u4eac'}, {u'name': u'\u5927\u5218'}]

strDict = json.load(open("dictStr.json"))
print strDict
# {u'city': u'\u5317\u4eac', u'name': u'\u5927\u5218'}

七. JsonPath

JsonPath 是一种信息抽取类库,是从JSON文档中抽取指定信息的工具,提供多种语言实现版本,包括:Javascript, Python, PHP 和 Java。

JsonPath 对于 JSON 来说,相当于 XPATH 对于 XML。

下载地址:https://pypi.python.org/pypi/jsonpath

安装方法:点击Download URL链接下载jsonpath,解压之后执行python setup.py install

官方文档:http://goessner.net/articles/JsonPath

八. JsonPath与XPath语法对比

Json结构清晰,可读性高,复杂度低,非常容易匹配,下表中对应了XPath的用法。
在这里插入图片描述

九. 案例分析

以拉勾网城市JSON文件 http://www.lagou.com/lbs/getAllCitySearchLabels.json 为例,获取所有城市。

# 以拉勾网城市JSON文件 http://www.lagou.com/lbs/getAllCitySearchLabels.json 为例,获取所有城市。

import requests
import jsonpath
import json
import chardet

import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

url = 'http://www.lagou.com/lbs/getAllCitySearchLabels.json'
headers = {
   "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}
response = requests.get(url,verify=False,headers = headers)
html = response.text

print('------------1.json格式,类型为str-----------')
print(html)
print (type(html))

# 把json格式字符串转换成python对象
jsonobj = json.loads(html)
print('------------2.python对象-----------')
print(jsonobj)
print(type(jsonobj))


# 从根节点开始,匹配name节点
citylist = jsonpath.jsonpath(jsonobj,'$..name')
print('------------3.提取name-----------')
print (citylist)
print (type(citylist))


content = json.dumps(citylist, ensure_ascii=False)
print('------------4.json格式,类型为str-----------')
print (content)
print (type(content))


fp = open('city.json','wb')
fp.write(content.encode('utf-8'))
fp.close()

输出结果:

------------1.json格式,类型为str-----------
{
   "state":1,"message":"success","content":{
   "data":{
   "allCitySearchLabels":{
   "A":[{
   "id":723,"name":"安阳","parentId":545,"code":"171500000","isSelected":false},{
   "id":601,"name":"鞍山","parentId":535,"code":"081600000","isSelected":false},{
   "id":105795,"name":"澳门特别行政区","parentId":562,"code":"330100000","isSelected":false},{
   "id":671,"name":"安庆","parentId":541,"code":"131800000","isSelected":false},{
   "id":825,"name":"安顺","parentId":553,"code":"240400000","isSelected":false},{
   "id":903,"name":"阿勒泰","parentId":560,"code":"310400000","isSelected":false},{
   "id":897,"name":"阿克苏","parentId":560,"code":"311800000","isSelected":false},{
   "id":862,"name":"安康","parentId":556,"code":"270400000","isSelected":false},{
   "id":819,"name":"阿坝藏族羌族自治州","parentId":552,"code":"230700000","isSelected":false},{
   "id":598,"name":"阿拉善盟","parentId":534,"code":"070300000","isSelected":false}],"B":[{
   "id":5,"name":"北京","parentId":1,"code":"010100000","isSelected":false},{
   "id":570,"name":"保定","parentId":532,"code":"051100000","isSelected":false},{
   "id":666,"name":"蚌埠","parentId":541,"code":"131300000","isSelected":false},{
   "id":588,"name":"包头","parentId":534,"code":"071300000","isSelected":false},{
   "id":717,"name":"滨州","parentId":544,"code":"161400000","isSelected":false},{
   "id":856,"name":"宝鸡","parentId":556,"code":"271000000","isSelected":false},{
   "id":678,"name":"亳州","parentId":541,"code":"130500000","isSelected":false},{
   "id":789,"name":"北海","parentId":549,"code":"211000000","isSelected":false},{
   "id":794,"name":"百色","parentId":549,"code":"210500000","isSelected":false},{
   "id":817,"name":"巴中","parentId":552,"code":"230500000","isSelected":false},{
   "id":828,"name":"毕节","parentId":553,"code":"240700000","isSelected":false},{
   "id":603,"name":"本溪","parentId":535,"code":"081400000","isSelected":false},{
   "id":896,"name":"巴音郭楞","parentId":560,"code":"311700000","isSelected":false},{
   "id":834,"name":"保山","parentId":554,"code":"251400000","isSelected":false},{
   "id":597,"name":"巴彦淖尔","parentId":534,"code":"070400000","isSelected":false},{
   "id":895,"name":"博尔塔拉","parentId":560,"code":"311600000","isSelected":false},{
   "id":620,"name":"白城","parentId":536,"code":"090400000","isSelected":false},{
   "id":618,"name":"白山","parentId":536,"code":"090600000","isSelected":false}],"C":[{
   "id":801,"name":"成都","parentId":552,"code":"230100000","isSelected":false},{
   "id":749,"name":"长沙","parentId":547,"code":"190100000","isSelected":false},{
   "id":8,"name":"重庆","parentId":4,"code":"040100000","isSelected":false},{
   "id":613,"name":"长春","parentId":536,"code":"090100000","isSelected":false},{
   "id":638,"name":"常州","parentId":539,"code":"112000000","isSelected":false},{
   "id":573,"name":"沧州","parentId":532,"code":"050800000","isSelected":false},{
   "id":590,"name":"赤峰","parentId":534,"code":"071100000","isSelected":false},{
   "id":758,"name":"郴州","parentId":547,"code":"190500000","isSelected":false},{
   "id":781,"name":"潮州","parentId":548,"code":"200500000","isSelected":false},{
   "id":755,"name":"常德","parentId":547,"code":"190800000","isSelected":false},{
   "id":673,"name":"滁州","parentId":541,"code":"131100000","isSelected":false},{
   "id":611,"name":"朝阳","parentId":535,"code":"080600000","isSelected":false},{
   "id":572,"name":"承德","parentId":532,"code":"050900000","isSelected":false},{
   "id":679,"name":"池州","parentId":541,"code":"130600000","isSelected":false},{
   "id":836,"name":"楚雄","parentId":554,"code":"251200000","isSelected":false},{
   "id":894,"name":"昌吉","parentId":560,"code":"311500000","isSelected":false},{
   "id":905,"name":"崇左","parentId":549,"code":"211400000","isSelected":false}],"D":[{
   "id":779,"name":"东莞","parentId":548,"code":"200300000","isSelected":false},{
   "id":600,"name":"大连","parentId":535,"code":"081700000","isSelected":false},{
   "id":715,"name":"德州","parentId":544,"code":"161600000","isSelected":false},{
   "id":627,"name":"大庆","parentId":537,"code":"101300000","isSelected":false},{
   "id":805,"name":"德阳","parentId":552,"code":"231700000","isSelected":false},{
   "id":706,"name":"东营","parentId":544,"code":"162000000","isSelected":false},{
   "id":577,"name":"大同","parentId":533,"code":"061200000","isSelected":false},{
   "id":815,"name":"达州","parentId":552,"code":"230300000","isSelected":false},{
   "id":841,"name":"大理","parentId":554,"code":"250700000","isSelected":false},{
   "id":604,"name":"丹东","parentId":535,"code":"081300000","isSelected":false},{
   "id":874,"name":"定西","parentId":557,"code":"280400000","isSelected":false},{
   "id":842,"name":"德宏","parentId":554,"code":"250600000","isSelected":false},{
   "id":107620,"name":"儋州","parentId":550,"code":"220201000","isSelected":false},{
   "id":845,"name":"迪庆","parentId":554,"code":"250300000","isSelected":false}],"E":[{
   "id":741,"name":"鄂州","parentId":546,"code":"181600000","isSelected":false},{
   "id":748,"name":"恩施","parentId":546,"code":"180300000","isSelected":false},{
   "id":592,"name":"鄂尔多斯","parentId":534,"code":"070900000","isSelected":false}],"F":[{
   "id":768,"name":"佛山","parentId":548,"code":"202000000","isSelected":false},{
   "id":681,"name":"福州","parentId":542,"code":"140100000","isSelected":false},{
   "id":674,"name":"阜阳","parentId":541,"code":"131000000","isSelected":false},{
   "id":700,"name":"抚州","parentId":543,"code":"150200000","isSelected":false},{
   "id":602,"name":"抚顺","parentId":535,"code":"081500000","isSelected":false},{
   "id":607,"name":"阜新","parentId":535,"code":"081000000","isSelected":false},{
   "id":790,"name":"防城港","parentId":549,"code":"210900000","isSelected":false}],"G":[{
   "id":763,"name":"广州","parentId":548,"code":"200100000","isSelected":false},{
   "id":822,"name":"贵阳","parentId":553,"code":"240100000","isSelected":false},{
   "id":787,"name":"桂林","parentId":549,"code":"211200000","isSelected":false},{
   "id":697,"name":"赣州","parentId":543,"code":"150500000","isSelected":false},{
   "id":807,"name":"广元","parentId":552,"code":"231900000","isSelected":false},{
   "id":792,"name":"贵港","parentId":549,"code":"210700000","isSelected":false},{
   "id":814,"name":"广安","parentId":552,"code":"230200000","isSelected":false},{
   "id":889,"name":"固原","parentId":559,"code":"300400000","isSelected":false},{
   "id":820,"name":"甘孜藏族自治州","parentId":552,"code":"230800000","isSelected":false}],"H":[{
   "id":653,"name":"杭州","parentId":540,"code":"120100000","isSelected":false},{
   "id":664,"name":"合肥","parentId":541,"code":"130100000","isSelected":false},{
   "id":773,"name":"惠州","parentId":548,"code":"202500000","isSelected":false},{
   "id":622,"name":"哈尔滨","parentId":537,"code":"100100000","isSelected":false},{
   "id":799,"name":"海口","parentId":550,"code":"220100000","isSelected":false},{
   "id":587,"name":"呼和浩特","parentId":534,"code":"070100000","isSelected":false},{
   "id":568,"name":"邯郸","parentId":532,"code":"051300000","isSelected":false},{
   "id":657,"name":"湖州","parentId":540,"code":"122200000","isSelected":false},{
   "id":752,"name":"衡阳","parentId":547,"code":"191100000","isSelected":false},{
   "id":108353,"name":"海外","parentId":108352,"code":"350100000","isSelected":false},{
   "id":643,"name":"淮安","parentId":539,"code":"112500000","isSelected":false},{
   "id":718,"name":"菏泽","parentId":544,"code":"160200000","isSelected":false},{
   "id":575,"name":"衡水","parentId":532,"code":"050600000","isSelected":false},{
   "id":776,"name":"河源","parentId":548,"code":"201400000","isSelected":false},{
   "id":760,"name":"怀化","parentId":547,"code":"190300000","isSelected":false},{
   "id":745,"name":"黄冈","parentId":546,"code":"181100000","isSelected":false},{
   "id":737,"name":"黄石","parentId":546,"code":"181200000","isSelected":false},{
   "id":672,"name":"黄山","parentId":541,"code":"131900000","isSelected":false},{
   "id":612,"name":"葫芦岛","parentId":535,"code":"080500000","isSelected":false},{
   "id":669,"name":"淮北","parentId":541,"code":"131600000","isSelected":false},{
   "id":667,"name":"淮南","parentId":541,"code":"131400000","isSelected":false},{
   "id":593,"name":"呼伦贝尔","parentId":534,"code":"070800000","isSelected":false},{
   "id":860,"name":"汉中","parentId":556,"code":"270600000","isSelected":false},{
   "id":795,"name":"贺州","parentId":549,"code":"210400000","isSelected":false},{
   "id":837,"name":"红河","parentId":554,"code":"251100000","isSelected":false},{
   "id":796,"name":"河池","parentId":549,"code":"210300000","isSelected":false},{
   "id":724,"name":"鹤壁","parentId":545,"code":"171600000","isSelected":false},{
   "id":879,"name":"海东","parentId":558,"code":"290200000","isSelected":false},{
   "id":625,"name":"鹤岗","parentId":537,"code":"101500000","isSelected":false},{
   "id":893,"name":"哈密","parentId":560,"code":"311400000","isSelected":false}],"J":[{
   "id":702,"name":"济南","parentId":544,"code":"160100000","isSelected":false},{
   "id":659,"name":"金华","parentId":540,"code":"122400000","isSelected":false},{
   "id":656,"name":"嘉兴","parentId":540,"code":"122100000","isSelected":false},{
   "id":769,"name":"江门","parentId":548,"code":"202100000","isSelected":false},{
   "id":709,"name":"济宁","parentId":544,"code":"162300000","isSelected":false},{
   "id":582,"name":"晋中","parentId":533,"code":"060700000","isSelected":false},{
   "id":614,"name":"吉林","parentId":536,"code":"091000000","isSelected":false},{
   "id":694,"name":"九江","parentId":543,"code":"150800000","isSelected":false},{
   "id":782,"name":"揭阳","parentId":548,"code":"200600000","isSelected":false},{
   "id":744,"name":"荆州","parentId":546,"code":"181900000","isSelected":false},{
   "id":726,"name":"焦作","parentId":545,"code":"171800000","isSelected":false},{
   "id":605,"name":"锦州","parentId":535,"code":"081200000","isSelected":false},{
   "id":742,"name":"荆门","parentId":546,"code":"181700000","isSelected":false},{
   "id":698,"name":"吉安","parentId":543,"code":"150400000","isSelected":false},{
   "id":692,"name":"景德镇","parentId":543,"code":"151000000","isSelected":false},{
   "id":580,"name":"晋城","parentId":533,"code":"060900000","isSelected":false},{
   "id":629,"name":"佳木斯","parentId":537,"code":"101100000","isSelected":false},{
   "id":872,"name":"酒泉","parentId":557,"code":"280600000","isSelected":false},{
   "id":107292,"name":"济源","parentId":545,"code"
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值