java爬虫抓取行政区划_7-爬虫爬API抓取行政区划(urllib).ipynb

{

"cells": [

{

"cell_type": "markdown",

"metadata": {},

"source": [

"在这个教程中,你将会学到如何用高德地图api抓取行政区划\n",

"\n",

"

提供的基础数据是:

\n",

" 没有,我们的数据无中生有

"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"# 观察网络连接行为"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"我们从高德地图抓,先观察一下如果在高德地图输入深圳的某一个行政区划查询,它的网络连接行为是怎样的\n",

" \n",

"谷歌浏览器右键检查,或者点设置里面的开发者工具,再点network选项,可以看到网络的连接行为(其他浏览器也有类似的功能,需要找一找)\n",

"\n",

"爬虫的原理在这里我们这里用到的是爬虫2.0\n",

"\n",

"每个网络访问中,有\n",

"\n",

" Response Headers(响应头)\n",

" Request Headers(请求头)\n",

" Query String Parameters(查询参数)\n",

" \n",

"其中,请求头和查询参数是我们要关注的"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"# json数据格式"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"在网络访问行为中,对方服务器返回给我们的数据是json结构,那么json是什么呢"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"先补充一下基础知识,学习一下python的\n",

"\n",

"list\n",

"\n",

"dict\n",

"\n",

"tuple\n",

"\n",

"把list,dict,tuple自由组合起来就变成了json\n",

"\n",

"JSON 实例"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"直接从高德地图抓是比较困难的,有防爬机制\n",

"\n",

"不过,高德专门为开发者提供了抓数据的接口\n",

"\n",

"高德地图开放平台\n",

"\n",

"各位需要注册一下高德开发者申请一个key"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"在其中,高德已经给我们提供了开发者专用的行政区查询服务,以及相关说明行政区查询\n",

"\n",

"在其中选择一个行政区查询,然后看看网络连接行为吧"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"# 开始抓行政区划"

]

},

{

"cell_type": "code",

"execution_count": 1,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-19T03:44:12.354768Z",

"start_time": "2020-01-19T03:44:11.803160Z"

}

},

"outputs": [],

"source": [

"#导入必要的爬虫包\n",

"import urllib\n",

"from urllib import parse\n",

"from urllib import request\n",

"\n",

"import pandas as pd\n",

"#导入json包,后面解析json数据\n",

"import json"

]

},

{

"cell_type": "code",

"execution_count": 2,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-19T03:44:13.091717Z",

"start_time": "2020-01-19T03:44:13.086729Z"

}

},

"outputs": [],

"source": [

"mykey = '在此输入你的key'\n",

"#这个输入你开发者key,告诉高德这个数据是你抓的,每天会有限额,你们可以注册成为开发者,这样就有自己的key拉"

]

},

{

"cell_type": "code",

"execution_count": 3,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-19T03:44:16.032917Z",

"start_time": "2020-01-19T03:44:14.094053Z"

}

},

"outputs": [],

"source": [

"keywords = '罗湖区'\n",

"\n",

"#查询的接口地址\n",

"url = 'https://restapi.amap.com/v3/config/district?'\n",

"\n",

"#查询的条件\n",

"dict1 = {\n",

"'subdistrict':'3',\n",

" 'showbiz':'false',\n",

" 'extensions':'all',\n",

" 'key':mykey,#这个是我的开发者key,告诉高德这个数据是我抓的,每天会有限额,你们可以注册成为开发者,这样就有自己的key拉\n",

" 's':'rsv3',\n",

" 'output':'json',\n",

" 'level':'district',\n",

" 'keywords':keywords,\n",

" 'platform':'JS',\n",

" 'logversion':'2.0',\n",

" 'sdkversion':'1.4.10'\n",

"}\n",

"\n",

"#把查询条件组合成网页地址\n",

"url_data = parse.urlencode(dict1)\n",

"url = url+url_data\n",

"\n",

"#创建一个访问器\n",

"request = urllib.request.Request(url)\n",

"\n",

"#访问网页\n",

"response = urllib.request.urlopen(request)\n",

"\n",

"#读取网页内容\n",

"webpage = response.read()\n",

"\n",

"#将内容用json解析\n",

"result = json.loads(webpage.decode('utf8','ignore'))"

]

},

{

"cell_type": "code",

"execution_count": 4,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-19T03:44:17.628606Z",

"start_time": "2020-01-19T03:44:17.611648Z"

},

"scrolled": false

},

"outputs": [

{

"data": {

"text/plain": [

"{'status': '1',\n",

" 'info': 'OK',\n",

" 'infocode': '10000',\n",

" 'count': '1',\n",

" 'suggestion': {'keywords': [], 'cities': []},\n",

" 'districts': [{'citycode': '0755',\n",

" 'adcode': '440303',\n",

" 'name': '罗湖区',\n",

" 'polyline': '114.105177,22.531626;114.104808,22.532512;114.104774,22.535038;114.104757,22.535105;114.104772,22.5352;114.104764,22.535834;114.104699,22.540773;114.104687,22.541316;114.104589,22.546031;114.104519,22.547975;114.104464,22.548114;114.104502,22.548445;114.104486,22.548663;114.10449,22.548786;114.104477,22.549163;114.104506,22.549251;114.104505,22.549327;114.10447,22.549363;114.104341,22.552936;114.104281,22.555434;114.104472,22.555779;114.104487,22.555809;114.104508,22.555845;114.104557,22.555933;114.104576,22.556013;114.104576,22.556037;114.104574,22.556168;114.10456,22.556475;114.10456,22.556561;114.104559,22.556903;114.104552,22.557291;114.104551,22.557399;114.104547,22.557726;114.104541,22.557852;114.104542,22.558166;114.104536,22.558579;114.104534,22.558701;114.104529,22.559019;114.104521,22.559395;114.104523,22.559815;114.104503,22.560243;114.104498,22.560353;114.104502,22.560685;114.104495,22.561075;114.104487,22.561174;114.104496,22.561506;114.104496,22.561921;114.104496,22.562368;114.104502,22.562489;114.104504,22.562812;114.104506,22.563216;114.104504,22.563617;114.104508,22.563748;114.104512,22.564046;114.104508,22.56422;114.104498,22.564475;114.104502,22.564899;114.104511,22.565285;114.104509,22.565474;114.104508,22.565722;114.104517,22.56614;114.104521,22.566593;114.104523,22.567017;114.104518,22.567455;114.104524,22.567872;114.104514,22.56789;114.104473,22.567887;114.104281,22.56785;114.104165,22.567827;114.104041,22.567776;114.103984,22.567714;114.103221,22.566996;114.101604,22.566261;114.100625,22.565896;114.099474,22.565566;114.098216,22.565256;114.096829,22.56491;114.095665,22.564683;114.094653,22.564538;114.093662,22.564387;114.09319,22.564326;114.092677,22.564341;114.092628,22.564336;114.092616,22.564338;114.092547,22.564334;114.092482,22.564333;114.092448,22.564344;114.092346,22.564394;114.092238,22.564462;114.092232,22.564471;114.092198,22.564515;114.092159,22.564576;114.092053,22.564749;114.091951,22.564934;114.091793,22.565191;114.091604,22.565503;114.091404,22.565793;114.09114,22.566201;114.091064,22.566319;114.090868,22.566609;114.090934,22.56668;114.090822,22.566674;114.090558,22.567061;114.090367,22.567349;114.090287,22.567475;114.090217,22.567571;114.090162,22.567638;114.08992,22.56787;114.089529,22.568249;114.089399,22.56837;114.089068,22.568689;114.088819,22.568935;114.088709,22.569037;114.088646,22.569078;114.088535,22.569171;114.088247,22.569416;114.088167,22.569495;114.088081,22.569532;114.08804,22.569578;114.087999,22.569556;114.087781,22.569619;114.087106,22.56982;114.086649,22.569956;114.08641,22.570031;114.086242,22.570088;114.086163,22.570103;114.085801,22.570148;114.085672,22.570161;114.085463,22.570188;114.085116,22.570236;114.084791,22.570277;114.084619,22.570303;114.084456,22.570309;114.084331,22.570304;114.084081,22.570304;114.083769,22.570298;114.083665,22.570292;114.083473,22.570284;114.083275,22.570264;114.083075,22.570242;114.082817,22.57021;114.082622,22.570188;114.082508,22.570159;114.08228,22.570096;114.082174,22.570069;114.081817,22.569965;114.081591,22.569903;114.081421,22.569841;114.081242,22.569769;114.080933,22.569619;114.080908,22.569568;114.08085,22.569578;114.080686,22.569497;114.080538,22.569445;114.080433,22.569393;114.080271,22.569293;114.079923,22.569056;114.079662,22.568882;114.079393,22.568705;114.079153,22.568538;114.079117,22.568517;114.079096,22.568519;114.079028,22.568567;114.078887,22.568673;114.078607,22.56885;114.078495,22.568924;114.078397,22.569016;114.078186,22.569219;114.078015,22.56938;114.077965,22.569438;114.077919,22.569561;114.077803,22.569836;114.077751,22.569966;114.077635,22.570266;114.077564,22.57047;114.077481,22.570689;114.077396,22.570895;114.077328,22.571063;114.077286,22.571168;114.077251,22.571219;114.077139,22.571326;114.076993,22.571499;114.076867,22.571625;114.076762,22.571693;114.076688,22.571739;114.07653,22.571842;114.076423,22.571923;114.076276,22.572064;114.076119,22.57222;114.076027,22.572324;114.075995,22.572373;114.075944,22.572466;114.075442,22.573406;114.074444,22.57527;114.074429,22.575304;114.0744

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值