JSOUP 抓取县以上行政代码划分并导出json文件
由于在一次工作中需要用到这个 ,网上找的大都是2017年的,所以就自己写了一个抓取,并把它转成一个json文件来自用
代码
- 首先需要一个jsoup ,我用的是maven结构的项目,所以引入jsoup及其他一些需要的jar包
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.11.3</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
</dependency>
- 其次呢,肯定是要去中华人民共和国民政部的官网去找这些数据的,下面给出一个链接,可以获取数据及历年的一些变更情况
http://www.mca.gov.cn/article/sj/xzqh/2018/
,我用的是:2018年10月中华人民共和国县以上行政区划代码,然后分析一下页面结构,就可以干活了 - 下面开始干货
package com.province;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.junit.Test;
import java.io.*;
import java.util.*;
/**
* @author Mr.c
* @Description
* @create 2018/12/6 17:43
* Copyright: Copyright (c) 2018
*/
public class AreaCode {
private static String top2;
@Test
public void test() throws IOException {
String url="http://www.mca.gov.cn/article/sj/xzqh/2018/201804-12/20181011221630.html";
Document doc = Jsoup.connect(url).maxBodySize(0).get();
String fileName = doc.select("link[rel=File-List]").attr("href");
fileName=fileName.replace(".files/filelist.xml","");
System.out.println(fileName);
Elements trList = doc.select("tr[height=19]");
Map<String,CodeMessage> province=new HashMap<>();
List<CodeMessage> codeMessageList=new ArrayList<>();
int i=0;
for(Element tr:trList){
Elements tdList = tr.select("[class=xl7021820]");
String code = tdList.get(0).text();
String text = tdList.get(1).text();
++i;
System.out.print(code+"-->>>"+text);
System.out.println("\t\t\t\t 当前数:"+i);
CodeMessage codeMessage=new CodeMessage(code,text);
codeMessageList.add(codeMessage);
}
List<CodeMessage> list=new ArrayList<>();
deal(codeMessageList,list);
System.out.println(list.toString());
createJsonFile(list.toString());
}
public void createJsonFile(String jsonStr){
String fullPath = this.getClass().getResource("/").getPath() + "codeArea.json";
boolean flag = true;
try {
// 保证创建一个新文件
File file = new File(fullPath);
// 如果父目录不存在,创建父目录
if (!file.getParentFile().exists()) {
file.getParentFile().mkdirs();
}
// 如果已存在,删除旧文件
if (file.exists()) {
file.delete();
}
file.createNewFile();
// 将格式化后的字符串写入文件
Writer write = new OutputStreamWriter(new FileOutputStream(file), "UTF-8");
write.write(jsonStr);
write.flush();
write.close();
} catch (Exception e) {
flag = false;
e.printStackTrace();
}
System.out.println("创建行政区域json文件"+flag);
}
public void deal( List<CodeMessage> codeMessageList,List<CodeMessage> list){
//筛出省份和直辖市
Iterator<CodeMessage> iter = codeMessageList.iterator();
while(iter.hasNext()){
CodeMessage cm = iter.next();
String code = cm.getCode();
if(code.endsWith("0000")){
list.add(cm);
iter.remove();
}
}
//二级行政单位 城市或区
for(CodeMessage province:list){
String code = province.getCode();
String text = province.getText();
List<CodeMessage> cityList = new ArrayList<>();
top2=code.substring(0,2);
iter = codeMessageList.iterator();
while(iter.hasNext()){
CodeMessage city = iter.next();
if(text.contains("市")&&city.getCode().startsWith(top2)){
cityList.add(city); iter.remove();
}
if(city.getCode().startsWith(top2)&&city.getCode().endsWith("00")){
cityList.add(city); iter.remove();
}
}
province.setList(cityList);
}
}
}
/**
* 代码-行政类,用来存放数据
*/
class CodeMessage{
public CodeMessage(String code, String text) {
this.code = code;
this.text = text;
}
private String code;
private String text;
private List<CodeMessage> list;
public String getCode() {
return code;
}
public void setCode(String code) {
this.code = code;
}
public String getText() {
return text;
}
public void setText(String text) {
this.text = text;
}
public List<CodeMessage> getList() {
return list;
}
public void setList(List<CodeMessage> list) {
this.list = list;
}
@Override
public String toString() {
final StringBuilder sb = new StringBuilder("{");
sb.append("\"code\":\"")
.append(code).append('\"');
sb.append(",\"text\":\"")
.append(text).append('\"');
sb.append(",\"list\":")
.append(list);
sb.append('}');
return sb.toString();
}
}
- 生成的json文件 codeArea.json 的内容,由于我只用到了省市这两级 所以只生成了这些,其他可以参照代码自行生成。
[{
"code": "110000",
"text": "北京市",
"list": [{
"code": "110101",
"text": "东城区",
"list": null
}, {
"code": "110102",
"text": "西城区",
"list": null
}, {
"code": "110105",
"text": "朝阳区",
"list": null
}, {
"code": "110106",
"text": "丰台区",
"list": null
}, {
"code": "110107",
"text": "石景山区",
"list": null
}, {
"code": "110108",
"text": "海淀区",
"list": null
}, {
"code": "110109",
"text": "门头沟区",
"list": null
}, {
"code": "110111",
"text": "房山区",
"list": null
}, {
"code": "110112",
"text": "通州区",
"list": null
}, {
"code": "110113",
"text": "顺义区",
"list": null
}, {
"code": "110114",
"text": "昌平区",
"list": null
}, {
"code": "110115",
"text": "大兴区",
"list": null
}, {
"code": "110116",
"text": "怀柔区",
"list": null
}, {
"code": "110117",
"text": "平谷区",
"list": null
}, {
"code": "110118",
"text": "密云区",
"list": null
}, {
"code": "110119",
"text": "延庆区",
"list": null
}]
}, {
"code": "120000",
"text": "天津市",
"list": [{
"code": "120101",
"text": "和平区",
"list": null
}, {
"code": "120102",
"text": "河东区",
"list": null
}, {
"code": "120103",
"text": "河西区",
"list": null
}, {
"code": "120104",
"text": "南开区",
"list": null
}, {
"code": "120105",
"text": "河北区",
"list": null
}, {
"code": "120106",
"text": "红桥区",
"list": null
}, {
"code": "120110",
"text": "东丽区",
"list": null
}, {
"code": "120111",
"text": "西青区",
"list": null
}, {
"code": "120112",
"text": "津南区",
"list": null
}, {
"code": "120113",
"text": "北辰区",
"list": null
}, {
"code": "120114",
"text": "武清区",
"list": null
}, {
"code": "120115",
"text": "宝坻区",
"list": null
}, {
"code": "120116",
"text": "滨海新区",
"list": null
}, {
"code": "120117",
"text": "宁河区",
"list": null
}, {
"code": "120118",
"text": "静海区",
"list": null
}, {
"code": "120119",
"text": "蓟州区",
"list": null
}]
}, {
"code": "130000",
"text": "河北省",
"list": [{
"code": "130100",
"text": "石家庄市",
"list": null
}, {
"code": "130200",
"text": "唐山市",
"list": null
}, {
"code": "130300",
"text": "秦皇岛市",
"list": null
}, {
"code": "130400",
"text": "邯郸市",
"list": null
}, {
"code": "130500",
"text": "邢台市",
"list": null
}, {
"code": "130600",
"text": "保定市",