爬取国家统计局行政区划代码

目标网址:2022年统计用区划代码和城乡划分代码

 

结果预览,爬取了对应的province_code, province_name, city_code, city_name, county_code, county_name, viliage_code, vilage_name

 爬取问题:构造多线程爬取,发现短时间频繁对网站发起请求会导致请求不到页面,故直接单线程就慢慢爬吧

代码如下:

from lxml import etree
import requests
import time
import random


def get_html(url):
    response = requests.get(url)
    response.encoding = "utf8"
    res = response.text
    html = etree.HTML(res)
    return html


base_url = "http://www.stats.gov.cn/sj/tjbz/tjyqhdmhcxhfdm/2022/"
url = base_url + "index.html"
province_html = get_html(url)
province_list = province_html.xpath('//tr[@class="provincetr"]/td')
province_code = province_list[0].xpath('//td/a/@href')
province_name = province_list[0].xpath('//td/a/text()')
province = dict(zip([p.split(".")[0] for p in province_code], province_name))
for p_key in province.keys():
    url_city = base_url + p_key + ".html"
    time.sleep(random.randint(0, 3))
    city_html = get_html(url_city)
    if city_html is None:
        print("city_html is None", url_city)
        continue
    city_code = city_html.xpath('//tr[@class="citytr"]/td[1]/a/text()')
    city_name = city_html.xpath('//tr[@class="citytr"]/td[2]/a/text()')
    city_url = city_html.xpath('//tr[@class="citytr"]/td[1]/a/@href')
    for c_num in range(len(city_url)):
        county_url = base_url + city_url[c_num]
        time.sleep(random.randint(0, 3))
        county_html = get_html(county_url)
        if county_html is None:
            print("county_html is None", county_url)
            continue
        county_code = county_html.xpath('//tr[@class="countytr"]/td[1]/a/text()')
        county_name = county_html.xpath('//tr[@class="countytr"]/td[2]/a/text()')
        county_url = county_html.xpath('//tr[@class="countytr"]/td[1]/a/@href')
        for t_num in range(len(county_url)):
            town_url = base_url + "/" + city_url[c_num].split('/')[0] + "/" + county_url[t_num]
            time.sleep(random.randint(0, 3))
            town_html = get_html(town_url)
            if town_html is None:
                print("town_html is None", town_url)
                continue
            town_code = town_html.xpath('//tr[@class="towntr"]/td[1]/a/text()')
            town_name = town_html.xpath('//tr[@class="towntr"]/td[2]/a/text()')
            town_url = town_html.xpath('//tr[@class="towntr"]/td[1]/a/@href')
            for v_num in range(len(town_url)):
                code_ = town_url[v_num].split("/")[1].rstrip(".html")
                village_url = base_url + code_[0:2] + "/" + code_[2:4] + "/" + town_url[v_num]
                time.sleep(random.randint(0, 3))
                village_html = get_html(village_url)
                if village_html is None:
                    print("village_html is None", village_url)
                    continue
                village_code = village_html.xpath('//tr[@class="villagetr"]/td[1]/text()')
                village_name = village_html.xpath('//tr[@class="villagetr"]/td[3]/text()')
                for num in range(len(village_code)):
                    v_name = village_name[num]
                    v_code = village_code[num]
                    print(p_key, province[p_key], city_code[c_num], city_name[c_num], county_code[t_num],
                          county_name[t_num], town_code[v_num], town_name[v_num], v_code, v_name)

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
您好,可以使用Java的Jsoup库来爬取国家统计局行政区划信息。具体步骤如下: 1. 打开国家统计局行政区划页面:http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/ 2. 使用Jsoup库获取页面内容,并解析出需要的信息。 3. 遍历解析出的信息,可以将其存储到数据库或者文件中。 以下是示例代码: ```java import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import java.io.IOException; public class Main { public static void main(String[] args) throws IOException { // 打开国家统计局行政区划页面 String url = "http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/"; Document doc = Jsoup.connect(url).get(); // 解析出需要的信息 Elements provinces = doc.select("tr.provincetr td a"); for (Element province : provinces) { String provinceName = province.text(); String provinceUrl = url + province.attr("href"); System.out.println(provinceName + " " + provinceUrl); Document provinceDoc = Jsoup.connect(provinceUrl).get(); Elements cities = provinceDoc.select("tr.citytr td a"); for (Element city : cities) { String cityName = city.text(); String cityUrl = url + city.attr("href"); System.out.println("\t" + cityName + " " + cityUrl); Document cityDoc = Jsoup.connect(cityUrl).get(); Elements counties = cityDoc.select("tr.countytr td a"); for (Element county : counties) { String countyName = county.text(); String countyUrl = url + county.attr("href"); System.out.println("\t\t" + countyName + " " + countyUrl); } } } } } ```

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值