民政部网站数据抓取

最新推荐文章于 2024-05-29 19:32:08 发布

猫小咪编程

最新推荐文章于 2024-05-29 19:32:08 发布

阅读量1.2k

点赞数 4

分类专栏： Python爬虫

本文链接：https://blog.csdn.net/jaffe507/article/details/105614420

版权

Python爬虫专栏收录该内容

7 篇文章

订阅专栏

目标

1、URL: http://www.mca.gov.cn/ - 民政数据 - 行政区划代码
   即: http://www.mca.gov.cn/article/sj/xzqh/2019/
2、目标: 抓取最新中华人民共和国县以上行政区划代码

实现步骤

1、从民政数据网站中提取最新行政区划代码链接

# 特点
1、最新的在上面
2、命名格式: 2019年X月中华人民共和国县以上行政区划代码

2、从二级页面链接中提取真实链接（反爬-响应内容中嵌入JS，指向新的链接）

1、向二级页面链接发请求得到响应内容，并查看嵌入的JS代码
2、正则提取真实的二级页面链接

3、真实链接中提取所需数据

4、代码实现

import requests
from lxml import etree
import re

class GovementSpider(object):
  def __init__(self):
    self.url = 'http://www.mca.gov.cn/article/sj/xzqh/2019/'
    self.headers = {'User-Agent':'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.3)'}


  # 获取假链接
  def get_false_link(self):
    html = requests.get(url = self.url,headers = self.headers).text
    # 解析
    parse_html = etree.HTML(html)
    # a_list: [<element a at xx>,<element a at xxx>]
    a_list = parse_html.xpath('//a[@class="artitlelist"]')
    for a in a_list:
      # get()方法:获取某个属性的值
      title = a.get('title')   获取当前节点属性title的值
      if title.endswith('代码'):
        false_link = 'http://www.mca.gov.cn' + a.get('href')
        self.get_true_link(false_link)
        break


  # 获取真链接
  def get_true_link(self,false_link):
    # 先获取假链接的响应,然后根据响应获取真链接
    html = requests.get(url = false_link,headers = self.headers).text
    # 利用正则提取真实链接
    re_bds = r'window.location.href="(.*?)"'
    pattern = re.compile(re_bds,re.S)
    true_link = pattern.findall(html)[0]

    self.parse_html(true_link)

  # 数据提取
  def parse_html(self,true_link):
    html = requests.get( url = true_link,headers = self.headers).text

    # xpath提取数据
    parse_html = etree.HTML(html)
    tr_list = parse_html.xpath('//tr[@height="19"]')
    for tr in tr_list:
      code = tr.xpath('./td[2]/text()')[0].strip()
      name = tr.xpath('./td[3]/text()')[0].strip()

      print(name,code)


  # 主函数
  def main(self):
    self.get_false_link()

if __name__ == '__main__':
  spider = GovementSpider()
  spider.main()