Python 网络抓取和文本挖掘 - 4 濒危世界遗产地_濒危植物信息库文档页面信息python爬取-CSDN博客

本文链接：https://blog.csdn.net/hjh00/article/details/57083978

完成概述一章的案例研究：濒危世界遗产地。书中用R的stringr、XML和maps三各library来实现。在python实现书上的功能也需要配置相应的环境。开发环境用的是windows.

1. 准备工作

1) 到 http://download.osgeo.org/osgeo4w/ 下载osgeo4w-setup-x86_64.exe，安装 osgeo4w；

2)到 http://www.lfd.uci.edu/~gohlke/pythonlibs/ 下载basemap‑1.0.8‑cp27‑none‑win_amd64.whl 用pip安装

pip install basemap‑1.0.8‑cp27‑none‑win_amd64.whl

2. 代码说明

实现方案包括3部分 1）用urllib2下载html网页，2）用lxml的etree解析文件提取html table中数据，需要用到xpath，3)用basemap绘图。完整代码如下：

# -*- coding:utf8 -*-

import urllib2
from lxml import etree

from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
import numpy as np


def read_html(url):
    request = urllib2.Request(url)
    response = urllib2.urlopen(request)
    page = response.read()
    return etree.HTML(page)


def get_tables(doc):
    tables = []
    elements = doc.xpath( '/html/body//table')
    for tab in elements:
        table = []
        rows = tab.xpath('./tr')
        for row in rows:
            rec = []
            for column in row.xpath('./th[position()>0] |./td[position()>0]'):
                rec.append(column)
            table.append(rec)
        tables.append(table)
    return tables


def get_location(latitude, longitude):
    N = latitude[:-1]
    E = longitude[:-1]
    strsN = N.replace("′", "°").replace("″", "°").split("°")
    if len(strsN) == 4:
        N = float(strsN[0]) + float(strsN[1])/60 + float(strsN[2])/3600
    elif len(strsN) == 3:
        N = float(strsN[0]) + float(strsN[1]) / 60

    strsE = E.replace("′", "°").replace("″", "°").split("°")
    if len(strsE) == 4:
        E = float(strsE[0]) + float(strsE[1])/60 + float(strsE[2])/3600
    elif len(strsE) == 3:
        E = float(strsE[0]) + float(strsE[1]) / 60

    return (round(N,2), round(E,2))


def get_heritage_data(table):
    data = []
    n = len(table)
    m = len(table[0])
    for i in range(1,n):
        row = table[i]
        name = row[0].xpath('./a/text()')
        geo = row[2].xpath('.//span/a[@class="external text"]/span/span/span')

        name =  name[0].encode("utf-8")
        latitude =  geo[0].text.encode("utf-8")
        longitude = geo[1].text.encode("utf-8")
        loc = get_location(latitude, longitude)
        data.append((name, loc[0], loc[1]))

    return data


def main():
    #抓取数据
    url = 'http://en.wikipedia.org/wiki/List_of_World_Heritage_in_Danger'
    doc = read_html(url)
    tables = get_tables(doc)
    data = get_heritage_data(tables[1])
    latitude = []
    longitude = []
    for rec in data:
        latitude.append(rec[1])
        longitude.append(rec[2])

    #显示地图
    map = Basemap(projection='merc', llcrnrlat=-80, urcrnrlat=80,
                llcrnrlon=-180, urcrnrlon=180, lat_ts=20, resolution='c')

    map.drawcoastlines()
    #map.fillcontinents(color="coral", lake_color='aqua')
    map.drawparallels(np.arange(-90, 91, 30))
    map.drawmeridians(np.arange(-180, 181, 60))
    #map.drawmapboundary(fill_color='aqua')
    map.drawmapboundary()
    x, y = map(latitude,longitude)
    map.scatter(x, y, 50)
    plt.title('World Heritage in Danger')
    plt.show()

if __name__ == '__main__':
    main()

read_html，输入url连接，获取网页。

get_tables，提取tables，每个 table存储到python的list，list中的对象是<td>或<th>元素。

get_heritage_data 从表中提取名称和经纬度数据，用到xpath提取元素，在表格的<td>中又用了<span>元素来，要写出符合要求的xpath用到了浏览器中的开发者工具来辅助分析。