Get all the coordinates of listed banks from Google Maps with Python 3

7 篇文章 0 订阅
2 篇文章 0 订阅

Resume

Last week, my girl friend was writing her financial paper. She needs a data called Headquarter to sub branch distance in China. However, after have asked nearly all the sellers on Taobao, we cannot find out anyone who can collect this data. So I decided to help my girl to get this data. This passage will describe how we can get all the data by using Google Maps API. I found so many unforeseeable exceptions while developing this project. This passage will also tell you what are the problems as well as how to resolve them.


Google API

First of all, we have to register a Google API key, and while I was using the API, it was free and provided 150,000 queries per day.
We need to go to https://console.developers.google.com/ to get an API key, create a project, then enable the Google Place API Web Service at API manager.
After have enabled this API, we can get data from Google Maps.


urllib

In Python 3, we need to use urllib to fire a request and get data. (It’s almost the same in Python 2. Please check it on http://Python.org)
What we need is urllib.request, urllib.parse and HTTPError in urllib.error.
Moreover, we still need to handle json and do some math calculations.
So we just import them into out project.

import urllib.request, urllib.parse, json
from urllib.error import HTTPError
from math import radians, cos, sin, asin, sqrt, pi
import time
import os

Prelude

We need to do find out all the addresses for all the listed banks in China. So let’s just define some information like the name of bank, the coordinate of headquarters of banks.

banks = ['平安银行','宁波银行','浦发银行','华夏银行','民生银行','招商银行','南京银行','兴业银行','北京银行','农业银行','交通银行','工商银行','光大银行','建设银行','中国银行','中信银行']

banks_hq = {
    '平安银行' : (22.5407058,114.1075029),
    '宁波银行' : (29.8097237,121.5420964),
    '浦发银行' : (31.2378726,121.4899615),
    '华夏银行' : (39.9076986,116.4202794),
    '民生银行' : (39.9060149,116.3715725),
    '招商银行' : (22.5368386,114.0225838),
    '南京银行' : (32.0544521,118.7841394),
    '兴业银行' : (26.0928929,119.3020881),
    '北京银行' : (39.9172526,116.3578085),
    '农业银行' : (39.9085414,116.4227061),
    '交通银行' : (31.2395722,121.5040103),
    '工商银行' : (39.9087474,116.3656513),
    '光大银行' : (39.9182534,116.3635837),
    '建设银行' : (39.9129233,116.3581889),
    '中国银行' : (39.9076719,116.3734256),
    '中信银行' : (39.9308017,116.4350518)
}

Cities’s coordinates

I believe that most sub branch banks are in cities. So I found a list that contents all cities and its’s coordinates in China. And we convert it into csv file so that we can use it easily.
The csv file that we needs shall be like

P0,P1,P2,Longitude,Latitude,Size
广东省,广州市,荔湾区,23.13,113.27,0
...

I uploaded to my GitHub, here’s the link:
https://github.com/Voyager2718/Spider/blob/develop/AllAddressByGoogleMaps/cities.csv
Then we should define a function to read the csv file.

def getCities(file, passHead = 1):
    """
    Get all cities and its' coordinate.
    @param passHead: How many head line should be ignored.
    """
    fp = open(file,'r+')
    line = fp.readline()
    array = []
    while(line != ''):
        if passHead > 0:
            passHead -= 1
            line = fp.readline()
            continue
        array += [line.replace('\n','').split(',')]
        line = fp.readline()
    fp.close()
    return array

This function is quite simple, just read the file and return a list that contains all single lines.
And the second parameter is used for ignoring head lines.


Distance between 2 coordinates

I found this function on StackOverflow that can be used. But I modified a little to make it more clear and adapt to Google API. The 2 parameters are coordinates that formed into a tuple (Like (23.13,113.27)).

def distance(coord0, coord1):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    r = 6371e3
    lat1,lon1 = coord0[0],coord0[1]
    lat2,lon2 = coord1[0],coord1[1]
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    km = 6367 * c
    return km

The unit of return value is kilometer.


Google Place API

API parameters

According to Google Place API, we need to create a request to the following URL to get data.
https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=-33.8670522,151.1957362&radius=500&types=food&name=cruise&key=YOUR_API_KEY
Where

parameterDetail
keyThe API key that we applied above.
locationThe coordinates that we want to search.
radiusThe radius of the range that we want to search.

or

parameterDetail
keyThe API key that we applied above.
pagetokenPage token for multi-pages.

are required. It means that we can either choose parameters on the first table or parameters on the second table.
In this project, we need other optional parameters.

parameterDetail
languageThe language that we want in return JSON.
keywordThe keyword of the POI that we want.
typesThe type of POI that we want.

As we can see on Google Place API, it will return only 20 POI per request and 60 POI in total. And if there are other pages, the API will return the JSON with a specifics key next_page_token. So if we found the key next_page_token in the return JSON, we shall request for other results.


However, according to a question on StackOverflow and my experience, if we request for other pages after have gotten the first page immediately, the API will very likely to return an INVALID_REQUEST exception. Although the official documents said that the API will delay for a while, however, I found it may delay for 3-5 minutes which makes this delay unacceptable. So we have to request for other pages after a couple of minutes. We will see how to handle this problem in the following chapters.



Then we just form 2 functions to establish the URL that we want.

def getOtherPages(token, api_id = API_ID, lang = 'zh-CN'):
    url = 'https://maps.googleapis.com/maps/api/place/nearbysearch/json?language=' + str(lang) + '&pagetoken=' + str(token) + '&key=' + str(api_id)
    return query(url)

def getResults(coord, keyword, api_id = API_ID, type = 'bank', lang = 'zh-CN', radius = 10000):
    keyword = urllib.parse.quote(keyword) 
    url = 'https://maps.googleapis.com/maps/api/place/nearbysearch/json?types=' + str(type) +'&location=' + str(coord[0]) + ',' + str(coord[1]) + '&radius=' + str(radius) + '&keyword=' + str(keyword) + '&language=' + str(lang) + '&key=' + str(api_id)
    return query(url)

urllib.parse.quote(keyword) will encode the keyword into URL percent-coding so that we can use Unicode (to support different characters apart from ASCII) in the request.


Query

After have established the URL, we should query for information.
Then we should define a function query like

def query(url):
    while True:
        try:
            results = json.loads(urllib.request.urlopen(url).read().decode('utf-8'))
            if results['status'] == 'OK' or results['status'] == 'ZERO_RESULTS':
                if 'next_page_token' in results:
                    return {'token' : results['next_page_token'], 'results' : results['results']}
                return {'results' : results['results']}
            print('-' * 50)
            if 'error_message' in results:
                print('Status: ' + str(results['status']) + ' Error:', results['error_message'])
            else:
                print('Status: ' + str(results['status']))
            print('-' * 50)
            if results['status'] == 'OVER_QUERY_LIMIT':
                raise Exception('Over limit.')
            if results['status'] == 'REQUEST_DENIED':
                raise Exception('Request denied.')
            print('Got a "INVALID_REQUEST" error. Retry later...')
            print('URL:\n\t' + url)
            return {'url' : url}
        except HTTPError:
            print('Got a HTTP error. Retrying...')

While requesting for information, the API might sometimes return a 500 error, 503 error, etc. Then urllib.request.urlopen() will raise a HTTPError. But we shall not give up and we need to request again. So we should put the codes into an infinite loop and break it if request has succeed.
That’s the reason why we use

while True:
    try:
        urllib.request.urlopen(something)
    except HTTPError:
        print('HTTPError')

Then we shall put a bunch of if to catch different kind of conditions.
If the status of the request is 'OK' or 'ZERO_RESULTS', then we determine if 'next_page_token' is in the return JSON. If there is, we return the page token and leave it to be handled in the upper structure. Else we just return the results.
And we handle other conditions in that bunch of if.
Finally, it leave us only 'INVALID_REQUEST' error which is likely means that we shall request the URL later as we’ve talked about above. So we just return the URL.


Get all results

As we talked about above, we need to get other pages with delay. So we have to define another function to handle next_page_token at the end of all requests.

def getResultsInCitiesDelayed(cities, keyword, api_id = API_ID, type = 'bank', lang = 'zh-CN', radius = 10000):
    num = 0
    results = []
    ids = []
    for city in cities:
        num += 1
        print('Running in method delay...' + keyword + ' ' + str(num) + '/' + str(len(cities)))
        result = getResults((city[3],city[4]), keyword, api_id, type, lang, radius)
        for res in result['results']:
            if res['id'] not in ids:
                results += [(city[0], city[1], city[2], res)]
                ids += [res['id']]
    rnum = 0    
    while True:
        haveUnread = False
        for result in results:
            if 'token' in result[3]:
                print('Re-reading...' + str(rnum))
                re_read = getOtherPages(result[3]['token'], api_id, lang)
                if 'url' not in re_read:
                    del result[3]['token']
                    result[3]['result'] += re_read['result']
                    if 'token' in re_read:
                        result[3]['token'] = re_read['token']
        haveUnread = False
        for result in results:
            if 'token' in result[3]:
                haveUnread = True
        if not haveUnread:
            break
    return results

First of all, we do not need duplicated values, so we should have a list ids that contains all unique id.
And the structure

for res in result['results']:
    if res['id'] not in ids:
        ...

will check if the id is in the list. If not, we shall add the result to the list results.
Then we need to re-read all the results to check if there are other page needs to be read. So we use for result in results to check if there is 'token' in each result. If there is and there is no error (No 'url' in result. If there’s 'url', it means that the API has returned a INVALID_REQUEST, so that we need to leave it alone and wait for the next round of re-read.) , we process a re-read and add the results into the list results.
And finally, we need to check it again to see if there’are still any 'next_page_token' that needs to be re-read.
We should loop infinitely if there’re always a 'next_page_token' so that we put our codes into a while True loop and break it if there is no more 'next_page_token'.


Extract results

The results of request is an Object that contains all results from API. So we have to define functions to extract the results that satisfy us.
In this cast, we should define a function

def extractCoords(data):
    coords = []
    for item in data:
        coords += [(item[3]['geometry']['location']['lat'],item[3]['geometry']['location']['lng'])]
    return coords

to extract all coordinates.


Run for all banks

To reduce our works, we shall define a function that run for all banks.

def runAll(cities, extractFunction, source = banks, api_id = API_ID, type = 'bank', lang = 'zh-CN', radius = 10000):
    results = {}
    for s in source:
        results[s] = extractFunction(getResultsInCitiesDelayed(cities, s, api_id, type, lang, radius))
    return results

This function will run for all banks that are listed in banks and return a dictionary that indicates results for all banks like

{'平安银行':[results],'宁波银行':[results],...}

Average distance

Finally we need to define functions to calculate average distance for each banks.

def averageDistance(headQuarter, locations):
    dist = []
    for item in locations:
        if type(item[0]) == float and type(item[1]) == float:
            dist += [distance(item,headQuarter)]
    return sum(dist)/len(locations)

def getAllAverageDistance(headQuarters, banksLocations):
    dict = {}
    for loc in banksLocations:
        dict[loc] = averageDistance(headQuarters[loc], banksLocations[loc])
    return dict

getAllAverageDistance will return the average distance for each banks in a dictionary.


Test

>>> all = runAll(getCities('cities.csv'),extractCoords,api_id='AIzaSyCcJUqHWucOoG9r1nscshfBRQE6oycDY04')
Running in method delay...平安银行 1/3179
Running in method delay...平安银行 2/3179
Running in method delay...平安银行 3/3179
Running in method delay...平安银行 4/3179
Running in method delay...平安银行 5/3179
Running in method delay...平安银行 6/3179
Running in method delay...平安银行 7/3179

Finally it will return values that we want.
Don’t worry, my API ID has been changed. ;)


Project address

Try to check on my GitHub to find the entire code.


(Branch Master recommended. Branch new_algo contains the algorithm that request immediately. So we will have to wait for a lone time because of the API delay)

好的,我知道你想让我设计一个用于计算矩形集合并的分治算法。输入的矩形都是与坐标轴对齐的,并且它们都与 y 轴相交。每个矩形由其左上角和右下角的坐标表示,集合并由从上到下列出的内部不相交的矩形序列表示。我们要求序列中的任意两个相邻矩形都不能合并成一个矩形。 解题思路:我们可以使用类似于归并排序的方法来实现这个算法。具体来说,我们可以将输入的矩形集合分成两个部分,然后分别计算它们的联合。最后,我们可以将这两个联合合并成一个联合。 我们可以通过以下步骤来实现这个算法: 1. 如果矩形集合为空,则返回一个空的序列。 2. 如果矩形集合只包含一个矩形,则返回该矩形。 3. 将矩形集合分成两个部分,分别计算它们的联合。 4. 合并这两个联合,生成一个新的联合。 5. 对于新联合中的相邻矩形,如果它们可以合并,则将它们合并成一个矩形。 6. 返回新联合。 代码实现如下: ```python class Rectangle: def __init__(self, x1, y1, x2, y2): self.x1, self.y1, self.x2, self.y2 = x1, y1, x2, y2 def merge_rectangles(rectangles): def merge(left, right): if not left: return right if not right: return left i, j = 0, 0 res = [] while i < len(left) and j < len(right): if left[i].y2 <= right[j].y1: res.append(left[i]) i += 1 elif left[i].y1 >= right[j].y2: res.append(right[j]) j += 1 else: if left[i].x2 == right[j].x1: res.append(Rectangle(left[i].x1, left[i].y1, right[j].x2, right[j].y2)) i += 1 j += 1 elif left[i].x1 == right[j].x2: res.append(Rectangle(right[j].x1, right[j].y1, left[i].x2, left[i].y2)) i += 1 j += 1 elif left[i].x2 <= right[j].x1: res.append(left[i]) i += 1 elif left[i].x1 >= right[j].x2: res.append(right[j]) j += 1 else: if left[i].x1 < right[j].x1: res.append(Rectangle(left[i].x1, left[i].y1, right[j].x1, left[i].y2)) left[i] = Rectangle(right[j].x1, left[i].y1, left[i].x2, left[i].y2) elif left[i].x1 > right[j].x1: res.append(Rectangle(right[j].x1, right[j].y1, left[i].x1, right[j].y2)) right[j] = Rectangle(left[i].x1, right[j].y1, right[j].x2, right[j].y2) elif left[i].x2 > right[j].x2: res.append(Rectangle(left[i].x1, left[i].y1, right[j].x2, left[i].y2)) left[i] = Rectangle(right[j].x2, left[i].y1, left[i].x2, left[i].y2) elif left[i].x2 < right[j].x2: res.append(Rectangle(left[i].x1, left[i].y1, left[i].x2, left[i].y2)) right[j] = Rectangle(left[i].x2, right[j].y1, right[j].x2, right[j].y2) res.extend(left[i:]) res.extend(right[j:]) return res if len(rectangles) == 0: return [] if len(rectangles) == 1: return rectangles mid = len(rectangles) // 2 left, right = rectangles[:mid], rectangles[mid:] left = merge_rectangles(left) right = merge_rectangles(right) res = merge(left, right) i = 0 while i < len(res) - 1: if res[i].x1 == res[i+1].x1 and res[i].x2 == res[i+1].x2: res[i] = Rectangle(res[i].x1, res[i].y1, res[i].x2, res[i+1].y2) del res[i+1] else: i += 1 return res ``` 这个算法的时间复杂度是 O(nlogn),其中 n 是矩形的数量。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值