Python 规范化LinkedIn用户的联系人所在公司后缀 (data normalization)

CODE:

#!/usr/bin/python 
# -*- coding: utf-8 -*-

'''
Created on 2014-8-19
@author: guaguastd
@name: company_suffix_normalize.py
'''

# import json
import os
import csv
from collections import Counter
from operator import itemgetter
from prettytable import PrettyTable

# specify csv directory
CSV_FILE = os.path.join(r"E:", "\\", "eclipse", "LinkedIn", "dfile", "my_connections.csv")

# define a set of transforms that converts the first item
# to the second item
transforms = [(', Inc.', ''), (', Inc', ''), (', LLC', ''), (', LLP', ''), (' LLC', ''), (' Inc.', ''), (' Inc', '')]

csvReader = csv.DictReader(open(CSV_FILE), delimiter=',', quotechar='"')
contacts = [row for row in csvReader]
companies = [c['Company'].strip() for c in contacts if c['Company'].strip() != '']

for i, _ in enumerate(companies):
    for transform in transforms:
        companies[i] = companies[i].replace(*transform)

pt = PrettyTable(field_names=['Company', 'Freq'])
pt.align = 'l'
c = Counter(companies)
[pt.add_row([company, freq])
for (company, freq) in sorted(c.items(), key=itemgetter(1), reverse=True)
    if freq > 0]
print pt

RESULT:

+---------------------------------------+------+
| Company                               | Freq |
+---------------------------------------+------+
| ??????????                            | 1    |
| ??                                    | 1    |
| SoftTalent Consulting ??????????????? | 1    |
| SJTU                                  | 1    |
| WatchGuard Technologies               | 1    |
| Hebei Meishen Chemical Group CO.,Ltd  | 1    |
| Bloomberg LP                          | 1    |
| DiHao trading Co.,Ltd                 | 1    |
| CET                                   | 1    |
| Pica8                                 | 1    |
| Microsoft                             | 1    |
+---------------------------------------+------+


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值