实体解析在数据清理和融合中是一个普遍但困难的问题。这里我们将展示如何使用Spark来进行强大可扩展的文本分析技巧并执行跨数据集的实体解析。被用来描述结合来自不同数据源的记录表述同一实体的过程,另外一些常用的说法有实体连接、重复侦测、记录匹配、对象识别、数据融合等等。它指在数据集中找到跨不同数据源(例如数据文件、图书、网站、数据库)的同一实体的记录。
这里我们要处理来自两个不同数据库的记录,其中Amazon的记录为:
"id","title","description","manufacturer","price"
Google的记录为:
"id","name","description","manufacturer","price"
import re
DATAFILE_PATTERN = '^(.+),"(.+)",(.*),(.*),(.*)'
def removeQuotes(s):
return ''.join(i for i in s if i!='"')
def parseDatafileLine(datafileLine):
match = re.search(DATAFILE_PATTERN, datafileLine)
if match is None:
print 'Invalid datafile line: %s' % datafileLine
return (datafileLine, -1)
elif match.group(1) == '"id"':
print 'Header datafile line: %s' % datafileLine
return (datafileLine, 0)
else:
product = '%s %s %s' % (match.group(2), match.group(3), match.group(4))
return ((removeQuotes(match.group(1)), product), 1)
现在载入数据文件
import sys
import os
from databricks_test_helper import Test
data_dir = os.path.join('databricks-datasets', 'cs100', 'lab3', 'data-001')
GOOGLE_PATH = 'Google.csv'
GOOGLE_SMALL_PATH = 'Google_small.csv'
AMAZON_PATH = 'Amazon.csv'
AMAZON_SMALL_PATH = 'Amazon_small.csv'
GOLD_STANDARD_PATH = 'Amazon_Google_perfectMapping