如果能够度量两个文本之间的重合度,就可以很好地估计它们所用词的相似程度,而这也是它们语义上重合度的一个很好的估计。
import numpy as np
import pandas as pd
sentences = """Thomas Jefferson began building Monticello at the age of 26.\n"""
sentences += """Construction was done mostly by local masons and carpenters.\n"""
sentences += "He moved into the South Pavilion in 1770.\n"
sentences += """Turning Monticello into a neoclassical masterpiece was Jefferson's obsession."""
corpus = {}
for i, sent in enumerate(sentences.split('\n')):
corpus['sent{}'.format(i)] = dict((tok, 1) for tok in sent.split())
# pd.DataFrame.from_records()专门用于从元组和字典中创建数据框
df = pd.DataFrame.from_records(corpus).fillna(0).astype(int).T
print(df)
df = df.T
print(df.sent0.dot(df.sent1))
print(df.sent0.dot(df.sent2))
print(df.sent0.dot(df.sent3))