成为数据科学家的实体融合

最新推荐文章于 2024-01-18 16:41:05 发布

米老鼠与刘老根

最新推荐文章于 2024-01-18 16:41:05 发布

阅读量1.8k

点赞数 5

分类专栏：作业

本文链接：https://blog.csdn.net/qq_37623085/article/details/83963617

版权

作业专栏收录该内容

4 篇文章

订阅专栏

基本全凭题目要求和自己的理解做的，没参考答案，答案也不全。

项目要求：

内容完整程度
可用性（可操作、易操作、美观）
时间先后

确定项目目标：

(实体识别)
现有一个Amazon的众多商品的数据记录文件(Amazon.csv)，同时有Google对众多商品的数据库记录文件(Google.csv)。如何将两者的数据匹配起来？
原问题+答案(本文并没有参考)：
http://nbviewer.ipython.org/github/biddata/datascience/blob/master/F14/hw1/hw1.ipynb

获取数据：

使用题目提供的Amazon和Google的商品记录文件。
首先观察题目数据，
在这里插入图片描述

可以看到两组数据所包含的信息相同，第一列是用于区别各实体的id，后4列就是我们将要用于实体融合的信息。
具体观察后4列信息，发现词与词之间的分隔符多种多样，如:,/(-)等。
分割文本，得到有效Tokens：
根据题目要求，我们需要依照空格，逗号等分隔符将字符串分割，同时根据stopwords.txt删除像is of这种对实体融合没有意义的单词，需求很简单，我们可以直接写出tokenize的函数如下：
在这里插入图片描述
由于自然语言的不规整这一特质，在这里我们同时删去了length小于1的字符串。

求TF-IDF：

根据我们之前写出的tokenize函数，我们可以得到形如{id1:[tokens1],id2:[tokens2],…}的字典，我们把这个字典称为all_tokens,得到他的代码如下：
在这里插入图片描述
于是我们可以以此求出每个token的TF-IDF，为了求TF-IDF，我们要先求出TF和IDF。按照TF,IDF的定义，我们可以写出求出他们的函数如下（这两个方法时间复杂度很高，后面会利用反向索引做优化）：

如果我们利用这两个函数得到了TF与IDF，那么相乘就能得到一个token的TF-IDF。同样的，我们就能得到每一个id下所有token的TF-IDF，因此我们可以获得形如{id1:{token1,TF_IDF(token1),token2:…},id2:…}的字典，我们称之为all_TF_IDF，于是我们写出如下代码：
在这里插入图片描述
如此我们就得到了所有token的TF-IDF。

求余弦相似度：

按照题目的要求，每个token作为一个维度，每个维度的值是该token的TF-IDF,这样就可以算出两个实体的夹角。
需要考虑的是，两个实体会有重合以及非重合的token，我们需要把他们取并集后的结果作为维度，想到这里，我们就会自然的使用set这一数据类型来解决这一问题，这样就得到了向量的维度，然后对每个维度赋上对应的TF-IDF值就得到了向量。
计算夹角的代码如下：
在这里插入图片描述

进行实体融合：

利用我们之前得到的google和Amazon的TF-IDF，我们可以对amazon的每一行都寻找一个最匹配的google数据，基本思路很简单：对Amazon中的每一个实体，遍历Google的数据寻找余弦值最大的实体。找到之后与阈值进行比较，小于阈值的舍弃。代码如下：
在这里插入图片描述
这样计算出的结果就是我们实体融合的结果。
我们可以把它输出到csv文件中：

然后我们计算正确率，对模型进行评估，基本思路是：正确率=符合最优融合的结果/融合的结果总数。代码如下：

优化算法：

理想很丰满，但由于计算IDF的时间复杂度过高，我们甚至都无法等到程序开始进行融合的那一刻，更无法得到融合的结果。
因此我们按照题目的提示，使用反向索引进行优化，创建一个形如{token:[id1,id2],…}，包含了拥有一个token的所有的实体的id，这一过程可以放在tokenize中实现，于是我们修改tokenize的代码如下：
在这里插入图片描述
同样的我们也需要修改计算IDF以及TF-IDF的代码：

这样我们就能快速地计算出每个token的IDF，把计算IDF的时间复杂度从O(n^2)降低到了O(n)。现在我们可以进行实体融合了。
调整阈值：
首先我们把阈值设为0，得到的结果如下：
在这里插入图片描述
我们得到了1363条融合结果，正确率为33.82%，考虑到融合正确率最高只有64%，这个结果还算不错。
接下来调整阈值为1/8，结果如下：

得到了941条融合结果，正确率达到了41.02%。
调整阈值为0.8，得到结果如下：
在这里插入图片描述
融合结果只剩下17条了，但正确率达到了52.94%。
最后我们调整到最大精确率下的阈值：0.93：

可以看到融合成功率竟然达到了100%，更为惊人的是结果竟然只剩1条数据，这对于实际应用可以说是毫无价值了。
下面是截取的部分融合结果（阈值为0.125）：
在这里插入图片描述
与理想的融合结果比较：

可以看到相当多的融合结果是正确的，因此实际应用时阈值没必要设置的过高，过于苛刻的条件会极大的减少结果数据的数量。
最后附上两组数据所有token构成的词云图片（可以看到这是两组与软件相关密切的产品构成的数据）：
在这里插入图片描述

代码就不贴了，写的有点乱
在这里插入图片描述

———————————————2019.3.31————————————————
一个学期过去了，把代码贴上来吧，仅供参考，请不要抄袭

import re
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import PIL
THRESHOLD=0.125#阈值7.82-8.3
def tokenize(string,rindex,iid):#字符串转tokens,同时建立倒排索引
    result=[]
    ls=[]
    ls=re.split('[ ,;:/.&"!-()]',string)
    sto=open('stopwords.txt','r').read()
    sto1=sto.split()
    for c in ls:
        if (c not in sto1) and (' ' not in c) and len(c)>1:
            result.append(c)
            t=rindex.get(c,[])
            t.append(iid)
            rindex[c]=t
    return result

def getTF(token,tokens):#计算TF(token)
    return tokens.count(token)/len(tokens)

def getIDF(token,all_tokens,r_index):#计算IDF(token)
    tokensNum=len(all_tokens)
    count=len(r_index[token])
    return tokensNum/count

def getAllTokens(filename):
     #得到每一行的TF_IDF，{id1:{token1:tfidf1,token2:tfidf2},id2:{...},...}
    fo=open(filename,'r')
    all_tokens={}
    rindex={}
    #得到每一行的tokens
    skip=True
    for line in fo.readlines():
        if skip:
            skip=False
            continue
        data=line.split(',')
        iid=data[0]
        info=data[1]+data[2]+data[3]+data[4].replace('\n','')
        tokens=tokenize(info,rindex,iid)
        all_tokens[iid]=tokens
    return all_tokens,rindex
    
def getTF_IDF(all_tokens,r_index):#all_tokens={id1:[tokens],...}
    #对每一行的tokens中的每一个token计算TF，IDF
    all_TF_IDF={}
    for tokens in all_tokens.items():
        TF_IDF={}
        for token in tokens[1]:
            tf=getTF(token,tokens[1])
            idf=getIDF(token,all_tokens.values(),r_index)
            tf_idf=tf*idf
            TF_IDF[token]=tf_idf
            #print(token,TF_IDF[token])
        all_TF_IDF[tokens[0]]=TF_IDF
        #print(tokens[0],TF_IDF)
    return all_TF_IDF#{id1:{token1,TF_IDF(token1),token2:...},id2:...}

def calcCos(tif1,tif2):#参数为{token1:TF_IDF(token1),token2:...}
    dims=set(tif1.keys())|set(tif2.keys())
    if len(dims)<=0:
        return -1
    v1=[]
    v2=[]
    for dim in dims:
        v1.append(tif1.get(dim,0))
        v2.append(tif2.get(dim,0))
    vec1=np.mat(v1).T
    vec2=np.mat(v2).T
    num=vec1.T*vec2
    denom=np.linalg.norm(vec1)*np.linalg.norm(vec2)
    #print(float(num/denom))
    return float(num/denom)

def judge(ati,gti,grindex):
    gid=gti[0]
    count=0
    for token in ati[1].keys():
        #print(token,grindex.get(token,[]))
        if gid in grindex.get(token,[]):
            count+=1
    if count<=len(ati[1])*0.1:
        return False
    else:
        return True


def mapping(atif,gtif,threshold,gr_index):#amazon,google的TF_IDF;阈值
    mapp={}
    for ati in atif.items():
        result=[-1,-1]
        for gti in gtif.items():
            #if not judge(ati,gti,gr_index):
            #    continue
            cos=calcCos(ati[1],gti[1])
            if cos>result[0]:
                result[0]=cos
                result[1]=gti[0]
        if result[0]>threshold:
            mapp[ati[0]]=result[1]
        #print(ati[0],result[1],result[0])
    return mapp       

def outputMap(mapp):
    fw=open('myMapping.csv','w')
    for row in mapp.items():
        fw.write(','.join(row)+'\n')
    fw.close()
def compare(test,perfect):
    fo1=open(test,'r')
    fo2=open(perfect,'r')
    ls1=fo1.readlines()
    ls2=fo2.readlines()
    num=len(ls1)
    count=0
    for i in range(len(ls2)):
        ls2[i]=ls2[i].replace('"','')
    for line in ls1:
        if line.replace('"','') in ls2:
            count+=1
    return count/num
def wordcloudplot(txt):
    path = r'C:\Windows\Fonts\FZSTK.TTF'
    alice_mask = np.array(PIL.Image.open('bg.jpg'))
    wordcloud = WordCloud(font_path=path,
                          background_color="white",
                          margin=5, width=1800, height=800, mask=alice_mask, max_words=80, max_font_size=60,
                          random_state=42).generate(txt)
    wordcloud.to_file('dora.jpg')
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.show()
        
def main():    
    at,aridx=getAllTokens('Amazon_small.csv')
    print('amazon tokenize')
    atif=getTF_IDF(at,aridx)
    print('amazon finish')
    gt,gridx=getAllTokens('Google_small.csv')
    print('google tokenized')
    gtif=getTF_IDF(gt,gridx)
    print('start mapping')
    mapp=mapping(atif,gtif,THRESHOLD,gridx)
    print(THRESHOLD,len(mapp.items()))
    outputMap(mapp)
    print('Output Finished')
    com=compare('myMapping.csv','Amazon_Google_perfectMapping.csv')
    print('correct rate:',com)
def draw():
    at,aridx=getAllTokens('Amazon.csv')
    gt,gridx=getAllTokens('Google.csv')
    als=at.values()
    gls=gt.values()
    txt=[]
    for i in als:
        txt+=i
    for i in gls:
        txt+=i
    words=' '.join(txt)
    wordcloudplot(words)
    
main()