我有一个非常大的数据框,有两列名为sentence1和sentence2.
我正在尝试使用两个句子之间不同的单词创建一个新列,例如:
sentence1=c("This is sentence one", "This is sentence two", "This is sentence three")
sentence2=c("This is the sentence four", "This is the sentence five", "This is the sentence six")
df = as.data.frame(cbind(sentence1,sentence2))
我的数据框架具有以下结构:
ID sentence1 sentence2
1 This is sentence one This is the sentence four
2 This is sentence two This is the sentence five
3 This is sentence three This is the sentence six
我的预期结果是:
ID sentence1 sentence2 Expected_Result
1 This is ... This is ... one the four
2 This is ... This is ... two the five
3 This is ... This is ... three the six
在R中我试图分割句子,并在得到列表之间不同的元素后,例如:
df$split_Sentence1
df$split_Sentence2
df$Dif
但是这种方法在应用setdiff时不起作用……
在Python中,我试图应用NLTK,尝试首先获取令牌,然后提取两个列表之间的差异,如:
from nltk.tokenize import word_tokenize
df['tokensS1'] = df.sentence1.apply(lambda x: word_tokenize(x))
df['tokensS2'] = df.sentence2.apply(lambda x: word_tokenize(x))
在这一点上,我没有找到一个功能,给我我需要的结果..
我希望你能帮助我.谢谢