python提取英文单词怎么写_python – 提取两个句子之间不同的单词

我有一个非常大的数据框,有两列名为sentence1和sentence2.

我正在尝试使用两个句子之间不同的单词创建一个新列,例如:

sentence1=c("This is sentence one", "This is sentence two", "This is sentence three")

sentence2=c("This is the sentence four", "This is the sentence five", "This is the sentence six")

df = as.data.frame(cbind(sentence1,sentence2))

我的数据框架具有以下结构:

ID sentence1 sentence2

1 This is sentence one This is the sentence four

2 This is sentence two This is the sentence five

3 This is sentence three This is the sentence six

我的预期结果是:

ID sentence1 sentence2 Expected_Result

1 This is ... This is ... one the four

2 This is ... This is ... two the five

3 This is ... This is ... three the six

在R中我试图分割句子,并在得到列表之间不同的元素后,例如:

df$split_Sentence1<-strsplit(df$sentence1, split=" ")

df$split_Sentence2<-strsplit(df$sentence2, split=" ")

df$Dif<-setdiff(df$split_Sentence1, df$split_Sentence2)

但是这种方法在应用setdiff时不起作用……

在Python中,我试图应用NLTK,尝试首先获取令牌,然后提取两个列表之间的差异,如:

from nltk.tokenize import word_tokenize

df['tokensS1'] = df.sentence1.apply(lambda x: word_tokenize(x))

df['tokensS2'] = df.sentence2.apply(lambda x: word_tokenize(x))

在这一点上,我没有找到一个功能,给我我需要的结果..

我希望你能帮助我.谢谢

最佳答案 这是一个R解决方案.

我创建了一个exclusiveWords函数,用于查找两个集合之间的唯一单词,并返回由这些单词组成的“句子”.我将它包装在Vectorize()中,以便它可以同时处理data.frame的所有行.

df = as.data.frame(cbind(sentence1,sentence2), stringsAsFactors = F)

exclusiveWords <- function(x, y){

x <- strsplit(x, " ")[[1]]

y <- strsplit(y, " ")[[1]]

u <- union(x, y)

u <- union(setdiff(u, x), setdiff(u, y))

return(paste0(u, collapse = " "))

}

exclusiveWords <- Vectorize(exclusiveWords)

df$result <- exclusiveWords(df$sentence1, df$sentence2)

df

# sentence1 sentence2 result

# 1 This is sentence one This is the sentence four the four one

# 2 This is sentence two This is the sentence five the five two

# 3 This is sentence three This is the sentence six the six three

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值