任务1:针对CDIAL-BIAS-race数据集,使用结巴分词工具对文件进行分词,输出:分过词的文件。
任务2:依据分词后的文件,统计数据中可能导致种族文化偏见的敏感词(race.txt)的词频。输出文件格式:每行一个词及其词频,按照词频从小到大排序。如:
#种族歧视:188
#黑人:339
……
任务3:依据未分词的文件,统计数据中可能导致种族文化偏见的敏感词(race.txt)的词频。
任务4:任务2与任务3的结果是否相同,请找出不一致的原因与对应的所有句子。
任务1在GitHub下载jieba,调用一下函数就行
# encoding=utf-8
import jieba
text=''
with open("D:\personal\Desktop\Code\实验1\CDIAL-BIAS-race.txt","r",encoding="utf-8") as inf:
text =inf.read()
l=jieba.lcut(text)
f=open("D:\personal\Desktop\Code\实验1\分词.txt","w",encoding="utf-8")
for line in l:
f.write(line)
f.write("\t")
f.close()
任务2统计词频
# encoding=utf-8
text1=''
with open("D:\personal\Desktop\Code\实验1/race.txt","r",encoding="utf-8") as inf:
text1 =inf.read()
word1=text1.split("\n")
text2=''
with open("D:\personal\Desktop\Code\实验1\分词.txt","r",encoding="utf-8") as inf:
text2 =inf.read()
word2=text2.split("\t")
f=open("D:\personal\Desktop\Code\实验1\词频统计.txt","w",encoding="utf-8")
for line1 in word1:
sum=0
if line1=="":
continue
for line2 in word2:
if line1==line2:
sum=sum+1
s="#"+line1+": "+str(sum)
##print(s)
f.write(s)
f.write("\n")
f.close()
任务3代码和任务2差不多
# encoding=utf-8
import re
text1=''
with open("D:\personal\Desktop\Code\实验1/race.txt","r",encoding="utf-8") as inf:
text1 =inf.read()
word1=text1.split("\n")
text2=''
with open("D:\personal\Desktop\Code\实验1\CDIAL-BIAS-race.txt","r",encoding="utf-8") as inf:
text2 =inf.read()
word2=text2
f=open("D:\personal\Desktop\Code\实验1\分词前的词频统计.txt","w",encoding="utf-8")
for line1 in word1:
sum=0
if line1=="":
continue
length=len(line1)
for i in range(len(word2)-1):
if line1==word2[i:i+length]:
sum=sum+1
s="#"+line1+": "+str(sum)
##print(s)
f.write(s)
f.write("\n")
f.close()
任务4找出不一样的句子,分析原因
import jieba
text = ''
with open("D:\personal\Desktop\Code\实验1\CDIAL-BIAS-race.txt", "r", encoding="utf-8") as inf:
text = inf.read().splitlines()
s="黑人"
text1=''
with open("D:\personal\Desktop\Code\实验1/race.txt","r",encoding="utf-8") as inf:
text1 =inf.read().split("\n")
##for k in range(len(text1)-1):
for i in range(len(text)):
##print("[原句]:"+text[i])
sum1=0
for j in range(len(text[i])):
if s==text[i][j:j+2]:
sum1=sum1+1
##print("总数为:\t"+str(sum1))
seg_list=jieba.cut(text[i],cut_all=False)
##print ("[精确模式]"+"/ ".join(seg_list))
seg_list=jieba.lcut(text[i],cut_all=False)
sum2=0
for j in range(len(seg_list)):
if s==seg_list[j]:
sum2=sum2+1
if sum1!=sum2:
print("[原句]:"+text[i])
print("总数为:\t"+str(sum1))
print ("[精确模式]"+"/ ".join(seg_list))
print("总数为:\t"+str(sum2))
##print("总数为:\t"+str(sum2))
原因主要在东南亚人和东南亚,黑人和美国黑人等上。