python词性标注_文本分类的词性标注

本文档描述了一位使用Python处理文本分类问题的新人遇到的挑战。代码中包含了数据预处理步骤,如转为小写、去除标点、分词、去除停用词和词形还原。然而,作者在尝试进行词性标注时遇到问题,使用nltk的词性标注未达到预期效果,寻求帮助以解决代码中词性标注未显示的问题。
摘要由CSDN通过智能技术生成

我是一个新的python,正在处理一个文本分类问题。我用不同的在线资源开发了一个代码。但是这个代码并没有做词性标注。有人能帮我找出我的代码中我真正出错的那一行吗。我在代码中做词性标记,但结果中没有显示。我也试过用nltk做词性标注,但这对我也不起作用。如有任何帮助,我们将不胜感激。谢谢。在# Add the Data using pandas

Corpus = pd.read_csv(r"U:\FAHAD UL HASSAN\Python Code\projectdatacor.csv",encoding='latin-1')

# Data Pre-processing - This will help in getting better results through the classification algorithms

# Remove blank rows if any.

Corpus['description'].dropna(inplace=True)

# Change all the text to lower case. This is required as python interprets 'design' and 'DESIGN' differently

Corpus['description'] = [entry.lower() for entry in Corpus['description']]

# Punctuation Removal

Corpus['description'] = Corpus.description.str.replace('[^\w\s]', '')

# Tokenization : In this each entry in the corpus will be broken into set of words

Corpus['description']= [word_tokenize(entry) for entry in Corpus['description']]

# Remove Stop words, Non-Numeric and perfom Word Stemming/Lemmenting.

# WordNetLemmatizer requires Pos tags to understand if the word is noun or verb or adjective etc. By default it is set to Noun

STOPWORDS = set(stopwords.words('english'))

tag_map = defaultdict(lambda : wn.NOUN)

tag_map['J'] = wn.ADJ

tag_map['V'] = wn.VERB

tag_map['R'] = wn.ADV

for index,entry in enumerate(Corpus['description']):

# Declaring Empty List to store the words that follow the rules for this step

Final_words = []

# Initializing WordNetLemmatizer()

word_Lemmatized = WordNetLemmatizer()

# pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.

for word, tag in pos_tag(entry):

# Below condition is to check for Stop words and consider only alphabets

if word not in STOPWORDS and word.isalpha():

word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])

Final_words.append(word_Final)

# The final processed set of words for each iteration will be stored in 'description_final'

Corpus.loc[index,'description_final'] = str(Final_words)

print(Corpus['description_final'].head())

这些就是我得到的结果。这段代码做了很多事情,比如标记化,删除了stopwords,但是它在我的结果中显示了pos标记。在

^{pr2}$

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值