Preprocessing

clean_context

I substitute some special symbols using regular expression and split by predefined symbols.

Parameters

the input is a string.
output is a list whose element is a token.

Example

input: “Even though supervised ones tend to perform best in terms of accuracy, they often lose ground to more flexible knowledge-based solutions, which do not require training by a word expert for every disambiguation target.”
output: [‘even’, ‘though’, ‘supervised’, ‘ones’, ‘tend’, ‘to’, ‘perform’, ‘best’, ‘in’, ‘terms’, ‘of’, ‘accuracy’, ‘they’, ‘often’, ‘lose’, ‘ground’, ‘to’, ‘more’, ‘flexible’, ‘knowledgebased’, ‘solutions’, ‘which’, ‘do’, ‘not’, ‘require’, ‘training’, ‘by’, ‘a’, ‘word’, ‘expert’, ‘for’, ‘every’, ‘disambiguation’, ‘target’]

def clean_context(ctx_in, has_target=False):
    replace_newline = re.compile("\n")
    replace_dot = re.compile("\.")
    replace_cite = re.compile("'")
    replace_frac = re.compile("[\d]*frac[\d]+")
    replace_num = re.compile("\s\d+\s")
    rm_context_tag = re.compile('<.{0,1}context>')
    rm_cit_tag = re.compile('\[[eb]quo\]')
    rm_misc = re.compile("[\[\]\$`()%/,\.:;-]")

    ctx = replace_newline.sub(' ', ctx_in)  # (' <eop> ', ctx)

    ctx = replace_dot.sub(' ', ctx)  # .sub(' <eos> ', ctx)
    ctx = replace_cite.sub(' ', ctx)  # .sub(' <cite> ', ctx)
    ctx = replace_frac.sub(' <frac> ', ctx)
    ctx = replace_num.sub(' <number> ', ctx)
    ctx = rm_cit_tag.sub(' ', ctx)
    ctx = rm_context_tag.sub('', ctx)
    ctx = rm_misc.sub('', ctx)

    word_list = [word for word in re.split('`|, | +|\? |! |: |; |\(|\)|_|,|\.|"|“|”|\'|\'', ctx.lower()) if word]
    return word_list

lemmatize_data

For each word, I lemmatize it in order to reduce some words.

Parameters

input_data: a list of tokens returned by clean_context function.
output: a list of lemmatized tokens

Example

input: [‘even’, ‘though’, ‘supervised’, ‘ones’, ‘tend’, ‘to’, ‘perform’, ‘best’, ‘in’, ‘terms’, ‘of’, ‘accuracy’, ‘they’, ‘often’, ‘lose’, ‘ground’, ‘to’, ‘more’, ‘flexible’, ‘knowledgebased’, ‘solutions’, ‘which’, ‘do’, ‘not’, ‘require’, ‘training’, ‘by’, ‘a’, ‘word’, ‘expert’, ‘for’, ‘every’, ‘disambiguation’, ‘target’]
output: [‘even’, ‘though’, ‘supervised’, ‘one’, ‘tend’, ‘to’, ‘perform’, ‘best’, ‘in’, ‘term’, ‘of’, ‘accuracy’, ‘they’, ‘often’, ‘lose’, ‘ground’, ‘to’, ‘more’, ‘flexible’, ‘knowledgebased’, ‘solution’, ‘which’, ‘do’, ‘not’, ‘require’, ‘training’, ‘by’, ‘a’, ‘word’, ‘expert’, ‘for’, ‘every’, ‘disambiguation’, ‘target’]

def lemmatize_data(input_data):
    result = []
    wnl = WordNetLemmatizer()
    for token in input_data:
        result.append(wnl.lemmatize(token))
    return result 

In summary

Now I take advantage of the features of DataFrame and these two utility function I mentioned above to preprocess the data.

import nltk
from KaiCode.preprocessing import clean_str,lemmatize_data
from nltk.corpus import stopwords
import re
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
# Remove blank rows if any.
Corpus['full_text'].dropna(inplace=True)
Corpus['full_text'] = [lemmatize_data(clean_context(entry)) for entry in Corpus['full_text']]
Corpus['full_text'] = [' '.join([token for token in entry if token not in stop_words]) for entry in Corpus['full_text']]

Step1: drop all the empty rows
Step2: get a list whose element is a list of clean tokens
Step3: remove all the stopwords in each sentence

# encode the label
Corpus['class'] = Encoder.transform(Corpus['class'])
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(Corpus['full_text'],Corpus['class'],test_size=0.3)

# feature Engineering
Train_X_Tfidf = Tfidf_vect.transform(Train_X)
Test_X_Tfidf = Tfidf_vect.transform(Test_X)

Step1: encode each category into nature number as the final label for out model
Step2: split the dataset into training dataset and test dataset
Step3: get tf-idf matrix as the final input for our model

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值