NLTK 入门

NLTK 模块是一个巨大的工具包,目的是在整个自然语言处理(NLP)方法上帮助你。 NLTK 将为你提供一切,从将段落拆分为句子,拆分词语,识别这些词语的词性,高亮主题,甚至帮助你的机器了解文本关于什么。

  • 分词 - 将文本正文分割为句子和单词。
  • 词性标注
  • 机器学习与朴素贝叶斯分类器
  • 如何一起使用 Scikit Learn(sklearn)与 NLTK
  • 用数据集训练分类器
  • 用 Twitter 进行实时的流式情感分析。
  • #Loading the data set - training data.
    from sklearn.datasets import fetch_20newsgroups
    twenty_train = fetch_20newsgroups(subset='train', shuffle=True)
    # You can check the target names (categories) and some data files by following commands.
    twenty_train.target_names #prints all the categories
    print("\n".join(twenty_train.data[0].split("\n")[:3])) #prints first line of the first data file
    # Extracting features from text files
    from sklearn.feature_extraction.text import CountVectorizer
    count_vect = CountVectorizer()
    X_train_counts = count_vect.fit_transform(twenty_train.data)
    X_train_counts.shape
    # TF-IDF
    from sklearn.feature_extraction.text import TfidfTransformer
    tfidf_transformer = TfidfTransformer()
    X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
    X_train_tfidf.shape
    # Machine Learning
    # Training Naive Bayes (NB) classifier on training data.
    from sklearn.naive_bayes import MultinomialNB
    clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)
    # Building a pipeline: We can write less code and do all of the above, by building a pipeline as follows:
    # The names ‘vect’ , ‘tfidf’ and ‘clf’ are arbitrary but will be used later.
    # We will be using the 'text_clf' going forward.
    from sklearn.pipeline import Pipeline
    
    text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])
    
    text_clf = text_clf.fit(twenty_train.data, twenty_train.target)
    # Performance of NB Classifier
    import numpy as np
    twenty_test = fetch_20newsgroups(subset='test', shuffle=True)
    predicted = text_clf.predict(twenty_test.data)
    np.mean(predicted == twenty_test.target)
    # Training Support Vector Machines - SVM and calculating its performance
    
    from sklearn.linear_model import SGDClassifier
    text_clf_svm = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                             ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, max_iter=5, random_state=42))])
    
    text_clf_svm = text_clf_svm.fit(twenty_train.data, twenty_train.target)
    predicted_svm = text_clf_svm.predict(twenty_test.data)
    np.mean(predicted_svm == twenty_test.target)
    # Grid Search
    # Here, we are creating a list of parameters for which we would like to do performance tuning. 
    # All the parameters name start with the classifier name (remember the arbitrary name we gave). 
    # E.g. vect__ngram_range; here we are telling to use unigram and bigrams and choose the one which is optimal.
    
    from sklearn.model_selection import GridSearchCV
    parameters = {'vect__ngram_range': [(1, 1), (1, 2)], 'tfidf__use_idf': (True, False), 'clf__alpha': (1e-2, 1e-3)}
    # Next, we create an instance of the grid search by passing the classifier, parameters 
    # and n_jobs=-1 which tells to use multiple cores from user machine.
    
    gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
    gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)
    
    # To see the best mean score and the params, run the following code
    
    gs_clf.best_score_
    gs_clf.best_params_
    
    # Output for above should be: The accuracy has now increased to ~90.6% for the NB classifier (not so naive anymore! 😄)
    # and the corresponding parameters are {‘clf__alpha’: 0.01, ‘tfidf__use_idf’: True, ‘vect__ngram_range’: (1, 2)}.
    
    # Similarly doing grid search for SVM
    from sklearn.model_selection import GridSearchCV
    parameters_svm = {'vect__ngram_range': [(1, 1), (1, 2)], 'tfidf__use_idf': (True, False),'clf-svm__alpha': (1e-2, 1e-3)}
    
    gs_clf_svm = GridSearchCV(text_clf_svm, parameters_svm, n_jobs=-1)
    gs_clf_svm = gs_clf_svm.fit(twenty_train.data, twenty_train.target)
    
    
    gs_clf_svm.best_score_
    gs_clf_svm.best_params_
    
    # NLTK
    # Removing stop words
    from sklearn.pipeline import Pipeline
    text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf', TfidfTransformer()), 
                         ('clf', MultinomialNB())])
    
    # Stemming Code
    
    import nltk
    #nltk.download()
    nltk.download('stopwords')
    from nltk.stem.snowball import SnowballStemmer
    stemmer = SnowballStemmer("english", ignore_stopwords=True)
    
    class StemmedCountVectorizer(CountVectorizer):
        def build_analyzer(self):
            analyzer = super(StemmedCountVectorizer, self).build_analyzer()
            return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])
        
    stemmed_count_vect = StemmedCountVectorizer(stop_words='english')
    
    text_mnb_stemmed = Pipeline([('vect', stemmed_count_vect), ('tfidf', TfidfTransformer()), 
                                 ('mnb', MultinomialNB(fit_prior=False))])
    
    text_mnb_stemmed = text_mnb_stemmed.fit(twenty_train.data, twenty_train.target)
    
    predicted_mnb_stemmed = text_mnb_stemmed.predict(twenty_test.data)
    
    np.mean(predicted_mnb_stemmed == twenty_test.target)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值