类 workbooks 的 open 方法无效_使用python+sklearn实现基于外存的文本分类方法

最新推荐文章于 2024-05-31 17:05:22 发布

weixin_39637386

最新推荐文章于 2024-05-31 17:05:22 发布

阅读量819

点赞数

文章标签：类 workbooks 的 open 方法无效

本文链接：https://blog.csdn.net/weixin_39637386/article/details/111365668

版权

本示例展示了scikit-learn如何使用基于外存的方法来进行文本分类，即如何从无法放入主内存的数据中进行机器学习。我们使用一个在线分类器，即一个支持partial_fit 方法的分类器，该分类器将提供一批示例。为了保证特征空间在一段时间内保持不变，我们使用了一个HashingVectorizer，它将每个示例投影到同一个特征空间中，这在文本分类中是非常有用的，因为每个batch中可能会出现新的特征(单词)。

# 作者: Eustache Diemert #          @FedericoV # 许可证: BSD 3 clause
from glob import glob
import itertools
import os.path
import re
import tarfile
import time
import sys
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rcParams
from html.parser import HTMLParser
from urllib.request import urlretrieve
from sklearn.datasets import get_data_home
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.linear_model import Perceptron
from sklearn.naive_bayes import MultinomialNB
def _not_in_sphinx():# Hack to detect whether we are running by the sphinx builderreturn '__file__' in globals()

Reuters 数据集相关例程

本示例中使用的数据集是UCI ML存储库提供的Reuters-21578，它会在首次运行时自动下载并解压缩。

class ReutersParser(HTMLParser):"""实用程序类，用于解析SGML文件并一次生成一个文档。"""
    def __init__(self, encoding='latin-1'):
        HTMLParser.__init__(self)
        self._reset()
        self.encoding = encoding
    def handle_starttag(self, tag, attrs):
        method = 'start_' + tag
        getattr(self, method, lambda x: None)(attrs)
    def handle_endtag(self, tag):
        method = 'end_' + tag
        getattr(self, method, lambda: None)()
    def _reset(self):
        self.in_title = 0
        self.in_body = 0
        self.in_topics = 0
        self.in_topic_d = 0
        self.title = ""
        self.body = ""
        self.topics = []
        self.topic_d = ""
    def parse(self, fd):
        self.docs = []for chunk in fd:
            self.feed(chunk.decode(self.encoding))for doc in self.docs:
                yield doc
            self.docs = []
        self.close()
    def handle_data(self, data):if self.in_body:
            self.body += dataelif self.in_title:
            self.title += dataelif self.in_topic_d:
            self.topic_d += data
    def start_reuters(self, attributes):
        pass
    def end_reuters(self):
        self.body = re.sub(r'\s+', r' ', self.body)
        self.docs.append({
    'title': self.title,'body': self.body,'topics': self.topics})
        self._reset()
    def start_title(self, attributes):
        self.in_title = 1
    def end_title(self):
        self.in_title = 0
    def start_body(self, attributes):
        self.in_body = 1
    def end_body(self):
        self.in_body = 0
    def start_topics(self, attributes):
        self.in_topics = 1
    def end_topics(self):
        self.in_topics = 0
    def start_d(self, attributes):
        self.in_topic_d = 1
    def end_d(self):
        self.in_topic_d = 0
        self.topics.append(self.topic_d)
        self.topic_d = ""
def stream_reuters_documents(data_path=None):"""遍历Reuters数据集的文档。
    如果`data_path`目录不存在，Reuters文件将自动下载并解压缩，文档将表示
    成键(key)为'body' (str), 'title' (str), 'topics' (list(str))的字典。
    """

最低0.47元/天解锁文章

weixin_39637386

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
类 workbooks 的 open 方法无效_使用python+sklearn实现基于外存的文本分类方法

本示例展示了scikit-learn如何使用基于外存的方法来进行文本分类，即如何从无法放入主内存的数据中进行机器学习。我们使用一个在线分类器，即一个支持partial_fit 方法的分类器，该分类器将提供一批示例。为了保证特征空间在一段时间内保持不变，我们使用了一个HashingVectorizer，它将每个示例投影到同一个特征空间中，这在文本分类中是非常有用的，因为每个batch中可能...
复制链接

扫一扫