使用fastText实现文本分类-java版

使用FastText实现文本分类-java版

文本分类又称自动文本分类,是指计算机将载有信心的一篇文本映射到预先给定的某一类别或某几个类别主题的过程,实现这一过程的算法模型叫做分类器。哈哈哈,这一句是从大佬文章中借鉴来得,这是是原文 ,这篇文章具体介绍了文本分类的历史发展和一些分类算法,有兴趣的可以去看看。

我这里主要说的是使用FastText实现文本分类,至于想要弄明白原理的,建议看这里 ,大佬对原理做了很详细的说明,向大佬表示敬意。

依赖

<dependency>
	<groupId>com.github.sszuev</groupId>
	<artifactId>fasttext</artifactId>
	<version>1.0.0</version>
</dependency>

想要jar包的可以去maven repository官网 去搜索FastText,这里有各个版本的jar包选择。

训练模型

以下就是训练模型的代码,训练出的模型会保存到指定的路径下,一次训练就可以得到模型,之后就可以用模型预测数据了。

try {
	Main.train(new String[]{"supervised",
   		"-input","./parameter/category_train.txt",  //带标签的训练数据
        "-output","./parameter/commodity_model",    //训练出的模型保存路径及命名
        "-dim","64",         //词向量维度
        "-lr","0.5",         //学习率
        "-wordNgrams","2",   //最大子词长度
        "-minCount","1",     //最小词频
        "-bucket","10000000" //桶数
	});
}catch (Exception e) {
	e.printStackTrace();
}

那么带标签的训练数据集是什么样的?如下,这是从官网 找到的测试训练数据集,标签都是以 __label__ 为开头的,当然这个可以设置,只不过默认支持的是这样,以下数据都是英文,如果使用中文可以考虑分词和停用词过滤了。

__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe?
__label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments
__label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?
__label__restaurant Michelin Three Star Restaurant; but if the chef is not there
__label__knife-skills __label__dicing Without knife skills, how can I quickly and accurately dice vegetables?
__label__storage-method __label__equipment __label__bread What's the purpose of a bread box?
__label__baking __label__food-safety __label__substitutions __label__peanuts how to seperate peanut oil from roasted peanuts at home?
__label__chocolate American equivalent for British chocolate terms
__label__baking __label__oven __label__convection Fan bake vs bake
__label__sauce __label__storage-lifetime __label__acidity __label__mayonnaise Regulation and balancing of readymade packed mayonnaise and other sauces
__label__tea What kind of tea do you boil for 45minutes?
__label__baking __label__baking-powder __label__baking-soda __label__leavening How long can batter sit before chemical leaveners lose their power?
__label__food-safety __label__soup Can I RE-freeze chicken soup after it has thawed?

其实在训练模型中还有很多参数可以设置,如下,参数很多,所以如果想要搞明白所有的相关参数意义最好去官网多花点时间读一下。

The following arguments are mandatory:
	-input        training file path
    -output       output file path
     
    The following arguments are optional:
    -lr           learning rate [0.05]
    -lrUpdateRate change the rate of updates for the learning rate [100]
    -dim          size of word vectors [100]  //这里决定了词向量的维度
    -ws           size of the context window [5]
    -epoch        number of epochs [5]
    -minCount     minimal number of word occurences [1]  //最小词频
    -neg          number of negatives sampled [5]
    -wordNgrams   max length of word ngram [1]
    -loss         loss function {ns, hs, softmax} [ns]
    -bucket       number of buckets [2000000]
    -minn         min length of char ngram [3]
    -maxn         max length of char ngram [6]
    -thread       number of threads [12]
    -t            sampling threshold [0.0001]
    -label        labels prefix [__label__]  //看,这里就可以使用你自己想要的标签符号

生成的模型是在文件 commodity_model.bin 中,还会生成一个词向量文件 commodity_model.vec ,词向量文件内容如下。

37899 100
</s> -0.55874 -0.10935 0.17115 0.2388 0.2805 0.11883 0.10674 -0.13214 0.32224 -0.54488 0.4239 -0.51117 -0.56261 -0.23805 0.68661 0.35541 -0.77821 -0.53468 0.60013 0.063523 1.1078 -0.31767 -0.38917 -0.85577 -0.35906 0.23197 0.99904 0.26574 -0.27303 0.0091292 0.62824 0.35154 -0.18146 0.62103 -0.65914 0.99872 0.16585 0.2622 -0.1481 -0.22537 -0.27048 0.075146 -0.15598 -0.28847 0.16145 0.10381 0.32652 -0.45171 -0.26597 -0.061287 0.01858 0.50429 -0.17517 0.21205 -0.023571 0.37332 -0.41411 -0.34945 0.28114 -0.0046294 0.39406 0.39902 -0.25752 -0.052666 0.27889 -0.53025 -0.13618 1.1267 0.032445 -0.62555 0.45881 0.24994 -0.12783 1.0336 -0.56122 0.36742 0.26783 -0.064086 0.56198 -0.054111 0.27858 0.43912 -0.21053 0.20468 0.29792 0.16496 0.13347 0.85231 -0.048318 -0.86905 -0.57763 -0.26486 0.58158 0.49263 -0.14475 0.27656 0.29959 -0.37355 -0.24024 -0.83969
新款 -0.36792 -0.052885 0.2175 0.25894 0.14599 0.67911 0.0028615 0.13432 0.26988 0.44109 -0.047146 0.30435 -0.0052764 0.086225 -0.024577 -0.19852 -0.13228 0.27177 -0.26321 0.53231 0.030532 -0.12034 -0.21005 -0.035567 -0.09993 -0.21439 0.30124 -0.081924 0.219 -0.27545 -0.1321 -0.19909 0.49169 -0.35514 0.010071 0.32131 0.13274 0.0017961 -0.25752 0.15799 -0.15891 -0.19768 0.11458 -0.071166 -0.0049989 -0.10033 -0.089192 0.0051551 0.1948 -0.32105 -0.12673 -0.021479 0.2035 -0.3036 0.042287 -0.18418 -0.16937 0.0034305 -0.00054013 -0.20262 0.21633 0.55448 0.5047 -0.30521 -0.33969 0.23641 -0.13683 -0.039237 -0.038262 0.30688 -0.42023 0.17422 -0.36334 -0.027693 0.13593 0.055707 -0.45232 0.21901 0.0038972 0.073224 -0.24337 -0.17771 0.36014 0.0093526 -0.12155 -0.20782 -0.10076 0.0017971 -0.24683 -0.35155 -0.33855 -0.032884 -0.10803 -0.11618 0.06904 0.36682 0.30256 -0.11811 -0.2719 -0.3967
款 0.10719 -0.047097 0.035313 -0.11604 -0.11128 0.098889 0.07338 -0.062592 -0.36591 0.33696 -0.034139 0.41233 -0.044029 0.26736 -0.35058 0.0011242 0.012222 0.28305 -0.42311 0.49023 -0.063058 0.063913 0.065111 -0.073718 0.0074959 -0.36847 -0.041656 -0.09248 0.11114 -0.20892 -0.16453 -0.47284 0.23434 -0.26671 -0.22557 0.31284 0.21075 0.018571 0.10862 0.093459 -0.40156 0.10052 0.15448 0.38009 0.057931 0.10816 -0.456 -0.088661 0.25591 -0.25163 -0.44263 -0.44058 -0.1008 -0.38986 0.14394 -0.45649 0.031487 0.030237 -0.10706 -0.16413 0.43202 0.39386 0.21072 0.26409 0.43194 0.21329 0.10789 0.1402 0.23095 0.044306 -0.19572 -0.022013 -0.080285 -0.13523 0.076271 -0.073187 -0.044929 -0.15822 0.15628 -0.32383 -0.35998 0.16517 -0.20735 -0.066001 -0.012753 0.08091 -0.057798 -0.14446 -0.10688 -0.19407 -0.38554 0.12981 0.28439 -0.030242 0.16122 0.47328 -0.30161 -0.27374 -0.3731 -0.051387
斤 -0.18143 -0.40266 0.036 0.009181 0.12026 -0.29335 -0.15063 -0.078739 -0.1841 0.038445 0.11106 -0.32147 0.1611 0.38768 -0.14828 -0.30435 0.0057699 -0.1971 0.14482 -0.20802 0.71822 0.062339 0.067823 0.23867 -0.013739 -0.20461 0.045837 -0.19937 0.0002158 0.020062 -0.051749 -0.38833 0.010522 -0.056716 0.14504 0.16327 -0.096777 0.079007 0.015782 -0.073867 0.14208 -0.30683 -0.37885 -0.0098635 -0.071231 -0.079818 0.36771 0.097449 -0.54126 0.081269 0.017566 -0.80512 -0.18466 -0.45232 0.11477 0.53177 -0.36932 -0.46266 0.22826 0.20209 0.20264 0.22821 -0.1284 0.30707 0.33509 0.25278 0.18359 -0.122 -0.16871 0.078083 0.03181 0.043715 -0.027783 0.18351 0.22478 0.12866 0.54027 0.033152 0.14358 -0.08765 0.82464 0.19282 -0.64852 -0.045806 -0.12577 0.46663 0.24793 0.40814 0.10723 -0.050818 0.39453 -0.16839 -0.096019 -0.30689 0.3576 0.078405 -0.070142 -0.10879 0.08264 0.23929

预测

模型准备完毕,开始预测。我这里取了25253条已经标注好标签的数据验证模型的准确度。

//加载模型
FastText fastText = FastText.load("./parameter/commodity_model.bin");
//预测
Map<String, Float> map = fastText.predictLine("某条数据", 1);

这里预测的输入参数1表示只取一个分类标签,所以想要几条标签就输入几就好了。返回的结果map中是标签及占比。运行结果如图。

在这里插入图片描述

这里显示准确率86.0%,根据对预测结果与贴标结果不同的数据查看,最少有七八百条数据预测结果更加符合,另外有五六百条数据预测结果和贴标结果均适合,所以准确率应该是可以达到90%以上的。基本达到了上线生产的要求,可以使用。

  • 1
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
### 回答1: 以下是使用 fasttext 进行文本分类的示例代码: ``` import fasttext # 训练模型 classifier = fasttext.train_supervised(input="data.train", lr=1.0, epoch=25, wordNgrams=2) # 预测 texts = ["I love playing football", "I hate playing football"] labels, probs = classifier.predict(texts) print(labels) # 输出预测的标签 print(probs) # 输出预测的概率 # 评估模型 print(classifier.test("data.valid")) ``` 其中,"data.train" 是训练集文件路径,"data.valid" 是验证集文件路径,可根据自己的数据进行修改。 在这个例子中,我们使用了一个超参数wordNgrams=2,意味着我们把两个连续的单词组合成一个特征。 ### 回答2: FastText 是一个基于词袋模型的文本分类工具,采用了层次 Softmax 进行多层分类。以下是使用 FastText 实现文本分类的示例代码: 首先,需要安装 fasttext 库,并导入相应的模块: ``` !pip install fasttext import fasttext ``` 接下来,准备训练数据和测试数据。假设文本分类的目标是将电影评论分为积极和消极两类。训练数据和测试数据分别保存在 train.txt 和 test.txt 文件中,每行包含一个标签和一个句子,以空格分隔,例如: ``` __label__positive 这部电影真的很棒 __label__negative 这个电影太糟糕了 ``` 创建 FastText分类器,并设置一些模型参数: ``` classifier = fasttext.supervised('train.txt', 'model', label_prefix='__label__') ``` 使用训练数据训练模型: ``` classifier = fasttext.train_supervised('train.txt', label_prefix='__label__') ``` 对测试数据进行预测: ``` result = classifier.predict('test.txt') ``` 输出预测结果: ``` print(result) ``` 以上就是使用 FastText 实现文本分类的示例代码。通过训练和预测,可以对新的文本数据进行分类,并获得分类结果。需要注意的是,这只是一个简单的示例,实际应用中还需要进行更多的数据预处理、模型调参和评估等工作。 ### 回答3: 以下是使用fasttext实现文本分类的示例代码: ```python import pandas as pd import fasttext # 读取训练数据 train_data = pd.read_csv('train_data.csv') # 将训练数据写入文本文件 train_data['text_label'] = '__label__' + train_data['label'].astype(str) + ' ' + train_data['text'] train_data[['text_label']].to_csv('train.txt', index=False, header=False, sep='\t') # 训练模型 model = fasttext.train_supervised(input='train.txt', epoch=10, lr=1.0, wordNgrams=2) # 保存模型 model.save_model('model.bin') # 加载模型 model = fasttext.load_model('model.bin') # 预测文本分类 text = '这是一个测试文本' predicted_label = model.predict(text) print(predicted_label[0][0]) # 输出预测的类别 ``` 首先,要准备好训练数据,数据格式为CSV文件,包含两列:label和text,其中label代表文本类别,text代表文本内容。 读取训练数据后,将label和text合并,形成fasttext所需的标签格式。然后将标签格式的数据写入文本文件(train.txt)中,每行包含一个文本的标签和内容。 使用`fasttext.train_supervised`函数来训练文本分类模型,指定输入文件、迭代次数、学习率和wordNgrams等参数。 训练完成后,保存模型为二进制文件(model.bin)。 加载模型后,使用`model.predict`函数对文本进行分类预测,得到预测的类别。 最后打印出预测的类别。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值