使用fastText实现文本分类-java版

最新推荐文章于 2024-06-23 09:47:38 发布

yeah_you_are

最新推荐文章于 2024-06-23 09:47:38 发布

阅读量2k

点赞数 1

分类专栏：文本分类文章标签：分类 java 自然语言处理

本文链接：https://blog.csdn.net/yeah_you_are/article/details/124989938

版权

文本分类专栏收录该内容

1 篇文章

订阅专栏

使用FastText实现文本分类-java版

文本分类又称自动文本分类，是指计算机将载有信心的一篇文本映射到预先给定的某一类别或某几个类别主题的过程，实现这一过程的算法模型叫做分类器。哈哈哈，这一句是从大佬文章中借鉴来得，这是是原文，这篇文章具体介绍了文本分类的历史发展和一些分类算法，有兴趣的可以去看看。

我这里主要说的是使用FastText实现文本分类，至于想要弄明白原理的，建议看这里，大佬对原理做了很详细的说明，向大佬表示敬意。

依赖

<dependency>
	<groupId>com.github.sszuev</groupId>
	<artifactId>fasttext</artifactId>
	<version>1.0.0</version>
</dependency>

想要jar包的可以去maven repository官网去搜索FastText，这里有各个版本的jar包选择。

训练模型

以下就是训练模型的代码，训练出的模型会保存到指定的路径下，一次训练就可以得到模型，之后就可以用模型预测数据了。

try {
	Main.train(new String[]{"supervised",
   		"-input","./parameter/category_train.txt",  //带标签的训练数据
        "-output","./parameter/commodity_model",    //训练出的模型保存路径及命名
        "-dim","64",         //词向量维度
        "-lr","0.5",         //学习率
        "-wordNgrams","2",   //最大子词长度
        "-minCount","1",     //最小词频
        "-bucket","10000000" //桶数
	});
}catch (Exception e) {
	e.printStackTrace();
}

那么带标签的训练数据集是什么样的？如下，这是从官网找到的测试训练数据集，标签都是以 __label__ 为开头的，当然这个可以设置，只不过默认支持的是这样，以下数据都是英文，如果使用中文可以考虑分词和停用词过滤了。

__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe?
__label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments
__label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?
__label__restaurant Michelin Three Star Restaurant; but if the chef is not there
__label__knife-skills __label__dicing Without knife skills, how can I quickly and accurately dice vegetables?
__label__storage-method __label__equipment __label__bread What's the purpose of a bread box?
__label__baking __label__food-safety __label__substitutions __label__peanuts how to seperate peanut oil from roasted peanuts at home?
__label__chocolate American equivalent for British chocolate terms
__label__baking __label__oven __label__convection Fan bake vs bake
__label__sauce __label__storage-lifetime __label__acidity __label__mayonnaise Regulation and balancing of readymade packed mayonnaise and other sauces
__label__tea What kind of tea do you boil for 45minutes?
__label__baking __label__baking-powder __label__baking-soda __label__leavening How long can batter sit before chemical leaveners lose their power?
__label__food-safety __label__soup Can I RE-freeze chicken soup after it has thawed?

其实在训练模型中还有很多参数可以设置，如下，参数很多，所以如果想要搞明白所有的相关参数意义最好去官网多花点时间读一下。

The following arguments are mandatory:
	-input        training file path
    -output       output file path
     
    The following arguments are optional:
    -lr           learning rate [0.05]
    -lrUpdateRate change the rate of updates for the learning rate [100]
    -dim          size of word vectors [100]  //这里决定了词向量的维度
    -ws           size of the context window [5]
    -epoch        number of epochs [5]
    -minCount     minimal number of word occurences [1]  //最小词频
    -neg          number of negatives sampled [5]
    -wordNgrams   max length of word ngram [1]
    -loss         loss function {ns, hs, softmax} [ns]
    -bucket       number of buckets [2000000]
    -minn         min length of char ngram [3]
    -maxn         max length of char ngram [6]
    -thread       number of threads [12]
    -t            sampling threshold [0.0001]
    -label        labels prefix [__label__]  //看，这里就可以使用你自己想要的标签符号

生成的模型是在文件 commodity_model.bin 中，还会生成一个词向量文件 commodity_model.vec ，词向量文件内容如下。

37899 100
</s> -0.55874 -0.10935 0.17115 0.2388 0.2805 0.11883 0.10674 -0.13214 0.32224 -0.54488 0.4239 -0.51117 -0.56261 -0.23805 0.68661 0.35541 -0.77821 -0.53468 0.60013 0.063523 1.1078 -0.31767 -0.38917 -0.85577 -0.35906 0.23197 0.99904 0.26574 -0.27303 0.0091292 0.62824 0.35154 -0.18146 0.62103 -0.65914 0.99872 0.16585 0.2622 -0.1481 -0.22537 -0.27048 0.075146 -0.15598 -0.28847 0.16145 0.10381 0.32652 -0.45171 -0.26597 -0.061287 0.01858 0.50429 -0.17517 0.21205 -0.023571 0.37332 -0.41411 -0.34945 0.28114 -0.0046294 0.39406 0.39902 -0.25752 -0.052666 0.27889 -0.53025 -0.13618 1.1267 0.032445 -0.62555 0.45881 0.24994 -0.12783 1.0336 -0.56122 0.36742 0.26783 -0.064086 0.56198 -0.054111 0.27858 0.43912 -0.21053 0.20468 0.29792 0.16496 0.13347 0.85231 -0.048318 -0.86905 -0.57763 -0.26486 0.58158 0.49263 -0.14475 0.27656 0.29959 -0.37355 -0.24024 -0.83969
新款 -0.36792 -0.052885 0.2175 0.25894 0.14599 0.67911 0.0028615 0.13432 0.26988 0.44109 -0.047146 0.30435 -0.0052764 0.086225 -0.024577 -0.19852 -0.13228 0.27177 -0.26321 0.53231 0.030532 -0.12034 -0.21005 -0.035567 -0.09993 -0.21439 0.30124 -0.081924 0.219 -0.27545 -0.1321 -0.19909 0.49169 -0.35514 0.010071 0.32131 0.13274 0.0017961 -0.25752 0.15799 -0.15891 -0.19768 0.11458 -0.071166 -0.0049989 -0.10033 -0.089192 0.0051551 0.1948 -0.32105 -0.12673 -0.021479 0.2035 -0.3036 0.042287 -0.18418 -0.16937 0.0034305 -0.00054013 -0.20262 0.21633 0.55448 0.5047 -0.30521 -0.33969 0.23641 -0.13683 -0.039237 -0.038262 0.30688 -0.42023 0.17422 -0.36334 -0.027693 0.13593 0.055707 -0.45232 0.21901 0.0038972 0.073224 -0.24337 -0.17771 0.36014 0.0093526 -0.12155 -0.20782 -0.10076 0.0017971 -0.24683 -0.35155 -0.33855 -0.032884 -0.10803 -0.11618 0.06904 0.36682 0.30256 -0.11811 -0.2719 -0.3967
款 0.10719 -0.047097 0.035313 -0.11604 -0.11128 0.098889 0.07338 -0.062592 -0.36591 0.33696 -0.034139 0.41233 -0.044029 0.26736 -0.35058 0.0011242 0.012222 0.28305 -0.42311 0.49023 -0.063058 0.063913 0.065111 -0.073718 0.0074959 -0.36847 -0.041656 -0.09248 0.11114 -0.20892 -0.16453 -0.47284 0.23434 -0.26671 -0.22557 0.31284 0.21075 0.018571 0.10862 0.093459 -0.40156 0.10052 0.15448 0.38009 0.057931 0.10816 -0.456 -0.088661 0.25591 -0.25163 -0.44263 -0.44058 -0.1008 -0.38986 0.14394 -0.45649 0.031487 0.030237 -0.10706 -0.16413 0.43202 0.39386 0.21072 0.26409 0.43194 0.21329 0.10789 0.1402 0.23095 0.044306 -0.19572 -0.022013 -0.080285 -0.13523 0.076271 -0.073187 -0.044929 -0.15822 0.15628 -0.32383 -0.35998 0.16517 -0.20735 -0.066001 -0.012753 0.08091 -0.057798 -0.14446 -0.10688 -0.19407 -0.38554 0.12981 0.28439 -0.030242 0.16122 0.47328 -0.30161 -0.27374 -0.3731 -0.051387
斤 -0.18143 -0.40266 0.036 0.009181 0.12026 -0.29335 -0.15063 -0.078739 -0.1841 0.038445 0.11106 -0.32147 0.1611 0.38768 -0.14828 -0.30435 0.0057699 -0.1971 0.14482 -0.20802 0.71822 0.062339 0.067823 0.23867 -0.013739 -0.20461 0.045837 -0.19937 0.0002158 0.020062 -0.051749 -0.38833 0.010522 -0.056716 0.14504 0.16327 -0.096777 0.079007 0.015782 -0.073867 0.14208 -0.30683 -0.37885 -0.0098635 -0.071231 -0.079818 0.36771 0.097449 -0.54126 0.081269 0.017566 -0.80512 -0.18466 -0.45232 0.11477 0.53177 -0.36932 -0.46266 0.22826 0.20209 0.20264 0.22821 -0.1284 0.30707 0.33509 0.25278 0.18359 -0.122 -0.16871 0.078083 0.03181 0.043715 -0.027783 0.18351 0.22478 0.12866 0.54027 0.033152 0.14358 -0.08765 0.82464 0.19282 -0.64852 -0.045806 -0.12577 0.46663 0.24793 0.40814 0.10723 -0.050818 0.39453 -0.16839 -0.096019 -0.30689 0.3576 0.078405 -0.070142 -0.10879 0.08264 0.23929

预测

模型准备完毕，开始预测。我这里取了25253条已经标注好标签的数据验证模型的准确度。

//加载模型
FastText fastText = FastText.load("./parameter/commodity_model.bin");
//预测
Map<String, Float> map = fastText.predictLine("某条数据", 1);

这里预测的输入参数1表示只取一个分类标签，所以想要几条标签就输入几就好了。返回的结果map中是标签及占比。运行结果如图。

在这里插入图片描述

这里显示准确率86.0%，根据对预测结果与贴标结果不同的数据查看，最少有七八百条数据预测结果更加符合，另外有五六百条数据预测结果和贴标结果均适合，所以准确率应该是可以达到90%以上的。基本达到了上线生产的要求，可以使用。