利用mahout自带的fpgrowth算法挖掘频繁模式

最新推荐文章于 2023-11-23 13:56:13 发布

tangtang5156

最新推荐文章于 2023-11-23 13:56:13 发布

阅读量2.1k

点赞数

分类专栏： mahout 文章标签： hadoop mapreduce mahout fpgrowth

本文链接：https://blog.csdn.net/tangtang5156/article/details/40982303

版权

mahout 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

建立测试文件，将测试文件上传至hdfs上。这里我的测试文件是自己随便写的几行数字

1,5,2,3
5,7,3,4
5,2,3
1,5,2,7,3,4
1,2,4
5,2,4
1,2,3
1,5,2,6,3
1,5,6,3

hadoop fs -put fp.txt /

hadoop jar /opt/mahout-distribution-0.9/mahout-examples-0.9-job.jar org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver -i /fp.txt -o /out -s 1 -method mapreduce

运行后查看结果是乱码。

原因：mahout运行之后得到的结果是序列化的，必须将其转化为文本文件下载到本地才可进行查看

mahout seqdumper -i /out/frequentpatterns/part-r-00000 -o /home/mahout_test/fpresult1.txt

查看后发现结果仍为乱码

猜想：mahout处理的文件必须都为序列化文件，原因可能是我的输入文件是文本格式不是序列化。

用命令mahout seqdirectory -i /test/ -o /test1/ -c UTF-8将其转化为序列化文件，但是报错：

Error: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
   at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.initNextRecordReader(CombineFileRecordReader.java:164)
   at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.<init>(CombineFileRecordReader.java:126)
   at org.apache.mahout.text.MultipleTextFileInputFormat.createRecordReader(MultipleTextFileInputFormat.java:43)
   at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.<init>(MapTask.java:492)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:735)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
   at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.initNextRecordReader(CombineFileRecordReader.java:155)
   ... 10 more
Caused by: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
   at org.apache.mahout.text.WholeFileRecordReader.<init>(WholeFileRecordReader.java:59)
   ... 15 more

未能解决。

于是尝试利用mahout中自带的例子进行测试

使用测试样本在mahout的源码中，路径为F:\mahout\mahout-distribution-0.9-src\mahout-distribution-0.9\core\src\test\resources\retail.dat

将其上传至hdfs中，再运行如下命令

hadoop jar /opt/mahout-distribution-0.9/mahout-examples-0.9-job.jar org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver -i /test/retail.dat -o /out2 -s 5 -method mapreduce

结果存至hdfs上的out2中。

mahout seqdumper -i /out2/frequentpatterns/part-r-00000 -o /home/mahout_test/fpresult5.txt

将其下载到本地后，

直接vi fpresult5.txt，即可看到结果，其中，部分结果如下：

Key: 10 : Value: ([10 ],6)
Key: 10 39 : Value: ([10 39 ],5)
Key: 1034 : Value: ([1034 ],5)
Key: 1146 : Value: ([1146 ],5)
Key: 13518 : Value: ([13518 ],15)
Key: 14098 : Value: ([14098 ],6)
Key: 14099 : Value: ([14099 ],6)
Key: 14386 : Value: ([14386 ],14)
Key: 15094 : Value: ([15094 ],11)
Key: 15685 : Value: ([15685 ],7)
Key: 15686 : Value: ([15686 ],7)
Key: 170 : Value: ([170 ],12)
Key: 2046 : Value: ([2046 ],19)
Key: 225 : Value: ([225 ],7)
Key: 225 2238 : Value: ([225 2238 ],6)
Key: 237 : Value: ([237 ],5)
Key: 286 : Value: ([286 ],14)
Key: 31 : Value: ([31 ],8)
Key: 32 : Value: ([32 ],273)
Key: 32 1046 : Value: ([32 1046 ],10)

key代表频繁模式，value代表这个频繁模式出现的次数

如果只想知道结果的条数

mahout seqdumper -i /out2/frequentpatterns/part-r-00000 -o /home/mahout_test/fpresult5.txt -c

fpresult5.txt结果如下：

Input Path: /out2/frequentpatterns/part-r-00000
Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.fpm.pfpgrowth.convertors.string.TopKStringPatterns
Count: 147

有147条结果