web 挖掘工具一览,及详细介绍

1)MALLET

 

A Machine Learning for Language Toolkit
http://mallet.cs.umass.edu/
“an integrated collection of Java code useful for statistical natural language processing, document classification, clustering, information extraction, and other machine learning applications to text”
Minimally documented but has lots of stuff:
Building feature vectors
Various classification methods (Naïve Bayes, max-ent, boosting, winnowing)
Evaluation: precision, recall, F1, etc.
N-grams
Selecting features using information gain
They have some examples of front-end code

 

 

2)MinorThird
http://minorthird.sourceforge.net/
“a collection of Java classes for storing text, annotating text, and learning to extract entities and categorize text”
Documentation seems to be pretty good: comprehensive Javadocs, tutorial, FAQ…
Has the concept of “spans” (sequences of words) that can be extracted and classified based on content or context
Stored documents can be annotated in independent files using TextLabels (denoting, say, part-of-speech and semantic information)
3)Weka
http://www.cs.waikato.ac.nz/~ml/weka/
“Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.”
Has a GUI
Extensive documentation
Website lists a number of compatible datasets (regression and classification problems)
Also lists many Weka-related projects
4)CLUTO
http://www-users.cs.umn.edu/~karypis/cluto/
“a software package for clustering low- and high-dimensional datasets and for analyzing the characteristics of the various clusters”
Partitional, agglomerative and graph-partitioning algorithms
Various similarity/distance metrics
Many options/tools for visualizing and summarizing clustering results
Claims to scale to hundreds of thousands of objects in tens of thousands of dimensions
wCluto: web-based application built on CLUTO
gCluto: cross-platform graphical application
5)MG4J:
http://mg4j.dsi.unimi.it/
“a collaborative effort aimed at providing a free Java implementation of inverted-index compression techniques; as a by-product, it offers several general-purpose optimised classes, including fast & compact mutable strings, bit-level I/O, fast unsynchronised buffered streams, (possibly signed) minimal perfect hashing for very large strings collections, etc.”
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值