1)MALLET
A Machine Learning for Language Toolkit
http://mallet.cs.umass.edu/
“an integrated collection of Java code useful for statistical natural language processing, document classification, clustering, information extraction, and other machine learning applications to text”
Minimally documented but has lots of stuff:
Building feature vectors
Various classification methods (Naïve Bayes, max-ent, boosting, winnowing)
Evaluation: precision, recall, F1, etc.
N-grams
Selecting features using information gain
They have some examples of front-end code
2)MinorThird
http://minorthird.sourceforge.net/
“a collection of Java classes for storing text, annotating text, and learning to extract entities and categorize text”
Documentation seems to be pretty good: comprehensive Javadocs, tutorial, FAQ…
Has the concept of “spans” (sequences of words) that can be extracted and classified based on content or context
Stored documents can be annotated in independent files using TextLabels (denoting, say, part-of-speech and semantic information)
http://minorthird.sourceforge.net/
“a collection of Java classes for storing text, annotating text, and learning to extract entities and categorize text”
Documentation seems to be pretty good: comprehensive Javadocs, tutorial, FAQ…
Has the concept of “spans” (sequences of words) that can be extracted and classified based on content or context
Stored documents can be annotated in independent files using TextLabels (denoting, say, part-of-speech and semantic information)
3)Weka
http://www.cs.waikato.ac.nz/~ml/weka/
“Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.”
Has a GUI
Extensive documentation
Website lists a number of compatible datasets (regression and classification problems)
Also lists many Weka-related projects
http://www.cs.waikato.ac.nz/~ml/weka/
“Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.”
Has a GUI
Extensive documentation
Website lists a number of compatible datasets (regression and classification problems)
Also lists many Weka-related projects
4)CLUTO
http://www-users.cs.umn.edu/~karypis/cluto/
“a software package for clustering low- and high-dimensional datasets and for analyzing the characteristics of the various clusters”
Partitional, agglomerative and graph-partitioning algorithms
Various similarity/distance metrics
Many options/tools for visualizing and summarizing clustering results
Claims to scale to hundreds of thousands of objects in tens of thousands of dimensions
wCluto: web-based application built on CLUTO
gCluto: cross-platform graphical application
http://www-users.cs.umn.edu/~karypis/cluto/
“a software package for clustering low- and high-dimensional datasets and for analyzing the characteristics of the various clusters”
Partitional, agglomerative and graph-partitioning algorithms
Various similarity/distance metrics
Many options/tools for visualizing and summarizing clustering results
Claims to scale to hundreds of thousands of objects in tens of thousands of dimensions
wCluto: web-based application built on CLUTO
gCluto: cross-platform graphical application
5)MG4J:
http://mg4j.dsi.unimi.it/
“a collaborative effort aimed at providing a free Java implementation of inverted-index compression techniques; as a by-product, it offers several general-purpose optimised classes, including fast & compact mutable strings, bit-level I/O, fast unsynchronised buffered streams, (possibly signed) minimal perfect hashing for very large strings collections, etc.”
http://mg4j.dsi.unimi.it/
“a collaborative effort aimed at providing a free Java implementation of inverted-index compression techniques; as a by-product, it offers several general-purpose optimised classes, including fast & compact mutable strings, bit-level I/O, fast unsynchronised buffered streams, (possibly signed) minimal perfect hashing for very large strings collections, etc.”