From Mallet
There are numerous other software packages relevant to machine learning and text that in various ways are related to MALLET:
- NLTK ( ( also has "Classifier" and "ClassifierTrainer" classes for plug-and-play classifiers, as well as implementations of Naive Bayes, MaxEnt, feature selection, a "Token" class, finite state transducers with iterators over transitions. In addition it has facilities for tagging, parsing and information extraction.
- OpenNLP ( ( is also a Java package intended for text processing. It also has rich Pipelines consisting of chains of individual pipe Foo2Bar component steps that can be arbitrarily configured and plugged together.
- JavaNLP ( (from Chris Manning's group at Stanford.)
- BioJava ( The BioJava Project is an open-source project dedicated to providing Java tools for processing biological data. This will include objects for manipulating sequences, file parsers, CORBA interoperability, DAS, access to ACeDB, dynamic programming, and simple statistical routines to name just a few things.
- There is information about various Finite State Machine software at, including pointers to AT&T Finite State package, which also has very general finite state transducers with iterators over transitions, arbitrary transition costs, generalized implementations of Viterbi and Forward Backward. In addition it has epsilon transitions, composition, and much more.
- Weka: Plug-and-play machine learning components in Java, including classes for "Classifier", "NaiveBayes", "DecisionStump", "LogisticRegression", etc. It also has methods for splitting training sets, and nice evaluation tools, and GUI components to boot.
- Orange, component-based data mining software in C++ includes SVM, logistic regression, clustering, and lots more.
- Libbow also has mechanisms for feature extraction pipelines, plug-and-play classifiers, feature selection, clustering. It is written in C.
- COLT: High Performance Scientific Computing in Java: (
- There are Java applets for various machine learning algorithms at
- Bayesian Networks in Java:
- Fe Sha and Fernando Pereira have written a CRF implementation in Java.
- Jonathan Baxter ( wrote much machine learning code in the late 1990's, including a rich "optimization" package.
- Ray Mooney ( has also been writing machine learning code in Java lately.
- Alias-i's LingPipe 2.0 ( is an open-source Java toolkit including scalable high throughput implementations of entity extraction, sentence boundary annotation and within-document coreference. There is an extensively documented Java API, and configurable commands for handling plain text, HTML or XML. Version 2.0 adds character and token language models, multiclass and binary classification, hierarchical clustering, spelling correction, general statistical estimators, and utilities for parsing MEDLINE.
- Zhang Le has an (apparently very fast) implementation of Maximum Entropy training in C++ (
- Generalised Architcture for Text Engineering is JAVA GPL software which allows also automatic semantic tagging, whereby the tagging is performed against ontologies [1] (
- KIM is a powerful Semantic annotation engine [2] (
If you know about (or have written) another relevant toolkit, please feel free to add it to this page.