文本分类、聚类的开源项目

http://mallet.cs.umass.edu/index.php/Similar_software

From Mallet

There are numerous other software packages relevant to machine learning and text that in various ways are related to MALLET:

  • NLTK (http://nltk.sourceforge.net (http://nltk.sourceforge.net/)) also has "Classifier" and "ClassifierTrainer" classes for plug-and-play classifiers, as well as implementations of Naive Bayes, MaxEnt, feature selection, a "Token" class, finite state transducers with iterators over transitions. In addition it has facilities for tagging, parsing and information extraction.
  • OpenNLP (http://opennlp.sourceforge.net (http://opennlp.sourceforge.net/)) is also a Java package intended for text processing. It also has rich Pipelines consisting of chains of individual pipe Foo2Bar component steps that can be arbitrarily configured and plugged together.
  • JavaNLP (http://www-nlp.stanford.edu/javanlp/) (from Chris Manning's group at Stanford.)
  • BioJava (http://www.biojava.org/). The BioJava Project is an open-source project dedicated to providing Java tools for processing biological data. This will include objects for manipulating sequences, file parsers, CORBA interoperability, DAS, access to ACeDB, dynamic programming, and simple statistical routines to name just a few things.
  • There is information about various Finite State Machine software at http://www.cs.jhu.edu/~jason/405/software.html, including pointers to AT&T Finite State package, which also has very general finite state transducers with iterators over transitions, arbitrary transition costs, generalized implementations of Viterbi and Forward Backward. In addition it has epsilon transitions, composition, and much more.
  • Weka: Plug-and-play machine learning components in Java http://www.cs.waikato.ac.nz/~ml/weka, including classes for "Classifier", "NaiveBayes", "DecisionStump", "LogisticRegression", etc. It also has methods for splitting training sets, and nice evaluation tools, and GUI components to boot.
  • Orange, component-based data mining software in C++ includes SVM, logistic regression, clustering, and lots more. http://magix.fri.uni-lj.si/orange/.
  • Libbow http://www.cs.cmu.edu/~mccallum/bow also has mechanisms for feature extraction pipelines, plug-and-play classifiers, feature selection, clustering. It is written in C.
  • COLT: High Performance Scientific Computing in Java: http://tilde-hoschek.home.cern.ch/~hoschek/colt (http://tilde-hoschek.home.cern.ch/~hoschek/colt/).
  • There are Java applets for various machine learning algorithms at http://www.cse.unsw.edu.au/~cs9417.
  • Bayesian Networks in Java: http://www-2.cs.cmu.edu/~javabayes.
  • Fe Sha and Fernando Pereira have written a CRF implementation in Java.
  • Jonathan Baxter (http://www.panscient.com/) wrote much machine learning code in the late 1990's, including a rich "optimization" package.
  • Ray Mooney (http://www.cs.utexas.edu/~mooney) has also been writing machine learning code in Java lately.
  • Alias-i's LingPipe 2.0 (http://www.alias-i.com/lingpipe) is an open-source Java toolkit including scalable high throughput implementations of entity extraction, sentence boundary annotation and within-document coreference. There is an extensively documented Java API, and configurable commands for handling plain text, HTML or XML. Version 2.0 adds character and token language models, multiclass and binary classification, hierarchical clustering, spelling correction, general statistical estimators, and utilities for parsing MEDLINE.
  • Zhang Le has an (apparently very fast) implementation of Maximum Entropy training in C++ (http://homepages.inf.ed.ac.uk/s0450736/maxent.html#soft).
  • Generalised Architcture for Text Engineering is JAVA GPL software which allows also automatic semantic tagging, whereby the tagging is performed against ontologies [1] (http://www.gate.ac.uk/).
  • KIM is a powerful Semantic annotation engine [2] (http://www.ontotext.com/kim/semanticannotation.html)

If you know about (or have written) another relevant toolkit, please feel free to add it to this page.

原帖:http://hi.baidu.com/phpasp/blog/item/2dc3a834d9bbd5b2d0a2d3a2.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值