介绍一个bioinformatics的toolkit

最近扫到生物信息学软件的paper,发现有很多bioinformatics的toolkit,这里介绍一个bow,剩下有些我也打不开,但是关于svm等等的toolkit还是很多的

比如 SVM light http://svmlight.joachims.org/

PASBio http://research.nii.ac.jp/~collier/projects/PASBio/

POSTLAB http://rostlab.org/cms/index.php?id=94

http://nlp.stanford.edu/downloads/lex-parser.shtml

Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering

Bow (or libbow) is a library of C code useful for writing statistical text analysis, language modeling and information retrieval programs. The current distribution includes the library, as well as front-ends for document classification (rainbow), document retrieval (arrow) and document clustering (crossbow).

The library and its front-ends were designed and written by Andrew McCallum, with some contributions from several graduate and undergraduate students.

The name of the library rhymes with `low', not `cow'.

About the library

The library provides facilities for:

  • Recursively descending directories, finding text files.
  • Finding `document' boundaries when there are multiple documents per file.
  • Tokenizing a text file, according to several different methods.
  • Including N-grams among the tokens.
  • Mapping strings to integers and back again, very efficiently.
  • Building a sparse matrix of document/token counts.
  • Pruning vocabulary by word counts or by information gain.
  • Building and manipulating word vectors.
  • Setting word vector weights according to Naive Bayes, TFIDF, and several other methods.
  • Smoothing word probabilities according to Laplace (Dirichlet uniform), M-estimates, Witten-Bell, and Good-Turning.
  • Scoring queries for retrieval or classification.
  • Writing all data structures to disk in a compact format.
  • Reading the document/token matrix from disk in an efficient, sparse fashion.
  • Performing test/train splits, and automatic classification tests.
  • Operating in server mode, receiving and answering queries over a socket.

The library does not:

  • Have English parsing or part-of-speech tagging facilities.
  • Do smoothing across N-gram models.
  • Claim to be finished.
  • Have good documentation.
  • Claim to be bug-free.

It is known to compile on most UNIX systems, including Linux, Solaris, SUNOS, Irix and HPUX. Over a year ago, it compiled on WindowsNT (with a GNU build environment); it doesn't do this any more, but probably could with small fixes. Patches to the code are most welcome. It is developed on a Linux system.

The code conforms to the GNU coding standards. It is released under the Library GNU Public License (LGPL).

Citation

You are welcome to use the code under the terms of the licence for research or commercial purposes, however please acknowledge its use with a citation:

 

   McCallum, Andrew Kachites.  "Bow: A toolkit for statistical language
   modeling, text retrieval, classification and clustering."
   http://www.cs.cmu.edu/~mccallum/bow.  1996.

Here is a BiBTeX entry:

 

   @unpublished{McCallumLibbow,
      author = "Andrew Kachites McCallum",
      title = "Bow: A toolkit for statistical language modeling, 
               text retrieval, classification and clustering",
      note = "http://www.cs.cmu.edu/~mccallum/bow",
      year = 1996}

Obtaining the Source

Source code for the library can be downloaded from this directory. Different versions are indicated by eight digit sequences that indicate year, month and day. Thus, the most recent version is the one with the largest version number.

Unfortunately I do not have time to help rainbow's many users with all their compilation and usage problems. Feel free to send me mail asking for help, but please do not necessarily expect me to have time to help. Most appreciated are bug reports accompanied by fixes.

Bow Library Front-Ends

Provided in the library source distribution, there are currently three executable programs based on the library.

  • Rainbow is an executable program that does document classification. While mostly designed for classification by naive Bayes, it also provides TFIDF/Rocchio, Probabilistic Indexing and K-nearest neighbor.
  • Arrow is an executable program that does document retrieval. It currently only performs simple TFIDF-based retrieval.
  • Crossbow is a an executable program that does document clustering (and also classification).
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值