Contents
![*](https://i-blog.csdnimg.cn/blog_migrate/7b026d56b7cf4eac474472dde2520e6b.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/7b026d56b7cf4eac474472dde2520e6b.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/7b026d56b7cf4eac474472dde2520e6b.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/7b026d56b7cf4eac474472dde2520e6b.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/7b026d56b7cf4eac474472dde2520e6b.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/7b026d56b7cf4eac474472dde2520e6b.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/7b026d56b7cf4eac474472dde2520e6b.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/7b026d56b7cf4eac474472dde2520e6b.gif)
Tools
Machine Translation systems
Instructions
- Wonderful pages about how to download a bunch of tools and some data and put them together to build a very competent baseline statistical MT system: NAACL 2006 WMt or 2009 WMT .
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
Freely downloadable
- System from 1999 JHU workshop. Mainly of historical interest.
- Franz Och. C++. GPL.
- Phrase-based model building kit
- An Open-Source Java Statistical Phrase-Based MT Decoder
- A new open-source phrase-based MT decoder with functionality beyond Pharaoh.
- Andreas Zollmann and Ashish Venugopal
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
Free, but getting them requires hassle
- Philip Koehn, ISI.
- Machine Translation Tool Kit. Deng and Byrne.
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
Part of Speech Taggers
Freely downloadable
- Loglinear tagger in Java (by Kristina Toutanova)
- An HMM tagger with models available for English and Hungarian. A reimplementation of TnT (see below) in OCaml. pre-compiled models. Runs on Linux, Mac OS X, and Windows.
- Based on TiMBL
- A decision tree based tagger from the University of Stuttgart (Helmut Scmid). It's language independent, but comes complete with parameter files for English, German, Italian, Dutch, French, Old French, Spanish, Bulgarian, and Russian. (Linux, Sparc-Solaris, Windows, and Mac OS X versions. Binary distribution only.) Page has links to sites where you can run it online.
- POS Tagger based on SVMs (uses SVMlight). LGPL.
- Open source C taggers originally written by by Ingo Schröder. Implements maximum entropy, HMM trigram, and transformation-based learning. C source available under GNU public license.
- Java POS tagger. A sentence boundary detector (MXTERMINATOR) is also included. Original version was only JDK1.1; later version worked with JDK1.3+. Class files, not source.
- A fast and flexible implementation of Transformation-Based Learning in C++. Includes a POS tagger, but also NP chunking and general chunking models.
- An implementation of a Transformation-based Learner (a la Brill), usable for POS tagging and other things by Torbjörn Lager. Web demo also available. Prolog.
- SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.)
- An HMM-based Java POS tagger from Birmingham U. (Oliver Mason). English and German parameter files. [Java class files, not source.]
- Currently available for MS-DOS only. But the decision to make this famous system available is very interesting from an historical perspective, and for software sharing in academia more generally. LOB tag set.
- A symbolic tagger, written in C. It's no longer available from a canonical location, but you might find a version from the Wikipedia page or you could try a reimplementation such as fnTBL .
- A common lisp HMM tagger available by ftp .
- Perl POS tagger by Maciej Ceglowski and Aaron Coburn. Version 0.11. (A bigram HMM tagger.)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
Free, but require registration
- The ISSCO tagger. HMM tagger. Need to register to download.
- Online registration.
- Trainable for various languages, comes with English and German pre-compiled models. Runs on Solaris and Linux.
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
Usable by email or on the web, but not distributed freely
- From ILK group, Catholic University Brabant (Jakub Zavrel/Walter Daelemans). Does Dutch, English, Spanish, Swedish, Slovene. Other MBL demos are also available.
- Accepts only plain ASCII email message contents. The tagset used is similar to the Brown/LOB/Penn set.
- The UCREL CLAWS tagger is available for trial use on the web. (It's limited to 300 words though -- this site is more of an advertisement for licensing the real thing -- available as software for Suns or as a paid service.) You can also find info on CLAWS tagsets , though that page doesn't seem to link to the C7 tagset .
- The AMALGAM Project also has various other useful resources, in particular a web guide to different tag sets in common use . The tagging is actually done by a (retrained) version of the Brill tagger (q.v.).
- Tags any of 14 languages (European and Arabic), online on the web.
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
Not free
-
Lingsoft in Finland has (symbolic) analysis tools for many European languages. More information can be obtained by emailing
info@lingsoft.fi
. There is an online demo . - Conexor in Finland has demonstrations of EngCG-style taggers and parsers, for English, Swedish, and Spanish.
- Xerox has morphological analyzers and taggers for many languages. There are demos of some of their tools on the web. More information can be obtained by contacting Daniella Russo .
- Infogistics , an Edinburgh spinoff has a tagging and NP/Verb group chunker available commercially, including an evaluation version.
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
No longer available
- The Edinburgh Language Technology Group tagger and text tokenizer (and sentence splitter were binary-only Solaris tools which no longer seem to be available.
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
NP chunking
Downloadable
- SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.)
- A Java reimplementation of Ramshaw and Marcus (1995).
- A fast and flexible implementation of Transformation-Based Learning in C++. Includes a POS tagger, but also NP chunking and general chunking models.
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
Generic sequence models
Downloadable
- Generic CRF-based model in C++. Open source. By the author of YamCha.
- Generic CRF-based sequence models in O-CaML. Open source. By Ben Wellner.
- A large suite of language analyzers. Written in C++. Covers text preprocessing, morphology, NER, POS tagging, parsing.
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
Parsers
Information on available probabilistic parsers can be found on the FSNLP: probabilistic parsing links page.
Semantic Parsers
Downloadable
- PropBank semantic roles (and opinions, etc.) by Sameer Pradhan.
- FrameNet-based by Katrin Erk.
- A general package, but it has particularly been used for SRL.
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
Named Entity Recognition
Downloadable
- A Java Conditional Random Field sequence model with trained models for Named Entity Recognition. Java. GPL. By Jenny Finkel.
- Tools include statistical named-entity recognition, a heuristic sentence boundary detector, and a heuristic within-document coreference resolution engine. Java. GPL. By Bob Carpenter, Breck Baldwin and co.
- SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
Coreference (Anaphora) Resolution
Downloadable
- A Beautiful Anaphora Resolution Toolkit. Java. By Yannick Versley and many others. Java. Apache with GPL components.
- Java. GPL.
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
Language modeling toolkits
Downloadable
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
Downloadable, but requires registration
- by Andreas Stolcke is another good system for building language models, freely available for research purposes.
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
Not yet classified
- A package of tools for creating weighted finite-state transducers (WFST) from high-level linguistic descriptions. Lextools binaries are available free for non-commercial use at: http://www.research.att.com/sw/tools/lextools/ . Supported platforms are: linux (i686), sgi (mips2) and sun4. Lextools is built on top of, and requires, the AT&T WFST toolkit (version 3.6), available free for non-commercial use from: http://www.research.att.com/sw/tools/fsm/
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
Friendly concordancing and text analysis tools
- The thing to get if you are working in the Windows world.
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
Text summarization tools
- A public domain portable multi-document summarization system. (Dragomir Radev and others.)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
Other
Downloadable
- Tilburg's Memory Based Learner by Walter Daelemans et al. A general near-neighbour-based machine learning package, but optimized for statistical NLP applications.
- TIMEX2 standard taggers (site at Mitre).
- An open source Python package for NLP application development with tools such as tokenization, POS TAGGING and parsers by Ed Loper and Steven Bird.
- Ngram Statistics Package: Perl code that implements: Fisher's exact test, the likelihood ratio, Pearson's chi squared test, the Dice Coefficient, and Mutual Information; Duluth Senseval-2 word sense disambiguation systems; Senseval-1 data in Senseval-2 format; various other WSD datasets in Senseval formats, and semantic distances derived via WordNet.
- The main aim is a publically available speech recognition system (alpha release available), but along the way there are also toolkits for discrete HMMs and statistical decision trees, and for various aspects of signal processing.
- by Hugo WL ter Doest.
- A system (for Windows) for automatically learning the morphological forms of words in a corpus by John Goldsmith.
- Wordnet is available by ftp , compiled for a variety of machine types. For money, one can also get EuroWordNet for various European languages, an Italian/English/Spanish MultiWordNet and there's now a site for Global Wordnet . (See also Mappings between WordNet versions and Perl WordNet-Similarity module by Ted Pedersen, and WordNet Domains (coarse-grained sense topic classifications).)
- A wide-coverage tree-adjoining grammar written in a mixture of C and Common Lisp. Also includes a large coverage morphological analyzer. Now includes more tools such as TCL/Tk tree viewer.
- A collection of various tools including a simulated annealling program, a post-processor for English stemming for the Penn XTAG morphology system, Good-Turing smoothing software, general text processing tools, text statistics tools and bitext geometry tools (mainly written in Perl 5).
-
Constructing corpora and tools for processing multilingual corpora. Contact: Jean Veronis
veronis@univ-aix.fr
. Some stuff including a multilingual text editor is downloadable. MULTEXT EAST has parallel versions of Orwell's 1984 available free (upon registration) for a number of Central European languages. - Software from the Rainbow/Libbow software package that implements several algorithms for text categorization, including naive Bayes, TF.IDF, and probabilistic algorithms. Accompanies Tom Mitchell's ML text.
- Text Data Mining API from Lehigh University.
- Japanese morphological analyzer. Descendent of JUMAN.
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
Free, but require registration
- A workbench for full-text retrieval from large corpora (with a query language and corpus indexing). Includes the Corpus Query Processor (CQP) and xkwic. Available free for research groups (currently only as Solaris 1/2 or Linux binaries), on signing a license agreement.
- University of Sheffield's General Architecture for Text Engineering. Primarily an Information Extraction system.
- A workbench for the development of tagged corpora. Includes a tagger based on Brill's TBL approach.
- SNoW is a learning program that can be used as a general purpose multi-class classifier and is specifically tailored for learning in the presence of a very large number of features. The learning architecture is a sparse network of linear units over a pre-defined or incrementally acquired feature space (Dan Roth).
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
Unsure
-
a finite-state transducer analysis system for English, French, and Italian that runs under NextStep. Contact: Max Silberztein
silberz@ladl.jussieu.fr
![*](https://i-blog.csdnimg.cn/blog_migrate/f00d18ef423567a65313c1f997a14338.gif)
The PennTools page collects information on a variety of NLP systems, many of which are available externally.
Corpora
Large collections aimed at the NLP community
-
Email:
ldc@ldc.upenn.edu
. Provides the largest range of corpora on CD-ROM. Cost ranges from cheap (e.g., ACL-DCI disk) to pricey. CDs can be purchased individually; institutions can become members and receive discounts on CDs. There's an LDC Online service for searches over the web (mainly intended for members, but there are samplers available). - Distribution agency is ELDA . Rapidly growing collection of materials in European languages.
-
Sells various corpora (including Brown and London-Lund). Information on corpora on
the web , by sending the message
help
tofileserv@nora.hd.uib.no
, by ftp tonora.hd.uib.no
. Also, manuals for these corpora. - TELRI Research Archive of Computational Tools and Resource. Corpora, many multilingual, in European community languages. Small fee for joining in order to be able to get corpora (unless you have contributed corpora).
-
Email:
lexical@nmsu.edu
. Focuses more on language processing tools and lexicons, but does have some corpora. As of Feb 1996, you can get most of their stuff by anonymous ftp toclr.nmsu.edu
. Their catalog is available as a postscript file. -
Provides mainly literary texts. Has a bright new web site. Email:
info@ota.ahds.ac.uk
. Most materials are available on the web or by anonymous ftp toota.ox.ac.uk
. Some require negotiations with the providers. - Sentence collections in MySQL database for 17 mainly European languages.
- A 100 million word corpus of British English. You can search it online from their simple web interface or via View , a much better interface by Mark Davies, and there is an index to genres by David Lee. And now, an XML edition .
- A 98 million word corpus, covering most of the major European languages, as well as Turkish, Japanese, Russian, Chinese, and Malay. Cheap. Need to sign a license agreement available at either the WWW site. Also available from the LDC.
- At the Department of English Language and Literature at University College London. Includes the British part of ICE , the International Corpus of English project. Now available tagged, and parsed for function. 83,419 sentences. Includes ICECUP, dedicated retrieval software. Also, Diachronic Corpus of Present-Day Spoken English (800,000 words, tagged and parsed, half from ICE-GB and half from London-Lund).
- Million word collections of English from various world Englishes: ICE-NZ, ICE-HK, ICE-East Africa, etc. Several of them are downloadable from this site.
- This link provides its own annotations.
- Promises a uniform query language for accessing corpora in all EU languages -- but isn't quite there yet.
- Rich video and transcripts.
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
Particular languages
English
English language corpora available from the sites above are not repeated here.
- The SUSANNE corpus and the CHRISTINE corpus (SUSANNE markup of a speech corpus).
- A syntactically annotated corpus of the Middle English prose samples in the Helsinki Corpus of Historical English, with additions. 1.3 million words. $200.
- 2 million words from faculty and committee meetings and White House press conferences (50K work sample free on internet).
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
Chinese
English language corpora available from the sites above are not repeated here.
- By Tony McEnery and Richard Xiao. Distinguished by being a balanced corpus, and freely available.
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
Multilingual
- A parallel corpus of EU documents across all member states. 8 million words or more in each of 20 languages.
- Monolingual written corpus data for 14 South Asian languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu). Orthographically transcribed spoken data and parallel corpus data for five South Asian languages (Bengali, Gujarati, Hindi, Punjabi and Urdu). In addition, the parallel corpus contains the English originals from which the translations stored in the corpus were derived. All data in the corpus is CES and Unicode compliant. The EMILLE corpus totals some 94 million words. Downloadable.
- An open source parallel corpus, aligned, in many languages, based on free Linux etc. manuals.
- Also includes a good selection of links on Computer Assisted Translation. (See also the copyright page .)
- From the Laboratoire de Recherche Appliquée en Linguistique Informatique, Universite de Montréal
- Parallel text in all EU languages. (In particular try European legislation .)
- Parallel and other text in central and eastern european languages.
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
Bosnian
Czech
- Literature translations in Czech and English
- 100 million words of contemporary Czech.
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
French
- Various French literary works.
- 150 million word corpus of various genres of French. You have to be a member to use it (but membership is fairly cheap).
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
German
- Large (over a billion words!) online-searchable German and Austrian corpora. This is the publically available part of the 1.85 billion word Mannheimer Corpus Collection
- Saarland University Syntactically Annotated Corpus of German Newspaper Texts. Available free of charge to academics. 20,000 sentences, tagged, and with syntactic structures. Free for academic use.
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
Russian
- 150 million words, 5 million words POS-tagged, some in dependency treebank.
- Various literary works.
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
Slovene
- 1 M words, free to download + on-line concordances.
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
Spanish and Portuguese
- Over a million words of Portuguese from different historical periods, some of it morphologically analyzed/tagged. Free.
- It's not clear what their availability is.
- Morphosyntactically tagged telecommunication manuals) is available by ftp .
- In total about 70 million words, available free, from various sources (newswire, etc.)
- 4 annual CDROMs with full text.
- Portuguese-English parallel corpus. (In general, various resources at Linguateca site.
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
Swedish
- Has various searcable part of speech tagged Swedish corpora (Parole, Bank of Swedish, etc.), and some material in Zimbabwean languages.
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
Treebanks
Name | Language | Size | Availability | Comments |
---|---|---|---|---|
Penn Treebank | US English | 2 million + words | Available (distributed by LDC) | 1 million WSJ, 1 million speech, surface syntax (1970s TG) |
BLLIP WSJ corpus | US English | 30 million words | Available (distributed by LDC) | WSJ newswire. Automatically parsed, not hand checked. Same structure as Penn Treebank, except for some additional coreference marking |
ICE-GB | UK English | 1 million words (83,394 sentences) | Available; c. 500 pounds | British part of ICE, the International Corpus of English project. Tagged and parsed for function. Half spoken material. |
NEGRA Corpus | German | 20,000 sentences | Available free of charge to academics on completion of license agreement. | Saarland University Syntactically Annotated Corpus of German Newspaper Texts. Tagged, and with syntactic structures. |
TIGER corpus | German | 700,000 words | Available free of charge for research purposes on completion of license agreement. | German newspaper text (Frankfurter Rundschau). Semi-automatically parsed. They also have a good treebank search tool, TIGERSearch . |
Alpino Dependency Treebank | Dutch | 150,000 words | Freely downloadable | Assorted subcorpora. By far the largest is the full cdbl (newspaper) part of the Eindhoven corpus. |
The Prague Dependency Treebank 1.0 | Czech | 500,000 words | Free on completion of license agreement (available through LDC). | Analyzed at the levels of parts of speech, syntactic functions (and, in the future, semantic roles) level in a dependency framework. Text from newspapers and weekly magazines. |
TUT: Turin University Treebank | Italian | 2,400 sentences | Free download. | Morhpological analysis and dependency analysis. Penn Treebank translation. Civil law and newspaper texts. |
Bulgarian Treebank | Bulgarian | n/a | POS-tagged texts and dependencies analyses are available (some are free on the web, others via a license agreement) | An under construction Bulgarian HPSG treebank. |
Penn Chinese Treebank | Chinese | 100,000 words | Available (LDC ) | Based on Xinhua news articles. 1980s-style GB syntax. |
Danish Dependency Treebank 1.0 | Danish | 100,000 words | Available free under the GPL. | Built on a portion of the Parole corpus. |
Floresta Sintá(c)tica | Portuguese | 168,000 words hand-corrected; 1,000,000 words automatically parsed | Hand corrected part is free web download; automatically parsed part available through email contact | Text from CETEMPúblico corpus . Phrase structure and dependency representations. Available in several formats, including Penn Treebank format. |
Talbanken05 | Swedish | 300,000 words | Free download | Resurrects and modernizes an early treebank from the 1970s. |
- Deriving Linguistic Resources from Treebanks.
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
Treebanks
CSTBank : Cross-document Structure Theory: marking sentence functional relationships across related documents.
Resources for Word Sense Disambiguation
- Has a comprehensive selection of resources for WSD, including a good list of WSD data resources , but not yet the new SEMCOR .
- Includes various WSD systems.
- Open source package for unsupervised discovery of word senses by clustering together instances of a word (or words) that are used in similar contexts in raw text, supporting a wide range of clustering techniques based on both context vectors and similarity matrices, and including links to SVDPACKC and CLUTO. Ted Pedersen and Amruta Purandare.
- Judgments on how similar the meanings of synsets are and how common they are in the BNC from Jordan Boyd-Graber.
![*](https://i-blog.csdnimg.cn/blog_migrate/6bec43626ceb3ee8b7975bb938c70ae6.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/6bec43626ceb3ee8b7975bb938c70ae6.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/6bec43626ceb3ee8b7975bb938c70ae6.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/6bec43626ceb3ee8b7975bb938c70ae6.gif)
Literature
There are now quite large collections of online literature, available in various languages (though the majority are in English, of course). Below are pointers to some of the main collections:
Entirely or mainly English
- Seems to have one of the largest collection. Searching and browsing facilities through gopher menus. Many languages.
- Extensive and good quality. Still in the gopher age, though.
- The index here only covers books in English, but there are lots of links to other collections of material in all languages.
- The oldest and largest project to get out of copyright literature online, freely available. (Or see the mirror, Sailor's Project Gutenberg site .)
- Large collection of SGML text, mainly in English, but also in other major languages.
- Princeton/Rutgers collaboration. They didn't have it together with their web site when I stopped by, but they may soon.
- Available from Oxford University Press, 200 Madison Ave, NY, NY 10016 212-679-7300. The Complete Works of Jane Austen is $95.00, and is reviewed in Computers and the Humanities , 28:4-5 (Aug/Oct, 1994), 317-321.
- From University of Woverhampton (R. Mitkov, C. Barbu et al.).
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
Acquisition data
- Database of child language transcriptions in English and many other languages. Texts are also available by ftp . Certain usage requirements. Manuals and programs for accessing the data (the CLAN concordancer) are also available online. Now in Unicode XML.
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)
SGML/XML
- This is a wonderful compendium of information on SGML and XML, including information on the Text Encoding Initiative (TEI) . This document is also a guide to many text collections (ones usi
![*](https://i-blog.csdnimg.cn/blog_migrate/4033019775d82056b07065aeb13df6c4.gif)