Standord NLP组整理的NLP工具、资源列表

本文转载自:http://fuliang.iteye.com/blog/1882983

 

Statistical natural language processing and corpus-based computational linguistics: An annotated list of resources

 

Contents

 

* ToolsMachine TranslationPOS TaggersNP chunkingSequence modelsParsersSemantic Parsers/SRL, NERCoreferenceLanguage modelsConcordancesSummarizationOther * CorporaLarge collectionsParticular languagesTreebanksDiscourseWSDLiteratureAcquisition * SGML/XML * Dictionaries * Lexical/morphological resources * Courses, Syllabi, and other Educational Resources * Mailing lists * Other stuff on the WebGeneralIRIE/WrappersPeopleSocieties

 

Tools

 

Machine Translation systems

 

Instructions

 

* Building a baseline statistical phrase MT system
Wonderful pages about how to download a bunch of tools and some data and put them together to build a very competent baseline statistical MT system:  NAACL 2006 WMT or  2009 WMT.

 

Freely downloadable

 

* EGYPT system
System from 1999 JHU workshop. Mainly of historical interest.
* GIZA++ and  mkcls
Franz Och. C++. GPL.
* Thot
Phrase-based model building kit
* Phramer
An Open-Source Java Statistical Phrase-Based MT Decoder
* Moses
A new open-source phrase-based MT decoder with functionality beyond Pharaoh.
* Syntax Augmented Machine Translation via Chart Parsing
Andreas Zollmann and Ashish Venugopal

 

Free, but getting them requires hassle

 

* Pharaoh decoder
Philip Koehn, ISI.
* MTTK
Machine Translation Tool Kit. Deng and Byrne.

 

Part of Speech Taggers

 

Freely downloadable

 

* Stanford POS tagger
Loglinear tagger in Java (by Kristina Toutanova)
* hunpos
An HMM tagger with models available for English and Hungarian. A reimplementation of TnT (see below) in OCaml. pre-compiled models. Runs on Linux, Mac OS X, and Windows.
* MBT: Memory-based Tagger
Based on TiMBL
* TreeTagger
A decision tree based tagger from the University of Stuttgart (Helmut Scmid). It's language independent, but comes complete with parameter files for English, German, Italian, Dutch, French, Old French, Spanish, Bulgarian, and Russian. (Linux, Sparc-Solaris, Windows, and Mac OS X versions. Binary distribution only.) Page has links to sites where you can run it online.
* SVMTool
POS Tagger based on SVMs (uses SVMlight). LGPL.
* ACOPOST (formerly ICOPOST)
Open source C taggers originally written by by Ingo Schröder. Implements maximum entropy, HMM trigram, and transformation-based learning. C source available under GNU public license.
* MXPOST: Adwait Ratnaparkhi's Maximum Entropy part of speech tagger
Java POS tagger. A sentence boundary detector (MXTERMINATOR) is also included. Original version was only JDK1.1; later version worked with JDK1.3+. Class files, not source.
* fnTBL
A fast and flexible implementation of Transformation-Based Learning in C++. Includes a POS tagger, but also NP chunking and general chunking models.
* mu-TBL
An implementation of a Transformation-based Learner (a la Brill), usable for POS tagging and other things by Torbjörn Lager. Web demo also available. Prolog.
* YamCha
SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.)
* QTAG Part of speech tagger
An HMM-based Java POS tagger from Birmingham U. (Oliver Mason). English and German parameter files. [Java class files, not source.]
* The TOSCA/LOB tagger.
Currently available for MS-DOS only. But the decision to make this famous system available is very interesting from an historical perspective, and for software sharing in academia more generally. LOB tag set.
* The venerable Brill's Transformation-based learning Tagger
A symbolic tagger, written in C. It's no longer available from a canonical location, but you might find a version from the  Wikipedia page or you could try a reimplementation such as  fnTBL.
* Original Xerox Tagger
A common lisp HMM tagger available by  ftp.
* Lingua-EN-Tagger
Perl POS tagger by Maciej Ceglowski and Aaron Coburn. Version 0.11. (A bigram HMM tagger.)

 

Free, but require registration

 

* TATOO
The ISSCO tagger. HMM tagger. Need to register to download.
* PoSTech Korean morphological analyzer and tagger
Online registration.
* TnT - A Statistical Part-of-Speech Tagger
Trainable for various languages, comes with English and German pre-compiled models. Runs on Solaris and Linux.

 

Usable by email or on the web, but not distributed freely

 

* Memory-based tagger
From ILK group, Catholic University Brabant (Jakub Zavrel/Walter Daelemans). Does Dutch, English, Spanish, Swedish, Slovene.  Other MBL demos are also available.
* Birmingham tagger
Accepts only  plain ASCII email message contents. The tagset used is similar to the Brown/LOB/Penn set.
* CLAWS tagger
The UCREL CLAWS tagger is available for trial use on the web. (It's limited to 300 words though -- this site is more of an advertisement for licensing the real thing -- available as software for Suns or as a paid service.) You can also find info on  CLAWS tagsets, though that page doesn't seem to link to the  C7 tagset.
* The AMALGAM tagger
The  AMALGAM Project also has various other useful resources, in particular  a web guide to different tag sets in common use. The tagging is actually done by a (retrained) version of the Brill tagger (q.v.).
* Xerox XRCE MLTT Part Of Speech Taggers
Tags any of 14 languages (European and Arabic), online on the web.
* Portuguese taggers on the web:  Projecto Natura and  a QTAG adaptation.

 

Not free

 

* Lingsoft
Lingsoft in Finland has (symbolic) analysis tools for many European languages. More information can be obtained by emailing  info@lingsoft.fi. There is an  online demo.
* Conexor
Conexor in Finland has demonstrations of EngCG-style taggers and parsers, for English, Swedish, and Spanish.
* Xerox
Xerox has morphological analyzers and taggers for many languages. There are  demos of some of their tools on the web. More information can be obtained by contacting  Daniella Russo.
* Infogistics
Infogistics, an Edinburgh spinoff has a tagging and NP/Verb group chunker available commercially, including an evaluation version.

 

No longer available

 

* LT POS and LT TTT
The Edinburgh Language Technology Group tagger and text tokenizer (and sentence splitter were binary-only Solaris tools which no longer seem to be available.

 

NP chunking

 

Downloadable

 

* YamCha
SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.)
* Mark Greenwood's Noun Phrase Chunker
A Java reimplementation of Ramshaw and Marcus (1995).
* fnTBL
A fast and flexible implementation of Transformation-Based Learning in C++. Includes a POS tagger, but also NP chunking and general chunking models.

 

Generic sequence models

 

Downloadable

 

* CRF++
Generic CRF-based model in C++. Open source. By the author of YamCha.
* Carafe
Generic CRF-based sequence models in O-CaML. Open source. By Ben Wellner.
* FreeLing
A large suite of language analyzers. Written in C++. Covers text preprocessing, morphology, NER, POS tagging, parsing.

 

Parsers

 

Information on available probabilistic parsers can be found on the FSNLP: probabilistic parsing links page.

 

Semantic Parsers

 

Downloadable

 

* ASSERT
PropBank semantic roles (and opinions, etc.) by Sameer Pradhan.
* Shalmaneser
FrameNet-based by Katrin Erk.
* Tree Kernels in SVMlight by Alessandro Moschitti.
A general package, but it has particularly been used for SRL.

 

Named Entity Recognition

 

Downloadable

 

* Stanford Named Entity Recognizer
A Java Conditional Random Field sequence model with trained models for Named Entity Recognition. Java. GPL. By Jenny Finkel.
* LingPipe
Tools include statistical named-entity recognition, a heuristic sentence boundary detector, and a heuristic within-document coreference resolution engine. Java. GPL. By Bob Carpenter, Breck Baldwin and co.
* YamCha
SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.)

 

Coreference (Anaphora) Resolution

 

Downloadable

 

* BART
A Beautiful Anaphora Resolution Toolkit. Java. By Yannick Versley and many others. Java. Apache with GPL components.
* Guitar
Java. GPL.

 

Language modeling toolkits

 

Downloadable

 

* IRSTLM Toolkit Compatible with SRILM, suitable for very large language models. LGPL. By Marcello Federico, Nicola Bertoldi et al. * CMU-Cambridge Statistical Language Modeling toolkit

 

Downloadable, but requires registration

 

* The  SRI Language Modeling toolkit
by Andreas Stolcke is another good system for building language models, freely available for research purposes.

 

Not yet classified

 

* Lextools
A package of tools for creating weighted finite-state transducers (WFST) from high-level linguistic descriptions. Lextools binaries are available free for non-commercial use at: http://www.research.att.com/sw/tools/lextools/. Supported platforms are: linux (i686), sgi (mips2) and sun4. Lextools is built on top of, and requires, the AT&T WFST toolkit (version 3.6), available free for non-commercial use from:  http://www.research.att.com/sw/tools/fsm/

 

Friendly concordancing and text analysis tools

 

* Wordsmith Tools (Mike Scott)
The thing to get if you are working in the Windows world.

 

Text summarization tools

 

* A prototype Java Summarisation applet (System Quirk) * MEAD
A public domain portable multi-document summarization system. (Dragomir Radev and others.)

 

Other

 

Downloadable

 

* Tilburg University's  TiMBL
Tilburg's Memory Based Learner by Walter Daelemans et al. A general near-neighbour-based machine learning package, but optimized for statistical NLP applications.
* splitta
Statistical sentence boundary detection by Dan Gillick.
* Time Expression taggers
TIMEX2 standard taggers (site at Mitre).
* NLTK
An open source Python package for NLP application development with tools such as tokenization, POS TAGGING and parsers by Ed Loper and Steven Bird.
* Ted Pedersen's code
Ngram Statistics Package: Perl code that implements: Fisher's exact test, the likelihood ratio, Pearson's chi squared test, the Dice Coefficient, and Mutual Information; Duluth Senseval-2 word sense disambiguation systems; Senseval-1 data in Senseval-2 format; various other WSD datasets in Senseval formats, and semantic distances derived via WordNet.
* ISIP tools
The main aim is a publically available speech recognition system (alpha release available), but along the way there are also toolkits for discrete HMMs and statistical decision trees, and for various aspects of signal processing.
* Mem. A Perl implementation of Generalized and Improved Iterative Scaling
by Hugo WL ter Doest.
* Automorphology
A system (for Windows) for automatically learning the morphological forms of words in a corpus by John Goldsmith.
* Wordnet
Wordnet is available by  ftp, compiled for a variety of machine types. For money, one can also get EuroWordNet for various European languages, an  Italian/English/Spanish MultiWordNet and there's now a site for  Global Wordnet. (See also  Mappings between WordNet versions and  Perl WordNet-Similarity module by Ted Pedersen, and  WordNet Domains (coarse-grained sense topic classifications).)
* Penn XTAG project
A wide-coverage tree-adjoining grammar written in a mixture of C and Common Lisp. Also includes a large coverage morphological analyzer. Now includes more tools such as TCL/Tk tree viewer.
* Dan Melamed's Assorted Tools
A collection of various tools including a simulated annealling program, a post-processor for English stemming for the Penn XTAG morphology system, Good-Turing smoothing software, general text processing tools, text statistics tools and bitext geometry tools (mainly written in Perl 5).
* MULTEXT
Constructing corpora and tools for processing multilingual corpora. Contact: Jean Veronis veronis@univ-aix.fr. Some stuff including a multilingual text editor is downloadable.  MULTEXT EAST has parallel versions of Orwell's 1984 available free (upon registration) for a number of Central European languages.
* Naive Bayes algorithm
Software from the Rainbow/Libbow software package that implements several algorithms for text categorization, including naive Bayes, TF.IDF, and probabilistic algorithms. Accompanies Tom Mitchell's ML text.
* HDDI
Text Data Mining API from Lehigh University.
* Emdros: a text database engine for linguistic analysis and research * Chasen
Japanese morphological analyzer. Descendent of JUMAN.

 

Free, but require registration

 

* Stuttgart's  IMS Corpus Workbench (CWB)
A workbench for full-text retrieval from large corpora (with a query language and corpus indexing). Includes the Corpus Query Processor (CQP) and xkwic. Available free for research groups (currently only as Solaris 1/2 or Linux binaries), on signing a license agreement.
* Gate
University of Sheffield's General Architecture for Text Engineering. Primarily an Information Extraction system.
* MITRE's  Alembic Workbench
A workbench for the development of tagged corpora. Includes a tagger based on Brill's TBL approach.
* SNoW
SNoW is a learning program that can be used as a general purpose multi-class classifier and is specifically tailored for learning in the presence of a very large number of features. The learning architecture is a sparse network of linear units over a pre-defined or incrementally acquired feature space (Dan Roth).

 

Unsure

 

* INTEX
a finite-state transducer analysis system for English, French, and Italian that runs under NextStep. Contact: Max Silberztein  silberz@ladl.jussieu.fr

 

The PennTools page collects information on a variety of NLP systems, many of which are available externally.

 

Corpora

 

Large collections aimed at the NLP community

 

* LDC (Linguistic Data Consortium) and its  catalogue by year.
Email:  ldc@ldc.upenn.edu. Provides the largest range of corpora on CD-ROM. Cost ranges from cheap (e.g., ACL-DCI disk) to pricey. CDs can be purchased individually; institutions can become members and receive discounts on CDs. There's an  LDC Online service for searches over the web (mainly intended for members, but there are samplers available).
* European Language Resources Association and its  catalogue.
Distribution agency is  ELDA. Rapidly growing collection of materials in European languages.
* ICAME (International Computer Archive of Modern English)
Sells various corpora (including Brown and London-Lund). Information on corpora on  the web, by sending the message  help to  fileserv@nora.hd.uib.no, by ftp to  nora.hd.uib.no. Also,  manualsfor these corpora.
* Reuters @ NIST Reuters corpora are now distributed by NIST. * TRACTOR
TELRI Research Archive of Computational Tools and Resource. Corpora, many multilingual, in European community languages. Small fee for joining in order to be able to get corpora (unless you have contributed corpora).
* CLR (Consortium for Lexical Research)
Email:  lexical@nmsu.edu. Focuses more on language processing tools and lexicons, but does have some corpora. As of Feb 1996, you can get most of their stuff by anonymous ftp to  clr.nmsu.edu. Their  catalog is available as a postscript file.
* OTA (Oxford Text Archive)
Provides mainly literary texts. Has a bright new web site. Email:  info@ota.ahds.ac.uk. Most materials are available on the web or by anonymous ftp to  ota.ox.ac.uk. Some require negotiations with the providers.
* Leipzig Corpora Collection
Sentence collections in MySQL database for 17 mainly European languages.
* BNC (British National Corpus)
A 100 million word corpus of British English. You can  search it online from their simple web interface or via  View, a much better interface by Mark Davies, and there is an  index to genres by David Lee. And now, an  XML edition.
* European Corpus Initiative Multilingual Corpus I (ECI/MCI)
A 98 million word corpus, covering most of the major European languages, as well as Turkish, Japanese, Russian, Chinese, and Malay. Cheap. Need to sign a license agreement available at either the WWW site. Also available from the LDC.
* Survey of English Usage
At the Department of English Language and Literature at University College London. Includes the  British part of ICE, the  International Corpus of English project. Now available tagged, and parsed for function. 83,419 sentences. Includes ICECUP, dedicated retrieval software. Also,  Diachronic Corpus of Present-Day Spoken English (800,000 words, tagged and parsed, half from ICE-GB and half from London-Lund).
* International Corpus of English (ICE)
Million word collections of English from various world Englishes: ICE-NZ, ICE-HK, ICE-East Africa, etc. Several of them are downloadable from this site.
* Corpora held by Lancaster University
This link provides its own annotations.
* The European Language Activity Network
Promises a uniform query language for accessing corpora in all EU languages -- but isn't quite there yet.
* Talkbank.
Rich video and transcripts.

 

Particular languages

 

English

 

English language corpora available from the sites above are not repeated here.

 

* Corpora by Geoffrey Sampson's team
The  SUSANNE corpus and the  CHRISTINE corpus (SUSANNE markup of a speech corpus).
* Michigan Corpus of Academic Spoken English (MICASE). 1.7 million words from 1997-2001. * Penn-Helsinki Parsed Corpus of Middle English
A syntactically annotated corpus of the Middle English prose samples in the Helsinki Corpus of Historical English, with additions. 1.3 million words. $200.
* Corpus of Professional, Spoken American-English (CPSA)
2 million words from faculty and committee meetings and White House press conferences (50K work sample free on internet).
* Lancaster Parsed Corpus * Dialogue Diversity Corpus (Bill Mann) * American National Corpus

 

Chinese

 

English language corpora available from the sites above are not repeated here.

 

* The Lancaster Corpus of Mandarin Chinese (LCMC)
By Tony McEnery and Richard Xiao. Distinguished by being a balanced corpus, and freely available.

 

Multilingual

 

* JRC-Acquis
A parallel corpus of EU documents across all member states. 8 million words or more in each of 20 languages.
* EMILLE/CIIL
Monolingual written corpus data for 14 South Asian languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu). Orthographically transcribed spoken data and parallel corpus data for five South Asian languages (Bengali, Gujarati, Hindi, Punjabi and Urdu). In addition, the parallel corpus contains the English originals from which the translations stored in the corpus were derived. All data in the corpus is CES and Unicode compliant. The EMILLE corpus totals some 94 million words. Downloadable.
* OPUS
An open source parallel corpus, aligned, in many languages, based on free Linux etc. manuals.
* World Health Organization Computer Assisted Translation page.
Also includes a good selection of links on Computer Assisted Translation. (See also  the copyright page.)
* Searchable Canadian Hansard French-English parallel texts (1986-1993)
From the  Laboratoire de Recherche Appliquée en Linguistique Informatique, Universite de Montréal
* European Union web server
Parallel text in all EU languages. (In particular try  European legislation.)
* TELRI CD-ROMs
Parallel and other text in central and eastern european languages.

 

Bosnian

 

* The Oslo Corpus of Bosnian Texts.

 

Czech

 

* Parallel Czech-English
Literature translations in Czech and English
* Czech National Corpus project: SYN2000
100 million words of contemporary Czech.

 

French

 

* Association des Bibliophiles Universels
Various French literary works.
* American and French Research on the Treasury of the French Language (ARTFL)
150 million word corpus of various genres of French. You have to be a member to use it (but membership is fairly cheap).

 

German

 

* COSMAS Corpus
Large (over a billion words!) online-searchable German and Austrian corpora. This is the publically available part of the 1.85 billion word  Mannheimer Corpus Collection
* NEGRA Corpus
Saarland University Syntactically Annotated Corpus of German Newspaper Texts. Available free of charge to academics. 20,000 sentences, tagged, and with syntactic structures. Free for academic use.

 

Russian

 

* Russian National Corpus
150 million words, 5 million words POS-tagged, some in dependency treebank.
* Library of Russian Internet Libraries
Various literary works.

 

Slovene

 

* Slovene-English parallel corpus
1 M words, free to download + on-line concordances.
* Coming soon:  Slovene reference corpus of 100 M words

 

Croatian

 

* Croatian National Corpus
100 M words

 

Spanish and Portuguese

 

* TychoBrahe Parsed Corpus of Historical Portuguese
Over a million words of Portuguese from different historical periods, some of it morphologically analyzed/tagged. Free.
* Information about Mark Davies' collection of (mainly historical Spanish and Portuguese.
It's not clear what their availability is.
* The CUMBRE corpus. Contact  Professor Aquilino Sánchez * The CRATER Spanish corpus
Morphosyntactically tagged telecommunication manuals) is available by  ftp.
* Corpus resources for Portuguese
In total about 70 million words, available free, from various sources (newswire, etc.)
* Folha de S. Paulo newspaper
4 annual CDROMs with full text.
* COMPARA
Portuguese-English parallel corpus. (In general, various resources at  Linguateca site.
* See also under ELRA, above.

 

Swedish

 

* Spraakdata, Department of Swedish, Göteborgs University.
Has various searcable part of speech tagged Swedish corpora (Parole, Bank of Swedish, etc.), and some material in Zimbabwean languages.

 

Treebanks

 

Name Language Size Availability Comments
Penn TreebankUS English2 million + wordsAvailable (distributed by LDC)1 million WSJ, 1 million speech, surface syntax (1970s TG)
BLLIP WSJ corpusUS English30 million wordsAvailable (distributed by LDC)WSJ newswire. Automatically parsed, not hand checked. Same structure as Penn Treebank, except for some additional coreference marking
ICE-GBUK English1 million words (83,394 sentences)Available; c. 500 poundsBritish part of ICE, the International Corpus of English project. Tagged and parsed for function. Half spoken material.
Bulgarian TreebankBulgariann/aPOS-tagged texts and dependencies analyses are available (some are free on the web, others via a license agreement)An under construction Bulgarian HPSG treebank.
Penn Chinese TreebankChinese100,000 wordsAvailable (LDC)Based on Xinhua news articles. 1980s-style GB syntax.
The Prague Dependency Treebank 1.0Czech500,000 wordsFree on completion of license agreement (available through LDC).Analyzed at the levels of parts of speech, syntactic functions (and, in the future, semantic roles) level in a dependency framework. Text from newspapers and weekly magazines.
Danish Dependency Treebank 1.0Danish100,000 wordsAvailable free under the GPL.Built on a portion of the Parole corpus.
Alpino Dependency TreebankDutch150,000 wordsFreely downloadableAssorted subcorpora. By far the largest is the full cdbl (newspaper) part of the Eindhoven corpus.
NEGRA CorpusGerman20,000 sentencesAvailable free of charge to academics on completion of license agreement.Saarland University Syntactically Annotated Corpus of German Newspaper Texts. Tagged, and with syntactic structures.
TIGER corpusGerman700,000 wordsAvailable free of charge for research purposes on completion of license agreement.German newspaper text (Frankfurter Rundschau). Semi-automatically parsed. They also have a good treebank search tool, TIGERSearch.
Icelandic Parsed Historical Corpus (IcePaHC)Icelandic1,000,000 wordsFree download (LGPL)Texts from 1150 through 2008!
TUT: Turin University TreebankItalian2,400 sentencesFree download.Morhpological analysis and dependency analysis. Penn Treebank translation. Civil law and newspaper texts.
Floresta Sintá(c)ticaPortuguese168,000 words hand-corrected; 1,000,000 words automatically parsedHand corrected part is free web download; automatically parsed part available through email contactText from CETEMPúblico corpus. Phrase structure and dependency representations. Available in several formats, including Penn Treebank format.
Talbanken05Swedish300,000 wordsFree downloadResurrects and modernizes an early treebank from the 1970s.

 

* Verbmobil Tübingen: under construction treebanked corpus of German, English, and Japanese sentences from Verbmobil (appointment scheduling) data * Syntactic Spanish Database (SDB) University of Santago de Compostela. 160,000 clauses / 1.5 million words. * CKIP Chinese Treebank (Taiwan). Based on Academia Sinica corpus. (There's also a  100 sentence Chinese treebank at U. Maryland.) * LDC Korean Treebank. * Dublin-Essex Treebank project
Deriving Linguistic Resources from Treebanks.

 

Treebanks

 

CSTBank: Cross-document Structure Theory: marking sentence functional relationships across related documents.

 

Resources for Word Sense Disambiguation

 

* The  Senseval web site
Has a comprehensive selection of resources for WSD, including a good  list of WSD data resources, but not yet the  new SEMCOR.
* Ted Pedersen's code
Includes various WSD systems.
* SenseClusters
Open source package for unsupervised discovery of word senses by clustering together instances of a word (or words) that are used in similar contexts in raw text, supporting a wide range of clustering techniques based on both context vectors and similarity matrices, and including links to SVDPACKC and CLUTO. Ted Pedersen and Amruta Purandare.
* Evocation WordNet synset similarity judgments
Judgments on how similar the meanings of synsets are and how common they are in the BNC from Jordan Boyd-Graber.

 

Literature

 

There are now quite large collections of online literature, available in various languages (though the majority are in English, of course). Below are pointers to some of the main collections:

 

Entirely or mainly English

 

* Alex: A Catalogue of Electronic Texts on the Internet
Seems to have one of the largest collection. Searching and browsing facilities through gopher menus. Many languages.
* Wiretap Electronic Text Archive
Extensive and good quality. Still in the gopher age, though.
* The On-line Books Page
The index here only covers books in English, but there are lots of links to other collections of material in all languages.
* Project Gutenberg
The oldest and largest project to get out of copyright literature online, freely available. (Or see the mirror,  Sailor's Project Gutenberg site.)
* The Electronic Text Center of the University of Virginia
Large collection of SGML text, mainly in English, but also in other major languages.
* Center for Electronic Texts in the Humanities
Princeton/Rutgers collaboration. They didn't have it together with their web site when I stopped by, but they may soon.
* Oxford Electronic Text Library Editions
Available from Oxford University Press, 200 Madison Ave, NY, NY 10016 212-679-7300. The Complete Works of Jane Austen is $95.00, and is reviewed in  Computers and the Humanities, 28:4-5 (Aug/Oct, 1994), 317-321.
* Coreference annotated texts
From University of Woverhampton (R. Mitkov, C. Barbu et al.).

 

Acquisition data

 

* CHILDES database.
Database of child language transcriptions in English and many other languages. Texts are also available by  ftp. Certain usage requirements. Manuals and programs for accessing the data (the CLAN concordancer) are also available online. Now in Unicode XML.

 

SGML/XML

 

* Robin Cover's  SGML/XML Web Page
This is a wonderful compendium of information on SGML and XML, including  information on the Text Encoding Initiative (TEI). This document is also a guide to many text collections (ones using SGML).
* Information about the Text Encoding Initiative (TEI). (The  Pizza Chef acts as a TEI tag set selector.) * Xaira
XML Aware Indexing and Retrieval Application. The successor of SARA.
* Microsoft's XML page * W3C XML page. * The Corpus Encoding Standard.
An SGML instance designed for language engineering applications. Also the  XML version.

 

Dictionaries

 

Dictionaries of subcategorization frames

 

The following dictionaries all list surface subcategorization frames (each with a different annotation scheme). They are also all available in electronic form from the publishers (not free).

 

* COBUILD
Collins Cobuild English Language Dictionary. London: Collins, 1987. The  COBUILD web site lets you search their Bank of English corpus (but you need to pay to get more than a trial.
* LDOCE
Longman Dictionary of Contemporary English. Burnt Mill, Essex: Longman, 1978.
* OALD
Oxford Advanced Learner's Dictionary of Current English. Oxford: Oxford University Press, Fourth Edition, 1989. The third edition also had information on subcategorization frames, although in a different incompatible format. However, a  partial version of the third edition (with this information) is available free online from the Oxford Text Archive.

 

Not exactly a dictionary, but other popular sources are:

 

* Levin (1993)
Beth Levin. 1993. English Verb Classes and Alternations: A Preliminary Investigation. Chicago. Discusses linguistic distinctions (like unergative/unaccusative verbs, dative shift, etc., not made by the above dictionaries).  The index of verbs is online.
* English subcategorization evaluation resources
Gold standard data, from Cambridge University (Anna Korhonen)

 

See also COMLEX and CELEX available from the LDC.

 

Dictionaries of assorted languages on the web

 

* The old version of Robert Beard's Web of Online Dictionaries long ago mutated into  YourDictionary.com. I'm told the IPO has been delayed. Nevertheless, it's the most comprehensive index of dictionaries available on the web.

 

Names

 

U.S. names with frequency information, are available from the Census Bureau.

 

SGML structured dictionaries

 

* Cambridge International Dictionary of English and other products in SGML.

 

Lexical/morphological resources

 

* English SENSEVAL Resources
Dictionary entries and tagged examples for 35 words.
* ARIES Natural Language Tools
Lexicons and morphological analysis for Spanish. There is a free Prolog demonstrator, but the real lexicons and C/C++ access tools cost money.

 

Courses, Syllabi, and other Educational Resources

 

"Techie"

 

* Foundations of Statistical Natural Language Processing 
Some information about, and sample chapters from, Christopher Manning and Hinrich Schütze's new textbook, published in June 1999 by MIT Press. Read about  courses using this book.
* Corpus-based Linguistics
Christopher Manning's Fall 1994 CMU course syllabus (a postscript file).
* Statistical NLP: Theory and Practice
Christopher Manning's Spring 1996 CMU course materials.
* John Lafferty and Roni Rosenfeld's Spring 1997 CMU course Language and Statistics. * Boston University (John D. Burger and Lynette Hirschman)
A good course and web site, by the looks!
* Draft of Data-Intensive Linguistics
By Chris Brew and Marc Moens.
* Statistical Natural Language Processing course
By Joakim Nivre. Elsnet suported.
* Short Course: Statistical Methods in NLP
By Philip Resnik
* Linguist's Guide to Statistics by Brigitte Krenn and Christer Samuelsson. * Statistical and Corpora Based Methods for Processing Natural Languages
By Alon Itai, Technion Computer Science Department. (Don't read those old drafts of mine though ... get the real thing!)
* CS 241 Statistical Models in Natural-Language Processing
Eugene Charniak, Brown University.
* Michael Littman, Duke:  19971998.

 

"Corpus Linguistics"

 

* A tutorial on concordances and corpora by Cathy Ball * Tony Berber Sardinha's Corpus Linguistics course
Powerpoint slides in an interesting mixture of English and Portuguese (plus the rest of his homepage!)
* Concordancing and corpus linguistics
Notes prepared by Phil Benson, Hong Kong University.
* Computational Approaches to Collocations
Discussion of all the measures that have been used, and software for calculating them. By Evert and Krenn.

 

Mailing lists

 

Mailing lists that have information on these topics include:

 

* Corpora
The main mailing list for info on corpus-based linguistics. Subscribe by sending the message: subscribe corporato  listserv@uib.no. Or if you want to subscribe with a different email address, send: subscribe corpora email-address(Note that you're now speaking to a Majordomo server, not a listserv, so you don't send your name!). Or you can  subscribe on the web.
* Empiricist
The empiricist list appears to be defunct now. You used to send a "subscribe" message to empiricists-request@unagi.cis.upenn.edu.

 

Other stuff on the Web

 

General resources

 

* NIST Human Language Technology programs
Including: TREC, TIDES, ACE, ....
* Text summarization
Tons of resources (tutorialis, bibliographies, and software) for document summarization, maintained by Dragomir Radev.
* PropositionBank @ UPenn * Statistical MT * Bookmarks for Corpus-based Linguists An extensive annotated collection by David Lee, aimed at linguistics more than NLP (includes web-searchable corpora and concordancing options). * HLTCentral
European site aiming to increase transfer of language technologies to the commercial market. News, etc.
* Linguistic annotation
A description of formats for linguistic annotation by Steven Bird.
* CTI Textual Studies, University of Oxford, Guide to Digital Resources
Lists text analysis tools, corpora, and other stuff.
* U. Essex W3-Corpora
Lots of teaching material, links, and online corpora.
* Computational Linguistics and NLP (Kenji Kita, Tokushima U.)
A good well organized list of CL references, concentrating on corpus-based and statistical NLP methods. See also  Software tools for NLP.
* HLT Central
European Human Language Technology site
* Survey of the State of the Art in Human Language Technology * ACL SIGLEX list of Lexical Resources * Online materials for a course on Learning Dynamical Systems at Brown University.
Lots of neat info.
* Expert Advisory Group for Language Engineering Standards (EAGLES) home page
European standards organization.
* Materials prepared for Michael Barlow's Corpus Linguistics course * Corpus Linguistics University of Birmingham * Chris Brew's Teaching Materials for statistical NLP
Not much there last time I looked; you might also try  his home page.
* Edinburgh LTG HelpDesk's  FAQ
Many of the questions in the concern issues related to corpora and tagging.
* Content Analysis Resources
Qualitative Text Analysis, Concordances, etc.
* MT paper archive
Lots of papers, etc.

 

Information Retrieval

 

* The SMART IR system * ACM SIGIR * Managing Gigabytes * TREC conference * Text-based Intelligent Systems (Bruce Croft)

 

Information Extraction/Wrapper Induction

 

* Introduction to Information Extraction Technology. A tutorial by Douglas E. Appelt and David Israel. * IE data sets
Updated versions (i.e., now well-formed XML) of classic IE data sets: Seminar Announcements and Corporate Acquisitions.
* Web -> KB. CMU World Wide Knowledge Base project (Tom Mitchell). Has a lot of the best recent probabilistic model IE work, and links to data sets. * RISE: Repository of Online Information Sources Used in Information Extraction Tasks, including links to people, papers, and many widely used data sets, etc. (Ion Muslea). Appears to not have been updated since 1999. * Message Understanding Conference (MUC) information. A US government funded information extraction exercise (from the 1990s). * Web IR and IE (Einat Amitay). Various links on IR and IE on the web. * Web question answering system (University of Michigan) * GATE: General Architecture for Text Engineering (Sheffield) * Genia Project. Biomedical text information extraction corpus (Tsujii lab). And IE tutorial slides.

 

People's homepages

 

Home pages with something useful on them.

 

* University of Texas at Austin Machine Learning Research Group * Steven Abney (until 1997) * Adam Berger
Various stuff on statistical MT and maximum entropy models
* Alex Chengyu Fang
Provides a lot of info on the kinds of things they get up to at UCL, without actually giving you anything to play with yourself.

 

Societies/Journals

 

* International Quantitative Linguistics Association/Journal of Quantitative Linguistics
Not very hip.
* Association for Computational Linguistics/Computational Linguistics
Hipper
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
NLP(自然语言处理)是一门研究如何使计算机能够理解和处理人类语言的领域。Python是一种广泛用于编程的编程语言。结合Python和NLP可以进行各种自然语言处理任务,如文本分类、命名实体识别、情感分析等。在Python中,有许多流行的库和工具可供使用,以便进行NLP任务。以下是一些常用的Python库和工具: 1. NLTK(自然语言工具包):它是Python中最常用的NLP库之一,提供了许多用于文本预处理、词性标注、词袋模型、语法分析等任务的功能。 2. spaCy :这是另一个流行的Python库,它提供了高效的自然语言处理功能,包括分词、词性标注、命名实体识别和依赖解析等任务。 3. TextBlob :这是一个易于使用的Python库,它提供了一系列简单的API,用于处理常见的NLP任务,如情感分析、词性标注和文本分类。 4. Gensim :它是一个用于主题建模和文本相似度计算的Python库,可以用于处理大规模的文本数据。 5. Scikit-learn :这是一个通用的机器学习库,其中包含了许多用于文本分类、情感分析和文本聚类等任务的算法。 这些库和工具为Python开发者提供了丰富的功能和资源,便于进行各种NLP任务的开发和实验。你可以根据具体的需求选择适合的库和工具,并结合相关的算法和技术来处理自然语言数据。希望这些信息能够帮助到你。 NLTK官方网站:https://www.nltk.org/ spaCy官方网站:https://spacy.io/ TextBlob官方网站:https://textblob.readthedocs.io/ Gensim官方网站:https://radimrehurek.com/gensim/ Scikit-learn官方网站:https://scikit-learn.org/stable/

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值