Some Useful Corpora

Suggested Corpora and Resources in English if not stated otherwise
(not all of them are free of charge)

Genre-specific corpora:
- Genre: SMS Messages = NUS SMS corpus:
http://wing.comp.nus.edu.sg:8080/SMSCorpus/  (English / Chinese)

- Genre: chatlogs = CODIAC chatlogs
( http://data.eol.ucar.edu/codiac/dss/id=92.124 ;
http://data.eol.ucar.edu/codiac/dss/id=88.044 ;
http://data.eol.ucar.edu/codiac/dss/id=107.010 )

- Genre: chatlogs = Many Eyes datasets: some chatlogs can be found
here:
http://www-958.ibm.com/software/data/cognos/manyeyes/datasets

- Genre: chats and switchboard conversations =
Switchboard corpus and NPS chat corpus samples NLTK in NLTK data
( http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml ) . The NPS
chat corpus ( http://faculty.nps.edu/cmartell/NPSChat.htm ) is a POS
tagged chat corpus and the switchboard corpus
( http://spot.colorado.edu/~michaeli/Lexsubj/swbd.html ) is a telephonic
conversation corpus.

- The Linguistics Data Consortium has a good deal of telephone
conversation - many files and a variety of languages. See
http://www.ldc.upenn.edu/Catalog/byType.jsp#lexicon,%20speech,%20
text (not for free)

- Genre: blogs = The Corporate weblogs dataset in TREC datasets
( http://ir.dcs.gla.ac.uk/test_collections/ ) is not for free. Helpful wiki:
http://ir.dcs.gla.ac.uk/wiki/TREC-BLOG
- Genre: corporate blogs = It is possible to pull corporate blog feeds
or scrape the blogs from this list:
http://www.debbieweil.com/blog/list-of-67-big-brand-corporate-blogs/

- The Göteborg Spoken Language Corpus and other corpora in
Swedish ( http://spraakbanken.gu.se/ )

- Genre: tweets = The  twitter  corpus associated with the paper
www.stanford.edu/~alecmgo/papers/TwitterDistantSupervision09.pdf  is
here:  https://sites.google.com/site/twittersentimenthelp/for-researchers

- Genre: tweets and other microblogs= MicroBlog track
http://sites.google.com/site/trecmicroblogtrack/  (not for free)

- Genre: Newswires: Reuters' Newswires collections =
http://trec.nist.gov/data/reuters/reuters.html

- Genre: emails = Enron corpus ( http://www.cs.cmu.edu/~enron/ );
categorized Enron emails ( http://sgi.nu/enron/corpora.php )

- Genre: emails = Junk email corpus
( http://clg.wlv.ac.uk/resources/junk-emails/index.php )

- Genre: FAQs = 200 FAQs
( http://www.itri.brighton.ac.uk/~Marina.Santini/#Download )

Resources:
- In terms of words and concept, there are two main resources for
English. First is WordNet, originally from Princeton, it is in NLTK (and
one can get it separately). It is English words 'organized' according to
their relationships: synonym, hyponym, piece of a whole, etc. The other
resource is Word Association Norms, one can get that from the
University of South Florida ( http://w3.usf.edu/FreeAssociation/ ).
- Article: Hella Koo Finding:  Twitter  Dialect -
http://blogs.wsj.com/ideas-market/2011/02/08/hella-koo-finding-twitter-
dialect/

- Genre: tweets = the suggestion is to use  Twitter  API to crawl  twitter
dataset.
- DiscoverText is a program you can use to scoop out  Twitter  feeds
really easily. Their website is here:
http://discovertext.com/defaultDT2.aspx
One can do a free 30 day trial and get a bunch of  Twitter  messages.

Note:
Genre: Tweets = The Edinburg Tweets corpus has been withdrawn:
http://demeter.inf.ed.ac.uk/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值