Some Useful Corpora

最新推荐文章于 2023-07-05 23:27:27 发布

kite1988

最新推荐文章于 2023-07-05 23:27:27 发布

阅读量894

点赞数

文章标签： blogs resources collections dataset sms website

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/kite1988/article/details/7207044

版权

Suggested Corpora and Resources in English if not stated otherwise
(not all of them are free of charge)

Genre-specific corpora:
- Genre: SMS Messages = NUS SMS corpus:
http://wing.comp.nus.edu.sg:8080/SMSCorpus/ (English / Chinese)

- Genre: chatlogs = CODIAC chatlogs
( http://data.eol.ucar.edu/codiac/dss/id=92.124 ;
http://data.eol.ucar.edu/codiac/dss/id=88.044 ;
http://data.eol.ucar.edu/codiac/dss/id=107.010 )

- Genre: chatlogs = Many Eyes datasets: some chatlogs can be found
here:
http://www-958.ibm.com/software/data/cognos/manyeyes/datasets

- Genre: chats and switchboard conversations =
Switchboard corpus and NPS chat corpus samples NLTK in NLTK data
( http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml ) . The NPS
chat corpus ( http://faculty.nps.edu/cmartell/NPSChat.htm ) is a POS
tagged chat corpus and the switchboard corpus
( http://spot.colorado.edu/~michaeli/Lexsubj/swbd.html ) is a telephonic
conversation corpus.

- The Linguistics Data Consortium has a good deal of telephone
conversation - many files and a variety of languages. See
http://www.ldc.upenn.edu/Catalog/byType.jsp#lexicon,%20speech,%20
text (not for free)

- Genre: blogs = The Corporate weblogs dataset in TREC datasets
( http://ir.dcs.gla.ac.uk/test_collections/ ) is not for free. Helpful wiki:
http://ir.dcs.gla.ac.uk/wiki/TREC-BLOG
- Genre: corporate blogs = It is possible to pull corporate blog feeds
or scrape the blogs from this list:
http://www.debbieweil.com/blog/list-of-67-big-brand-corporate-blogs/

- The Göteborg Spoken Language Corpus and other corpora in
Swedish ( http://spraakbanken.gu.se/ )

- Genre: tweets = The twitter corpus associated with the paper
www.stanford.edu/~alecmgo/papers/TwitterDistantSupervision09.pdf is
here: https://sites.google.com/site/twittersentimenthelp/for-researchers

- Genre: tweets and other microblogs= MicroBlog track
http://sites.google.com/site/trecmicroblogtrack/ (not for free)

- Genre: Newswires: Reuters' Newswires collections =
http://trec.nist.gov/data/reuters/reuters.html

- Genre: emails = Enron corpus ( http://www.cs.cmu.edu/~enron/ );
categorized Enron emails ( http://sgi.nu/enron/corpora.php )

- Genre: emails = Junk email corpus
( http://clg.wlv.ac.uk/resources/junk-emails/index.php )

- Genre: FAQs = 200 FAQs
( http://www.itri.brighton.ac.uk/~Marina.Santini/#Download )

Resources:
- In terms of words and concept, there are two main resources for
English. First is WordNet, originally from Princeton, it is in NLTK (and
one can get it separately). It is English words 'organized' according to
their relationships: synonym, hyponym, piece of a whole, etc. The other
resource is Word Association Norms, one can get that from the
University of South Florida ( http://w3.usf.edu/FreeAssociation/ ).
- Article: Hella Koo Finding: Twitter Dialect -
http://blogs.wsj.com/ideas-market/2011/02/08/hella-koo-finding-twitter-
dialect/
- Genre: tweets = the suggestion is to use Twitter API to crawl twitter
dataset.
- DiscoverText is a program you can use to scoop out Twitter feeds
really easily. Their website is here:
http://discovertext.com/defaultDT2.aspx
One can do a free 30 day trial and get a bunch of Twitter messages.

Note:
Genre: Tweets = The Edinburg Tweets corpus has been withdrawn:
http://demeter.inf.ed.ac.uk/

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Some Useful Corpora

Suggested Corpora and Resources in English if not stated otherwise(not all of them are free of charge)Genre-specific corpora:- Genre: SMS Messages = NUS SMS corpus:http://wing.comp.nus.edu.s
复制链接

扫一扫

kite1988 CSDN认证博客专家 CSDN认证企业博客

码龄17年

44: 原创

110万+: 周排名

199万+: 总排名

9万+: 访问

: 等级

1505: 积分

43: 粉丝

7: 获赞

177: 评论

12: 收藏

私信

关注

热门文章

分类专栏

最新评论

DBLP数据解析
Future_Fighting: 楼主，你好，最近在写论文，使用的是dblp数据集，但是在解析过程中出现了很多问题，想请你把你的源代码发我一份我的邮箱是masteryi1005@sina.com
Sina Weibo API 10006 错误
面向未来的历史: https://blog.csdn.net/androidyue/article/details/6220478 https://github.com/node-modules/weibo/issues/31
DBLP数据解析
康小广: <cite>...</cite>是指原文中的引用（尚）未收录至DBLP中。
POS-tagger程序总结
慢生活的人生: 不错，可以把一些书本上的基础的都总结一下，没事看看挺好的，想我毕业两年多不少基础知识都忘了，看看学妹总结的挺好
POS-tagger程序总结
慢生活的人生: 不错，可以把一些书本上的基础的都总结一下，没事看看挺好的，想我毕业两年多不少基础知识都忘了，看看学妹总结的挺好

大家在看

最新文章

目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。