(4月13日补充:这两天用网上的一些文章和GMAT的一份资料验证了一下这个WORDLIST的覆盖率,证明它的20000单词的覆盖率真的很高,几乎全部覆盖,只有一两个很个别的词没查到。它的前5000单词所带的词族估计有一万多单词,如果能熟练运用,英语水平就已经很不错了)。
因为准备8月开始的MBA课程,所以最近有意识地上网找wordlist(单词表)来加强一下词汇。GMAT、gre的单词表中很多生涩的单词只有专业文章才用,在日常学习生活中使用率很低,所以学习效率不高。后来找到了一个网上很流行的6138个单词的词频表,没看完就晕了,一方面因为它的出处是英国英语,另一方面拼写方式都很古老,甚至有whilst这样的词。whilst在美国现代用语中肯定是20000以外的词汇。可见那个表的古老程度了。功夫不负有心人,终于发现了一个最新的来自于CCAE的单词表。
美国CCAE至今还没结束,目前收集了4亿词汇的文献资料。这4亿词汇的基础材料包括1990-2009二十年里阅读量最广泛的小说和杂志(“TIME”、“New Yorker”等都是项目的参与者),电影、电视节目,大量的电话记录和面对面谈话记录,甚至还包括911报告等...)。它根据使用时间、文献性质等使用统计学方法进行分类统计,等于是在编一本带词频和流行用法的新美国英语使用辞典。
这个wordlist最牛的是每个单词不仅带词频和同义词,而且都标注着“类词集”。类词集就是把这个词使用最相关、使用密度最高的词的集合。有了它,我们就知道美国人对这个词的最常用的几十种用法和使用环境。比如说break这个词的类词集里,前四个常用邻接词是law,heart,news和rule,所以我们猜测这个词的最高频用法是break law,break heart, breaking news和 break the rule。这比字典里的例句对培养语感所起的作用大不知高出多少倍。 下面是关于它特点的英文介绍,或者去网站http://www.wordfrequency.info直接看吧。
另外,如果你帮助他们在大的英语学习者的论坛里发一个贴子做宣传(发一个就行),然后把link用电子邮件发给他们,还能够免费得到5,000单词的词频表和类词集的电子书。这本书的印刷版在AMAZON也可以买到。 目前,这算是我见过的最好的wordlist了。 COMPARE (to data from the British National Corpus / American National Corpus) There are many English word lists and frequency lists out on the Web. Some are good, some are very bad. Not all frequency lists are created equal. One should be very, very suspicious of word lists that are taken from small samples of web data, outdated texts, or corpora that are too small to effectively model what is happening in the real world. Or worse, word lists that don't give you any idea what they are based on. As the saying goes: "garbage in (bad texts), garbage out (frequency lists)". Rather than focusing too much on a comparison with specific wordlists that are out there on the Web, here's some questions you might ask yourself as you consider downloading or purchasing a word list: Depth and accuracy. Why do so many wordlists on the web contain just the top 1000-3000 words of English? Why not the top 10,000 or 20,000? It's because even a bad corpus (the collection of texts that the word lists are based on) can produce a moderately accurate list for the very most frequent words. But because the corpus is neither deep nor balanced enough, you start getting messy data for medium and lower frequency words. Ask to see samples of the top 10,000 or 20,000 words (e.g. every 7th or 10th word). If they don't have it, then you should be very, very suspicious of that word list. Genres. Does the corpus contain texts from a wide variety of genres -- spoken, fiction, popular magazines, newspapers, and academic journals? Frequency lists that are based on just one of these may only contain 40-50% of the words from a more balanced corpus. Our frequency list is based on the Corpus of Contemporary American English (COCA), which is almost perfectly balanced across genres. Size. COCA contains more than 400 million words, and each of the top 20,000 words occurs at least 300 times. In a small 10-20 million word corpus, some of these words would occur just 7-8 times. At that point, the lower frequency words might make it into the list "by chance", whereas others are left out. No such problem with COCA. How recent is it? Language change happens. If the word list is based on 15-20 year-old texts (or much worse, 100 year old public domain novels), then it will be missing many of the words from the modern language. COCA is based on texts from 1990-2009 (20 million words each year)-- or in other words, virtually right up to the current time. Is it just a bare wordlist? Word lists are nice, but to be really useful (especially for language learning) there ought to be some indication of what these words mean and how they are used. Most of our frequency lists contain the top 20-30 collocates (nearby words) for each word in the list, which creates a great "sketch" of each word. -------------------------------------------------------------------------------- Summary. There are many word frequency lists out on the web. Some are just OK, and some are truly bad. The frequency lists that we have created are the only ones that are based on a large, recent, and balanced corpus of English, and which provide indications of the meaning and use of each word.
|