[1]Dunning, T. 1993. Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19, 1 (Mar. 1993), 61-74.
The structure of the paper is:
1. Introduction, 2. The assumption of normality, 3. The tradition of Chi-squared tests, 4. Binomial distributions for text analysis, 5. Likelihood ratio tests, 6. Practical results, 7. Conclusions, 8. Summary of formulae, 9. References
Ted divided the past work about text analysis into three catagories:The first method based on enormous volumes for a good measures, the second based on a small volumes of text but correct empirically for the error or ignore the issue, the third type perform no statistical analysis at all. To address the problems in the old methods, Ted presented a "practical measure that is motivated by statistical considerations and that can be used in a number of settings." And the measure "has better asymptotic behavior than more traditional measures." After analysis the statistical theorm of the likelihood ratio test, Ted reported the practical results. The results showed that "parametric statistical analysis based on the binomial or multinomial distribution extends the applicability of statistical methods to much smaller texts than models using normal distributions and shows good promise in early applications of the method".
This paper is recommended by Razvan, but I don't think I get this.