互相成就:Massive quantities of data; The new norm of eval

Introduction to the Special Issue on CL Using Large Corpora

— Church & Mercer, 1993


The data-intensive approach to language, which is becoming known as Text Analysis, takes a pragmatic approach that is well suited to meet the recent emphasis on numerical evaluations and concrete delieverables [A Pendulum Swung Too Far]. Text Analysis focused on broad (though possibly superficial) coverage of unrestricted text, rather than deep analysis of (artificially) restricted domains.


Section 1: The Influence from the Speech Community

A consensus: Stochastic methods based on Shannon’s noisy channel model are outperforming knowledge-based approaches based on hand-crafted transition rules.

Back in the 1970s, the more data-intensive methods were probably beyond the means of many researchers, especially those working in universities. Perhaps some researchers turned to the knowledge-based approach because they couldn’t afford the alternative. It is an interesting fact that most of the authors of knowledge-based papers have a university affiliation whereas most of the authors of the data-intensive papers have an industrial affiliation.

The Raleigh System: A Foundation for the Revival of Empiricism

In the midst of all of this excitement over high-level knowledge-based NLP techniques, IBM formed a speech group around the nucleus of an existing group (the group was moved from Raleigh to Yorkstown Heights early in 1972) that had been designed in accordance with prevailing anti-empiricist attitudes of the time, though it would soon serve as a foundation for the revivial of empiricism in the speech and language communities.

The back end of the Raleigh system converted labels into a sequence of words using an artificial finite-state grammar that was so small that the FSM could be written down on a single piece of paper. It was designed to overcome problems, like missing rapid phones, wrong segmentation of stressed vowels, by applying a complicated set of hand-tuned penalties and bonuses to the various paths in order to favor those paths where the low-level acoustics matched the high-level grammatical constraints. Now that most systems use parameters trained on real data, rather than a complicated set of hand-tuned rules.

In a radical departure from the prevailing attitudes of the time, the Yorkstown group turned to Shannon’s theory of communication in the presence of noise and recast the speech recognition problem in terms of transmission through a noisy channel. The Yorkstown group used three levels of HMMs to compute the conditional probabilities necessary for the noisy channel. 哦这还是有rationalism的影子挥之不去吗 At first, the values of the parameters in these HMMs were carefully constructed by hand, but eventually they would all be replace with estimates obtained by training on real data using statistical estimation procedures such as Forward-Backward alg. At first, they need initial estimates carefully prepared by hand work over several weeks. But these days, most researchers find that they do not need to be nearly so careful in obtaining initial estimates.

Emboldened by this success, the group began by throwing out the phonological rules. Thus, they accepted only a single pronunciation for each word. Any change in these pronunciations was treated as a mislabeling from the front end.

Finally, they removed the distionary lookup HMM, taking for the pronunciation of each word its spelling (e.g. through is assumed to spell like tuh huh ruh oh uu guh huh. After training, the system learned that with words like late the front end often missed the e. Similarly, it learned that g’s and h’s are often silent.

One by one, pieces of the system that had been assiduously assembled by speech experts yielded to probabilistic modeling.


Section 2: POS Tagging

Probalistic taggers are a major improvement over earlier technologies that ignored lexical probabilities and other preferences that can be estimated statistically from corpus evidence.

Efforts of Chomsky in anti-preferences-based approximations: The tradition of ignoring preferences dates back to Chomsky’s introduction of the competence approximation. Chomsky was concerned that approximations which was very in vogue at the time were inappropriate for his needs. The competence approximation is more appropriate for modelling long-distance dependences such as agreement constraints and wh-movement, but at the cost of missing certain crucial local constraints, especially the kinds of preferences. Pobabilitic models provide a theoretical abstraction of language. They are designed to capture the more important aspects of language and ignore the less important aspects, where what counts as important may depend on the application. 没办法 before the field exhaust all the low-hanging fruits, no one will care about the long tails because, probabilistically, it is the easily captured features get the most weights on importance.

Binomial model: probability of word w w w showing up in doc d d

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值