[Paper Summary] A Pendulum Swung Too Far

A Pendulum Swung Too Far

— Kenneth Church, 2007, Linguistic Issues in Language Technology (LiLT)


There is a trend of oscillation between Rationalism and Empiricism and back with a switch every couple decades. The reason grandparents and grandchildren get along so well is that they have a common enemy.
 
1950s: Empiricism (Shannon, Skinner, Firth, Harris)
1970s: Ratinalism (Pierce, Chomsky, Minsky)
1990s: Empiricism (IBM Speech Group, AT&T Bell Labs)
2010s: A Return to Rationalism? forget about it, 2010是deep learning元年【狗头】

  • Our prediction that the pendulum would swing back to Rationalism by now was not exactly a prediction, but more of a plea for inclusiveness. CL used to be an interdisciplinary combination of Humanities and Engineering, with more Humanities in Europe and more Engineering in Asia. As the field took a hard turn toward Empiricism in the 1990s, we have gained new interdisciplinary connections to ML, but the connections to Linguistics and Humanities are no longer as strong as they used to be. We would be better off if we could find ways to work together. There has been way too much talk about firing linguists. [Church & Hestness 2019]

This paper was written when people were riding the tide of empiricism, but is going to review some of the rationalist positions (Pierce, Chomsky, Minsky, “PCM”) that our generation rebelled against. It is a shame that our generation is so successful that these rationalist positions are being forgotten. Some of the more important rationalists like Pierce are no longer even mentioned in currently popular textbooks.

The 1990s have witnessed a resurgence of interest in 1950s-style empirical and statiatical methods of language analysis. Empiricism was at its peak in the 1950s. At that time, it was common practice in linguistics to classify words not only on the basis of their meanings but also on the basis of their co-occurrences with other words. Firth (1957) summarized the approach with a memorable line: "You shall know a word by the company it keeps."

Regrettably, interest in empiricism faded in the late 1950s and early 1960s with a number of significant events including Chomsky’s criticism of n-grams in Syntactic Structures [Chomsky 1957 Syntactic Structures.], Minsky and Papert’s criticism of neural networks in Perceptron [Minsky & Papert 1969 Perceptrons] and Pierce’s skepticism about speech recognition [Whither speech recognition] and MT [ALPAC].

The most immediate reason for the empirical renaissance is the availability of massive quantities of data. The data-intensive approach to language takes a pragmatic approach that is well suited to meet the recent emphasis on numerical evaluations and concrete deliverables.

Revival from winter: The community found the pragmatic approach attractive in the early 1990s because the field was in the midst of a severe funding winter. So the community was relatively receptive to a new approach that promised reliable results that we could bank on. That’s why concrete deliverables and tangible evaluation results were valued so much.

15 years of picking low hanging fruit has produced a relatively stable stream of results, and relatively stable funding, at least when compared to the AI Winters. Thus far, ngrams and FSMs have served us well. While there are obvious limitations, it is hard to point to more effective alternatives. In other words, attempts to capture unusual long-distance dependencies tend to fix a few fringe cases, but break more cases than they fix. At least as per the experience of our generation, addressing more common short-distance dependencies is more important than less common long-distance dependencies.


Pierce  was one of AI’s more outspoken critics: "Funding AI is real stupidity, I though of it the first time I saw it.". Pierce has little patience with AI: "After growing wildly for years, the field of computing appears to be reaching its infancy". He objected to anything that attempts to come close to human intelligence. He chaired the (in)famous ALPAC report. He also wrote “Whiter Speech Recognition” that had a chilling effect on funding for speech recognition. He was a highly accomplished executive at the top of his game. The poor folks on the other side of the debate were simply no match. Some of Pierce’s opponents included junior faculty about to be denied tenure. But even so, there is no reason to ignore his contributions to the field, inconvenient as they may be.

To be crisp, Pierce objected to two things: evaluation & pattern matching

  • Pierce objects to evaluation by demos. It is hard to gauge the success of an attempt even when statistics are given. It is not easy to see a practical, economically sound application for speech recognition with this capability.
  • Pierce objects to pattern matching as artful deception, that is apt to succeed better and more quickly than science.

Pierce endorsed two positions. 1) Pierce was a strong supporter of basic science. Pierce objects to attempts to sell science as something other than it is (e.g. applications) as well as attemps to misrepresent progress with misleading demos and/or mindless metrics (such as the king of evaluations that are routinely performed today). 2) On the other hand, there is also a practical side to Pierce. He is a strong supporter of applied work, but under very different rules, e.g., in terms of a business case. Applied work should be evaluated as applied work (based on a business case), and science should be evaluated as science (based on peer review). Perhaps that’s what led to my inner “dichotomy” toward my research.


ELIZA Effect

ELIZA refers to the susceptibility of people to read far more understanding than is warranted into strings of symbols — especially strung together by computers. From a psychological standpoint, the ELIZA effect is the result of a subtle cognitive dissonance between the user’s awareness of programming limitations and their behavior towards the output of the program.

Pierce followed up the ‘artful deception’ remark with a reference to Weizenbaum’s doctor program, ELIZA. Weizenbaum himself become a strong opponent of AI when he realized just how convincing his ELIZA program was to the public. Weizenbaum wrote the book ‘Incomprehensible Programs’, where he criticized upon nontheory-based programs: Theory-based programs rest on mathematical control theory and on firmly established physical theories. Such theory-based programs enjoy enormously important advantage that, when whey misbehave, their human monitors can detect that their performance does not correspond to the dictates of their theory and can diagnose the reason for the failure. But most existing programs are not theory-based. They are heuristic, strategems that appear to “work” under *most foreseen circumstances’. My own program, ELIZA was precisely this type. So is Winograd’s Language-understanding system and Newell and Simon’s General Problem Solver.

Pattern recognition has its strengths and weaknesses. One the pos side, pattern recognition makes it possible to make progress on applications by finessing many hard scientific questions. On the other hand, it is hard to make progress on the key scientific questions because short-term finesses distracts long-term science.

Many engineering tasks share the same two threads of research: a pragmatic engineering approach and a more ambitious scientific program. We have a better chance of making progress on big open scientific questions if we address them head-on rather than finessing around them.


Chomsky  wrote about limitations with ngrams. Chomsky was a student of Zellig Harris, whose distributional hypothesis is nicely described by Firth’s quote “You shall know a word by the company it keeps”. At that time there was a lot of excitment at the time over the Shanno-McMillan-Breiman Theorem, which was interpreted to say that, in the limit, under just a couple of minor caveats and a little bit of not-very-important fine print, ngram statistics are sufficient to capture all the information in a string. Chomsky realized that while that may be true in the limit, ngrams are far from the most parsimonious representation of many linguistic facts. Truncated ngram systems (where n=3 or 5) can capture many agreement facts, but not all. Our generation has been fortunate to have plenty of low hanging fruit to pick (the facts that can be captured with short ngrams), but the next generation will be less fortunate since most of those facts will have been pretty well picked over before they retire.

Chomsky not only objected to ngrams, but he also objected to finite-state methods, which include HMMs and CRFs. Finite-state methods can capture everything ngrams capture, and go beyond that.

However, finite-state methods cannot capture center-embedding. Chomsky established center-embedding as the key difference between context-free and finite-state. That is, if (and only if) the grammar is center-embedded, then it requires infinite memory (stack). Otherwise, it can be processed with finite memory (a FSM). More formally, a grammar is center embedded if there is a non-terminal A that can generate xAy where both x and y are non-empty. If either x or y are empty, then we have the simpler case of left-branching and right-branching, which can be processed with FSMs, unlike center-embedding which requires unbounded momery. Chomsky assumed that English has a non-terminal S (for sentence or clause) that generates itself with non-empty material on both sides, e.g. 1) S → If S, then S., 2) S → Either S, or S., 3) S → The man who said that S, is arriving today.


Minsky  objected Perceptron (i.e. linear separators). The objection also has implications for popular techniques in pattern matching tasks such as Word-sense Disambiguation (WSD), Author Identification, IR and Sentiment Analysis. The objection to perceptrons apply to many variations of logistic regression, SVM, naive bayes, etc, cluding both discriminative and generative variants.

For the tasks listed above, there is an important difference in stopwords. Author Identification places content words on a stop list because this task is more interested in style than content. Term weighting can be viewed as a generalization of stop lists. In search engines, it is common to learn optimal weights by learning to rank algs. User logs tend to be more informative than document features because the web tends to have more readers than writers, which adds value by helping users discover the wisdom of the crowd.

Learning to rank is a pragmatic approach that uses relatively simple ML and pattern matching techniques to finesse problems that might otherwise require AI-complete understanding.
“Rather than trying to get computers to understand the content and whether it is useful, we watch people who read the content and look at whether they found it useful. Crowds find the wisdom on the web. Computers surface that wisdom.” — GREG LINDEN, 2007

Why Current Technology Ignores Predicates
Because it is hard to make sense of modifiers unless you know what they are modifying. You can’t do much with a feature if you don’t know the sign, even if you know the magnitude is large.


What do all these have to offer us today?

Thus far, the field has done well be picking low-hanging fruit. In good times, when there are lots of easy pickings, we should take advantage of the opportunities. But if those opotunities should dry up, we would be better off following Pierce’s advice. It is better to address the core scientific challenges than to continue to look for easy pickings that are no longer there.

Those Who Ignore History Are Doomed To Repeat It

For the most part, the empirical revivals in ML, IR and Speech Recognition have simply ignored PCM’s arguments. The issues that give rise to excitment today seem much the same as those that were responsible for previous rounds of excitement. But we expect the growth to require a degree of critical analysis that its more romantic advocates have always been reluctant to pursue — perhaps because the spirit of connectionism seems itself to go somewhat against the grain of analytic rigor.

The revival of empiricism was an exciting time. We never imagined that that effort would be as successful as it turned out to be. At the time, all we wanted was a seat at the table. …but we may have been too successful. Not only have we succeeded in making room for what we were interested in, but now there is no longer much room for anything else. We were standing right in the middle of the dramatic shift from Rationalism to Empiricism with no end in sight.

Something has to do with Teaching: Gaps in Courses on CL

One side of the debate is written out of the textbooks and forgotten, only to be revived/reinvented by the next generation.

Contemporary textbooks in CL have remarkably little to say about PCM. Comtemporary textbooks ought to teach both the strengths and the weaknesses of useful approximations such as neural networks. Both sides of the debate have much to offer, even if the recepients of such education choose to take one side or the other. We do the next generation a disservice when we dismiss one side or the other with harsh words like “incorrect conjecture” and “not much has changed”.

Chomsky receives more coverage than Pierce and Minsky in contemporary textbooks. But they quickly move on to describe the revival of empirical methods, with relatively little discussion of the argument, motications for the revival, and implications for current practice and the future.

e.g. Chomsky argued that finite-state Markov processes, while a possiblly useful engineering heuristic, were incapable of being a complete cognitive model of human grammatical knowledge. These arguments led many linguistics and computational linguists away from statistical models altogether. The resurgence of N-gram models came from … [Jurafsky & Martin 2000 Speech and Language Processing]

The following quotes introduce the student to the existence of a controversy, but they don’t help the student appreciate what it means for them.

  • But it much be recognized that the notion “probability of a sentence” is an entirely useless one, under any known interpretation of this term.
  • Anytime a linguist leaves the group the recognition rate goes up.
  • Statistical considerations are essential to an understanding of the operation and development of languages.
  • One’s ability to produce and recognize grammatical utterances is not based on notions of statistical approximations and the like.

To prepare students for what might come after the low hanging fruit has been picked over, it would be good to provide today’s students with a broad education that makes room for many topics such as syntax, morphology, phonetics, historical linguistics and language universals. We are graduating CL students these days that have very deep knowledge of one particular narrow sub-area (such as ML and statistical-MT) but may not have heard of Greenber’s Universals, Raising, Equi, quantifier scope, gapping, island constraints, generative capacity, the Chomsky Hierarchy and so on. Students working on coref should know about c-comman and disjoint reference. Those working on speech recognition need to know about lexical stress. Phonological stress has all sorts of consequences on downstream phonetic and acoustic processes. Speech recognizers currently don’t do much with lexical stress which seems like a missed opportunity since stress is one of the more salient properties in speech signals. We ought to teach students about the phonology and acoustic-phonetics of lexical stress, so they will be ready when the SOTA advances past the current bottlenecks at the segmental level, since there are long-distance dependencies associated with stress that span over more than tri-phones.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值