1. NLP中两种流派 Rational 和 Statiscal 的基本观点和方法:
Symbolic approach: encode all required information into computer(rationalism)
linguistic knowledge(static knowledge, context-dependent knowledge)
world knowledge(uniqueness of reference, type of num, situational associativity between noun)
将所有需要的信息编码到计算机中(理性主义)
语言知识(静态知识、语境相关知识)
世界知识(引用的唯一性,num的类型,名词之间的情景联想性)
Statistic approach: infer language properties from language samples(empiricism)
Collect a large collection of texts relevant to your domain
For each noun, compute its probability to take a certain determiner
P(determiner | noun)= n o u n , d e t e r m i n e r f r e q ( n o u n ) \frac{noun,determiner}{freq(noun)} freq(noun)noun,determiner
Given a new noun, select a determiner with the highest likelihood as estimated on the training corpus
从语言样本中推断语言特性(经验主义)
收集大量与您的领域相关的文本
对于每个名词,计算它取某个限定词的概率
2. Encoding
Big5: the first byte ranges from 0xA0-0xF9Big5: the first byte ranges from 0xA0-0xF9
the second byte ranges from 0x40-0x7e, 0xA0 to 0xFE, ASCII characters are still represented with a single byte
3. 判断构词方法:
- noun compound (大人, 小人, 热心, 水手, 黑板, 去年)——名词复合
- verb compound (寄生, 飞驰, 杂居, 火葬, 面授, 单恋)——动词复合
- coordinative compound (报告, 声音, 奇怪, 帮助, 学习, 购买)——同义词复合
- antonymous compounds (买卖, 左右, 高矮, 大小, 开关, 长短)——反义复合
- verb-object compound (放心, 鼓掌, 动员, 司机, 主席, 干事)——动宾复合
- verb complement compound (进来, 进去, 介入, 改良, 打破, 推翻)——动词补语复合
- subject-predicate compound (地震, 心疼, 民主, 自决, 胆小, 年轻)——主谓复合
- noun-measure complement compounds (人口, 羊群, 书本, 花朵, 枪支)——主语量词补足
- modifier-noun (情人节, 小说家, 加油站, 大学生, 金黄色)——名词修饰复合
- verb-object tri-syllabic compound (开玩笑, 吹牛皮, 吃豆腐)——动宾音节复合
- subject-verb-object (胆结石, 鬼画符, 鬼打墙) ——主谓宾复合
- description + noun (棒棒糖, 乒乓球, 呼啦圈)——描述名词复合
4. 给出三种结构的混淆(Structural Ambiguities)的词语的例子
- Overlapping ambiguity (交集型歧义)[网球会, 美国会] 理解为一个字可以划分到前词,也可以划分到后词
- Combinatorial ambiguity (组合型歧义) [才能, 学生会] 理解为组合或者不组合都对
- Mixed type (混合型歧义) [太平洋, 太平, 平淡]
5. write down three types of feature of unknown words
- abbreviation(缩写) 国考 - 国家公务员考试
- proper name/ name entity (专有名词) 鹿鼎记
- compounds (复合词)光敏感,流体力学
- names of places (地名) 天涯海角
- name of organization (组织名) 上海合作组织
- derived words (派生词) 审计人,审计员,审计局,审计处
- numeric type compounds (数字词) 八月十五, 第一,八点十分
6. Define of Entropy
- Defined by the second law of thermodynamics
- A measure of the energy not available for work in a thermodynamic process
- A closed system always tends towards achieving a state with a maximum of entropy
7. 针对Limited substitutability(有限的可置换性), limited modifiability(可修改性), Limited extend compositional(有限的扩展组成), 分别给出了两个Quantitative Features (定量的特性)
one sense per collocation,
one sense per discourse
https://baike.baidu.com/item/隐马尔可夫模型/7932524?fr=aladdin
https://www.cnblogs.com/skyme/p/4651331.html