斯坦福 Speech and Language Processing - 2.2 words &2.3Corpora

2.2 words

“I do uh main- mainly business data processing”

Fragments : main- 断掉的这种
**filled pauses:**uh 停顿

“Seuss’s cat in the hat is different from other cats!”

  • Lemma: same stem, part of speech, rough word sense
    词根
    cat and cats = same lemma
  • Wordform: the full inflected surface form
    词形
    cat and cats = different wordforms

They picnicked by the pool,then lay back on the grass and looked at the stars.

Type: an element of the vocabulary.
Token: an instance of that type in running text.
How many?
15 tokens (or 14)
13 types (or 12) (or 11?)

N = number of tokens
V = vocabulary = set of types,
∣ V ∣ |V| V is size of vocabulary
Heaps Law = Herdan’s Law = ∣ V ∣ = k N β |V|=kN^{\beta } V=kNβ
where often 0.67 < β < 0.75 0.67 < β < 0.75 0.67<β<0.75
i.e., vocabulary size grows with > square root of the number of word tokens

2.3 Corpora

Words don’t appear out of nowhere.
A text is produced by a specific writer(s), at a specific time, in a specific variety of a specific language, for a specific function.

Corpora vary along dimension like:

  • Language: 7097 languages in the world
  • Variety, like African American Language varieties.

AAL Twitter posts might include forms like “iont” (I don’t)

  • Code switching, e.g., Spanish/English, Hindi/English:

S/E: Por primera vez veo a @username actually being hateful! It was beautiful:)
[For the first time I get to see @username actually being hateful! it was beautiful:) ]

H/E: dost tha or ra- hega … dont wory … but dherya rakhe [“he was and will remain a friend … don’t worry … but have faith”]

  • Genre: newswire, fiction, non-fiction, scientific articles, Wikipedia
  • Author Demographics: writer’s age, gender, race, socioeconomic status, etc.

Corpus datasheets

  • Motivation: Why was the corpus collected, by whom, and who funded it?
  • Situation: In what situation was the text written?
  • Collection process: If it is a subsample how was it sampled? Was there consent? Pre-processing?
    +Annotation process, Language variety, speaker demographics
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值