斯坦福 Speech and Language Processing - 2.2 words &2.3Corpora

最新推荐文章于 2021-03-20 19:01:15 发布

周某1111

最新推荐文章于 2021-03-20 19:01:15 发布

阅读量89

点赞数

分类专栏：自学文章标签：自然语言处理

本文链接：https://blog.csdn.net/weixin_48760912/article/details/114835845

版权

自学专栏收录该内容

48 篇文章 5 订阅

订阅专栏

2.2 words

“I do uh main- mainly business data processing”

Fragments : main- 断掉的这种
**filled pauses:**uh 停顿

“Seuss’s cat in the hat is different from other cats!”

Lemma: same stem, part of speech, rough word sense
词根
cat and cats = same lemma
Wordform: the full inflected surface form
词形
cat and cats = different wordforms

They picnicked by the pool,then lay back on the grass and looked at the stars.

Type: an element of the vocabulary.
Token: an instance of that type in running text.
How many?
15 tokens (or 14)
13 types (or 12) (or 11?)

N = number of tokens
V = vocabulary = set of types,
$∣ V ∣$ is size of vocabulary
Heaps Law = Herdan’s Law = $|V|=kN^{\beta }$
where often $0.67 < β < 0.75$
i.e., vocabulary size grows with > square root of the number of word tokens

2.3 Corpora

Words don’t appear out of nowhere.
A text is produced by a specific writer(s), at a specific time, in a specific variety of a specific language, for a specific function.

Corpora vary along dimension like:

Language: 7097 languages in the world
Variety, like African American Language varieties.

AAL Twitter posts might include forms like “iont” (I don’t)

Code switching, e.g., Spanish/English, Hindi/English:

S/E: Por primera vez veo a @username actually being hateful! It was beautiful:)
[For the first time I get to see @username actually being hateful! it was beautiful:) ]

H/E: dost tha or ra- hega … dont wory … but dherya rakhe [“he was and will remain a friend … don’t worry … but have faith”]

Genre: newswire, fiction, non-fiction, scientific articles, Wikipedia
Author Demographics: writer’s age, gender, race, socioeconomic status, etc.

Corpus datasheets

Motivation: Why was the corpus collected, by whom, and who funded it?
Situation: In what situation was the text written?
Collection process: If it is a subsample how was it sampled? Was there consent? Pre-processing?
+Annotation process, Language variety, speaker demographics

周某1111

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
斯坦福 Speech and Language Processing - 2.2 words &2.3Corpora

目录2.2 words2.3 CorporaCorpora vary along dimension like:Corpus datasheets2.2 words“I do uh main- mainly business data processing”Fragments : main- 断掉的这种**filled pauses:**uh 停顿“Seuss’s cat in the hat is different from other cats!”Lemma: same
复制链接

扫一扫