2.2 words
“I do uh main- mainly business data processing”
Fragments : main- 断掉的这种
**filled pauses:**uh 停顿
“Seuss’s cat in the hat is different from other cats!”
- Lemma: same stem, part of speech, rough word sense
词根
cat and cats = same lemma - Wordform: the full inflected surface form
词形
cat and cats = different wordforms
They picnicked by the pool,then lay back on the grass and looked at the stars.
Type: an element of the vocabulary.
Token: an instance of that type in running text.
How many?
15 tokens (or 14)
13 types (or 12) (or 11?)
N = number of tokens
V = vocabulary = set of types,
∣
V
∣
|V|
∣V∣ is size of vocabulary
Heaps Law = Herdan’s Law =
∣
V
∣
=
k
N
β
|V|=kN^{\beta }
∣V∣=kNβ
where often
0.67
<
β
<
0.75
0.67 < β < 0.75
0.67<β<0.75
i.e., vocabulary size grows with > square root of the number of word tokens
2.3 Corpora
Words don’t appear out of nowhere.
A text is produced by a specific writer(s), at a specific time, in a specific variety of a specific language, for a specific function.
Corpora vary along dimension like:
- Language: 7097 languages in the world
- Variety, like African American Language varieties.
AAL Twitter posts might include forms like “iont” (I don’t)
- Code switching, e.g., Spanish/English, Hindi/English:
S/E: Por primera vez veo a @username actually being hateful! It was beautiful:)
[For the first time I get to see @username actually being hateful! it was beautiful:) ]
H/E: dost tha or ra- hega … dont wory … but dherya rakhe [“he was and will remain a friend … don’t worry … but have faith”]
- Genre: newswire, fiction, non-fiction, scientific articles, Wikipedia
- Author Demographics: writer’s age, gender, race, socioeconomic status, etc.
Corpus datasheets
- Motivation: Why was the corpus collected, by whom, and who funded it?
- Situation: In what situation was the text written?
- Collection process: If it is a subsample how was it sampled? Was there consent? Pre-processing?
+Annotation process, Language variety, speaker demographics