摘抄:
The process we just went through was based on fixing two kinds of errors: false positives, strings that we incorrectly matched like other or there, and false negatives, strings that we incorrectly missed, like The. Addressing these two kinds of errors comes up again and again in implementing speech and language processing systems. Reducing the overall error rate for an application thus involves two antagonistic efforts:
- Increasing precision (minimizing false positives)
- Increasing recall (minimizing false negatives)
Before almost any natural language processing of a text, the text has to be normalized. At least three tasks are commonly applied as part of any normalization process:
- Segmenting or tokenizing words from running text
- Normalizing word formats
- Segmenting sentences in running text.
Noun
- corpus corpora
- ut