最近看文本建模,给一大段文本,如何建模???
以MeTa代码为例:
[[analyzers]]
method = "ngram-word"
ngram = 1
[[analyzers.filter]]
type = "whitespace-tokenizer"
[[analyzers.filter]]
type = "lowercase"
[[analyzers.filter]]
type = "alpha"
[[analyzers.filter]]
type = "length"
min = 2
max = 35
[[analyzers.filter]]
type = "list"
file = "lemur-stopwords.txt"
[[analyzers.filter]]
type = "porter2-stemmer"
This tells MeTA how to process the text before indexing the documents. “ngram=1” configures MeTA to use unigrams (single words). Each “[[analyzers.filter]]” tag defines a text filter that applies a special function on the text. T