(1)下载与安装
- 下载网址:https://download.csdn.net/download/cyinfi/10299520
- 解压: tar -zxvf irstlm-5.80.08.tgz
- 进入目录: srilm-1.7.2
- 编译: make World (注意更改编译目录:修改makefile文件,添加 SRILM = $(PWD))
- 测试: make test
结果如下便成功了。
fngram-count: stdout output IDENTICAL.
fngram-count: stderr output IDENTICAL.
(2)参数介绍用法
-version: print version information
-order: max ngram order
Default value: 3
-debug: debugging level for lm
Default value: 0
-skipoovs: skip n-gram contexts containing OOVs
-df: use disfluency ngram model
-tagged: use a tagged LM
-factored: use a factored LM
-skip: use skip ngram model
-hiddens: use hidden sentence ngram model
-hidden-vocab: hidden ngram vocabulary
-hidden-not: process overt hidden events
-classes: class definitions
-simple-classes: use unique class model
-expand-classes: expand class-model into word-model
Default value: -1
-expand-exact: compute expanded ngrams longer than this exactly
Default value: 0
-stop-words: stop-word vocabulary for stop-Ngram LM
-decipher: use bigram model exactly as recognizer
-unk: vocabulary contains unknown word tag
-nonull: remove <NULL> in LM
-map-unk: word to map unknown words to
-zeroprob-word: word to back off to for zero probs
-tolower: map vocabulary to lowercase
-multiwords: split multiwords for LM evaluation
-ppl: text file to compute perplexity from
-text-has-weights: text file contains sentence weights
-escape: escape prefix to pass data through -ppl
-counts: count file to compute perplexity from
-counts-entropy: compute entropy (not perplexity) from counts
-count-order: max count order used by -counts
Default value: 0
-float-counts: use fractional -counts
-use-server: port@host to use as LM server
-cache-served-ngrams: enable client side caching
-server-port: port to listen on as probability server
Default value: 0
-server-maxclients: maximum number of simultaneous server clients
Default value: 0
-gen: number of random sentences to generate
Default value: 0
-gen-prefixes: file of prefixes to generate sentences
-seed: seed for randomization
Default value: 1521617620
-vocab: vocab file
-vocab-aliases: vocab alias file
-nonevents: non-event vocabulary
-limit-vocab: limit LM reading to specified vocabulary
-codebook: codebook for quantized LM parameters
-write-codebook: output codebook (for validation)
-write-with-codebook: write ngram LM using codebook
-quantize: quantize ngram LM using specified number of bins
Default value: 0
-lm: file in ARPA LM format
-bayes: context length for Bayes mixture LM
Default value: 4294967295
-bayes-scale: log likelihood scale for -bayes
Default value: 1
-mix-lm: LM to mix in
-lambda: mixture weight for -lm
Default value: 0.5
-mix-lm2: second LM to mix in
-mix-lambda2: mixture weight for -mix-lm2
Default value: 0
-mix-lm3: third LM to mix in
-mix-lambda3: mixture weight for -mix-lm3
Default value: 0
-mix-lm4: fourth LM to mix in
-mix-lambda4: mixture weight for -mix-lm4
Default value: 0
-mix-lm5: fifth LM to mix in
-mix-lambda5: mixture weight for -mix-lm5
Default value: 0
-mix-lm6: sixth LM to mix in
-mix-lambda6: mixture weight for -mix-lm6
Default value: 0
-mix-lm7: seventh LM to mix in
-mix-lambda7: mixture weight for -mix-lm7
Default value: 0
-mix-lm8: eighth LM to mix in
-mix-lambda8: mixture weight for -mix-lm8
Default value: 0
-mix-lm9: ninth LM to mix in
-mix-lambda9: mixture weight for -mix-lm9
Default value: 0
-context-priors: context-dependent mixture weights file
-loglinear-mix: use log-linear mixture LM
-read-mix-lms: read mixture LMs from -lm file
-maxent: Read a maximum entropy model
-mix-maxent: Mixed LMs in the interpolation scheme are maximum entropy models
-maxent-convert-to-arpa: Convert maxent model to backoff model
-null: use a null language model
-cache: history length for cache language model
Default value: 0
-cache-lambda: interpolation weight for -cache
Default value: 0.05
-dynamic: interpolate with a dynamic lm
-hmm: use HMM of n-grams model
-count-lm: use a count-based LM
-msweb-lm: use Microsoft Web LM
-adapt-mix: use adaptive mixture of n-grams model
-adapt-decay: history likelihood decay factor
Default value: 1
-adapt-iters: EM iterations for adaptive mix
Default value: 2
-adapt-marginals: unigram marginals to adapt base LM to
-base-marginals: unigram marginals of base LM to
-adapt-marginals-beta: marginals adaptation weight
Default value: 0.5
-adapt-marginals-ratios: compute ratios between marginals-adapted and base probs
-dynamic-lambda: interpolation weight for -dynamic
Default value: 0.05
-reverse: reverse words
-no-sos: don't insert start-of-sentence tokens
-no-eos: don't insert end-of-sentence tokens
-rescore-ngram: recompute probs in N-gram LM
-write-lm: re-write LM to file
-write-bin-lm: write LM to file in binary format
-write-oldbin-lm: write LM to file in old binary format
-write-vocab: write LM vocab to file
-renorm: renormalize backoff weights
-prune: prune redundant probs
Default value: 0
-minprune: prune only ngrams at least this long
Default value: 2
-prune-lowprobs: low probability N-grams
-prune-history-lm: LM used for history probabilities in pruning
-memuse: show memory usage
-nbest: nbest list file to rescore
-nbest-files: list of N-best filenames
-split-multiwords: split multiwords in N-best lists
-multi-char: multiword component delimiter
Default value: "_"
-write-nbest-dir: output directory for N-best rescoring
-decipher-nbest: output Decipher n-best format
-max-nbest: maximum number of hyps to consider
Default value: 0
-no-reorder: don't reorder N-best hyps after rescoring
-rescore: hyp stream input file to rescore
-decipher-lm: DECIPHER(TM) LM for nbest list generation
-decipher-order: ngram order for -decipher-lm
Default value: 2
-decipher-nobackoff: disable backoff hack in recognizer LM
-decipher-lmw: DECIPHER(TM) LM weight
Default value: 8
-decipher-wtw: DECIPHER(TM) word transition weight
Default value: 0
-rescore-lmw: rescoring LM weight
Default value: 8
-rescore-wtw: rescoring word transition weight
Default value: 0
-noise: noise tag to skip
-noise-vocab: noise vocabulary to skip
-help: Print this message
(3)常用命令行(进入到lm目录下的子目录下,也可以添加到PATH路径)
生成计数文件: ngram-count -text train.txt -order 3 -write train.txt.count
生成语言模型 :ngram-count -read train.txt.count -order 3 -lm LM -interpolate -kndiscount
计算困惑度 : ngram -ppl test.txt -order 3 -lm LM > result
句子单独打分: ngram -ppl test.txt -order 3 -lm LM -debug 1 > result
参考资源 :https://blog.csdn.net/u011500062/article/details/50780935