Fisher Language Model

From SpeechWiki

Jump to: navigation, search

<bibimport />

The design choices behind the various language models built on the Fisher Corpus are described below.

Contents

The scripts, data and generated language Models

The implementation uses the SRILM toolkit and some of the scripts are taken from this gigaword LM recipie. The main script is go_all.sh, and supporting scripts are here.

The data and models are here here.

Source Text

The language model is built using only the repaired training data transcriptions, where each unique partial word is treated as a separate word.

N-gram order

2-gram, 3-gram and 4-gram models are built. 5-grams and more do not yield significant improvements and are not worth the extra computational resources they require in an ASR system, and essentially nothing is gained beyond 5-grams (.06 bits lower entropy, going from 4-grams to 5-grams<bib id='Goodman2001abit' />) .

Vocab Sizes

The LMs generated use only N most frequent words, mapping the rest to the UNK token. Where N is:

N Comparable vocab sizes
500 Svitchboard vocab size used by JHU06 workshop
1000
5000
10000 Similar to Switchboard vocab size used by JANUS speech recognition group (9800)
20k WSJ/NAB vocab size used in the 1995 ARPA continuous speech evaluation
70957 All repaired words in the training data.

Which of these sizes will actually be used in the recognizer still remains to be seen.


Some experiments

The model quality is reported as the cross-entropy per word: The expected value with respect to the tests data of the log probablity estimated from the training data, divided by the number of tokens in the test corpus. See Perplexity for more details. The percentage of tokens in the test data that does not occur in the vocabulary (out of vocabulary %) is also reported.

The experiments consider:

  • the n-gram order (2, 3, or 4)
  • the vocab size (500, 1k, 5k, 10k, 20k, all)
  • choice of smoothing (modified Knesser-Ney or Good-Turing)
  • using entropy pruning or not
The cross-entropy (bits per word) and out-of-vocabulary (OOV) percentages (max is 100%) on the dev and test sets. The cross entropy values can be compared with those used in <bib id='Chen1998empirical' />.
n-gram order vocab total ngrams smoothing pruning dev entropy devOOV % test entropy test OOV %
2-gram500127889 GT no-pr5.80215.4615.85815.093
2-gram50060296 GT pr5.84515.4615.90315.093
2-gram500127889 KN no-pr5.80115.4615.85715.093
2-gram50070752 KN pr6.14215.4616.20415.093
2-gram 1k266359 GT no-pr6.24710.0136.2889.806
2-gram 1k100363 GT pr6.2910.0136.3329.806
2-gram 1k266359 KN no-pr6.24310.0136.2849.806
2-gram 1k121297 KN pr6.45210.0136.5039.806
2-gram 5k781554 GT no-pr6.9223.1176.9562.917
2-gram 5k196611 GT pr6.963.1176.9922.917
2-gram 5k781554 KN no-pr6.9083.1176.9432.917
2-gram 5k265487 KN pr6.9423.1176.9762.917
2-gram 10k1013070 GT no-pr7.0981.6637.1211.526
2-gram 10k224037 GT pr7.1461.6637.1681.526
2-gram 10k1013070 KN no-pr7.0791.6637.1051.526
2-gram 10k327131 KN pr7.1251.6637.1491.526
2-gram 20k1196182 GT no-pr7.2140.817.2280.748
2-gram 20k243371 GT pr7.2690.817.2810.748
2-gram 20k1196182 KN no-pr7.1930.817.2090.748
2-gram 20k387048 KN pr7.2470.817.2610.748
2-gram all1413153 GT no-pr7.3310.2717.3380.254
2-gram all292708 GT pr7.3890.2717.3920.254
2-gram all1413153 KN no-pr7.3150.2717.3230.254
2-gram all503370 KN pr7.3710.2717.3770.254
3-gram500919391 GT no-pr5.46315.4615.51615.093
3-gram500201453 GT pr5.49615.4615.54815.093
3-gram5001844779 KN no-pr5.45315.4615.50715.093
3-gram500456598 KN pr5.56715.4615.6215.093
3-gram 1k1309423 GT no-pr5.87810.0135.9159.806
3-gram 1k267219 GT pr5.92610.0135.9629.806
3-gram 1k2902901 KN no-pr5.8610.0135.8999.806
3-gram 1k544497 KN pr5.94710.0135.9899.806
3-gram 5k2098252 GT no-pr6.5433.1176.5732.917
3-gram 5k375374 GT pr6.6183.1176.6452.917
3-gram 5k5251482 KN no-pr6.5113.1176.5442.917
3-gram 5k673910 KN pr6.6323.1176.672.917
3-gram 10k2342585 GT no-pr6.7221.6636.7411.526
3-gram 10k400424 GT pr6.8091.6636.8241.526
3-gram 10k5934715 KN no-pr6.6861.6636.7091.526
3-gram 10k716844 KN pr6.8191.6636.8461.526
3-gram 20k2516904 GT no-pr6.8420.816.8510.748
3-gram 20k417045 GT pr6.9350.816.940.748
3-gram 20k6366532 KN no-pr6.8030.816.8160.748
3-gram 20k760104 KN pr6.9440.816.9610.748
3-gram all2713750 GT no-pr6.9620.2716.9630.254
3-gram all465498 GT pr7.0570.2717.0530.254
3-gram all6737974 KN no-pr6.9280.2716.9330.254
3-gram all858818 KN pr7.0680.2717.0770.254
4-gram5002456011 GT no-pr5.40115.4615.45515.093
4-gram500264641 GT pr5.43315.4615.48615.093
4-gram5007576736 KN no-pr5.39815.4615.45315.093
4-gram500569016 KN pr5.69515.4615.7515.093
4-gram 1k2899330 GT no-pr5.82210.0135.8599.806
4-gram 1k326234 GT pr5.86910.0135.9059.806
4-gram 1k10204831 KN no-pr5.81110.0135.859.806
4-gram 1k602069 KN pr6.04810.0136.0969.806
4-gram 5k3534140 GT no-pr6.4923.1176.5212.917
4-gram 5k425285 GT pr6.573.1176.5962.917
4-gram 5k14591103 KN no-pr6.4663.1176.4992.917
4-gram 5k681505 KN pr6.6813.1176.7232.917
4-gram 10k3717947 GT no-pr6.6731.6636.6911.526
4-gram 10k448768 GT pr6.7621.6636.7771.526
4-gram 10k15603196 KN no-pr6.6421.6636.6641.526
4-gram 10k719912 KN pr6.8641.6636.8971.526
4-gram 20k3856702 GT no-pr6.7930.816.8010.748
4-gram 20k464271 GT pr6.8890.816.8940.748
4-gram 20k16181859 KN no-pr6.7590.816.7710.748
4-gram 20k762636 KN pr6.9870.817.010.748
4-gram all4031432 GT no-pr6.9130.2716.9130.254
4-gram all512518 GT pr7.0110.2717.0060.254
4-gram all16621480 KN no-pr6.8840.2716.8890.254
4-gram all862439 KN pr7.1090.2717.1240.254


Same thing, but disallowing UNK and word fragments

The cross-entropy (bits per word) and out-of-vocabulary (OOV) percentages (max is 100%) on the dev and test sets. The cross entropy values can be compared with those used in <bib id='Chen1998empirical' />.
n-gram order vocab total ngrams smoothing pruning dev entropy devOOV % test entropy test OOV %
2-gram 500 126891 GT no-pr 6.219 15.461 6.266 15.093
2-gram 500 64134 GT pr 6.224 15.461 6.271 15.093
2-gram 500 126891 KN no-pr 6.442 15.461 6.506 15.093
2-gram 500 72315 KN pr 6.445 15.461 6.509 15.093
2-gram 1k 264166 GT no-pr 6.490 10.044 6.522 9.835
2-gram 1k 106118 GT pr 6.502 10.044 6.534 9.835
2-gram 1k 264166 KN no-pr 6.621 10.044 6.667 9.835
2-gram 1k 123634 KN pr 6.629 10.044 6.675 9.835
2-gram 5k 758743 GT no-pr 6.935 3.530 6.967 3.269
2-gram 5k 197912 GT pr 6.972 3.530 7.002 3.269
2-gram 5k 758743 KN no-pr 6.954 3.530 6.988 3.269
2-gram 5k 264873 KN pr 6.987 3.530 7.021 3.269
2-gram 10k 972903 GT no-pr 7.068 2.222 7.093 2.010
2-gram 10k 222968 GT pr 7.114 2.222 7.138 2.010
2-gram 10k 972903 KN no-pr 7.068 2.222 7.095 2.010
2-gram 10k 324605 KN pr 7.112 2.222 7.138 2.010
2-gram 20k 1137401 GT no-pr 7.162 1.490 7.180 1.341
2-gram 20k 240038 GT pr 7.214 1.490 7.231 1.341
2-gram 20k 1137401 KN no-pr 7.152 1.490 7.172 1.341
2-gram 20k 382880 KN pr 7.203 1.490 7.222 1.341
2-gram all 1283056 GT no-pr 7.221 1.138 7.235 1.020
2-gram all 271116 GT pr 7.275 1.138 7.287 1.020
2-gram all 1283056 KN no-pr 7.207 1.138 7.223 1.020
2-gram all 458278 KN pr 7.262 1.138 7.276 1.020
3-gram 500 786161 GT no-pr 5.894 15.461 5.939 15.093
3-gram 500 227568 GT pr 5.915 15.461 5.958 15.093
3-gram 500 1639004 KN no-pr 6.181 15.461 6.244 15.093
3-gram 500 442809 KN pr 6.237 15.461 6.302 15.093
3-gram 1k 1155260 GT no-pr 6.131 10.044 6.159 9.835
3-gram 1k 289463 GT pr 6.167 10.044 6.193 9.835
3-gram 1k 2615123 KN no-pr 6.285 10.044 6.329 9.835
3-gram 1k 534098 KN pr 6.352 10.044 6.399 9.835
3-gram 5k 1971240 GT no-pr 6.556 3.530 6.584 3.269
3-gram 5k 381235 GT pr 6.628 3.530 6.653 3.269
3-gram 5k 4897039 KN no-pr 6.562 3.530 6.594 3.269
3-gram 5k 667339 KN pr 6.692 3.530 6.728 3.269
3-gram 10k 2232977 GT no-pr 6.690 2.222 6.711 2.010
3-gram 10k 402731 GT pr 6.773 2.222 6.791 2.010
3-gram 10k 5602110 KN no-pr 6.674 2.222 6.698 2.010
3-gram 10k 712123 KN pr 6.806 2.222 6.833 2.010
3-gram 20k 2414260 GT no-pr 6.786 1.490 6.800 1.341
3-gram 20k 417236 GT pr 6.876 1.490 6.885 1.341
3-gram 20k 6052980 KN no-pr 6.759 1.490 6.777 1.341
3-gram 20k 756645 KN pr 6.899 1.490 6.919 1.341
3-gram all 2563519 GT no-pr 6.847 1.138 6.856 1.020
3-gram all 446877 GT pr 6.938 1.138 6.943 1.020
3-gram all 6366489 KN no-pr 6.816 1.138 6.828 1.020
3-gram all 819805 KN pr 6.959 1.138 6.974 1.020
4-gram 500 1686580 GT no-pr 5.849 15.461 5.894 15.093
4-gram 500 304801 GT pr 5.868 15.461 5.911 15.093
4-gram 500 5508209 KN no-pr 6.155 15.461 6.219 15.093
4-gram 500 521968 KN pr 6.276 15.461 6.343 15.093
4-gram 1k 2250207 GT no-pr 6.082 10.044 6.110 9.835
4-gram 1k 356969 GT pr 6.118 10.044 6.143 9.835
4-gram 1k 8097795 KN no-pr 6.245 10.044 6.290 9.835
4-gram 1k 577415 KN pr 6.399 10.044 6.450 9.835
4-gram 5k 3250889 GT no-pr 6.505 3.530 6.533 3.269
4-gram 5k 433381 GT pr 6.580 3.530 6.604 3.269
4-gram 5k 13193700 KN no-pr 6.515 3.530 6.547 3.269
4-gram 5k 675976 KN pr 6.731 3.530 6.772 3.269
4-gram 10k 3532281 GT no-pr 6.639 2.222 6.660 2.010
4-gram 10k 452650 GT pr 6.726 2.222 6.743 2.010
4-gram 10k 14564693 KN no-pr 6.627 2.222 6.651 2.010
4-gram 10k 718111 KN pr 6.848 2.222 6.882 2.010
4-gram 20k 3719834 GT no-pr 6.736 1.490 6.749 1.341
4-gram 20k 466285 GT pr 6.829 1.490 6.838 1.341
4-gram 20k 15383234 KN no-pr 6.713 1.490 6.730 1.341
4-gram 20k 761847 KN pr 6.941 1.490 6.967 1.341
4-gram all 3870394 GT no-pr 6.796 1.138 6.805 1.020
4-gram all 495306 GT pr 6.891 1.138 6.895 1.020
4-gram all 15902348 KN no-pr 6.769 1.138 6.782 1.020
4-gram all 825449 KN pr 7.000 1.138 7.021 1.020

Conclusions

Smoothing Alone

Modified Kneser-Ney beats Good-Turing across all n-gram orders and vocab sizes, confirming the results in <bib id='Chen1998empirical' />.

Smoothing and Pruning

To make the language model smaller, it is common to drop n-grams which occur less than k times. Alternatively one can drop ngrams which do not increase the entropy of the model by more than some threshold, as described in <bib id='Stolcke1998entropy-based' />.

Good-Turing smoothing with entropy pruning generally outperforms modified Knesser-Ney smoothing with entropy pruning. In the experiments above, this is true for 3- and 4-gram models, despite the GT models having many fewer N-grams. For pruned 2-gram models, the perplexity is about the same for GT and KN smoothing, but GT models have many fewer N-grams. This is consistent with the findings in <bib id='Siivola2007On' />.

Which model should we use?

Clearly, we want GT smoothing and pruning, or KN without pruning. Beyond this, we have to somehow estimate the LM effect on the overall WER rate, and make some guess at compute time as a function of ngram order, number of ngrams and vocab size. I don't know any good ways to make these estimates, but roughly:

  • from <bib id='Goodman2001abit' />, we can estimate that a 1 bit improvement in LM yields a WER improvement of somewhere around 1.5% at starting WER of around 9% (Fig. 14 of <bib id='Goodman2001abit' />) and 4% WER improvement at starting WER in the range of 35% to 52%(Fig. 15 of <bib id='Goodman2001abit' />)
  • We are guessing our starting WER will probably be somewhere around 40%-50%.
  • The WER rate is at least the OOV rate, and probably higher because an OOV word hurts the recognition of the neighboring words, and may be the entire utterance.
  • We don't know the effect of increased confusability due to the increased vocab size on the WER, so assume it doesn't exist.

So a really simple model of WER as a function of entropy H is: <math>WER(H)=startingWER+OOVrate+4*H-offset</math> Plotting it vs. the number of n-grams, we get WER vs. number of n-grams

The rough horizontal stripes correspond to different vocab sizes, with the 500 word vocab (and corresponding 15% OOV) being at the top. If we include the confusability-of-large-vocab effect, the stripes would move up, with bottom stripes moving up more.

The relationship between the number of ngrams and the computational time (i.e. how likely we are to retain the correct hypothesis in the beam) is no easier, so we might as well consider options to the left of the knee in the curve (i.e. less than 500000 ngrams).

So, to me a decent choice looks like a GT, entropy pruned 2-gram or 3-gram model, with 5k or 10k vocab. Will it work? Who knows...

Do not read: Rantings, and leftovers

Differences between ngram-count and make-big-lm

There is some subtle difference between ngram-count and make-big-lm, which I cannot track down.

make-big-lm -name .big -order 3 -sort -read ./LM/counts/ngrams -kndiscount -lm try -unk -debug 1

and

ngram-count -order 3 -read LM/counts/ngrams -kn1 LM/counts/kn1-3.txt -kn2 LM/counts/kn2-3.txt -kn3 LM/counts/kn3-3.txt -kn4 LM/counts/kn4-3.txt -kn5 LM/counts/kn5-3.txt -kn6 LM/counts/kn6-3.txt -kn7 LM/counts/kn7-3.txt -kn8 LM/counts/kn8-3.txt -kn9 LM/counts/kn9-3.txt -kndiscount1 -kndiscount2 -lm try -interpolate -debug 1

give different results, even though the KN discounts are identical in both case above.

ngram-count seems to do better on vocab sizes < 20k, and make-big-lm is slightly better on the full vocab.

I suspect it has something to do with make-big-lm allocating .05 of total probability to the unk token for some reason, while ngram-count allocates almost nothing to it.

Comparing the following table against the models in #Model Quality section.

The cross-entropy (bits per word) and out-of-vocabulary (OOV) percentages (max is 100%) on the dev and test sets, for the language models generated by the original giga-word recipe script. The cross entropy values can be compared with those used in <bib id='Chen1998empirical' />.
n-gram
order
vocab ngram
count
dev
entropy
dev
OOV %
test
entropy
test
OOV %
2-gram 5k ngram 1=5000
ngram 2=580572
ngram 3=0
6.924 3.117 6.958 2.917
2-gram 20k ngram 1=20000
ngram 2=980894
ngram 3=0
7.200 0.810 7.217 0.748
2-gram all ngram 1=70957
ngram 2=1180725
ngram 3=0
7.286 0.271 7.297 0.254
3-gram 5k ngram 1=5000
ngram 2=776554
ngram 3=3234773
6.530 3.117 6.566 2.917
3-gram 20k ngram 1=20000
ngram 2=1176182
ngram 3=4038261
6.809 0.810 6.827 0.748
3-gram all ngram 1=70957
ngram 2=1342196
ngram 3=4201973
6.898 0.271 6.910 0.254

Caching, Clustering and all that

Not worth trying due to trigrams being fairly space efficient (See 11.2 All hope abandon, ye who enter here in <bib id='Goodman2001abit' />).

In particular, they have this to say about the best smoothing strategy they themselves have developed:

"Kneser-Ney smoothing leads to improvements in theory, but in practice, most language models are built with high count cutoffs, to conserve space, and speed the search; with high count cutoffs, smoothing doesn’t matter."

References

<references/>

<bibprint />

Personal tools