Fisher Language Model

From SpeechWiki

(Difference between revisions)
Jump to: navigation, search
Line 35: Line 35:
=Smoothing=
=Smoothing=
Modified Kneser-Ney is used, since it performs best across a variety of n-gram counts, and training corpora sizes<ref name="chen1998empirical">[http://research.microsoft.com/~joshuago/tr-10-98.pdf Stanley Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. Technical report TR-10-98, Harvard University, August 1998.]</ref>.
Modified Kneser-Ney is used, since it performs best across a variety of n-gram counts, and training corpora sizes<ref name="chen1998empirical">[http://research.microsoft.com/~joshuago/tr-10-98.pdf Stanley Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. Technical report TR-10-98, Harvard University, August 1998.]</ref>.
-
 
-
=Pruning=
 
=Some experiments=
=Some experiments=
 +
 +
==Model Quality==
 +
The model quality is reported as the cross-entropy per word: The expected value with respect to the tests data of the log probablity estimated from the training data, divided by the number of tokens in the test corpus.  See [http://en.wikipedia.org/wiki/Perplexity Perplexity] for more details.  The percentage of tokens in the test data that does not occur in the vocabulary (out of vocabulary %) is also reported.
 +
 +
The following table shows the cross entropy of models with no n-gram pruning at all: the only pruning that takes place is the vocabulary restriction
 +
 +
{| class="wikitable sortable" |
 +
|+ The cross-entropy (bits per word) and out-of-vocabulary (OOV) percentages (max is 100%) on the dev and test sets, for the language models '''with no n-gram pruning'''.  The cross entropy values can be compared with those used in <ref name="chen1998empirical"/>.
 +
!n-gram<br>order
 +
! vocab
 +
! ngram<br>count
 +
! dev<br>entropy
 +
! dev<br>OOV %
 +
! test<br>entropy
 +
! test<br>OOV %
 +
|----
 +
|2-gram
 +
| 500
 +
| ngram 1=500<br>ngram 2=127389
 +
| 5.801
 +
| 15.461
 +
| 5.857
 +
| 15.093
 +
|----
 +
|2-gram
 +
| 1k
 +
| ngram 1=1000<br>ngram 2=265359
 +
| 6.243
 +
| 10.013
 +
| 6.284
 +
| 9.806
 +
|----
 +
|2-gram
 +
| 5k
 +
| ngram 1=5000<br>ngram 2=776554
 +
| 6.908
 +
| 3.117
 +
| 6.943
 +
| 2.917
 +
|----
 +
|2-gram
 +
| 10k
 +
| ngram 1=10000<br>ngram 2=1003070
 +
| 7.079
 +
| 1.663
 +
| 7.105
 +
| 1.526
 +
|----
 +
|2-gram
 +
| 20k
 +
| ngram 1=20000<br>ngram 2=1176182
 +
| 7.193
 +
| 0.810
 +
| 7.209
 +
| 0.748
 +
|----
 +
|2-gram
 +
| all
 +
| ngram 1=70957<br>ngram 2=1342196
 +
| 7.315
 +
| 0.271
 +
| 7.323
 +
| 0.254
 +
|----
 +
|3-gram
 +
| 500
 +
| ngram 1=500<br>ngram 2=127389<br>ngram 3=1716890
 +
| 5.453
 +
| 15.461
 +
| 5.507
 +
| 15.093
 +
|----
 +
|3-gram
 +
| 1k
 +
| ngram 1=1000<br>ngram 2=265359<br>ngram 3=2636542
 +
| 5.860
 +
| 10.013
 +
| 5.899
 +
| 9.806
 +
|----
 +
|3-gram
 +
| 5k
 +
| ngram 1=5000<br>ngram 2=776554<br>ngram 3=4469928
 +
| 6.511
 +
| 3.117
 +
| 6.544
 +
| 2.917
 +
|----
 +
|3-gram
 +
| 10k
 +
| ngram 1=10000<br>ngram 2=1003070<br>ngram 3=4921645
 +
| 6.686
 +
| 1.663
 +
| 6.709
 +
| 1.526
 +
|----
 +
|3-gram
 +
| 20k
 +
| ngram 1=20000<br>ngram 2=1176182<br>ngram 3=5170350
 +
| 6.803
 +
| 0.810
 +
| 6.816
 +
| 0.748
 +
|----
 +
|3-gram
 +
| all
 +
| ngram 1=70957<br>ngram 2=1342196<br>ngram 3=5324821
 +
| 6.928
 +
| 0.271
 +
| 6.933
 +
| 0.254
 +
|----
 +
|4-gram
 +
| 500
 +
| ngram 1=500<br>ngram 2=127389<br>ngram 3=1716890<br>ngram 4=5731957
 +
| 5.398
 +
| 15.461
 +
| 5.453
 +
| 15.093
 +
|----
 +
|4-gram
 +
| 1k
 +
| ngram 1=1000<br>ngram 2=265359<br>ngram 3=2636542<br>ngram 4=7301930
 +
| 5.811
 +
| 10.013
 +
| 5.850
 +
| 9.806
 +
|----
 +
|4-gram
 +
| 5k
 +
| ngram 1=5000<br>ngram 2=776554<br>ngram 3=4469928<br>ngram 4=9339621
 +
| 6.466
 +
| 3.117
 +
| 6.499
 +
| 2.917
 +
|----
 +
|4-gram
 +
| 10k
 +
| ngram 1=10000<br>ngram 2=1003070<br>ngram 3=4921645<br>ngram 4=9668481
 +
| 6.642
 +
| 1.663
 +
| 6.664
 +
| 1.526
 +
|----
 +
|4-gram
 +
| 20k
 +
| ngram 1=20000<br>ngram 2=1176182<br>ngram 3=5170350<br>ngram 4=9815327
 +
| 6.759
 +
| 0.810
 +
| 6.771
 +
| 0.748
 +
|----
 +
|4-gram
 +
| all
 +
| ngram 1=70957<br>ngram 2=1342196<br>ngram 3=5324821<br>ngram 4=9883506
 +
| 6.884
 +
| 0.271
 +
| 6.889
 +
| 0.254
 +
|----
 +
|}
 +
 +
==Pruning==
 +
To make the language model smaller, it is common to drop n-grams which occur less than k times,
==Differences between ngram-count and make-big-lm==
==Differences between ngram-count and make-big-lm==
Line 55: Line 217:
I suspect it has something to do with make-big-lm allocating .05 of total probability to the unk token for some reason, while ngram-count allocates almost nothing to it.
I suspect it has something to do with make-big-lm allocating .05 of total probability to the unk token for some reason, while ngram-count allocates almost nothing to it.
 +
Comparing the following table against the models in [[#Model Quality]] section.
<!-- header for the tables below:
<!-- header for the tables below:
{| class="wikitable sortable" |
{| class="wikitable sortable" |
Line 68: Line 231:
{| class="wikitable sortable" |
{| class="wikitable sortable" |
-
|+ The cross-entropy (bits per word) and out-of-vocabulary (OOV) percentages (max is 100%) on the dev and test sets, for the language models generated by the original giga-word recipie script.  The cross entropy values can be compared with those used in <ref name="chen1998empirical"/>.  
+
|+ The cross-entropy (bits per word) and out-of-vocabulary (OOV) percentages (max is 100%) on the dev and test sets, for the language models generated by '''the original giga-word recipe script'''.  The cross entropy values can be compared with those used in <ref name="chen1998empirical"/>.  
!n-gram<br>order
!n-gram<br>order
! vocab
! vocab
Line 132: Line 295:
In particular, they have this to say about the best smoothing strategy they themselves have developed:  
In particular, they have this to say about the best smoothing strategy they themselves have developed:  
::"Kneser-Ney smoothing leads to improvements in theory, but in practice, most language models are built with high count cutoffs, to conserve space, and speed the search; with high count cutoffs, smoothing doesn’t matter."
::"Kneser-Ney smoothing leads to improvements in theory, but in practice, most language models are built with high count cutoffs, to conserve space, and speed the search; with high count cutoffs, smoothing doesn’t matter."
-
 
-
=Model Quality=
 
-
The model [[wikipedia:Perplexity]]
 
-
 
<references/>
<references/>

Revision as of 00:51, 27 July 2008

The design choices behind the various language models built on the Fisher Corpus are described below. The implementation uses the SRILM toolkit and some of the scripts are taken from this gigaword LM recipie. The main script is go_all.sh, and supporting scripts are here.

Contents

Source Text

The language model is built using only the repaired training data transcriptions, where each unique partial word is treated as a separate word.

N-gram order

2-gram, 3-gram and 4-gram models are built. 4-grams and more do not yield significant improvements and are not worth the extra computational resources they require in an ASR system, and essentially nothing is gained beyond 5-grams (.06 bits lower entropy, going from 4-grams to 5-grams<ref name="goodman2001abit"/>) .

Vocab Sizes

The LMs generated use only N most frequent words, mapping the rest to the UNK token. Where N is:

N Comparable vocab sizes
500 Svitchboard vocab size used by JHU06 workshop
1000
5000
10000 Similar to Switchboard vocab size used by JANUS speech recognition group (9800)
20k WSJ/NAB vocab size used in the 1995 ARPA continuous speech evaluation
70957 All repaired words in the training data.

Which of these sizes will actually be used in the recognizer still remains to be seen.

Smoothing

Modified Kneser-Ney is used, since it performs best across a variety of n-gram counts, and training corpora sizes<ref name="chen1998empirical">Stanley Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. Technical report TR-10-98, Harvard University, August 1998.</ref>.

Some experiments

Model Quality

The model quality is reported as the cross-entropy per word: The expected value with respect to the tests data of the log probablity estimated from the training data, divided by the number of tokens in the test corpus. See Perplexity for more details. The percentage of tokens in the test data that does not occur in the vocabulary (out of vocabulary %) is also reported.

The following table shows the cross entropy of models with no n-gram pruning at all: the only pruning that takes place is the vocabulary restriction

The cross-entropy (bits per word) and out-of-vocabulary (OOV) percentages (max is 100%) on the dev and test sets, for the language models with no n-gram pruning. The cross entropy values can be compared with those used in <ref name="chen1998empirical"/>.
n-gram
order
vocab ngram
count
dev
entropy
dev
OOV %
test
entropy
test
OOV %
2-gram 500 ngram 1=500
ngram 2=127389
5.801 15.461 5.857 15.093
2-gram 1k ngram 1=1000
ngram 2=265359
6.243 10.013 6.284 9.806
2-gram 5k ngram 1=5000
ngram 2=776554
6.908 3.117 6.943 2.917
2-gram 10k ngram 1=10000
ngram 2=1003070
7.079 1.663 7.105 1.526
2-gram 20k ngram 1=20000
ngram 2=1176182
7.193 0.810 7.209 0.748
2-gram all ngram 1=70957
ngram 2=1342196
7.315 0.271 7.323 0.254
3-gram 500 ngram 1=500
ngram 2=127389
ngram 3=1716890
5.453 15.461 5.507 15.093
3-gram 1k ngram 1=1000
ngram 2=265359
ngram 3=2636542
5.860 10.013 5.899 9.806
3-gram 5k ngram 1=5000
ngram 2=776554
ngram 3=4469928
6.511 3.117 6.544 2.917
3-gram 10k ngram 1=10000
ngram 2=1003070
ngram 3=4921645
6.686 1.663 6.709 1.526
3-gram 20k ngram 1=20000
ngram 2=1176182
ngram 3=5170350
6.803 0.810 6.816 0.748
3-gram all ngram 1=70957
ngram 2=1342196
ngram 3=5324821
6.928 0.271 6.933 0.254
4-gram 500 ngram 1=500
ngram 2=127389
ngram 3=1716890
ngram 4=5731957
5.398 15.461 5.453 15.093
4-gram 1k ngram 1=1000
ngram 2=265359
ngram 3=2636542
ngram 4=7301930
5.811 10.013 5.850 9.806
4-gram 5k ngram 1=5000
ngram 2=776554
ngram 3=4469928
ngram 4=9339621
6.466 3.117 6.499 2.917
4-gram 10k ngram 1=10000
ngram 2=1003070
ngram 3=4921645
ngram 4=9668481
6.642 1.663 6.664 1.526
4-gram 20k ngram 1=20000
ngram 2=1176182
ngram 3=5170350
ngram 4=9815327
6.759 0.810 6.771 0.748
4-gram all ngram 1=70957
ngram 2=1342196
ngram 3=5324821
ngram 4=9883506
6.884 0.271 6.889 0.254

Pruning

To make the language model smaller, it is common to drop n-grams which occur less than k times,

Differences between ngram-count and make-big-lm

There is some subtle difference between ngram-count and make-big-lm, which I cannot track down.

make-big-lm -name .big -order 3 -sort -read ./LM/counts/ngrams -kndiscount -lm try -unk -debug 1

and

ngram-count -order 3 -read LM/counts/ngrams -kn1 LM/counts/kn1-3.txt -kn2 LM/counts/kn2-3.txt -kn3 LM/counts/kn3-3.txt -kn4 LM/counts/kn4-3.txt -kn5 LM/counts/kn5-3.txt -kn6 LM/counts/kn6-3.txt -kn7 LM/counts/kn7-3.txt -kn8 LM/counts/kn8-3.txt -kn9 LM/counts/kn9-3.txt -kndiscount1 -kndiscount2 -lm try -interpolate -debug 1

give different results, even though the KN discounts are identical in both case above.

ngram-count seems to do better on vocab sizes < 20k, and make-big-lm is slightly better on the full vocab.

I suspect it has something to do with make-big-lm allocating .05 of total probability to the unk token for some reason, while ngram-count allocates almost nothing to it.

Comparing the following table against the models in #Model Quality section.

The cross-entropy (bits per word) and out-of-vocabulary (OOV) percentages (max is 100%) on the dev and test sets, for the language models generated by the original giga-word recipe script. The cross entropy values can be compared with those used in <ref name="chen1998empirical"/>.
n-gram
order
vocab ngram
count
dev
entropy
dev
OOV %
test
entropy
test
OOV %
2-gram 5k ngram 1=5000
ngram 2=580572
ngram 3=0
6.924 3.117 6.958 2.917
2-gram 20k ngram 1=20000
ngram 2=980894
ngram 3=0
7.200 0.810 7.217 0.748
2-gram all ngram 1=70957
ngram 2=1180725
ngram 3=0
7.286 0.271 7.297 0.254
3-gram 5k ngram 1=5000
ngram 2=776554
ngram 3=3234773
6.530 3.117 6.566 2.917
3-gram 20k ngram 1=20000
ngram 2=1176182
ngram 3=4038261
6.809 0.810 6.827 0.748
3-gram all ngram 1=70957
ngram 2=1342196
ngram 3=4201973
6.898 0.271 6.910 0.254

Caching, Clustering and all that

Not worth trying due to trigrams being fairly space efficient (See 11.2 All hope abandon, ye who enter here in <ref name="goodman2001abit">Joshua Goodman. A Bit of Progress in Language Modeling, Extended Version Microsoft Research Technical Report MSR-TR-2001-72.</ref>).

In particular, they have this to say about the best smoothing strategy they themselves have developed:

"Kneser-Ney smoothing leads to improvements in theory, but in practice, most language models are built with high count cutoffs, to conserve space, and speed the search; with high count cutoffs, smoothing doesn’t matter."

<references/>

Personal tools