Fisher Language Model

From SpeechWiki

(Difference between revisions)
Jump to: navigation, search
(New page: The design choices behind the various language models built on the Fisher Corpus are described below. The implementation uses the [http://www.speech.sri.com/projects/srilm/ SRILM toolkit]...)
Line 1: Line 1:
-
The design choices behind the various language models built on the Fisher Corpus are described below.  The implementation uses the [http://www.speech.sri.com/projects/srilm/ SRILM toolkit] and the scripts are based on this [http://www.inference.phy.cam.ac.uk/kv227/lm_giga/ gigaword LM recipie].
+
The design choices behind the various language models built on the Fisher Corpus are described below.  The implementation uses the [http://www.speech.sri.com/projects/srilm/ SRILM toolkit] and some of the scripts are taken from this [http://www.inference.phy.cam.ac.uk/kv227/lm_giga/ gigaword LM recipie].  The main script is [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/scr/LM/go_all.sh go_all.sh], and supporting scripts are [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/scr/LM/ here].
=Source Text=
=Source Text=
Line 5: Line 5:
=N-gram order=
=N-gram order=
-
2-gram and 3-gram models are built. 4-grams and more do not yield significant improvements and are not worth the extra computational resources they require in an ASR system, and essentially nothing is gained beyond 5-grams (.06 bits lower entropy, going from 4-grams to 5-grams<ref name="goodman2001abit"/>) .  
+
2-gram, 3-gram and 4-gram models are built. 4-grams and more do not yield significant improvements and are not worth the extra computational resources they require in an ASR system, and essentially nothing is gained beyond 5-grams (.06 bits lower entropy, going from 4-grams to 5-grams<ref name="goodman2001abit"/>) .  
=Vocab Sizes=
=Vocab Sizes=
The LMs generated use only N most frequent words, mapping the rest to the UNK token.  Where N is:
The LMs generated use only N most frequent words, mapping the rest to the UNK token.  Where N is:
{| class="wikitable"   
{| class="wikitable"   
! N
! N
-
! Reason
+
! Comparable vocab sizes
|-
|-
! 500
! 500
| Svitchboard vocab size used by JHU06 workshop
| Svitchboard vocab size used by JHU06 workshop
|-
|-
-
! 9800
+
! 1000
-
| Switchboard vocab size used by JANUS speech recognition group
+
|  
 +
|-
 +
! 5000
 +
|
 +
|-
 +
! 10000
 +
| Similar to Switchboard vocab size used by JANUS speech recognition group (9800)
|-
|-
! 20k
! 20k
| WSJ/NAB vocab size used in the 1995 ARPA continuous speech evaluation
| WSJ/NAB vocab size used in the 1995 ARPA continuous speech evaluation
|-
|-
-
! ~70k
+
! 70957
| All repaired words in the training data.
| All repaired words in the training data.
|}   
|}   
Line 28: Line 34:
=Smoothing=
=Smoothing=
-
Modified Kneser-Ney is used, since it performs best across a variety of n-gram counts, and training corpora sizes<ref name="chen1998empirical">[http://research.microsoft.com/~joshuago/tr-10-98.pdf Stanley Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. Technical report TR-10-98, Harvard University, August 1998.]</ref>
+
Modified Kneser-Ney is used, since it performs best across a variety of n-gram counts, and training corpora sizes<ref name="chen1998empirical">[http://research.microsoft.com/~joshuago/tr-10-98.pdf Stanley Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. Technical report TR-10-98, Harvard University, August 1998.]</ref>.
-
.
+
 
 +
=Pruning=
=Caching, Clustering and all that=
=Caching, Clustering and all that=

Revision as of 15:34, 25 July 2008

The design choices behind the various language models built on the Fisher Corpus are described below. The implementation uses the SRILM toolkit and some of the scripts are taken from this gigaword LM recipie. The main script is go_all.sh, and supporting scripts are here.

Contents

Source Text

The language model is built using only the repaired training data transcriptions, where each unique partial word is treated as a separate word.

N-gram order

2-gram, 3-gram and 4-gram models are built. 4-grams and more do not yield significant improvements and are not worth the extra computational resources they require in an ASR system, and essentially nothing is gained beyond 5-grams (.06 bits lower entropy, going from 4-grams to 5-grams<ref name="goodman2001abit"/>) .

Vocab Sizes

The LMs generated use only N most frequent words, mapping the rest to the UNK token. Where N is:

N Comparable vocab sizes
500 Svitchboard vocab size used by JHU06 workshop
1000
5000
10000 Similar to Switchboard vocab size used by JANUS speech recognition group (9800)
20k WSJ/NAB vocab size used in the 1995 ARPA continuous speech evaluation
70957 All repaired words in the training data.

Which of these sizes will actually be used in the recognizer still remains to be seen.

Smoothing

Modified Kneser-Ney is used, since it performs best across a variety of n-gram counts, and training corpora sizes<ref name="chen1998empirical">Stanley Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. Technical report TR-10-98, Harvard University, August 1998.</ref>.

Pruning

Caching, Clustering and all that

Not worth trying due to trigrams being fairly space efficient (See 11.2 All hope abandon, ye who enter here in <ref name="goodman2001abit">Joshua Goodman. A Bit of Progress in Language Modeling, Extended Version Microsoft Research Technical Report MSR-TR-2001-72.</ref>).

In particular, they have this to say about the best smoothing strategy they themselves have developed:

"Kneser-Ney smoothing leads to improvements in theory, but in practice, most language models are built with high count cutoffs, to conserve space, and speed the search; with high count cutoffs, smoothing doesn’t matter."

Model Quality

The model wikipedia:Perplexity


<references/>

Personal tools