Fisher Baseline Experiments
From SpeechWiki
A set of traditional phone-based baselines against which to compare our mixed-unit systems. An implementation of it using GMTK and associated scripts is here.
Contents |
Multi-pronunciation Monophone Model
The initial model is monophone, with the number of states per phone specified in phone.state. With 64 gaussians per state.
The dictionary used allows multiple pronunciations (up to 6). See more details here.
Training on the entire fisher training set (1775831 utterances, specified in Fisher Corpus), takes an exceedingly long time: ~15 hours for a single EM iteration, on a 32 cpu cluster. Possible solutions are:
- Do training on data where word boundaries are observed. (Thanks Chris) For this we need to force-align the entire fisher corpus using some existing recognizer.
- Use a smaller corpus.
- Use teragrid (our allocation of 30000 cpu-hours is only 40 days on our cluster - almost not worth the effort of porting)
- Fool around with better triangulations (not much hope for improvement since the RVs with each frame are densely connected in our DBN)
Decoding and Scoring
Decoding is done by GMTKViterbiNew and scoring by sclite.
Preliminary Results
Tested on the first 500 utterances of the Dev Set, using the following language models. LM_SCALE and LM_PENALTY were not tuned. The language model and scoring treat all word fragments, filled pauses, and non-speech events as distinct words, and so the WER reported is conservative.
Vocab | Lang Model | Config | WER |
---|---|---|---|
10000 | bigram | config 0 | 67.3% |
10000 | trigram | config 9 | 67.2% |
5000 | trigram | config 10 | 67.7% |
1000 | trigram | config 11 | 69.5% |
500 | trigram | config 12 | 71.6% |
Decoding and Scoring as NIST does it
Tested on the first 500 utterances of the Dev Set, using the 10k vocab 2-gram LM. LM_SCALE and LM_PENALTY were not tuned (fixed at 10 -1 respectively).
The language model and scoring treat all word fragments, filled pauses, and non-speech events as distinct words, and so the WER reported is conservative. This is still using config 0.
LM | post-viterbi processing and sclite Scoring rules | Config | WER |
---|---|---|---|
Same as in prev section (for comparison purposes) | filled pauses and word fragments treated as words | config 0 | 67.3% |
Same as in prev section, but vocab contains no word fragments (vocab filled in with more whole words to reach 10k), and no <unk> | word fragments match any word with common prefix/suffix, filled pauses and non-speech sounds mapped to %hesitation, and are optionally deletable | config 0 | 64.7% |
Same as in prev section | word fragments deleted in post-viterbi processing, and made optionally deletable in reference transcription | config 17 | 66.6% |
Same as in prev section | word fragments and <unk> deleted in post-viterbi processing (but unk is allowed in the language model), and made optionally deletable in reference transcription | config 17 | 64.9% |