Fisher Baseline Experiments
From SpeechWiki
(Difference between revisions)
Line 20: | Line 20: | ||
Tested on the first 500 utterances of the Dev Set. | Tested on the first 500 utterances of the Dev Set. | ||
- | {| | + | {| class="wikitable sortable" | |
- | |+ | + | |+ experiments with language model and vocab choices, untuned LM_SCALE and LM_PENALTY |
+ | ! Vocab | ||
! Lang Model | ! Lang Model | ||
! Config | ! Config | ||
! WER | ! WER | ||
|- | |- | ||
- | | | + | | 10000 |
+ | | bigram | ||
| [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/mpTest/PARAM/test.grid config 0] | | [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/mpTest/PARAM/test.grid config 0] | ||
| [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/mpTrigram/config9/test0/accuracy/out.nosil.trn.dtl 67.3%] | | [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/mpTrigram/config9/test0/accuracy/out.nosil.trn.dtl 67.3%] | ||
|- | |- | ||
- | | | + | | 10000 |
+ | | trigram | ||
| [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/PARAM/test.grid config 9] | | [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/PARAM/test.grid config 9] | ||
| [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/mpTrigram/config9/test0/accuracy/out.nosil.trn.dtl 67.2%] | | [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/mpTrigram/config9/test0/accuracy/out.nosil.trn.dtl 67.2%] | ||
|- | |- | ||
- | | | + | | 5000 |
+ | | trigram | ||
| [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/PARAM/test.grid config 10] | | [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/PARAM/test.grid config 10] | ||
| [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/mpTrigram/config10/test0/accuracy/out.nosil.trn.dtl 67.7%] | | [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/mpTrigram/config10/test0/accuracy/out.nosil.trn.dtl 67.7%] | ||
|- | |- | ||
- | | | + | | 1000 |
+ | | trigram | ||
| [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/PARAM/test.grid config 11] | | [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/PARAM/test.grid config 11] | ||
| [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/mpTrigram/config11/test0/accuracy/out.nosil.trn.dtl 69.5%] | | [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/mpTrigram/config11/test0/accuracy/out.nosil.trn.dtl 69.5%] | ||
|- | |- | ||
- | | 500 trigram | + | | 500 |
+ | | trigram | ||
| [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/PARAM/test.grid config 12] | | [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/PARAM/test.grid config 12] | ||
| [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/mpTrigram/config12/test0/accuracy/out.nosil.trn.dtl 71.6%] | | [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/mpTrigram/config12/test0/accuracy/out.nosil.trn.dtl 71.6%] | ||
|} | |} | ||
- | |||
- |
Revision as of 20:02, 2 October 2008
First we need a reasonable traditional phone-based recognizer against which to compare. An implementation of it using GMTK and associated scripts is here.
It uses the phone set, the multi-pronunciation phonetic dictionary and MVA-normalized PLP observations described in Fisher Corpus.
Monophone Model
The initial model is monophone, with the number of states per phone specified in phone.state. The number of gaussians per state will be determined by tuning.
Training on the entire fisher training set (1775831 utterances, specified in Fisher Corpus), takes an exceedingly long time: ~15 hours for a single EM iteration, on a 32 cpu cluster. Possible solutions are:
- Do training on data where word boundaries are observed. (Thanks Chris) For this we need to force-align the entire fisher corpus using some existing recognizer.
- Use a smaller corpus.
- Use teragrid (our allocation of 30000 cpu-hours is only 40 days on our cluster - almost not worth the effort of porting)
- Fool around with better triangulations (not much hope for improvement since the RVs with each frame are densely connected in our DBN)
Decoding
Preliminary Results
Tested on the first 500 utterances of the Dev Set.
Vocab | Lang Model | Config | WER |
---|---|---|---|
10000 | bigram | config 0 | 67.3% |
10000 | trigram | config 9 | 67.2% |
5000 | trigram | config 10 | 67.7% |
1000 | trigram | config 11 | 69.5% |
500 | trigram | config 12 | 71.6% |