Fisher Baseline Experiments

From SpeechWiki

(Difference between revisions)
Jump to: navigation, search
Line 1: Line 1:
-
First we need a reasonable traditional phone-based recognizer against which to compare.
+
A set of traditional phone-based baselines against which to compare our mixed-unit systems.
An implementation of it using GMTK and associated [[GMTK parallel tools|scripts]] is [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/ here].
An implementation of it using GMTK and associated [[GMTK parallel tools|scripts]] is [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/ here].
-
It uses the phone set, the multi-pronunciation phonetic dictionary and MVA-normalized PLP observations described in [[Fisher Corpus]]. 
 
-
=Monophone Model=
+
= Multi-pronunciation Monophone Model=
-
The initial model is monophone, with the number of states per phone specified in [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/scr/phone.state phone.state].  The number of gaussians per state will be determined by tuning.   
+
The initial model is monophone, with the number of states per phone specified in [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/scr/phone.state phone.state].  With 64 gaussians per state.   
 +
 
 +
The [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/dict/fisherPhoneticMpronDict.txt dictionary used] allows multiple pronunciations (up to 6).  See more details [[Fisher Dictionaries|here]].
Training on the entire fisher training set (1775831 utterances, specified in [[Fisher Corpus]]), takes an exceedingly long time: ~15 hours for a single EM iteration, on a 32 cpu cluster. Possible solutions are:
Training on the entire fisher training set (1775831 utterances, specified in [[Fisher Corpus]]), takes an exceedingly long time: ~15 hours for a single EM iteration, on a 32 cpu cluster. Possible solutions are:
Line 14: Line 15:
* Fool around with better triangulations (not much hope for improvement since the RVs with each frame are densely connected in our DBN)
* Fool around with better triangulations (not much hope for improvement since the RVs with each frame are densely connected in our DBN)
-
==Decoding==
+
==Decoding and Scoring==
-
 
+
Decoding is done by GMTKViterbiNew and scoring by sclite.
===Preliminary Results===
===Preliminary Results===
-
Tested on the first 500 utterances of the Dev Set.
+
Tested on the first 500 utterances of the Dev Set, using the following language models. 
 +
LM_SCALE and LM_PENALTY were not tuned.
 +
The language model and scoring treat all word fragments, filled pauses, and non-speech events as distinct words, and so the WER reported is conservative.
{| class="wikitable sortable" |
{| class="wikitable sortable" |
-
|+ experiments with language model and vocab choices, untuned LM_SCALE and LM_PENALTY
+
|+ experiments with language model and vocab choices.
! Vocab
! Vocab
! Lang Model
! Lang Model
Line 52: Line 55:
| [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/mpTrigram/config12/test0/accuracy/out.nosil.trn.dtl 71.6%]
| [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/mpTrigram/config12/test0/accuracy/out.nosil.trn.dtl 71.6%]
|}
|}
 +
 +
===Decoding and Scoring as NIST does it===
 +
TODO

Revision as of 20:41, 2 October 2008

A set of traditional phone-based baselines against which to compare our mixed-unit systems. An implementation of it using GMTK and associated scripts is here.


Contents

Multi-pronunciation Monophone Model

The initial model is monophone, with the number of states per phone specified in phone.state. With 64 gaussians per state.

The dictionary used allows multiple pronunciations (up to 6). See more details here.

Training on the entire fisher training set (1775831 utterances, specified in Fisher Corpus), takes an exceedingly long time: ~15 hours for a single EM iteration, on a 32 cpu cluster. Possible solutions are:

  • Do training on data where word boundaries are observed. (Thanks Chris) For this we need to force-align the entire fisher corpus using some existing recognizer.
  • Use a smaller corpus.
  • Use teragrid (our allocation of 30000 cpu-hours is only 40 days on our cluster - almost not worth the effort of porting)
  • Fool around with better triangulations (not much hope for improvement since the RVs with each frame are densely connected in our DBN)

Decoding and Scoring

Decoding is done by GMTKViterbiNew and scoring by sclite.

Preliminary Results

Tested on the first 500 utterances of the Dev Set, using the following language models. LM_SCALE and LM_PENALTY were not tuned. The language model and scoring treat all word fragments, filled pauses, and non-speech events as distinct words, and so the WER reported is conservative.

experiments with language model and vocab choices.
Vocab Lang Model Config WER
10000 bigram config 0 67.3%
10000 trigram config 9 67.2%
5000 trigram config 10 67.7%
1000 trigram config 11 69.5%
500 trigram config 12 71.6%

Decoding and Scoring as NIST does it

TODO

Personal tools