Fisher Baseline Experiments

From SpeechWiki

(Difference between revisions)
Jump to: navigation, search
Line 65: Line 65:
===Testing the word-aligned model quality===
===Testing the word-aligned model quality===
-
Tested on the first 500 utterances of the Dev Set, using the 10k vocab 3-gram LM, mapping filled-pauses and non-speech to optionally deletable %hesitation but using the model in mpWAModel yields 65.6% WER, about 1% absolute worse than not specifying word boundaries.
+
Tested on the first 500 utterances of the Dev Set, using the 10k vocab 3-gram LM with no word fragments and no <unk>, mapping filled-pauses (uh, huh, etc.) and non-speech (e.g. [LAUGH]) to optionally deletable %hesitation, and allowing word fragments to match the prefix or suffix of hypothesized word (identical to row two in the next table with the bi-gram replaced with tri-gram and mpModel with mpWAModel).
-
FIXME
+
Using the model in mpWAModel yields [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/mpTestWA/config9/test0/accuracy/out.nosil.trn.dtl 65.6%] WER, about 1.1% absolute worse than not specifying word boundaries.
 +
 
===Decoding and Scoring as NIST does it===
===Decoding and Scoring as NIST does it===
Line 77: Line 78:
{| class="wikitable sortable" |
{| class="wikitable sortable" |
|+ Experiments with different scoring rules.
|+ Experiments with different scoring rules.
 +
! Row
! LM
! LM
! post-viterbi processing and sclite Scoring rules
! post-viterbi processing and sclite Scoring rules
Line 82: Line 84:
! WER
! WER
|-
|-
 +
| 1
| Same as in prev section (for comparison purposes)
| Same as in prev section (for comparison purposes)
| filled pauses and word fragments treated as words  
| filled pauses and word fragments treated as words  
Line 87: Line 90:
| [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/mpTest/config0OriginalLMWithFragmentsAndWithUnk/test0/accuracyWordFragLMModelPlusUnk/out.nosil.trn.dtl 67.3%]
| [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/mpTest/config0OriginalLMWithFragmentsAndWithUnk/test0/accuracyWordFragLMModelPlusUnk/out.nosil.trn.dtl 67.3%]
|-
|-
 +
| 2
| Same as in prev section, but vocab contains no word fragments (vocab filled in with more whole words to reach 10k), and no <unk>
| Same as in prev section, but vocab contains no word fragments (vocab filled in with more whole words to reach 10k), and no <unk>
| word fragments match any word with common prefix/suffix, filled pauses and non-speech sounds mapped to %hesitation, and are optionally deletable
| word fragments match any word with common prefix/suffix, filled pauses and non-speech sounds mapped to %hesitation, and are optionally deletable
| [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/mpTest/PARAM/test.grid config 0]
| [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/mpTest/PARAM/test.grid config 0]
-
| [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/mpTest/config0/test0/accuracy/out.nosil.trn.dtl 64.7%]  
+
| [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/mpTest/config0/test0/accuracy/out.nosil.trn.dtl 64.7%]  
|-
|-
 +
| 3
| Same as in prev section
| Same as in prev section
| word fragments deleted in post-viterbi processing, and made optionally deletable in reference transcription
| word fragments deleted in post-viterbi processing, and made optionally deletable in reference transcription
Line 97: Line 102:
| [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/mpTest/config17/test0/accuracyFilledPausesDeleted/out.nosil.trn.dtl 66.6%]
| [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/mpTest/config17/test0/accuracyFilledPausesDeleted/out.nosil.trn.dtl 66.6%]
|-
|-
 +
| 4
| Same as in prev section
| Same as in prev section
| word fragments and <unk> deleted in post-viterbi processing (but unk is allowed in the language model), and made optionally deletable in reference transcription
| word fragments and <unk> deleted in post-viterbi processing (but unk is allowed in the language model), and made optionally deletable in reference transcription
Line 102: Line 108:
| [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/mpTest/config17/test0/accuracyFilledPausesAndUnkDeleted/out.nosil.trn.dtl 64.9%]
| [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/exp/baseline/mpTest/config17/test0/accuracyFilledPausesAndUnkDeleted/out.nosil.trn.dtl 64.9%]
|}
|}
 +
 +
[[Category:Fisher Experiments]]

Revision as of 03:26, 6 November 2008

A set of traditional phone-based baselines against which to compare our mixed-unit systems. An implementation of it using GMTK and associated scripts is here.


Contents

Multi-pronunciation Monophone Model

The initial model is monophone, with the number of states per phone specified in phone.state. With 64 gaussians per state.

The dictionary used allows multiple pronunciations (up to 6). See more details here.

Training on the entire fisher training set (1775831 utterances, specified in Fisher Corpus), takes an exceedingly long time: more than 15 hours for a single EM iteration, on a 32 cpu cluster or about 20 cluster compute days to get 64-gaussian mixtures, doubling the number of mixtures after converging with 6 iterations.

Possible solutions are:

  • Do training on data where word boundaries are observed. (Thanks Chris) For this we need to force-align the entire fisher corpus using some existing recognizer.
  • Use a smaller corpus.
  • Use teragrid (our allocation of 30000 cpu-hours is only 40 days on our cluster - almost not worth the effort of porting)
  • Fool around with better triangulations (not much hope for improvement since the RVs with each frame are densely connected in our DBN)

Training on word-aligned data

To speed up training up, I've used the above model to obtain word boundaries using force alignment. See forceAlign.pl script. The word boudaries, are in wordObservationsWordAligned.scp. Some of the utterance were impossible to word align to transcriptions. The transcription and scp files omitting those bad utterances are also generated.

The model with observed word boundaries could be trained about 6 times faster, and took about 3-4 days to train. The model from word-aligned training is in mpWAmodel dir.


Decoding and Scoring

Decoding is done by GMTKViterbiNew and scoring by sclite.

Preliminary Results

Tested on the first 500 utterances of the Dev Set, using the following language models. LM_SCALE and LM_PENALTY were not tuned. The language model and scoring treat all word fragments, filled pauses, and non-speech events as distinct words, and so the WER reported is conservative.

experiments with language model and vocab choices.
Vocab Lang Model Config WER
10000 bigram config 0 67.3%
10000 trigram config 9 67.2%
5000 trigram config 10 67.7%
1000 trigram config 11 69.5%
500 trigram config 12 71.6%

Testing the word-aligned model quality

Tested on the first 500 utterances of the Dev Set, using the 10k vocab 3-gram LM with no word fragments and no <unk>, mapping filled-pauses (uh, huh, etc.) and non-speech (e.g. [LAUGH]) to optionally deletable %hesitation, and allowing word fragments to match the prefix or suffix of hypothesized word (identical to row two in the next table with the bi-gram replaced with tri-gram and mpModel with mpWAModel). Using the model in mpWAModel yields 65.6% WER, about 1.1% absolute worse than not specifying word boundaries.


Decoding and Scoring as NIST does it

Tested on the first 500 utterances of the Dev Set, using the 10k vocab 2-gram LM. LM_SCALE and LM_PENALTY were not tuned (fixed at 10 -1 respectively).

The language model and scoring treat all word fragments, filled pauses, and non-speech events as distinct words, and so the WER reported is conservative. This is still using config 0.

Experiments with different scoring rules.
Row LM post-viterbi processing and sclite Scoring rules Config WER
1 Same as in prev section (for comparison purposes) filled pauses and word fragments treated as words config 0 67.3%
2 Same as in prev section, but vocab contains no word fragments (vocab filled in with more whole words to reach 10k), and no <unk> word fragments match any word with common prefix/suffix, filled pauses and non-speech sounds mapped to %hesitation, and are optionally deletable config 0 64.7%
3 Same as in prev section word fragments deleted in post-viterbi processing, and made optionally deletable in reference transcription config 17 66.6%
4 Same as in prev section word fragments and <unk> deleted in post-viterbi processing (but unk is allowed in the language model), and made optionally deletable in reference transcription config 17 64.9%
Personal tools