Fisher Corpus

From SpeechWiki

(Difference between revisions)
Jump to: navigation, search
Line 121: Line 121:
==PLPs for MLP classifiers==
==PLPs for MLP classifiers==
PLPs created in exactly the same way as the training data for MLPs described in  
PLPs created in exactly the same way as the training data for MLPs described in  
-
[http://www.cstr.ed.ac.uk/downloads/publications/2007/Cetin_icassp07_tandem.pdf J. Frankel et al., “Articulatory feature classifiers trained on 2000 hours. of telephone speech,” ICASSP, 2007 ]
+
<ref name="frankel2007articulatory">[http://www.cstr.ed.ac.uk/downloads/publications/2007/Cetin_icassp07_tandem.pdf J. Frankel et al., “Articulatory feature classifiers trained on 2000 hours. of telephone speech,” ICASSP, 2007]</ref>
 +
The hcopy config file to generate PLP features for MLP input is [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/scr/hcopy.wav2plpForMlp.config here].
 +
This way, we can use the MLPs presented in the above paper for segmenting the speech for  [[timeshrinking]] experiments.
 +
==mean and variance normalized, ARMAed PLPs for gaussian mixtures ==
 +
The second set of features is used to construct the mixture gaussian models. The features are PLPs, deltas and accelerations generated with [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/scr/hcopy.wav2plp.config this] hcopy config.  The following aspects are slightly non-standard:
 +
* The mel-frequency filter bank is constructed only over the band of 125hz-3800khz, and not over the entire  telephone speech range of 0-4000hz.  There is some slight benefit to this found in <ref name="MVA">[http://ssli.ee.washington.edu/people/chiaping/mva.html MVA: a noise-robust feature processing scheme]</ref>, although in <ref name="Hain1998Htk"/>  band-limiting has an ambiguous affect on accuracy.
 +
* The 0th cepstral coefficient is used, instead of the log-energy again due to experiments in <ref name="MVA"/>.
-
The hcopy config PLP for MLP file is
+
At this point, only the frames which correspond to transcribed audio are extracted, and the following steps are performed only on frames from time periods of transcribed audio.  The features are still stored in one file per conversation side.
-
<pre>
+
-
# PLP base coefficient config courtesy of AMI eval RT05s
+
-
#the commented out lines and their replacements by Arthur 4/24/08
+
===Normalization===
-
#MAXTRYOPEN=10
+
* The cepstral coefficients, the deltas and accelerations are each normalized to 0-mean, unit-variance. as in <ref name="MVA"/>. This is different from the HTK book, which normalizes only the coefficients, and takes the deltas and accelerations afterwards (deltas and accelerations are not re-normalized). Normalization is done per conversation side as recommended in <ref name="Hain1998Htk">[http://citeseer.ist.psu.edu/267590.html Hain 1998, The 1998 HTK System For Transcription of Conversational Telephone Speech]</ref>.
-
MAXTRYOPEN=3
+
-
SOURCEKIND = WAVEFORM
+
-
#SOURCEFORMAT = NIST
+
-
SOURCEFORMAT = WAV
+
-
TARGETFORMAT = HTK
+
-
TARGETKIND = PLP_E
+
-
TARGETRATE = 100000.0
+
-
HPARM: SAVECOMPRESSED = T
+
-
HPARM: SAVEWITHCRC = T
+
-
HPARM: ZMEANSOURCE = T
+
-
HPARM: WINDOWSIZE = 250000.0
+
-
HPARM: USEHAMMING = T
+
-
HPARM: PREEMCOEF = 0.97
+
-
HPARM: NUMCHANS = 24
+
-
HPARM: LPCORDER = 12
+
-
HPARM: COMPRESSFACT = 0.3333333
+
-
HPARM: NUMCEPS = 12
+
-
HPARM: CEPLIFTER = 22
+
-
HPARM: ESCALE = 1.0
+
-
HPARM: ENORMALISE = T
+
-
HPARM: SILFLOOR = 50.0
+
-
HPARM: USEPOWER = T
+
-
HPARM: CEPSCALE = 10
+
-
# required to read fisher/switchboard wavefiles as HTK does not support shorten compression
 
-
#HWAVEFILTER    = 'w_decode -o pcm $ -'
 
-
</pre>
 
-
==PLPs for gaussian mixtures ==
+
<references/>
-
PLPs, same as above, except bandlimited to 125hz-3800khz, along with deltas and delta-deltas.  The cepstral mean is normalized on the PLPs and cepstral variance is normalized for the PLPs, delta and delta-deltas.  Normalization is done per conversation side.  This is suggested in [http://citeseer.ist.psu.edu/267590.html The 1998 HTK System For Transcription of Conversational Telephone Speech]
+
-
 
+
-
 
+
-
 
+
-
The PLPs over which gaussians will be built are generated with this hcopy config:
+
-
<pre>
+
-
#modified from the "PLP base coefficient config courtesy of AMI eval RT05s"
+
-
#the original param is commented and it's replacement is right next to it, if there was a change
+
-
#-arthur 4/24/08
+
-
 
+
-
#MAXTRYOPEN=10
+
-
MAXTRYOPEN=3
+
-
SOURCEKIND = WAVEFORM
+
-
#SOURCEFORMAT = NIST
+
-
SOURCEFORMAT = WAV
+
-
 
+
-
TARGETFORMAT = HTK
+
-
#TARGETKIND = PLP_E
+
-
TARGETKIND = PLP_E_D_A
+
-
TARGETRATE = 100000.0
+
-
HPARM: SAVECOMPRESSED = T
+
-
HPARM: SAVEWITHCRC = T
+
-
HPARM: ZMEANSOURCE = T
+
-
HPARM: WINDOWSIZE = 250000.0
+
-
HPARM: USEHAMMING = T
+
-
HPARM: PREEMCOEF = 0.97
+
-
HPARM: NUMCHANS = 24
+
-
HPARM: LPCORDER = 12
+
-
HPARM: COMPRESSFACT = 0.3333333
+
-
HPARM: NUMCEPS = 12
+
-
HPARM: CEPLIFTER = 22
+
-
HPARM: ESCALE = 1.0
+
-
HPARM: ENORMALISE = T
+
-
HPARM: SILFLOOR = 50.0
+
-
HPARM: USEPOWER = T
+
-
HPARM: CEPSCALE = 10
+
-
 
+
-
#ADDED BY ARTHUR
+
-
#cut-off frequencies for telephone speech
+
-
#FIXME: may be rerun with 125hz-3800khz as in [hain1999Htk]
+
-
HPARM: LOFREQ = 125
+
-
HPARM: HIFREQ = 3800
+
-
 
+
-
#cepstral mean and variance will be normalized after the utterances are extracted
+
-
 
+
-
</pre>
+
-
 
+
-
At this point we extract only those observations corrseponding to transcribed speech.  There is still one file per conversation side.
+
-
Cepstral mean and variance statistics are collected with <code>HCompV -c</code>, per conversation side, and global variance estimated on the training set.
+
-
Finally, the each converstation side is zero-meaned, and variance renormalized to match the global variance on the training set.
+
-
The end result is in the plpcut dir.
+

Revision as of 06:42, 24 May 2008

The fisher corpus is still relatively new and rough, and this page is to help people quickly build a basic speech recognizer with it.

Contents

Train/Devel/Test partition

I've split the entire Fisher corpus into 80/10/10 percent for Train/Devel/Test partitions

The utterance id file is in filelists/uttIds.txt And the splits are as follows:

Set Conversation Sides Lines in uttIds.txt
Training 00001A to 09360B
Devel 09361A to 10530B
Test 10531A to 11699B


Dictionaries

Corpus Statistics
total non-empty utterances 2223159
total uncertain words or phrases enclosed in (( )) (e.g. (( NO WAY )) ) 283935
total word tokens in corpus (including uncertain words) 21905137 100%
total non-speech markers enclosed in [] (e.g. [LAUGH])) 559629 2.555%
total partial words (starting or ending in -) 153098 0.6990%
total partial words that could be repaired 101550 0.4636%
Vocab statistics on the raw corpus
total unique words 64924 100%
unique words occuring once in the corpus 23192 35.72%
unique words occuring once or twice in the corpus 31272 48.17%
corpus coverage if vocab does not include words occuring once in the corpus 99.894%
corpus coverage if vocab does not include words occuring once or twice in the corpus 99.857%

In Fisher, the partial words (those starting or ending with a '-'), often have the complete word in the vicinity (within 6 words of the same conversation side). I've replaced the - with the missing part of the word completed from the nearby word having the same non '-' part and enclosed in [] brackets. Statistics for this new vocabulary are below:

Vocab statistics on the corpus with repaired partial words
total unique words 79742 100%
unique words occuring once in the corpus 32703 41.01%
unique words occuring once or twice in the corpus 42967 53.88%
words not in the cmudict 0.6 dictionary 18588 23.3%
Corpus coverage by words not in the cmudict 0.6 dictionary  ??


Note that there two uses for the [] brackets:

  1. A complete word in [] brackets denotes non-speech events, e.g. [LAUGH] or [SIGH]
  2. A word with only part of the word enclosed in [] brackets denotes a partial word, with the word in [] brackets missing, (e.g. RA[THER] => R AE).


Phonetic Dictionary

A word pronunciation was derived using Phonetic Transcription Tool, in this order of preference:

  1. If a word is in the dictionary, use the dictionary definition.
  2. If a word contains numbers spell out the single digits.
  3. If a word contains underscores, treat it as an acronym and spell out single letters.
  4. If a partial word (has [] brackets) but the whole word is in the dictionary, do forced alignment.
  5. Otherwise do viterbi decoding.

Phonetic Transcription Tool still could not handle some of the words. These I transcribed by hand, and they are listed here.

Mixed Unit Dictionary

Language Model

Acoustic Model

There are two sets of PLP feature vectors created for the entire corpus.

PLPs for MLP classifiers

PLPs created in exactly the same way as the training data for MLPs described in <ref name="frankel2007articulatory">J. Frankel et al., “Articulatory feature classifiers trained on 2000 hours. of telephone speech,” ICASSP, 2007</ref> The hcopy config file to generate PLP features for MLP input is here. This way, we can use the MLPs presented in the above paper for segmenting the speech for timeshrinking experiments.

mean and variance normalized, ARMAed PLPs for gaussian mixtures

The second set of features is used to construct the mixture gaussian models. The features are PLPs, deltas and accelerations generated with this hcopy config. The following aspects are slightly non-standard:

  • The mel-frequency filter bank is constructed only over the band of 125hz-3800khz, and not over the entire telephone speech range of 0-4000hz. There is some slight benefit to this found in <ref name="MVA">MVA: a noise-robust feature processing scheme</ref>, although in <ref name="Hain1998Htk"/> band-limiting has an ambiguous affect on accuracy.
  • The 0th cepstral coefficient is used, instead of the log-energy again due to experiments in <ref name="MVA"/>.

At this point, only the frames which correspond to transcribed audio are extracted, and the following steps are performed only on frames from time periods of transcribed audio. The features are still stored in one file per conversation side.

Normalization

  • The cepstral coefficients, the deltas and accelerations are each normalized to 0-mean, unit-variance. as in <ref name="MVA"/>. This is different from the HTK book, which normalizes only the coefficients, and takes the deltas and accelerations afterwards (deltas and accelerations are not re-normalized). Normalization is done per conversation side as recommended in <ref name="Hain1998Htk">Hain 1998, The 1998 HTK System For Transcription of Conversational Telephone Speech</ref>.


<references/>

Personal tools