Latest revision as of 20:07, 23 September 2009

This page links to the various things I've done with the Fisher corpus. It may be helpful for quickly building a basic speech recognizer.

Train/Devel/Test partitions

For all the models and experiments, the entire Fisher corpus into 80/10/10 percent for Train/Devel/Test partitions as follows

The utterance id file is in scliteUttIds.txt The utterance IDs are in the 'swb' format that sclite understands, and so sclite can report accuracy statistics per conversation side.

There is also another utterance Ids file uttIds.txt, which is used by some tools (only by resegment.pl, I think). It's there only for compatibility and should not be used.

Additionally, the scp files and corresponding transcriptions file pointing to the Word Aligned data are a subset of all the utterances (since some of the utterances could not be word aligned). They are partitioned differently, and are specified below.

The splits are as follows:

Set	Conversation Sides	Lines in scliteUttIds.txt	Lines in wordAlignedTranscriptions.txt and in subphoneAlignedTranscriptions.txt
Training	00001A to 09360B	1 to 1775831	1 to 1775773
first quarter of Training	00001A to 2340B	1 to 465067	1 to 465031
Devel	09361A to 10530B	1775832 to 1991965	1775774 to 1991904
Test	10531A to 11699B	1991965 to 2223159	1991905 to 2223080

Observation file format

The pfiles have this form

Columns	Category	Data Description
0:38	PLPs	13 PLPs, delta PLPs and delta-delta PLPs
39	Word boundaries determined through forced alignment	word Id
40		word Transition (0 or 1 valued)
41	Timeshrinking	Segment start
42		Segment duration
43		Representative Frame
44:66	MLPs	PCA_to_95_percent_variance(log(MLP activations))

The experiment infrastructure needs its own page.

The experiments

The goal of these experiments is to explore the utility of using mixed units (phones, syllables and whole words) for large vocabulary speech recognition. These experiments are preformed on the Fisher Corpus.

The phonetic and mixed-unit dictionaries, the language models and the front end used in my pronunciation experiments all have their own pages.

The Fisher Baseline Experiments and Mixed Unit Experiments.

@@ Line 1: / Line 1: @@
-The fisher corpus is still relatively new and rough, and this page is to help people quickly build a basic speech recognizer with it.
+This page links to the various things I've done with the Fisher corpus. It may be  helpful for quickly building a basic speech recognizer.
-=Train/Devel/Test partition=
-I've split the entire Fisher corpus into 80/10/10 percent for Train/Devel/Test partitions
+=Train/Devel/Test partitions=
+For all the models and experiments, the entire Fisher corpus into 80/10/10 percent for Train/Devel/Test partitions as follows
 The utterance id file is in
-filelists/uttIds.txt
+[http://mickey.ifp.uiuc.edu/speech/akantor/fisher/filelists/scliteUttIds.txt scliteUttIds.txt]
-And the splits are as follows:
+The utterance IDs are in the 'swb' format that sclite understands, and so sclite can report accuracy statistics per conversation side.
-* Conversation 00001A to 09360B : training set
-* Conversation 09361A to 10530B : devel set
-* Conversation 10531A to 11699B : test set
+There is also another utterance Ids file [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/filelists/uttIds.txt uttIds.txt], which is used by some tools (only by resegment.pl, I think).  It's there only for compatibility and should not be used.
+Additionally, the scp files and corresponding transcriptions file pointing to the Word Aligned data are a subset of all the utterances (since some of the utterances could not be word aligned).  They are partitioned differently, and are specified below.
-=Vocabulary=
+The splits are as follows:
-{| class="wikitable" style="text-align:center"
+{| class="wikitable"
-|+Some Corpus Statistics
+! Set
+! Conversation Sides
+! Lines in scliteUttIds.txt
+! Lines in wordAlignedTranscriptions.txt and in subphoneAlignedTranscriptions.txt
 |-
-! total non-empty utterances
+! Training
-| 2223159
+| 00001A to 09360B
+| 1 to 1775831
+| 1 to 1775773
 |-
-! total uncertain words or phrases enclosed in (( )) (e.g. (( NO WAY )) )
+! first quarter of Training
-| 283935
+| 00001A to 2340B
+| 1 to 465067
+| 1 to 465031
 |-
-! total word tokens in corpus (including uncertain words)
+! Devel
-| 21905137 || 100%
+| 09361A to 10530B
+| 1775832 to  1991965
+| 1775774 to 1991904
 |-
-! total non-speech markers enclosed in [] (e.g. [LAUGH]))
+! Test
-| 559629 || 2.555%
+| 10531A to 11699B
+| 1991965 to 2223159
+| 1991905 to 2223080
+|}
+=Observation file format=
+The pfiles have this form
+{| class="wikitable"
+! Columns !! Category !! Data Description
 |-
-! total partial words (starting or ending in -)
+|0:38 || PLPs                                                || 13 PLPs, delta PLPs and delta-delta PLPs
-| 153098 || 0.6990%
 |-
-! total partial words that could be repaired
+|39   || Word boundaries determined through forced alignment ||	word Id
-| 101550 || 0.4636%
 |-
-|}
+|40   ||                                                     || word Transition (0 or 1 valued)
-{| class="wikitable" style="text-align:center"
-|+Some vocab statistics
-! total unique words
-| 64924  || 100%
 |-
-! unique words occuring once in the corpus
+|41   || Timeshrinking                                       || Segment start
-| 23192 || 35.72%
 |-
-!unique words occuring once or twice in the corpus
+|42   ||                                                     || Segment duration
-| 31272 || 48.17%
 |-
-!corpus coverage if vocab does not include words occuring once in the corpus
+|43   ||                                                     || Representative Frame
-| 99.894%
+|-
+|44:66|| MLPs                                                || PCA_to_95_percent_variance(log(MLP activations))
 |-
-!corpus coverage if vocab does not include words occuring once or twice in the corpus
-| 99.857%
 |}
-=Language Model=
-=Acoustic Model=
-There are two sets of PLP feature vectors created for the entire corpus.
-==PLPs for MLP classifiers==
-PLPs created in exactly the same way as the training data for MLPs described in
-[http://www.cstr.ed.ac.uk/downloads/publications/2007/Cetin_icassp07_tandem.pdf J. Frankel et al., “Articulatory feature classifiers trained on 2000 hours. of telephone speech,” ICASSP, 2007 ]
-The hcopy config PLP for MLP file is
-<pre>
-# PLP base coefficient config courtesy of AMI eval RT05s
-#the commented out lines and their replacements by Arthur 4/24/08
-#MAXTRYOPEN=10
-MAXTRYOPEN=3
-SOURCEKIND = WAVEFORM
-#SOURCEFORMAT = NIST
-SOURCEFORMAT = WAV
-TARGETFORMAT = HTK
-TARGETKIND = PLP_E
-TARGETRATE = 100000.0
-HPARM: SAVECOMPRESSED = T
-HPARM: SAVEWITHCRC = T
-HPARM: ZMEANSOURCE = T
-HPARM: WINDOWSIZE = 250000.0
-HPARM: USEHAMMING = T
-HPARM: PREEMCOEF = 0.97
-HPARM: NUMCHANS = 24
-HPARM: LPCORDER = 12
-HPARM: COMPRESSFACT = 0.3333333
-HPARM: NUMCEPS = 12
-HPARM: CEPLIFTER = 22
-HPARM: ESCALE = 1.0
-HPARM: ENORMALISE = T
-HPARM: SILFLOOR = 50.0
-HPARM: USEPOWER = T
-HPARM: CEPSCALE = 10
-# required to read fisher/switchboard wavefiles as HTK does not support shorten compression
-#HWAVEFILTER    = 'w_decode -o pcm $ -'
-</pre>
-==PLPs for gaussian mixtures ==
-PLPs, same as above, except bandlimited to 125hz-3800khz, along with deltas and delta-deltas.  The cepstral mean is normalized on the PLPs and cepstral variance is normalized for the PLPs, delta and delta-deltas.  Normalization is done per conversation side.  This is suggested in [http://citeseer.ist.psu.edu/267590.html The 1998 HTK System For Transcription of Conversational Telephone Speech]
-The PLPs over which gaussians will be built are generated with this hcopy config:
-<pre>
-#modified from the "PLP base coefficient config courtesy of AMI eval RT05s"
-#the original param is commented and it's replacement is right next to it, if there was a change
-#-arthur 4/24/08
-#MAXTRYOPEN=10
-MAXTRYOPEN=3
-SOURCEKIND = WAVEFORM
-#SOURCEFORMAT = NIST
-SOURCEFORMAT = WAV
-TARGETFORMAT = HTK
+The [[experiment infrastructure]] needs its own page.
-#TARGETKIND = PLP_E
-TARGETKIND = PLP_E_D_A
-TARGETRATE = 100000.0
-HPARM: SAVECOMPRESSED = T
-HPARM: SAVEWITHCRC = T
-HPARM: ZMEANSOURCE = T
-HPARM: WINDOWSIZE = 250000.0
-HPARM: USEHAMMING = T
-HPARM: PREEMCOEF = 0.97
-HPARM: NUMCHANS = 24
-HPARM: LPCORDER = 12
-HPARM: COMPRESSFACT = 0.3333333
-HPARM: NUMCEPS = 12
-HPARM: CEPLIFTER = 22
-HPARM: ESCALE = 1.0
-HPARM: ENORMALISE = T
-HPARM: SILFLOOR = 50.0
-HPARM: USEPOWER = T
-HPARM: CEPSCALE = 10
-#ADDED BY ARTHUR
+=The experiments=
-#cut-off frequencies for telephone speech
+The goal of these experiments is to explore the utility of using mixed units (phones, syllables and whole words) for large vocabulary speech recognition.
-#FIXME: may be rerun with 125hz-3800khz as in [hain1999Htk]
+These experiments are preformed on the Fisher Corpus.
-HPARM: LOFREQ = 125
-HPARM: HIFREQ = 3800
-#cepstral mean and variance will be normalized after the utterances are extracted
+The phonetic and mixed-unit [[Fisher Dictionaries| dictionaries]], the [[Fisher Language Model | language model]]s and the [[Fisher Front End | front end]] used in my pronunciation experiments all have their own pages.
-</pre>
+The [[Fisher Baseline Experiments]] and [[Mixed Unit Experiments]].
-At this point we extract only those observations corrseponding to transcribed speech.  There is still one file per conversation side.
+[[Category:Fisher Experiments]]
-Cepstral mean and variance statistics are collected with <code>HCompV -c</code>, per conversation side, and global variance estimated on the training set.
-Finally, the each converstation side is zero-meaned, and variance renormalized to match the global variance on the training set.
-The end result is in plpcut.

Fisher Corpus

From SpeechWiki

Latest revision as of 20:07, 23 September 2009

Train/Devel/Test partitions

Observation file format

The experiments

Views

Personal tools

Navigation

Toolbox

Search