Fisher Corpus

From SpeechWiki

(Difference between revisions)
Jump to: navigation, search
 
(20 intermediate revisions not shown)
Line 1: Line 1:
-
The fisher corpus is still relatively new and rough, and this page is to help people quickly build a basic speech recognizer with it.
+
This page links to the various things I've done with the Fisher corpus. It may be  helpful for quickly building a basic speech recognizer.
-
=Train/Devel/Test partition=
+
 
-
I've split the entire Fisher corpus into 80/10/10 percent for Train/Devel/Test partitions
+
 
 +
=Train/Devel/Test partitions=
 +
For all the models and experiments, the entire Fisher corpus into 80/10/10 percent for Train/Devel/Test partitions as follows
The utterance id file is in  
The utterance id file is in  
-
filelists/uttIds.txt
+
[http://mickey.ifp.uiuc.edu/speech/akantor/fisher/filelists/scliteUttIds.txt scliteUttIds.txt]
-
And the splits are as follows:
+
The utterance IDs are in the 'swb' format that sclite understands, and so sclite can report accuracy statistics per conversation side.
 +
 
 +
There is also another utterance Ids file [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/filelists/uttIds.txt uttIds.txt], which is used by some tools (only by resegment.pl, I think).  It's there only for compatibility and should not be used.
 +
Additionally, the scp files and corresponding transcriptions file pointing to the Word Aligned data are a subset of all the utterances (since some of the utterances could not be word aligned).  They are partitioned differently, and are specified below.
 +
 +
The splits are as follows:
{| class="wikitable"   
{| class="wikitable"   
! Set
! Set
! Conversation Sides
! Conversation Sides
-
! Lines in uttIds.txt
+
! Lines in scliteUttIds.txt
 +
! Lines in wordAlignedTranscriptions.txt and in subphoneAlignedTranscriptions.txt
|-
|-
! Training
! Training
| 00001A to 09360B
| 00001A to 09360B
-
|  
+
| 1 to 1775831
 +
| 1 to 1775773
 +
|-
 +
! first quarter of Training
 +
| 00001A to 2340B
 +
| 1 to 465067
 +
| 1 to 465031
|-
|-
! Devel
! Devel
| 09361A to 10530B
| 09361A to 10530B
-
|  
+
| 1775832 to  1991965
 +
| 1775774 to 1991904
|-
|-
! Test
! Test
| 10531A to 11699B
| 10531A to 11699B
-
|  
+
| 1991965 to 2223159
 +
| 1991905 to 2223080
|}
|}
-
=Dictionaries=
+
=Observation file format=
-
{| class="wikitable" style="text-align:center"
+
The pfiles have this form
-
|+Corpus Statistics
+
-
|-
+
-
! total non-empty utterances
+
-
| 2223159
+
-
|-
+
-
! total uncertain words or phrases enclosed in (( )) (e.g. (( NO WAY )) )
+
-
| 283935
+
-
|-
+
-
! total word tokens in corpus (including uncertain words)
+
-
| 21905137 || 100%
+
-
|-
+
-
! total non-speech markers enclosed in [] (e.g. [LAUGH]))
+
-
| 559629 || 2.555%
+
-
|-
+
-
! total partial words (starting or ending in -)
+
-
| 153098 || 0.6990%
+
-
|-
+
-
! total partial words that could be repaired
+
-
| 101550 || 0.4636%
+
-
|-
+
-
|}
+
-
{| class="wikitable" style="text-align:center"
+
{| class="wikitable"
-
|+Vocab statistics on the raw corpus
+
! Columns !! Category !! Data Description
-
! total unique words
+
-
| 64924  || 100%
+
|-
|-
-
! unique words occuring once in the corpus
+
|0:38 || PLPs                                                || 13 PLPs, delta PLPs and delta-delta PLPs
-
| 23192 || 35.72%
+
|-
|-
-
!unique words occuring once or twice in the corpus
+
|39  || Word boundaries determined through forced alignment || word Id
-
| 31272 || 48.17%
+
|-
|-
-
!corpus coverage if vocab does not include words occuring once in the corpus
+
|40  ||                                                    || word Transition (0 or 1 valued)
-
| 99.894%
+
|-
|-
-
!corpus coverage if vocab does not include words occuring once or twice in the corpus
+
|41  || Timeshrinking                                      || Segment start
-
| 99.857%
+
-
|}
+
-
 
+
-
In Fisher, the partial words (those starting or ending with a '-'), often have the complete word in the vicinity (within 6 words of the same conversation side).  I've replaced the - with the missing part of the word completed from the nearby word having the same non '-' part and enclosed in [] brackets.  Statistics for this new vocabulary are below:
+
-
 
+
-
{| class="wikitable"
+
-
|+  Vocab statistics on the corpus with repaired partial words 
+
-
|-
+
-
! style="background: #FFDDDD;"|total unique words or word segments
+
-
| 79742
+
-
| 100%
+
|-
|-
-
! style="background: #FFDDDD;"|total unique words
+
|42  ||                                                     || Segment duration
-
| 57036
+
-
|  
+
|-
|-
-
! style="background: #FFDDDD;"|unique words or fragments occuring once in the corpus
+
|43  ||                                                     || Representative Frame
-
| 32703
+
-
| 41.01%
+
|-
|-
-
! style="background: #FFDDDD;"|unique words or fragments occuring once or twice in the corpus
+
|44:66|| MLPs                                                || PCA_to_95_percent_variance(log(MLP activations))
-
| 42967
+
-
| 53.88%
+
-
|-
+
-
! style="background: #FFDDDD;"|whole words not in the [http://www.speech.cs.cmu.edu/cgi-bin/cmudict cmudict 0.6 dictionary]
+
-
| 16652
+
-
|  
+
-
|-
+
-
! style="background: #FFDDDD;"|Corpus coverage (token count) by whole words not in the cmudict 0.6 dictionary
+
-
| 0.47% (103244)
+
-
|
+
-
|-
+
-
! style="background: #FFDDDD;"|Corpus coverage (token count) by words, word fragments and non-speech sounds not in the cmudict 0.6 dictionary
+
-
| 3.76% (822146)
+
-
|
+
|-
|-
|}
|}
-
Note that there two uses for the [] brackets:
+
The [[experiment infrastructure]] needs its own page.
-
# A complete word in [] brackets denotes non-speech events, e.g. [LAUGH] or [SIGH]
+
-
# A word with only part of the word enclosed in [] brackets denotes a partial word, with the word in [] brackets missing, (e.g.  RA[THER]  => R AE).
+
-
 
+
-
 
+
-
==Phonetic Dictionary==
+
-
 
+
-
 
+
-
A word pronunciation was derived using [[Phonetic Transcription Tool]], in this order of preference:
+
-
 
+
-
# If a word is in the dictionary, use the dictionary definition.
+
-
# If a word contains numbers spell out the single digits.
+
-
# If a word contains underscores, treat it as an acronym and spell out single letters.
+
-
# If a partial word (has [] brackets) but the whole word is in the dictionary, do forced alignment.
+
-
# Otherwise do viterbi decoding.
+
-
# Phonetic Transcription Tool still could not handle some of the words.  These I transcribed by hand, and they are listed [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/scr/manualDict.txt here].
+
-
 
+
-
The final dictionary containing every word in the repaired fisher corpus is [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/dict/fisherPhonemicDict.txt here].
+
-
 
+
-
==Mixed Unit Dictionary==
+
-
=Language Model=
+
-
 
+
-
=Acoustic Model=
+
-
There are two sets of PLP feature vectors created for the entire corpus.
+
-
==PLPs for MLP classifiers==
+
-
PLPs created in exactly the same way as the training data for MLPs described in
+
-
<ref name="frankel2007articulatory">[http://www.cstr.ed.ac.uk/downloads/publications/2007/Cetin_icassp07_tandem.pdf J. Frankel et al., “Articulatory feature classifiers trained on 2000 hours. of telephone speech,” ICASSP, 2007]</ref>
+
-
The hcopy config file to generate PLP features for MLP input is [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/scr/hcopy.wav2plpForMlp.config here].
+
-
This way, we can use the MLPs presented in the above paper for segmenting the speech for  [[timeshrinking]] experiments.
+
-
 
+
-
==mean and variance normalized, ARMAed PLPs for gaussian mixtures ==
+
-
The second set of features is used to construct the mixture gaussian models. The features are PLPs, deltas and accelerations generated with [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/scr/hcopy.wav2plp.config this] hcopy config.  The following aspects are slightly non-standard:
+
-
* The mel-frequency filter bank is constructed only over the band of 125hz-3800khz, and not over the entire  telephone speech range of 0-4000hz.  There is some slight benefit to this found in <ref name="MVA">[http://ssli.ee.washington.edu/people/chiaping/mva.html MVA: a noise-robust feature processing scheme]</ref>, although in <ref name="Hain1998Htk"/>  band-limiting has an ambiguous affect on accuracy.
+
-
* The 0th cepstral coefficient is used, instead of the log-energy again due to experiments in <ref name="MVA"/>.
+
-
 
+
-
At this point, only the frames which correspond to transcribed audio are extracted, and the following steps are performed only on frames from time periods of transcribed audio.  The features are still stored in one file per conversation side.
+
-
===Normalization===
+
=The experiments=
-
* The cepstral coefficients, the deltas and accelerations are each normalized to 0-mean, unit-variance. as in <ref name="MVA"/>.  This is different from the HTK book, which normalizes only the coefficients, and takes the deltas and accelerations afterwards (deltas and accelerations are not re-normalized). Normalization is done per conversation side as recommended in <ref name="Hain1998Htk">[http://citeseer.ist.psu.edu/267590.html Hain 1998, The 1998 HTK System For Transcription of Conversational Telephone Speech]</ref>.
+
The goal of these experiments is to explore the utility of using mixed units (phones, syllables and whole words) for large vocabulary speech recognition.
 +
These experiments are preformed on the Fisher Corpus.
-
* Finally a order-2 ARMA filter is used.  The whole thing is made easy by this [http://mickey.ifp.uiuc.edu/speech/akantor/fisher/scr/MVA.cc MVA program] written by Chia-ping Chen.
+
The phonetic and mixed-unit [[Fisher Dictionaries| dictionaries]], the [[Fisher Language Model | language model]]s and the [[Fisher Front End | front end]] used in my pronunciation experiments all have their own pages.
 +
The [[Fisher Baseline Experiments]] and [[Mixed Unit Experiments]].
-
<references/>
+
[[Category:Fisher Experiments]]

Latest revision as of 20:07, 23 September 2009

This page links to the various things I've done with the Fisher corpus. It may be helpful for quickly building a basic speech recognizer.


Train/Devel/Test partitions

For all the models and experiments, the entire Fisher corpus into 80/10/10 percent for Train/Devel/Test partitions as follows

The utterance id file is in scliteUttIds.txt The utterance IDs are in the 'swb' format that sclite understands, and so sclite can report accuracy statistics per conversation side.

There is also another utterance Ids file uttIds.txt, which is used by some tools (only by resegment.pl, I think). It's there only for compatibility and should not be used.

Additionally, the scp files and corresponding transcriptions file pointing to the Word Aligned data are a subset of all the utterances (since some of the utterances could not be word aligned). They are partitioned differently, and are specified below.

The splits are as follows:

Set Conversation Sides Lines in scliteUttIds.txt Lines in wordAlignedTranscriptions.txt and in subphoneAlignedTranscriptions.txt
Training 00001A to 09360B 1 to 1775831 1 to 1775773
first quarter of Training 00001A to 2340B 1 to 465067 1 to 465031
Devel 09361A to 10530B 1775832 to 1991965 1775774 to 1991904
Test 10531A to 11699B 1991965 to 2223159 1991905 to 2223080


Observation file format

The pfiles have this form

Columns Category Data Description
0:38 PLPs 13 PLPs, delta PLPs and delta-delta PLPs
39 Word boundaries determined through forced alignment word Id
40 word Transition (0 or 1 valued)
41 Timeshrinking Segment start
42 Segment duration
43 Representative Frame
44:66 MLPs PCA_to_95_percent_variance(log(MLP activations))


The experiment infrastructure needs its own page.

The experiments

The goal of these experiments is to explore the utility of using mixed units (phones, syllables and whole words) for large vocabulary speech recognition. These experiments are preformed on the Fisher Corpus.

The phonetic and mixed-unit dictionaries, the language models and the front end used in my pronunciation experiments all have their own pages.

The Fisher Baseline Experiments and Mixed Unit Experiments.

Personal tools