From SpeechWiki
The fisher corpus is still relatively new and rough, and this page is to help people quickly build a basic speech recognizer with it.
Vocabulary
Some Corpus Statistics
total non-empty utterances
| 2223159
|
total uncertain words or phrases enclosed in (( )) (e.g. (( NO WAY )) )
| 283935
|
total word tokens in corpus (including uncertain words)
| 21905137 | 100%
|
total non-speech markers enclosed in [] (e.g. [LAUGH]))
| 559629 | 2.555%
|
total partial words (starting or ending in -)
| 153098 | 0.6990%
|
total partial words that could be repaired
| 101550 | 0.4636%
|
Some vocab statistics
total unique words
| 64924 | 100%
|
unique words occuring once in the corpus
| 23192 | 35.72%
|
unique words occuring once or twice in the corpus
| 31272 | 48.17%
|
corpus coverage if vocab does not include words occuring once in the corpus
| 99.894%
|
corpus coverage if vocab does not include words occuring once or twice in the corpus
| 99.857%
|
Language Model