From SpeechWiki
The fisher corpus is still relatively new and rough, and this page is to help people quickly build a basic speech recognizer with it.
Vocabulary
Some Corpus Statistics
total utterances
| 2223159
|
total uncertain words or phrases enclosed in (( )) (e.g. (( NO WAY )) )
| 283935
|
total word tokens in corpus
| 21905137 | 100%
|
total non-speech markers enclosed in [] (e.g. [LAUGH]))
| 559629 | 2.555%
|
total partial words (starting or ending in -)
| 154130 | 0.7036%
|
Some vocab statistics
total unique words
| 64924 | 100%
|
unique words occuring once in the corpus
| 23192 | 35.72%
|
unique words occuring once or twice in the corpus
| 31272 | 48.17%
|
corpus coverage if vocab does not include words occuring once in the corpus
| 99.894%
|
corpus coverage if vocab does not include words occuring once or twice in the corpus
| 99.857%
|
Language Model