Fisher Corpus
From SpeechWiki
(Difference between revisions)
m |
|||
Line 4: | Line 4: | ||
|+Some Corpus Statistics | |+Some Corpus Statistics | ||
|- | |- | ||
- | ! total utterances | + | ! total non-empty utterances |
| 2223159 | | 2223159 | ||
|- | |- | ||
Line 10: | Line 10: | ||
| 283935 | | 283935 | ||
|- | |- | ||
- | ! total word tokens in corpus | + | ! total word tokens in corpus (including uncertain words) |
| 21905137 || 100% | | 21905137 || 100% | ||
|- | |- |
Revision as of 02:05, 28 March 2008
The fisher corpus is still relatively new and rough, and this page is to help people quickly build a basic speech recognizer with it.
Vocabulary
total non-empty utterances | 2223159 | |
---|---|---|
total uncertain words or phrases enclosed in (( )) (e.g. (( NO WAY )) ) | 283935 | |
total word tokens in corpus (including uncertain words) | 21905137 | 100% |
total non-speech markers enclosed in [] (e.g. [LAUGH])) | 559629 | 2.555% |
total partial words (starting or ending in -) | 154130 | 0.7036% |
total unique words | 64924 | 100% |
---|---|---|
unique words occuring once in the corpus | 23192 | 35.72% |
unique words occuring once or twice in the corpus | 31272 | 48.17% |
corpus coverage if vocab does not include words occuring once in the corpus | 99.894% | |
corpus coverage if vocab does not include words occuring once or twice in the corpus | 99.857% |