Fisher Corpus

From SpeechWiki

(Difference between revisions)
Jump to: navigation, search
m
Line 4: Line 4:
|+Some Corpus Statistics
|+Some Corpus Statistics
|-
|-
-
! total utterances  
+
! total non-empty utterances  
| 2223159
| 2223159
|-
|-
Line 10: Line 10:
| 283935  
| 283935  
|-
|-
-
! total word tokens in corpus  
+
! total word tokens in corpus (including uncertain words)
| 21905137 || 100%
| 21905137 || 100%
|-
|-

Revision as of 02:05, 28 March 2008

The fisher corpus is still relatively new and rough, and this page is to help people quickly build a basic speech recognizer with it.

Vocabulary

Some Corpus Statistics
total non-empty utterances 2223159
total uncertain words or phrases enclosed in (( )) (e.g. (( NO WAY )) ) 283935
total word tokens in corpus (including uncertain words) 21905137 100%
total non-speech markers enclosed in [] (e.g. [LAUGH])) 559629 2.555%
total partial words (starting or ending in -) 154130 0.7036%
Some vocab statistics
total unique words 64924 100%
unique words occuring once in the corpus 23192 35.72%
unique words occuring once or twice in the corpus 31272 48.17%
corpus coverage if vocab does not include words occuring once in the corpus 99.894%
corpus coverage if vocab does not include words occuring once or twice in the corpus 99.857%

Language Model

Personal tools