:Units Paper

From SpeechWiki

Revision as of 17:31, 28 April 2009 by Arthur (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Outline

Intro
Unit Selection
- Mistake instance
  Unit
  Replacement
- Multwords
Baseline Description
- Vocab: single most frequent pronunciation from a multi-pronunciation dictionary (better than multi-pronunciation)
Results
- what to emphasize? Ideally, units+DTs will beat just DTs for every number of components. Even if we cannot grow the components until improvement bottoms out, at least there will be a trend.
Conclusion
- Future work: consider context during unit selection (right now the unit is context-free - the same unit appearing in all contexts where replacements took place).

Tests for units paper

tests to run
compPer:	units:	monophone states	Mix:	totalComp:	WER
512	1		503	256k	TR
256	1		1000	256k	TR
64	1		3854	256k	49.3
64	2		2000	256k	?
32	4		2000	256k	?
32	2		4000	256k	?

alternatively

256	48	137	503	127971	53.0
128	48	137	1033	131185	50.9
32	48	137	3845	122907	51.4
64	112	615	2024	~128k	TR
16	4		2000	128k	?
16	112	615	4000	128k	?

The units make it worse

with LM_PENALTY=0

DETAILED OVERALL REPORT FOR THE SYSTEM: test/config16/test0/accuracy/out.nosil.trn

SENTENCE RECOGNITION PERFORMANCE

sentences                                         500
with errors                             89.2%   ( 446)

  with substitions                      72.4%   ( 362)
  with deletions                        26.2%   ( 131)
  with insertions                       74.4%   ( 372)


WORD RECOGNITION PERFORMANCE

Percent Total Error       =   99.1%   (5107)

Percent Correct           =   28.6%   (1476)

Percent Substitution      =   66.2%   (3411)
Percent Deletions         =    5.1%   ( 265)
Percent Insertions        =   27.8%   (1431)
Percent Word Accuracy     =    0.9%


Ref. words                =           (5152)
Hyp. words                =           (6318)
Aligned words             =           (6583)

CONFUSION PAIRS                  Total                 (2790)
                                With >=  1 occurances (2790)

with LM_PENALTY=-1

test2kUtt/config16Disaster/test0/accuracy/out.nosil.trn.dtl 
DETAILED OVERALL REPORT FOR THE SYSTEM: test2kUtt/config16/test0/accuracy/out.nosil.trn

SENTENCE RECOGNITION PERFORMANCE

 sentences                                         500
 with errors                             88.2%   ( 441)

  with substitions                      72.8%   ( 364)
  with deletions                        32.8%   ( 164)
  with insertions                       68.6%   ( 343)


WORD RECOGNITION PERFORMANCE

Percent Total Error       =   94.6%   (4857)

Percent Correct           =   27.6%   (1417)

Percent Substitution      =   65.4%   (3358)
Percent Deletions         =    7.0%   ( 357)
Percent Insertions        =   22.3%   (1142)
Percent Word Accuracy     =    5.4%


Ref. words                =           (5132)
Hyp. words                =           (5917)
Aligned words             =           (6274)

LM_PENALTY = -2

DETAILED OVERALL REPORT FOR THE SYSTEM: test/config16/test0/accuracy/out.nosil.trn

SENTENCE RECOGNITION PERFORMANCE

sentences                                         500
with errors                             87.2%   ( 436)

  with substitions                      72.8%   ( 364)
  with deletions                        38.8%   ( 194)
  with insertions                       62.8%   ( 314)


WORD RECOGNITION PERFORMANCE

Percent Total Error       =   91.4%   (4674)

Percent Correct           =   26.9%   (1374)

Percent Substitution      =   64.0%   (3271)
Percent Deletions         =    9.2%   ( 468)
Percent Insertions        =   18.3%   ( 935)
Percent Word Accuracy     =    8.6%


Ref. words                =           (5113)
Hyp. words                =           (5580)
Aligned words             =           (6048)

CONFUSION PAIRS                  Total                 (2699)
                                 With >=  1 occurances (2699)

LM_PENALTY = -3

DETAILED OVERALL REPORT FOR THE SYSTEM: test/config16/test0/accuracy/out.nosil.trn

SENTENCE RECOGNITION PERFORMANCE

 sentences                                         500
 with errors                             87.0%   ( 435)

   with substitions                      72.6%   ( 363)
   with deletions                        40.6%   ( 203)
   with insertions                       56.4%   ( 282)


WORD RECOGNITION PERFORMANCE

Percent Total Error       =   88.9%   (4535)

Percent Correct           =   25.4%   (1298)

Percent Substitution      =   63.0%   (3215)
Percent Deletions         =   11.5%   ( 588)
Percent Insertions        =   14.4%   ( 732)
Percent Word Accuracy     =   11.1%


Ref. words                =           (5101)
Hyp. words                =           (5245)
Aligned words             =           (5833)

CONFUSION PAIRS                  Total                 (2671)
                                 With >=  1 occurances (2671)

LM_PENALTY = -4

DETAILED OVERALL REPORT FOR THE SYSTEM: test/config16/test0/accuracy/out.nosil.trn

SENTENCE RECOGNITION PERFORMANCE

 sentences                                         500
 with errors                             86.6%   ( 433)

   with substitions                      72.2%   ( 361)
   with deletions                        45.8%   ( 229)
   with insertions                       49.8%   ( 249)


WORD RECOGNITION PERFORMANCE

Percent Total Error       =   87.2%   (4446)

Percent Correct           =   24.5%   (1249)

Percent Substitution      =   60.4%   (3078)
Percent Deletions         =   15.1%   ( 770)
Percent Insertions        =   11.7%   ( 598)
Percent Word Accuracy     =   12.8%


Ref. words                =           (5097)
Hyp. words                =           (4925)
Aligned words             =           (5695)

CONFUSION PAIRS                  Total                 (2590)
                                 With >=  1 occurances (2590)

LM_PENALTY = -5

DETAILED OVERALL REPORT FOR THE SYSTEM: test/config16/test0/accuracy/out.nosil.trn

SENTENCE RECOGNITION PERFORMANCE

 sentences                                         500
 with errors                             86.4%   ( 432)

   with substitions                      72.2%   ( 361)
   with deletions                        49.2%   ( 246)
   with insertions                       46.2%   ( 231)


WORD RECOGNITION PERFORMANCE

Percent Total Error       =   87.0%   (4423)

Percent Correct           =   23.1%   (1176)

Percent Substitution      =   58.9%   (2998)
Percent Deletions         =   17.9%   ( 912)
Percent Insertions        =   10.1%   ( 513)
Percent Word Accuracy     =   13.0%


Ref. words                =           (5086)
Hyp. words                =           (4687)
Aligned words             =           (5599)

CONFUSION PAIRS                  Total                 (2533)
                                 With >=  1 occurances (2533)

LM_PENALTY = -6

DETAILED OVERALL REPORT FOR THE SYSTEM: test/config16/test0/accuracy/out.nosil.trn

SENTENCE RECOGNITION PERFORMANCE

 sentences                                         500
 with errors                             87.0%   ( 435)

   with substitions                      72.6%   ( 363)
   with deletions                        53.4%   ( 267)
   with insertions                       40.4%   ( 202)


WORD RECOGNITION PERFORMANCE

Percent Total Error       =   86.6%   (4394)

Percent Correct           =   21.5%   (1088)

Percent Substitution      =   56.8%   (2883)
Percent Deletions         =   21.7%   (1101)
Percent Insertions        =    8.1%   ( 410)
Percent Word Accuracy     =   13.4%


Ref. words                =           (5072)
Hyp. words                =           (4381)
Aligned words             =           (5482)

CONFUSION PAIRS                  Total                 (2472)
                                 With >=  1 occurances (2472)

Clearly there is something wrong with the monophone model - a triunit gaussian model seems about even with monophone gaussian model - should be better by about 10% WER, I think.

Monophone tests

Trying to track down where the error is coming from:

Standard monophone converged once WER 86.8:

DETAILED OVERALL REPORT FOR THE SYSTEM: trTest/config19/test0/accuracy/out.nosil.trn

SENTENCE RECOGNITION PERFORMANCE

 sentences                                         500
 with errors                             83.4%   ( 417)

   with substitions                      69.4%   ( 347)
   with deletions                        52.8%   ( 264)
   with insertions                       33.6%   ( 168)


WORD RECOGNITION PERFORMANCE

Percent Total Error       =   86.8%   (4426)

Percent Correct           =   18.9%   ( 961)

Percent Substitution      =   54.9%   (2796)
Percent Deletions         =   26.3%   (1340)
Percent Insertions        =    5.7%   ( 290)
Percent Word Accuracy     =   13.2%


Ref. words                =           (5097)
Hyp. words                =           (4047)
Aligned words             =           (5387)

The Units monophone:

DETAILED OVERALL REPORT FOR THE SYSTEM: monoUnitTest/config66/test0/accuracy/out.nosil.trn

SENTENCE RECOGNITION PERFORMANCE

 sentences                                         500
 with errors                             86.8%   ( 434)

   with substitions                      72.2%   ( 361)
   with deletions                        43.6%   ( 218)
   with insertions                       51.2%   ( 256)


WORD RECOGNITION PERFORMANCE

Percent Total Error       =   87.7%   (4497)

Percent Correct           =   25.2%   (1291)

Percent Substitution      =   61.8%   (3166)
Percent Deletions         =   13.0%   ( 668)
Percent Insertions        =   12.9%   ( 663)
Percent Word Accuracy     =   12.3%


Ref. words                =           (5125)
Hyp. words                =           (5120)
Aligned words             =           (5788)

CONFUSION PAIRS                  Total                 (2569)
                                 With >=  1 occurances (2569)

So there are probably two problems,

one with monophones (more units makes things worse?!),
and with triunits (adding context does not make things better).

I will dig apart monophones first.

config 27 Units with --maxStates 500 --unitType wordInternalOnly --subUnits asBefore --growUnitSet 0 , LM_PENALTY=-1

DETAILED OVERALL REPORT FOR THE SYSTEM: testc67OnBreakdownModelc26/config67/test0/accuracy/out.nosil.trn

SENTENCE RECOGNITION PERFORMANCE

 sentences                                         500
 with errors                             88.6%   ( 443)

   with substitions                      73.8%   ( 369)
   with deletions                        36.8%   ( 184)
   with insertions                       56.0%   ( 280)


WORD RECOGNITION PERFORMANCE

Percent Total Error       =   91.1%   (4679)

Percent Correct           =   24.5%   (1260)

Percent Substitution      =   65.1%   (3340)
Percent Deletions         =   10.4%   ( 534)
Percent Insertions        =   15.7%   ( 805)
Percent Word Accuracy     =    8.9%


Ref. words                =           (5134)
Hyp. words                =           (5405)
Aligned words             =           (5939)

CONFUSION PAIRS                  Total                 (2693)

Problems fixed so far

All previous test were with the bigram model!!! all interesting tests should be rerun.
Only 1 subunit on each boundary was clustered - a disadvantage against traditional units where the center unit was also clusterd. Now up to two subunits on each boundary are untied and clustered.
many unused GMs were left in the trainable params - probably didn't affect the accuracy but slowed everything down
SUB_PHONE_COUNTER_CARD was used where WORDSTATE_COUNTER_CARD should have been used. Any word with more than 15 substates was given 0 probability?! Another reason to redo all the monophone tests.

Finally, at least something reasonable: and (hopefully) a 2% WER improvement in the baseline So redoing monophone tests:

monophone tests
trainConfig	testConfig	Descr	States	WER	WER 2k	Comments
triUnitsModel/config28	testc67triUnitsModelc28/config67	baseline but using the new testing stuff: --maxStates 0 --unitType wordInternalOnly --subUnits asBefore --growUnitSet 0	137	84.3%	86.0%
moreDTsModel/config20	trTestTrigramFixed/config19	baseline	137	84.3%		Same as above but using older code
triUnitsModel/config27	testc67TriOnBreakdownModelc27/config67	--maxStates 500 --unitType wordInternalOnly --subUnits asBefore --growUnitSet 0	627	87.2	87.6
triUnitsModel/config26	testc67TriOnBreakdownModelc26/config67	--maxStates 500 --unitType wordInternalOnly --subUnits asBefore	627	87.1 was 91.1%	TE
triUnitsModel/config25	xx	--maxStates 500 --unitType wordInternalOnly	636	xx	87.1
triUnitsModel/config24	xx	--maxStates 500	615	xx	xx
triUnitsModel/config29	xx	--maxStates 500 --unitType multiWordOnly	637	xx	88.2	There must be something wrong with this - the confusion pairs are not making sense and there are way too many deletions
triUnitsModel/config30	xx	--maxStates 500 --unitType wordInternalOnly --subUnits asBefore, initializeUnitsFromFile	627	xx	86.0