Computer Resources

From SpeechWiki

Revision as of 04:42, 27 October 2009 by Arthur (Talk | contribs)
Jump to: navigation, search

Contents

This Wiki

Is generally readable by anyone, and editable by anyone with an account (The account can be created by anyone too). However, there is a namespace 'SST' which contains pages/work that only members of the sst_group can view or edit.

  • To check if you are a member, look here.
  • To become a member, email Arthur or Mark.
  • To create a page in the SST namespace, start the title of your page with SST: e.g. SST:Test_Page.
  • To move an existing page into or out of the SST namespace, use the `move' tab at the top of the page.

LVCSR at Illinois Computer Resources

  • Data:
    • Corpora we develop and distribute
    • We are members of LDC. Most LDC data is organized as described in the Data Organization README. Some useful slices of LDC data that have not been moved to ifp-32-2 include:
      • /workspace/fluffy1/12hour - 12 hours extracted from Switchboard 1, with SPHERE and WAV audio, MFCCs, transcriptions.
      • /workspace/fluffy1/{train-ws96,train-ws97,misc-ws97} - The ICSI phonetically transcribed Switchboard-1 extracts
      • /workspace/fletcher1/bdc - The Boston Directions Corpus, two speakers have prosodic transcriptions, others don't
      • /workspace/nibbler0/data/ylzheng/WS04/DATA - Tsinghua Wu-accented Mandarin (MFCC and FMT only, no waveforms)
      • /workspace/fluffy1/penn_treebank
  • Time-aligned Switchboard Disfluency corpus
    • mickey0/sw_disTime-0.9.9 - merged from the original Switchboard time transcription and the Treebank-3 disfluency transcription (TextGrid included)
    • mickey0/sw_disTime-1.0.0 (TextGrid NOT included)

Parallel Computing

The Installation notes for ifp-32 are available.

Sun Grid Engine on ifp-32

Bowon's brief introduction about the SGE here Detailed SGE document, including Job Dependency[1]

MPI

Perl MPI Simple


FAQ

Two jobs I submitted this afternoon remain "dr" status. I tried to delete them using "qdel -f <jobid>" but it didn't work.
It's because the compute node on which the jobs were running (compute-1-9 in this case) crashed. You can see it's down by going to https://ifp-32.ifp.uiuc.edu/ganglia/. Someone needs to physically go to the lab and reboot the 9th machine from the BOTTOM of the rack. (plug the keyboard and monitor from the back of the rack into the suspect machine first just to check it's really the right machine).

Applications

  • Acoustic model training:
    • HTK hidden Markov modeling toolkit: ifp-32-1/hasegawa/programs/htk-3.4
    • GMTK Dynamic Bayesian Nets/Graphical Models: nibbler0/speech_apps/GMTK
    • Sphinx speech recognizer
    • LIUM speech tools, including speaker segmentation
  • Decoding:
    • Julius LVCSR decoder - /workspace/ifp-32-1/hasegawa/programs/julius-4.1
    • AT&T DCD LVCSR decoder - nibbler0/speech_apps/dcd-2.0
  • Language model training:
    • SRILM Big N-gram counts and backoff, lattices: fluffy0/programs/srilm
    • AT&T FSM Library: fluffy0/programs/fsm-4.0
    • OpenFST: fluffy0/programs/OpenFst/
  • Spectrograms and Waveform Viewing
    • XKL (MIT): nibbler0/speech_apps/xkl-2.3.1
    • ESPS (Entropic Systems, now Microsoft)
    • Praat

Installing / Arranging Software

If you download linux software from the internet, and find it useful, please put it where others may also use it! Here's how.

  1. Type `umask 022` or `umask 000`. If you use 022, you are volunteering to manage the package; if you use 000, you are inviting others to help manage it.
  2. Download the tarfile to /workspace/ifp-32-1/hasegawa/programs; untar it to create $PACKAGE_DIR; remove the tar file (important!); configure; make all.
  3. Decide where you want the binaries. Reasonable places for programs are /workspace/ifp-32-1/hasegawa/programs/...
    • scripts = executes on any machine (e.g., perl, bash scripts)
    • bin.`uname` (i.e., bin.Linux) = executes on both ifp-32 and mickey. PLEASE CHECK: ssh mickey; execute code; see if it gives you "cannot execute binary file".
    • bin.`arch` = executes only on machines of type `arch`. Type `arch` to see what machine you're on.
    • $PACKAGE_DIR/bin.Linux = packages with many binaries should remain in $PACKAGE_DIR, to avoid over-writing similarly-named programs in ../bin.Linux.
  4. Change the installdir variable in your Makefile, according to your decision in part (3). Type "make install" to install, then "make clean" to remove object files and such.

Backups

If you have personal working directories that should be regularly backed up, outside of your own home directory, list them here.

  • Art
    • mickey0/akantor
    • rizzo1/akantor is itself a backup of svn because it cannot be backed up in the normal way.
  • Sarah
    • nibbler0/data
    • rizzo0/sborys
    • spot1/sborys
    • tico0/sborys
  • Xiaodan
    • /workspace/tico0/AED/
  • Camille
    • /workspace/ifp-32-2/hasegawa/data/multimodal/nonspeech/FODAVA/

SVN

Our server is svn://mickey.ifp.uiuc.edu

On windows, download tortoisesvn.

On linux, the client is svn, and should be installed everywhere.

For linux command help see simple tutorial (don't worry about any of the svnadmin commands, and replace file:///home/user/svn with svn://mickey.ifp.uiuc.edu

Compiling

gcc is used by default, but I (Arthur) am getting good results with intel's compiler which is available for free for non-comercial use and is installed in /workspace/ifp-32-1/hasegawa/programs/intel (we got the fortran, c/c++ compilers, and the intel math library).

Benchmarking quicknet in 4 thread mode with every combination of intel/gcc and ATLAS/intel implementations of the BLAS library, you get the following:

logs/smallGccCompilerIntelMath.log:     CV speed: 4351.14 MCPS, 3107.8 presentations/sec.
logs/smallGccCompilerIntelMath.log:     Train speed: 2056.95 MCUPS, 1469.2 presentations/sec.
logs/smallGccCompilerIntelMath.log:     CV speed: 4691.55 MCPS, 3351.0 presentations/sec.

logs/smallIntelCompilerIntelMathLib.log:CV speed: 3984.39 MCPS, 2845.9 presentations/sec.
logs/smallIntelCompilerIntelMathLib.log:Train speed: 2140.31 MCUPS, 1528.7 presentations/sec.
logs/smallIntelCompilerIntelMathLib.log:CV speed: 4034.74 MCPS, 2881.9 presentations/sec.

logs/smallIntelCompilerATLASMathLib.log:CV speed: 3508.69 MCPS, 2506.1 presentations/sec.
logs/smallIntelCompilerATLASMathLib.log:Train speed: 1961.05 MCUPS, 1400.7 presentations/sec.
logs/smallIntelCompilerATLASMathLib.log:CV speed: 3553.22 MCPS, 2537.9 presentations/sec.

logs/smallGccCompilerATLASMathLib.log:  CV speed: 4219.30 MCPS, 3013.7 presentations/sec.
logs/smallGccCompilerATLASMathLib.log:  Train speed: 1954.73 MCUPS, 1396.2 presentations/sec.
logs/smallGccCompilerATLASMathLib.log:  CV speed: 4133.10 MCPS, 2952.1 presentations/sec.

The train speed is the interesting one because it takes the longest, and on it we get almost a 10% speed up. Strangely CV (testing) speed is best with a gcc compiler and Intel math library.

Using the intel compiler and math library from the above setup and running on the shiny new PCs that Mark got for us:

CV speed:    3828.51 MCPS, 2734.6 presentations/sec.
Train speed: 2932.56 MCUPS, 2094.6 presentations/sec.
CV speed:    4093.49 MCPS, 2923.8 presentations/sec.

Going from gcc to intel you have to switch tools as follows :

gcc intel
---
gcc icc
g++ icpc
ar  xiar
Personal tools