Transcription Guidelines

From SpeechWiki

Jump to: navigation, search
Have the Praat transcription interface installed
Before starting an utterance
  • Have the phone-to-feature mappings, and these instructions handy.
  • Write down the time.
  • In order to open the utterance in Praat.
    • If you have not created a textgrid file yet, do the followings:
      • Open the .wrd file that contains word transcription: In the Praat Objects, Read -> Read from special tier file... -> Read IntervalTier from xwaves...
      • Open the corresponding .wav file: In the Praat Objects, Read -> Read from file...
      • While the sound file is highlighted in the Praat Objects window, go to New -> Create Gesture Textgrid...
      • Open Articulatory Feature Manual: in the Praat Obejcts window, Help -> Articulatory Feature Manual.
      • Now you are ready to transcribe.
    • If you already have a textgrid file for the utterance that you want to transcribe
      • Open the .wav file and its corresponding .textgrid file into the Praat objects window
      • Select both, and click Edit.
      • Open Articulatory Feature Manual page: in the Praat Obejcts window, Help -> Articulatory Feature Manual.
      • Now you are ready to transcribe.
To keep in mind during transcription
  • We recommend that transcribers do canonical phone transcription and conversion first, and then do articulatory feature transcriptions for non-canonical pronunciation. That is why the Gesture manual page lists Tier 2 symbols first, followed by conversion button, and then all the other feature symbols.
  • The boundaries in the initial word transcription may have errors. Do not feel bound by them.
  • The final transcription should give enough information so that the speaker, by looking at the transcription, could recreate the acoustics exactly. More on this in the detailed guidelines below.
  • Go for accuracy over speed. But if you are very unsure about a segment, use "?" or multiple labels (e.g. "FRIC/APP").
  • Don't worry about exact boundaries up to +/- 20ms. I.e. if you find yourself unsure about a boundary location, but the uncertainty is within +/- 20ms, place the boundary somewhere in the appropriate range and move on.
  • Use the comment tier to mark anything problematic or that should be discussed.
  • Use the menus to label feature tiers, rather than typing in, unless you are marking an unusual/uncertain segment (e.g. with "?" or multiple labels).
  • It's OK to use some deductive reasoning. E.g. if the intended sound is [b] but the actual segment is a fricative, it is probably a LAB, FRIC and not a [v] (L-D, FRIC).
When finishing an utterance
  • Write down the time.
  • Save the transcription.
  • Do these sanity checks:
    • Re-skim the guidelines below to make sure the transcription follows them.
    • Listen to each segment to make sure you believe the label for that segment.
    • Run the check_correctness script to see if there are any mistakes. If there are any, fix them.
  • Send a finished textgrid file to Heejin.
  • Stretch, get a drink...


When doing a 2nd pass
  • Fix mistakes, not disagreements.
  • Once the 2nd pass is done, you should be able to "defend" each difference between transcriptions and the other transcribers', i.e. "I chose this over that because...".

Detailed guidelines

In no particular order...

When using a phone label to generate a feature vector
  • If a phone has an unspecified feature value ([hh], [q], [r], [er], [axr], [sh], [zh], [ch], [jh]), that feature must be entered explicitly.
  • E.g. an [hh] must have its vowel tier specified by hand; it should be easy to tell what vowel shape the [hh] is in based on formants/listening. If it does not have any formant structure and you don't hear it, then mark it N/A. Similarly, when [q] is realized as IRR, the vowel shape is usually easy to tell and should be labeled in the .vow tier (if it's not easy to tell, label it '?').
  • For some phones with unspecified feature values, it may be hard to tell what the actual value is; e.g. an [r] may be rounded or unrounded; label it '?' if you can't tell.
[ah] vs. [ax], [ih] vs. [ix], [er] vs. [axr]
  • Use a schwa if the segment is unstressed and 50ms or shorter.
APP degree
  • Used for both glides ([y], [w]) and other sounds realized as approximants. If there is any gesture towards an intended consonant, even if small, use APP degree. E.g. if "probably" is produced almost like [p r ay] but with some evidence of lip narrowing in the middle of the [ay]-like region, mark that as APP, LAB.
Two stop closures in a row
  • If you can't tell when the place of closure has changed (e.g. "woul*d g*o"), just mark the boundary in the middle.
"GLO" place
  • Used only for glottal stops.
VOI vs. IRR in the .glo tier
  • If there are regular pitch periods, even with very low pitch, label them as VOI. Use IRR only when the pitch periods are not at regular intervals.
Creaky voice vs. IRR
  • Creaky voice should be labeled as VOI as long as intervals between pulses are regular.
  • Creaky voice receives the IRR label ONLY if a segment shows irregular spacing. That is, it is NOT the case that creaky voices are always marked distinctively from others; this is mostly based on the fact that creaky voicing isn't phonemically significant in English.
  • To determine the beginning and end boundaries of an IRR portion, you can use the PULSES function in the Praat editor.
Voiceless vowels
  • Mark them as "VL", not "ASP", in the glottal tier (to differentiate from [hh])
  • A reminder about when to use A+VO
      • voiced [h]
      • aspirated vowels/liquids/glides
      • aspirated part of stop burst when voicing for the following vowel starts prior to the onset of vowel (This is newly added from the above discussion).
The vowel rule
  • If the .pl/.dg tiers are both "NONE/VOW", then there should be a vowel label in the .vow tier; otherwise, the .vow tier should be "N/A".
  • The only exception is for rhoticized vowels ([er], [axr]) and syllabics ([el], [em], [en]); these get a vowel label in the .vow tier and a constriction in the .pl1/.dg1 tier ("RHO/APP" for [axr], [er]; "LAT/CLO", "LAB/CLO", or "ALV/CLO" for [el], [em], [en]).
Laterals
  • Use "LAT" place for both light and dark [l]s. [l]
  • The default dg1 is CLO, but make sure to check if it has full closure or it is approximant. For an [l] with incomplete tongue tip closure, transcribers should manually change CLO to APP in the degree tier. (Example: the [l]s at the end of the words "all", "feel" would usually be marked as a "LAT/APP" segment followed by a "LAT/CLO" segment.)
Stops
  • For unaspirated (voiced or voiceless) stops, the .dg is CLO during the closure, then BUR during the release.
  • All stop bursts will be canonically voiceless (glo = VL). Voiced vs. voiceless stops will only be distinguished by their VOTs (as it should be).
  • Canonical stop closures pcl, tcl, kcl, bcl, dcl, gcl will be: glo = VOI for voiced stops, and glo = VL for voiceless stops. However, if you don't see a voice bar, that should of course be changed to glo = VL.
  • For voiceless stops at the beginning of a stressed syllable, there may be aspiration that is clearly distinct from a burst portion. If so, use CLO for closure, BUR for burst, APP during the aspiration. The aspiration should also be indicated as ASP in the .glo tier.
    • For example, to label aspirated obstruents like in the word "invi[tÊ°]ation" (0012), do labeling in the following sequence:
      • Label the burst portion as 't' in the tier 2 and click the phone-to-gesture conversion button.
      • For the aspirated portion, 1) add a new interval in tier 4 (dg1) and label as APP, and 2) add a new interval in tier 8 (glo) and label as ASP. You don't have to change the other features such as place.
      • If voicing extends into the ASP section, then label it A+VO (aspirated with voice) as necessary.
  • If there is an utterance-initial or -final stop closure, it may not be possible to tell where the boundary between closure and silence is. In that case, mark a 100ms-long closure.
Transitional periods between steady states
  • Don't label them as separate segments if they are natural/necessary transitional periods, e.g. the formant transitions between vowels and consonants. If there is an "extra" transitional sound beyond what is necessary, like "feel" --> [f iy ax l], label it as such.
Voicing onset/offset
  • When in doubt, open a waveform blow-up; the onset/offset of voicing is the point at which periodicity starts/stops.
Vowel onset
  • The vowel after a stop burst starts when the upper formant structure starts, not at the beginning of voicing. The onset of glo = VOI is still to be marked at the actual onset of voicing, i.e. at the first sign of periodicity in the signal.
Vowel qualities of the onset and offset of diphthongs
  • [aw1] should look/sound more like an [ae] (or [aa] in some dialects), [aw2] like an [uh] or [w]
  • [ay1] <--> [aa], [ay2] <--> [ih] or [y]
  • [ey1] <--> [eh], [ey2] <--> [ih] or [y]
  • [ow1] doesn't have a non-diphthong correlate, [ow2] <--> [uh] or [w]
  • [oy1] <--> [ao] or [ow1], [oy2] <--> [ih] or [y]
  • Boundaries between 1 and 2 in diphthongs should be determined by identifying these qualities.
Diphthongs realized as monophthongs
  • Label them as the monophthong, not as 1, when possible. For example, an underlying [aw] that's produced as an [ae] should be marked [ae], not [aw1]. Similarly for [ey]. For [ow], there's no monophthong label corresponding to the first part, so mark it [ow1]. For [oy], the initial part of it may sound like an [ao], in which case label it as such; or it may sound more like the sound at the beginning of [ow] or "or", in which case label it [oy1].
Diphthongs realized as diphthongs but with non-canonical pronunciation
  • Label using monophthong symbols: e.g. When the offset of the word "I" is a lot more reduced than typical ay2 -> Lable [aa]-[ix] rather than [ay1]-[ix].
How to deal with H#
  • For H# in the beginning and end of file, label as SIL
  • For H# in the middle of file, decide whether it is SIL or xxx (non-speech), and mark it accordingly.
The "recreation" rule
  • The final transcription should give enough information so that the speaker, by looking at the transcription, could recreate the acoustics exactly (assuming he/she could actually read the transcription).
  • For example, if a word is very reduced but with a hint of the original gestures, that should be indicated somehow in the transcription. E.g. if the word is "probably" and is produced like "pry" but with a hint of labial/lateral gestures in the middle, don't transcribe it as /pcl p r ay1 ay2/. Use place=LAB or LAT and degree=APP to indicate these hints of gestures.

Miscellaneous tips

  • If an utterance is particularly long or difficult, remember that you can always save a partial transcription and return to it later.
  • Most people seem to find it easier to label all tiers from left to right, rather than one tier at a time. However, feel free to label however you see fit.
Personal tools