Florida Voices
Florida Voices is an initiative of the Florida Electronic Library to support all types of libraries and cultural heritage organizations in Florida.
Florida Voices
 

IV. Synchronizing transcripts and audio

  1. Multimedia integration
  2. Synchronizing text with audio using SMILs
  3. Letting text drive the audio
  4. Additional readings
1. Multimedia integration

It is not necessary to synchronize transcripts with audio or video recordings.  Many users of oral history will be happy to have the full transcript available in one piece.  However, if the budget allows, synchronizing audio and text may benefit some listeners by helping them maintain their place in the transcript and by making it easier to skip around in the interview.

Synchronization is most commonly accomplished using SMIL or SAMI.  SMIL (Synchronized Multimedia Integration Language) is a Web standards markup language for multimedia presentations.  SAMI (Synchronized Accessible Media Interchange) is a similar language developed by Microsoft. SMIL is supported by QuickTime and RealPlayer, while SAMI is used by Windows Media Player. 

Both SAMI and SMIL are HTML-like languages that can be written "by hand" or created by program.  Although few programs are designed specifically for oral history, there are many products that can be adapted to the task.  Captioning tools designed to increase Web accessibility for the hearing impaired work fine for oral history as well.  These include MagPie, a free tool developed by the CPB/WGBH National Center for Accessible Media (NCAM), and Hi Caption Studio which runs about $500.  Both of these products can output either SMIL or SAMI.

Web accessibility sites are good sources of "how to" information about captioning for common media players.  See: 

2. Synchronizing text with audio using SMILs

SMIL 2.0, published in 2001, is a complicated language not yet fully supported by any media player.  However, a simple SMIL (pronounced "smile") file to associate sections of transcript with corresponding audio can easily be made by hand.  Like HTML, a SMIL file is just tagged text with <head> and a <body> section.  The basic format of a SMIL file is:

<smil>
<head>
</head>
<body>
</body>
</smil>

The <head> section defines the layout of the presentation, for example:

<head>
<layout>
<root-layout width="800" height="800" background-color="white"/>
<region id="banner" top="0" left="0" height="42"/>
<region id="a" top="43" left="0" height="500"/>
</layout>
</head>

Here we define a screen of 800 x 800 pixels with two regions. The region called "banner" starts at the top left hand corner and is 42 pixels wide.  The region called "a" starts right below the banner region and is 500 pixels wide.

<body>
<par>
<img src="floridaVoices.jpg" region="banner" dur="165s"/>
<audio src="Gross.MP3" />
<text id="pt1" dur="25s" src="Gross-1.txt" region="a"/>
<text id="pt2" begin="+25s" dur="24s" src="Gross-2.txt" region="a" />
<text id="pt3" begin="+49s" dur="25s" src="Gross-3.txt" region="a" />
<text id="pt4" begin="+74s" dur="80s" src="Gross-4.txt" region="a" />
<text id="pt5" begin="+154s" dur="11s" src="Gross-5.txt" region="a" />
</par>
</body>

The <body> section defines the presentation.  The section above identifies 7 media files: one image,  one audio, and five text files.  The image will be displayed in the previously-defined region called "banner", and the text files will display in the region called "a".  The transcript is synchronized with the audio by the begin and dur parameters.  For example, the section of transcript called "pt2" will begin displaying after 25 seconds, and remain on the screen for 24 seconds.  Presumably, that corresponds to the start and end time of the appropriate audio segment. The <par> tag guarantees everything within it will be played/displayed simultaneously (while following timing instructions). Otherwise the text would not appear until the audio had completed.

To execute the full SMIL file, click here (you may have to install RealPlayer).   

This is a very simplistic example designed to show one technique -- breaking up a transcript into multiple files each containing small-ish segments of text, and timing the display of each file to correspond to the spoken audio.  This can be encoded in SMIL in several different ways, and the layout designed in the <head> section can be far more sophisticated than that shown here.  For more information see the official SMIL web page on the World Wide Web Consortium website.  This includes links to different versions of the specification, a list of authoring tools, and several SMIL manuals and tutorials.
3. Letting text drive the audio

The technique described above is useful for audiences who want to listen to an entire interview while reading along.  Often, however, a researcher will skim the text of the transcript looking for particular topics, and he will want to play the audio corresponding to a selected section of text.  In other words, he will want the text to drive the audio, rather than the other way around.

This can be done by breaking up the audio of the interview into smaller files and using MS Word or other text editor to insert links to the audio segments in the appropriate places in the transcript.  There are many inexpensive programs that will split audio files, including MP3 Splitter and Joiner (about $20 from EZ Softmagic, very easy to use) and Audacity (free, general purpose audio editor).

Here is an example of a page where each question and answer is a separate audio segment.   In reality, however, you would probably create longer segments; some projects have recommended three minutes.  For a real-world application of this technique, see Oral Histories of the American South.  This grant-funded project developed a tool for synchronizing audio and text that might in the future be made available to other oral history projects.

4. Additional reading
TOP | HOME | NEXT