Graduate Program

Technology

Degree Name

Master of Science (MS)

Semester of Degree Completion

2008

Thesis Director

Peter Ping Liu

Thesis Committee Member

Rigoberto Chinchilla

Thesis Committee Member

Sam Guccione

Abstract

Captioning provides accessibility to media resources for deaf and hearingimpaired persons and is mandated by federal and state laws such as the Americans with Disabilities Act. However, the process of creating synchronized captions for media is time consuming and labor intensive. Consequently, many content providers still do not incorporate captions into media presentations on the web.

In this research, algorithms were developed to automate part of the captioning process by estimating the timing of captions for web-based audio and video files using plain-text transcripts and their corresponding audio recordings. Recordings used in this study were of professional speakers/readers from American English radio and television broadcasts and of non-professional speakers reading text from a novel. The text transcripts were divided into sentences. The duration of each sentence was initially estimated from the number of characters in each sentence as a proportion of the total recording time. The locations and durations of pauses in the audio recordings were compiled by scanning for regions of low amplitude. It was found that the RMS amplitude of each audio file performs adequately as a threshold between silence and speech for captioning.

Statistically, pause durations at the ends of sentences are significantly greater than those within sentences. This observation was used to match the ends of sentences in the text to pauses in the audio track. In order to successfully distinguish between withinsentence pauses and end-of-sentence pauses on the basis of duration, it is advantageous to utilize data over localized portions of a file rather than over the entire audio file. When the results of the automated matching were ambiguous, a manual feedback mechanism Estimation of Time Codes for Captioning iv was utilized to further improve the accuracy of the algorithms. For the media files tested in this study, the algorithm accurately estimated the timing of 96% of the captions within 0.5 seconds.

harvey2008.zip (36065 kB)
Supplemental materials

Share

COinS