Voiced and Unvoiced Speech Overview

Voiced and Unvoiced Speech Overview

In this experiment you use the concept of the energy of a sequence in order to classify speech into voiced and unvoiced frames. This is accomplished by dividing a speech signal of your choice into short frames and by computing the average power of each frame. The speech in a particular frame is then declared to be voiced if its average power exceeds a threshold level that is chosen by the user. Otherwise it is declared unvoiced.

More specifically, if a frame has N samples, then the squared values of these samples are added and the sum (which is the energy of the frame) is divided by N. The result is taken as the average power of the frame and it is compared with the threshold level that is set by the user.

This experiment provides a simple example of an application of processing of sequences. The processing we perform here is rather trivial since it involves only segmentation and averaging of a discrete-time signal or sequence. Still, the operation is useful and allows us to classify speech into voiced or unvoiced.

Voiced and unvoiced speech are defined as follows. Speech is composed of phonemes, which are produced by the vocal cords and the vocal tract (which includes the mouth and the lips). Voiced signals are produced when the vocal cords vibrate during the pronounciation of a phoneme. Unvoiced signals, by contrast, do not entail the use of the vocal cords. For example, the only difference between the phonemes /s/ and /z/ or /f/ and /v/ is the vibration of the vocal cords.

Voiced signals tend to be louder like the vowels /a/, /e/, /i/, /u/, /o/. Unvoiced signals, on the other hand, tend to be more abrupt like the stop consonants /p/, /t/, /k/.

In this experiment, we divide the waveform of a speech signal into frames of duration 20ms each and compute the average power of each frame. This average power is an indication of the loudness of the frame. We thus expect higher average powers for voiced signals than for unvoiced signals. The user picks a threshold level between 0 and 1; it is closer to 0 --- the user should be able to pick a suitable value by experimentation.

Three plots are shown. The top plot is the original speech signal. The middle plot shows only those frames whose average power exceeds the threshold level. The bottom plot shows the remaining frames whose average power is lower than the threshold level. A proportion value is shown next to each of the the middle and bottom plots. This value indicates the percentage of frames that are plotted relative to the original signal.