Skip to main content

Research Repository

Advanced Search

Temporal characteristics of spoken consonants as discriminants in automatic speech recognition

Green, Philip Duncan

Temporal characteristics of spoken consonants as discriminants in automatic speech recognition Thumbnail


Authors

Philip Duncan Green



Abstract

Three time-varying functions, which can be extracted. directly from the raw speech waveform, are of importance in the field of automatic speech recognition. These functions arc the zero-crossing rate, the turnaround (local maximum or minimum) rate and the amplitude of the speech wave envelope. The aim of the work described here was to assess the feasibility of using these three variables to distinguish between the various consonant phonemes in English speech.
The investigation was confined to consonants spoken
in isolated consonant-vowel syllables, with the consonant in the initial position. All the consonant phonems which occur in the initial position in English were spoken with each of ten vowel phonemes by four male speakers.
The three functions mentioned above wore extracted from the speech wave by computer routines and displayed
simultaneously using an on-line C.R.T. display. On these traces, the consonant part of the syllable could be readily distinguished by eye from that of the vowel, and the consonant was normally represented by a single peak on each trace. Further computer routines were evolved to identify these consonant peaks and extract recognition parameters describing the form of the peaks. Mistakes made by these programmes could be corrected manually from
observation of the display.
An attempt was then made to identify tho consonant
phoneme, using the values of the recognition parameters.
The recognition algorithms took the form of modified binary threshold decision trees, and the task of designing these algorithms to fit new data was mostly automated.
Separate algorithms were constructed to recognise the
utterances of each of the four speakers. For the appropriate speakers, the performances of these algorithms were very similar, about 65% of the utterances being classified correctly, with a further 25% of 'possibly' or tentatively correct identifications. The algorithms were, however, greatly speaker dependant, and performance fell off sharply when the speaker was changed.
The performance of the algorithms was independent of
the vowel spoken a.ft er the consonant sound. For each
speaker, satisfactory means were found to identify most of the consonant phonemes except the semi-vowel and nasal
sounds.
Many similarities could be seen between the four
recognition algorithms, and it was concluded that the
speaker dependance might be reduced by the use of a differ­ent type of recognition algorithm coupled with normalis­ation of the recognition parameters,

Files




Downloadable Citations