Contact Information
last updated: 10/22/2010

Sound: A Simple Introduction to its production

Acoustic and Perceptual Worlds

     It may be commonplace to point out that acoustic reality and perceptual reality are different. In a live performance situation, for example, no matter how still the audience, the environment will be full of sounds extraneous to the music. If a tape recorder were positioned somewhere in the midst of such a situation, and if a segment of the resulting tape were submitted to digital sound analysis, the results would highlight the difference between what one heard during the performance (what is presumably captured on the tape), and what analysis confirms the tape actually contains. Sound analysis reveals the behavior of sound in the physical world. In this case, analysis would show that soundwaves from all the sound sources in the environment -- the various instruments of the performance, perhaps the stirring of the audience, or the sound of vehicles passing beyond the confines of the performance context -- the multitude of acoustic elements that make up each of these sounds do not remain conveniently grouped by source. Rather, the components of all these sounds mix together, combining into a single, very complex waveform which is represented on the tape and revealed through analysis. This is because sound waves are additive, like waves in water, multiplying in quality rather than quantity.

In the simplest possible terms, what digital analysis uncovers are the acoustic features of the sounds captured by the tape recorder; what are actually heard are the perceptual features of the same sounds. The acoustic and perceptual characteristics of sound are not the same, nor in many cases is there a one-to-one correspondence between them.

Parameters of Sound

     In a very general sense, sounds in a normal environment consist of the acoustic elements each characterized by a specific frequency, amplitude, and duration. In the perceptual world, these parameters correspond to the sensations of pitch, loudness, and existence over time, respectively. Again in a general sense, sounds in the real world can be categorized as periodic or aperiodic. Periodic sounds are those emitted by a source that produces regular vibrations over time, resulting in a collection of frequencies called harmonics, partials, or overtones. Harmonic frequencies originating from the same source are related in that they occur in multiples of the lowest frequency, referred to as the fundamental frequency. Thus, a collection of harmonically related frequencies, of which the fundamental is 200 Hz, would occur with frequencies of 400 Hz, 600 Hz, 800 Hz, 1000 Hz, 1200 Hz, etc.

Noise and Tone

     Perceptually, a periodic sound can be defined by the fact that it usually produces a distinct sensation of tonal pitch; the sustained portion of many musical instruments and much of speech consists of periodic sound. Aperiodic sounds are those which most often occur perceptually as noise. Acoustically, noise is defined as a random collection of frequencies from a single source which are not harmonically related and whose waveform is therefore irregular. Different kinds and sensations of noise can be distinguished by the bandwidth (or frequency range) in which the random frequencies of the noisy signal occur. Sporadic noise is a component of most natural musical instruments, especially in the attack phase; it has been shown, in fact, that the noisy onset of an instrumental tone is very important, in some cases necessary, for listener recognition of familiar instruments. In speech, noise is a primary component of many consonants.

Sustained and Impulsive Sounds

     In addition to noise (aperiodic) and tone (periodic), another basic division of sounds is based on the difference between sustained and impulsive stimulus of vibrating material. An impulsive, or percussive, sound is one for which the vibrating part of an instrument is excited discontinuously or in pulses, so that with each excitation, a tone is produced and immediately begins to decay until the next excitation starts the process again. Common examples of impulsive sounds are those produced from plucked or struck instruments, such as the guitar, most percussion instruments, the piano, etc.

     A sustained sound is one for which the vibrating column of an instrument is excited continuously, so that the sound continues in a more or less steady state as long as the excitation continues. Instruments producing sustained tones are those which are bowed or blown, such as bowed chordaphones and most aerophones.

Pitch, Timbre, and Vowel Quality

     For periodic sounds such as those emitted by musical instruments or the voice, the fundamental frequency - the lowest harmonic - usually corresponds to the sensation of pitch. When the instruments of an orchestra tune their instruments to a standard A-440, they are each producing a soundwave of frequency approximately 440 Hz. Since very often it is the fundamental which is the loudest and lowest frequency, it was thought for a long time that it was this frequency itself which "sounded" the pitch above the frequencies of the other harmonics. It is now understood that the fundamental does not sound the pitch; rather pitch is determined by the pitch period of the entire waveform which is the frequency distance between any two consecutive harmonics. It has been seen, in fact, that the pitch sensation of a tone remains the same, even if the fundamental or the fundamental and several of the lower harmonics are removed from the tone.

     If a tuning orchestra consists of a multitude of instruments all theoretically producing notes of the same frequency with the same harmonics (440 Hz, 880 Hz, 1320 Hz, 1760 Hz, etc.), what is it that distinguishes the sounds of the instruments from each other? Or if two speakers produce different vowels on the same pitch, producing the same harmonics, what is it that distinguishes the sounds? In most musical instruments, the action of the musician on the instrument produces a source wave consisting of the vibrations of the vibrating column -- a string, a column of air, or a membrane, depending on the classification of the instrument. This source wave, like other periodic waves, has a full complement of harmonics, and when their vibrations are conveyed to the instrument's resonator, they undergo what is called a transfer function, illustrated schematically in the Glossary, under the definition for "transfer function".

     Central to the transfer function is the fact that all resonators have certain characteristic resonances, or frequencies at which they vibrate most efficiently. When the source wave hits the resonator, therefore, the resonator's characteristic resonances act to filter the harmonics of the source wave. Source harmonics that are close in frequency to one of the more efficient resonant frequencies of the resonator will vibrate with greater energy, thus increasing in amplitude. The transfer function changes none of the frequencies of the source harmonics; rather it selectively amplifies the harmonics close to its characteristic resonances. In a steady-state portion of a tone, the sensation of timbre depends on the relative amplitude of harmonics over time. Thus, if two instruments play a tone on the same pitch, they produce all the same harmonics; the difference between the two instruments is the pattern of high and low harmonics.


     In speech, the functions of the lungs and vocal cords correspond to the action of a musician on an instrument. The pressure of air emitted from the lungs sets the cords into motion, emitting "pulses" of air. These pulses set up a periodic source wave consisting of a fundamental vibration and its harmonics. When the vocal cords are tightened by the speaker, the frequency of the pulses emitted by the cords increases, and the fundamental and all its harmonics rise in pitch. Above the vocal cords, the vocal tract can be thought of as a constantly changing resonator, as the speaker alters the place and manner of articulation with each phoneme. The source wave passes through the vocal tract undergoing the same transfer function that occurs with the resonator of a musical instrument.

     Like the instrumental resonator, the vocal tract acts as a selective filter, reinforcing the energy of some of the source wave's frequencies, depending on the resonant frequencies of the vocal tract in any particular conformation. Because of the relative softness of the vocal tract as resonator, its resonant frequencies are much broader in bandwidth (cover a wider range of frequencies) than instrumental resonators constructed of wood or metal. Therefore, rather than emphasizing one or several individual harmonics as occurs with instrumental resonators, the vocal tract emphasizes an entire band of harmonics, called formants. The result is that each vowel sound has characteristic formants consisting of bands of higher intensity harmonics.

Vocal v. Instrumental Timbre

     In both music and vowel sounds, then, the distinguishing quality between two sounds of the same pitch is the timbre or tone quality. And in both music and vowel sounds, timbre is dependent on the relative amplitudes of harmonics. As a musical instrument, however, the voice differs from other instruments in several ways, of which a few have already been mentioned. The first of these is that the voice is an immensely versatile instrument because of its variable resonator, whose resonant frequencies change with the articulation of each vowel.

     In fact, a strict analogy between the singing voice and other musical instruments might insist that the voice is comparable to a collection of instruments, whose pitch ranges are the same, but whose timbres are different. However, even if one considers each vowel to constitute a separate timbre (defining a separate instrument), there are still several characteristics of the voice as instrument that act not so much to set it apart from all other musical instruments as to position it at one end of a continuum of timbre characteristics. The defining characteristics of the timbre continuum - those which the voice possesses to a greater degree than other instruments - revolve around the difference between a harmonic and a formant.

This distinction has been the source of a longstanding argument in auditory research as to whether perception of instrumental timbre depends on the existence of formant frequencies or harmonic structure. Whereas harmonics are individual frequencies that are perceptually fused together into a unitary sensation of tone quality, formants are broader-band regions of intense energy. Understood in this way, the argument centers on two issues: 1) whether the salient perceptual features of a tone consist of individual high-intensity harmonics or high-intensity bands of harmonics; 2) whether the tone quality of an instrument is uniform over its entire pitch range, indicating that formant structure remains constant, regardless of the harmonics that fill it.

     The first of these issues concerns the bandwidth of harmonics effected by the enhancing properties of the resonator; as mentioned above, the softer the material from which a resonator is constructed, the broader its resonant frequencies, and the broader the band of harmonics that will be amplified. Vocal timbre is characterized by wide formants due to the softness of the vocal tract, while the timbre of a flute, for example, whose resonator is comparatively hard, depends on irregularly spaced, single harmonics emphasized over the others. The second issue, the constancy of an instrument's timbral quality over its pitch range, depends on the relationship, or coupling, between the source vibration and resonator of the instrument. As a vibrating system, the voice displays the loosest coupling between the source and resonator of all instruments, and is thus capable of the most independent variation of pitch and timbre; tone languages, in particular, depend greatly on this division of labor. A given vowel will maintain its timbral quality no matter what pitch it is pronounced or sung on; a more tightly coupled system, as is typical of many reed instruments, for instance, is characterized by significant timbre variation from its lowest to highest pitch. In addition to the voice, other formanted, timbre-constant instruments are stringed instruments, especially if constructed of softer wood, which might follow the voice on the timbre continuum. At the other end of the continuum might be instruments that depend on overblown harmonics for certain pitches, like the flute and many reed instruments, especially those made from harder wood or metal.

     Whether or not one can speak of a true continuum of timbre types, the general opinion at present is that some instruments exhibit formant structure and some instruments depend on the salience of individual harmonics. Though these characteristics are purely acoustic in nature, and ought in theory to be visibly measurable, they are musically important as well. Those instruments that sound more "voicelike" are inevitably positioned on the voice end of the continuum. Traditional musicians appear fascinated with the quality of formanted instruments: many "talking" instruments fall into the formanted category, and most instruments that use overtones to produce a second pitch above the drone of the fundamental rely on the presence of formants from which a single harmonic can be extracted.

Auditory Scene Analysis

     As noted above, digital analysis of a recorded live performance would show that a conglomerate waveform consisting of all frequencies at differing amplitudes from all sources contained in the environment. This identical to the signal entering the ears of any listener sitting in the audience. Unlike most systems of computer analysis of sound, which are incapable of separating the complex waveform into its individual source components, a human listener is able to parse or resolve the signal, thus determining which sources have emitted sounds contributing to that conglomerate. To do this, the auditory processing mechanisms of the listener must first isolate from the signal individual acoustic elements, in most cases single frequencies, which are then recombined according to source. This resolution will result in various source components, consisting partly of periodic waves, each with a fundamental frequency and appropriate harmonics, and partly of a of aperiodic waves, arriving from noise-producing sources, perhaps extraneous to the performance.

     The separation of a conglomerate or complex waveform into its individual elements and the recombination of these elements into source-dependent units and then into sound sensations is a process described by S. McAdams as the formation of "auditory images" (1982), and A. Bregman as "auditory scene analysis" (1990). Both of these phrases convey something of the interpretive quality of the organization process as McAdams and Bregman conceive it. Part of this interpretive quality results from the fact that the listener must engage in a "conversion process," that translates acoustic features into perceptual features. The regrouping of individual acoustic elements into "sound units" by the central auditory system is thought to entail "hypotheses" as to the nature of the source emitting the signal to be processed; the importance of a source hypothesis, formed on the evidence of incoming signals, is that it can then direct a continuing search for acoustic evidence to confirm the hypotheses. According to this theory of auditory processing, if there are competing hypotheses, then the grouping with the greatest amount of evidence for a possible source will be chosen. It is the competition between possible percepts that is responsible for the interpretive quality of the process, and it is the final translation that allows the listener to identify sources.

     Because the possible combination of sounds that may occur at any one time is virtually unlimited, the auditory processing system has developed what appear to be principles of organization to assist in hypothesis forming. Research has uncovered an immense amount of information on the "cognitive logic" that informs a source hypothesis. A generally accepted tenet about the nature of this logic is that it appears to depend on the characteristics of real sources in the acoustic world. For example, real world sources often emit frequencies that occur in harmonic relation, as described above. Similarly, when real world sources emit consecutive tones over time, these tones are often close together in frequency. If, therefore, the auditory processing mechanism determines that among the acoustic elements of an incoming signal, there are frequencies that are harmonically related, or that there are tones close in frequency that reoccur over time, then these two facts constitute evidence for two groupings of elements from the signal.

      If auditory perception in most situations is guided by an inherent understanding of source characteristics in the acoustic world, in a music performance situation, auditory perception is notable for its frequent disregard of sources. Efforts to blend the instruments of an ensemble, for example, act to disrupt the normal source-orientation of auditory grouping. Many genres of traditional music function by confounding sources, encouraging the inclusion into a single perceived sound unit of elements emitted by more than one source, or by provoking listeners to hear the same combination of sound in two different ways. Techniques such as this result in what Bregman has called an auditory chimera (1990), or in more subtle "anomalies" like the augmented timbres described by G. Sandell (personal communication). Because musical sound may disrupt the normal source-orientation of auditory processing, and especially because the cognitive reasoning that replaces source-orientation is often culturally conditioned, field researchers ought ideally to investigate the actual perceptions experienced by a listener. It is not enough for a researcher, foreign to the music he/she is studying, to register his/her own perceptions of the music under the assumption that there is only one way to hear the sound components of the music; nor is it enough to consider the acoustic information supplied by digital analysis without having established the perception - both indigenous and nonindigenous - that corresponds to the acoustic elements uncovered by analysis.

Visual Representation of Sound

     The papers presented here use two primary forms of representation to deduce acoustic information, and to demonstrate that information for readers. The two forms differ in the point of view each takes to sound and the amount of detail each supplies. Each can show two dimensions of sound simultaneously: the spectrogram looks at time (horizontal axis) by frequency (vertical axis) and the spectrum looks at frequency (horizontal axis) by amplitude (vertical axis) as shown in the schematic versions of included in the Glossary listing for "spectrogram" and "spectrum". Thus a spectrogram shows for a variable duration of time the harmonic composition of a sound; it can not indicate amplitude in any precise fashion, though degree of contrast in color is intended to indicate relative degrees of intensity.

     Spectrograms can be narrowband or wideband. The mechanical differences in these are not important for our purposes here, but the visual difference is that the narrowband spectrogram shows more vertical or frequential detail, including individual harmonics, while the wideband spectrogram shows more horizontal or temporal detail, so that while durations are easily measured, individual harmonics are visually merged into formant regions. Narrowband spectrograms thus demonstrate more efficiently the movement of harmonics, while wideband spectrograms demonstrate more efficiently the movement and duration of formants. Spectra, on the other hand, show the relative amplitudes of formant regions or individual harmonics at a single instant in time. If a spectrum is accurate, it is particularly useful in determining the harmonicity of partials, as well as their high intensity regions; a spectrum, then, gives a more precise representation of a sound's timbre at any particular moment.

It is important to remember, however, that often the information provided by either form of presentation is an approximation at best, limited by the resolution capabilities of both the digitizer and the analyzer, as well as by the fineness of detail possible in the graphic display of the software. It is also important to be cautious in considering which details of the visual representation of a sound sample are salient to the sound as perceived; often the picture of a sound will include clearly visible elements which are acoustically present in the sound but too short in duration, or too soft in intensity to register perceptually. A useful maxim in this regard is the following: If a discrete element is filtered from a sound with no difference to the resulting tonal sensation, then the element is unimportant to the final percept and need not be considered in interpreting the data, no matter how blatantly it appears in analysis.

last updated: 10/22/2010