How does Audio work ?


Have you ever wondered what happens internally in the hardware and software whenever we listen to a song or record any audio ? If yes, lets understand the mechanism then.

As such, the procedure is simple: Sound travels back and forth i.e pushes and then release the pressure, thereby making anything that it touches vibrate. Now whenever we listen to any audio, it makes our eardrum vibrate, the vibration of eardrum sends some electrical signal to our brain which kind of understand it.

Similarly it happens with mic. Mic is like an eardrum we have. The vibration on the mic is transformed by a hardware called Analog to Digital converter(ADC) into digital signals in the form of 0 and 1 which computer understands it.

This article will cover following topics:

  1. How does digital audio works?
  2. Audio Quality dependent parameters.

Before we proceed to dive into the sea of audio, lets understand some simple terminologies 1st:

  1. Analog Signals: These signals are continuous in nature, for eg. plot any sinusoidal wave you will get an idea.
  2. Digital Signals: These signals are discreet for eg. plot any square wave you will get an idea.
  3. Human can only perceive sound of frequency from 20hz to appx. 22 khz.

The mic generates analog signals whenever sufficient intensity of sound hit it. By sufficient intensity i mean, different mic have different sensitivity for the sound. The mic contains a capacitor which when hit by sound wave produce analog electrical signals. These signals pass through low pass filter which filter out any sound having frequency greater than ~22khz. After that it is sampled at sampling rate of 44.1khz or 48khz or 96khz depending on the use-case.

Note: The sampling rate is the rate by which the ADC hardware capture audio samples.

The sampling rate can also be understood by drawing vertical lines on the graph of audio analog signal. These vertical lines are spaced according to sampling frequency. They will intersect the graph at various points. These points are given value according to the audio level that we have chosen. This procedure is called Quantisation.

Note: These audio level are defined by bit depth. Bit-depth is generally chosen from 8 bit, 16 bit, 24 bit or 32 bit. The more the audio bit-depth, the better the quality. The bit-depth can be understood by drawing horizontal lines on the graph of audio analog signal. The 16 bit bit-depth has the range of  -32,768 (-1 × 215) through 32,767, that is, the points that we got by drawing vertical lines can have values in this range. In one way, we can see that as to store audio intensity in this pattern. IN SHORT: it determines the dynamic range of the signal. 24-bit digital audio has a maximum dynamic range of 144 dB, compared to 96 dB for 16-bit. Dynamic range means the difference between the softest sound and the loudest sound.

Note: The sampling frequency has to double and greater than or equal to the audio frequency that we are interested to capture. Lets see why…

We’ve noted that it’s necessary to take at least twice as many samples as the highest frequency we wish to record. This was proven by Harold Nyquist, and is known as the Nyquist theorem. Stated another way, the computer can only accurately represent frequencies up to half the sampling rate. One half the sampling rate is often referred to as the Nyquist frequency or the Nyquist rate.
If we take, for example, 16,000 samples of an audio signal per second, we can only capture frequencies up to 8,000 Hz. Any frequencies higher than the Nyquist rate are perceptually ‘folded’ back down into the range below the Nyquist frequency. So, if the sound we were trying to sample contained energy at 9,000 Hz, the sampling process would misrepresent that frequency as 7,000 Hz — a frequency that might not have been present at all in the original sound. This effect is known as foldover or aliasing. The main problem with aliasing is that it can add frequencies to the digitized sound that were not present in the original sound, and unless we know the exact spectrum of the original sound there is no way to know which frequencies truly belong in the digitized sound and which are the result of aliasing. That’s why it’s essential to use the low-pass filter before the sample and hold process, to remove any frequencies above the Nyquist frequency.
To understand why this aliasing phenomenon occurs, lets take an of example of a film camera, which shoots 24 frames per second. If we’re shooting a movie of a car, and the car wheel spins at a rate greater than 12 revolutions per second, it’s exceeding half the ‘sampling rate’ of the camera. The wheel completes more than 1/2 revolution per frame. If, for example it actually completes 18/24 of a revolution per frame, it will appear to be going backward at a rate of 6 revolutions per second. In other words, if we don’t witness what happens between samples, a 270° revolution of the wheel is indistinguishable from a -90° revolution. The samples we obtain in the two cases are precisely the same.
For audio sampling, the phenomenon is practically identical. Any frequency that exceeds the Nyquist rate is indistinguishable from a negative frequency the same amount less than the Nyquist rate. (And we do not distinguish perceptually between positive and negative frequencies.) To the extent that a frequency exceeds the Nyquist rate, it is folded back down from the Nyquist frequency by the same amount.
For a demonstration, consider the next two examples. The following example shows a graph of a 4,000 Hz cosine wave (energy only at 4,000 Hz) being sampled at a rate of 22,050 Hz. 22,050 Hz is half the CD sampling rate, and is an acceptable sampling rate for sounds that do not have much energy in the top octave of our hearing range.
In this case the sampling rate is quite adequate because the maximum frequency we are trying to record is well below the Nyquist frequency.
Now consider the same 4,000 Hz cosine wave sampled at an inadequate rate, such as 6,000 Hz. The wave completes more than 1/2 cycle per sample, and the resulting samples are indistinguishable from those that would be obtained from a 2,000 Hz cosine wave.
The simple lesson to be learned from the Nyquist theorem is that digital audio cannot accurately represent any frequency greater than half the sampling rate. Any such frequency will be misrepresented by being folded over into the range below half the sampling rate.

Leave a Comment

Your email address will not be published. Required fields are marked *