The position of each audio source within the audio signal is called a channel. Each channel contains a sample indicating the amplitude of the audio being produced by that source at a given moment in time. For instance, in stereo sound, there are two audio sources: one speaker on the left, and one on the right. Each of these is represented by one channel, and the number of channels contained in the audio signal is called the channel count.
While recording or generating multi-channel audio files, the channels are assembled into a series of audio frames , each consisting of one sample for each of the audio's channels. An individual sample is a numeric value representing the amplitude of the sound waveform at a single moment in time, and may be represented in various formats. Stereo audio is probably the most commonly used channel arrangement in web audio, and bit samples are used for the majority of day-to-day audio in use today.
For bit stereo audio, each sample taken from the analog signal is recorded as two bit integers, one for the left channel and one for the right. That means each sample requires 32 bits of memory. At the common sample rate of 48 kHz 48, samples per second , this means each second of audio occupies kB of memory. Therefore, a typical three-minute song requires about That's a lot of storage, but worse, it's an insane amount of network bandwidth to use for a relatively short piece of audio.
That's why most digital audio is compressed. Over the years, a large variety of codecs have been developed, several of which are commonly used on the web. For details about the most important and useful ones for web developers to be familiar with, see the article Guide to audio codecs used on the web.
There are two types of audio channel. Standard audio channels are used to present the majority of the audible sound. The sound for the left and right main channels, as well as all of your surround sound speakers center, left and right rear, left and right sides, ceiling channels, and so forth are all standard audio channels.
Special Low Frequency Enhancement LFE channels provide the signal for special speakers designed to produce the low frequency sounds and vibration to create a visceral sensation when listening to the audio. The LFE channels typically drive subwoofers and similar devices. Monophonic audio has one channel, stereo sound has two channels, 5. Each audio frame is a data record that contains the samples for all of the channels available in an audio signal.
The size of an audio frame is calculated by multiplying the sample size in bytes by the number of channels, so a single frame of stereo bit audio is 4 bytes long and a single frame of 5. Note: Some codecs will actually separate the left and right channels, storing them in separate blocks within their data structure.
However, an audio frame is always comprised of all of the data for all available channels. The number of frames that comprise a single second of audio varies depending on the sample rate used when recording the sound. Since the sample rate corresponds to the number of "slices" a sound wave is divided into for each second of time, it's sometimes thought of as a frequency in the sense that it's a description of something that repeats periodically, not in terms of actual audio frequency , and the samples per second measurement therefore uses the Hertz as its unit.
The international G. This is enough for human speech to be comprehensible. The CDs provide uncompressed bit stereo sound at Computer audio also frequently uses this frequency by default. There is a reason why The Nyquist-Shannon sampling theorem dictates that to reproduce a sound accurately, it must be sampled at twice the rate of the sound's frequency.
Since the range of human hearing is from around 20 Hz to 20, Hz, reproducing the highest-pitched sounds people can generally hear requires a sample rate of more than 40, Hz. To provide additional room for a low-pass filter in order to avoid distortion caused by aliasing , an additional 2. Doubling that per the Nyquist theorem results in a final minimum frequency of you guessed it High-resolution 96 kHz audio is used in some high-end audio systems, and it and ultra-high resolution kHz audio are useful for audio mastering, where you need as much quality as possible while manipulating and editing the sound before downsampling to the sample rate you will use for the final product.
This is similar to how photographers will use high resolution images for editing and compositing before presenting the customer with a JPEG suitable for use on a web site. Once you know the size of a single audio frame and how many frames per second make up your audio data, you can easily calculate how much space the raw sound data itself will occupy and therefore how much bandwidth it would consume on a network.
For example, consider a stereo audio clip that is, two audio channels with a sample size of 16 bits 2 bytes , recorded at 48 kHz:.
At kBps, lower-end networks are already going to be strained just by a single audio stream playing. If the network is also doing other things, the problem strikes even on higher bandwidth networks. With so much competition for network capacity, especially on slower networks, this amount of data may be too much to viably transmit during any kind of real-time applications. Note: Network bandwidth is obviously not the same thing as audio bandwidth, which is discussed in Sampling audio , above.
Unlike text and many other kinds of data, audio data tends to be noisy , meaning the data rarely consists of a series of exactly repeated bytes or byte sequences. As a result, audio data is difficult to compress using traditional algorithms such as those used by generral-purpose tools like zip , which usually work by replacing repeating sequences of data with a shorthand representation.
There are several techniques which can be applied when compressing audio. Most codecs use a combination of these, and may use other techniques as well. The simplest thing you can do is to apply a filter that removes hiss and quiet sounds, converting any quiet sections into silence and smoothing out the signal. This can produce stretches of silence as well as other repeating or nearly repeating signals that can be shortened.
You can apply a filter that narrows the audio bandwidth, removing any audio frequencies that you don't care about. This is especially useful for voice-only audio signals. Doing this removes data, making the resulting signal more likely to be easy to compress. If you know what kind of audio you're most likely to handle, you can potentially find special filtering techniques applicable specifically to that kind of sound, that will optimize the encoding. The most commonly-used compression methods for audio apply the science of psychoacoustics.
This is the science that studies how humans percieve sound, and what parts of the audio frequencies we hear are most important to how we respond to those sounds, given the context and content of the sound.
Factors such as the ability to sense the change in frequency of a sound, the overall range of human hearing versus the frequencies of the audio signal, audio localization, and so forth all can be considered by a codec. By using a sound no pun intended understanding of psychoacoustics, it's possible to design a compression method that will minimize the compressed size of the audio while maximizing the perceived fidelity of the sound. An algorithm employing psychoacoustics may use any of the techniques mentioned here, and will almost certainly apply others as well.
All of this means there is a fundamental question that has to be asked and answered before choosing a codec: Given the content of the sound, the usage context, and the target audience, is it acceptable to lose some degree of audio fidelity, and if so, how much; or is it necessary that, upon decoding the data, the result be identical to the source audio?
Similar to the Nyquist frequency, 24 FPS happens to be the magic number for making a series of pictures look like a fluid moving image. The audio sample rate must be a multiple of the frame-rate in order to stay in sync. Higher sample rates are also widely used, but their necessity is debated.
Higher sample rates are always multiples of For example, The bit depth of a file determines its dynamic resolution, similar to a digital photograph. Each bit can convey four amplitude values two positive and two negative , so more bits per sample means greater dynamic range.
Here's a rundown of common sample rates and their stats:. PCM audio can be encoded in many formats for the end user, and these formats fall into two categories: lossless and lossy. Lossless formats perfectly preserve whatever information was captured at the time of recording but can take up a lot of hard drive space. Lossy formats create compressed files note: data compression is different from audio compression which take up significantly less hard drive space but can sacrifice some audio quality or result in unpleasant artifacts.
Your purchases also help protect forests, including trees traditionally used to make instruments. We understand the importance of online privacy and are committed to complying with the EU General Data Protection Regulation. To reflect our commitment, we updated our terms and conditions.
By continuing to use Reverb, you agree to these updates, and to our cookie policy. We're much more sensitive to frequencies in the range from about Hz to 7, Hz than we are to frequencies outside that range. This might possibly be due evolutionarily to the importance of hearing speech and many other important sounds which lie mostly in that frequency range. Nevertheless, there is a correlation -- even if not perfectly linear -- between amplitude and loudness, so it's certainly informative to know the relative amplitude of two sounds.
As mentioned earlier, the softest sound we can hear has about one millionth the amplitude of the loudest sound we can bear. Rather than discuss amplitude using such a wide range of numbers from 0 to 1,,, it is more common to compare amplitudes on a logarithmic scale. The ratio between two amplitudes is commonly discussed in terms of decibels abbreviated dB. A level expressed in terms of decibels is a statement of a ratio relationship between two values -- not an absolute measurement.
If we consider one amplitude as a reference which we call A0 , then the relative amplitude of another sound in decibels can be calculated with the equation:. If we consider the maximum possible amplitude as a reference with a numerical value of 1, then a sound with amplitude 0. Each halving of amplitude is a difference of about -6 dB; each doubling of amplitude is an increase of about 6 dB. So, if one amplitude is 48 dB greater than another, one can estimate that it's about 2 8 times as great.
A theoretical understanding of sine waves, harmonic tones, inharmonic complex tones, and noise, as discussed here, is useful to understanding the nature of sound. However, most sounds are actually complicated combinations of these theoretical descriptions, changing from one instant to another. For example, a bowed string might include noise from the bow scraping against the string, variations in amplitude due to variations in bow pressure and speed, changes in the prominence of different frequencies due to bow position, changes in amplitude and in the fundamental frequency and all its harmonics due to vibrato movements in the left hand, etc.
A drum note may be noisy but might evolve so as to have emphases in certain regions of its spectrum that imply a harmonic tone, thus giving an impression of fundamental pitch. Examination of existing sounds, and experimentation in synthesizing new sounds, can give insight into how sounds are composed.
The computer provides that opportunity. To understand how a computer represents sound, consider how a film represents motion. A movie is made by taking still photos in rapid sequence at a constant rate, usually twenty-four frames per second. When the photos are displayed in sequence at that same rate, it fools us into thinking we are seeing continuous motion, even though we are actually seeing twenty-four discrete images per second.
Digital recording of sound works on the same principle. We take many discrete samples of the sound wave's instantaneous amplitude, store that information, then later reproduce those amplitudes at the same rate to create the illusion of a continuous wave.
The job of a microphone is to transduce convert one form of energy into another the change in air pressure into an analogous change in electrical voltage.
This continuously changing voltage can then be sampled periodically by a process known as sample and hold. At regularly spaced moments in time, the voltage at that instant is sampled and held constant until the next sample is taken. This reduces the total amount of information to a certain number of discrete voltages. A device known as an analog-to-digital converter ADC receives the discrete voltages from the sample and hold device, and ascribes a numerical value to each amplitude.
This process of converting voltages to numbers is known as quantization. Those numbers are expressed in the computer as a string of binary digits 1 or 0. The resulting binary numbers are stored in memory — usually on a digital audio tape, a hard disk, or a laser disc. To play the sound back, we read the numbers from memory, and deliver those numbers to a digital-to-analog converter DAC at the same rate at which they were recorded. The DAC converts each number to a voltage, and communicates those voltages to an amplifier to increase the amplitude of the voltage.
In order for a computer to represent sound accurately, many samples must be taken per second— many more than are necessary for filming a visual image. In fact, we need to take more than twice as many samples as the highest frequency we wish to record. For an explanation of why this is so, see Limitations of Digital Audio on the next page. If we want to record frequencies as high as 20, Hz, we need to sample the sound at least 40, times per second.
The number of samples taken per second is known as the sampling rate. This means the computer can only accurately represent frequencies up to half the sampling rate.
Any frequencies in the sound that exceed half the sampling rate must be filtered out before the sampling process takes place. This is accomplished by sending the electrical signal through a low-pass filter which removes any frequencies above a certain threshold. Also, when the digital signal the stream of binary digits representing the quantized samples is sent to the DAC to be re-converted into a continuous electrical signal, the sound coming out of the DAC will contain spurious high frequencies that were created by the sample and hold process itself.
Therefore, we need to send the output signal through a low-pass filter, as well. The digital recording and playback process, then, is a chain of operations, as represented in the following diagram. We've noted that it's necessary to take at least twice as many samples as the highest frequency we wish to record.
This was proven by Harold Nyquist, and is known as the Nyquist theorem. Stated another way, the computer can only accurately represent frequencies up to half the sampling rate. One half the sampling rate is often referred to as the Nyquist frequency or the Nyquist rate. If we take, for example, 16, samples of an audio signal per second, we can only capture frequencies up to 8, Hz. So, if the sound we were trying to sample contained energy at 9, Hz, the sampling process would misrepresent that frequency as 7, Hz -- a frequency that might not have been present at all in the original sound.
This effect is known as foldover or aliasing. The main problem with aliasing is that it can add frequencies to the digitized sound that were not present in the original sound, and unless we know the exact spectrum of the original sound there is no way to know which frequencies truly belong in the digitized sound and which are the result of aliasing.
That's why it's essential to use the low-pass filter before the sample and hold process, to remove any frequencies above the Nyquist frequency. To understand why this aliasing phenomenon occurs, think back to the example of a film camera, which shoots 24 frames per second. The samples we obtain in the two cases are precisely the same. For audio sampling, the phenomenon is practically identical.
Any frequency that exceeds the Nyquist rate is indistinguishable from a negative frequency the same amount less than the Nyquist rate. And we do not distinguish perceptually between positive and negative frequencies. To the extent that a frequency exceeds the Nyquist rate, it is folded back down from the Nyquist frequency by the same amount. For a demonstration, consider the next two examples. The following example shows a graph of a 4, Hz cosine wave energy only at 4, Hz being sampled at a rate of 22, Hz.
In this case the sampling rate is quite adequate because the maximum frequency we are trying to record is well below the Nyquist frequency. Now consider the same 4, Hz cosine wave sampled at an inadequate rate, such as 6, Hz. The simple lesson to be learned from the Nyquist theorem is that digital audio cannot accurately represent any frequency greater than half the sampling rate.
Any such frequency will be misrepresented by being folded over into the range below half the sampling rate. Each sample of an audio signal must be ascribed a numerical value to be stored in the computer. The numerical value expresses the instantaneous amplitude of the signal at the moment it was sampled. The range of the numbers must be sufficiently large to express adequately the entire amplitude range of the sound being sampled.
The range of possible numbers used by a computer depends on the number of binary digits bits used to store each number.
A bit can have one of two possible values: either 1 or 0. Two bits together can have one of four possible values: 00, 01, 10, or
0コメント