Thursday 21 May 2015

How does Sample Rate Conversion work?

I wrote post this mainly to address the following question: If I have a choice of sample rates available, which one should I choose?  I get asked this often, and the answer, like all things pertaining to digital audio, is both simple and complicated depending on how deeply you want to look into it.  So here is a quick primer on the technical issues that underpin SRC.  I have not attempted to sugar coat the technical aspect, so feel free to go away and read something else if you are easily intimidated :)

First of all, what, exactly is Sample Rate Conversion?  Well, digital audio works by encoding a waveform using a set of numbers.  Each number represents the magnitude of the waveform at a particular instant in time, so in principle, each time we measure (or ‘sample’) the waveform we need to store two numbers.  One number is the magnitude of the waveform itself and the other number is the exact point in time at which the number was measured.  That’s a lot of numbers, but we can cut them in half if we can eliminate having to store all the timing numbers.  Suppose we measure the waveform using a very specific regular timing pattern determined in advance?  If we can do that, then we don’t have to store the timing information because we can simply use a very accurate clock to regenerate it during playback.  This is how all digital audio is managed for consumer markets.

The “Sample Rate” is the rate at which we sample (or measure) the waveform.  Provided we know exactly what the sample rate is, we can relatively easily reconstruct the original waveform using those stored numbers.  The chosen sample rate imposes some very specific restrictions on the waveforms that we can encode in this manner.  Most particularly we must observe the Shannon-Nyquist criterion.  This states that the signal being sampled must contain no frequencies above one half of the sample rate.  If any such frequencies are present in the signal, they must be filtered out very strictly before being sampled.  Also, it is one of the simpler tenets of audio that human hearing is restricted to the frequency range below 20kHz.  Based on those two things, we can derive a commonly-quoted requirement that in order to achieve high quality, digital audio must therefore have a sample rate of at least 40kHz.  For that reason, the standard which has been chosen for CD audio, and widely adopted for digital audio in general, is 44.1kHz.  Interestingly, for DVD Audio, a slightly different sample rate of 48kHz was adopted.  These numbers have important consequences.

Of course, the above is not the whole story, and there are various good reasons why you might want to consider sampling your audio signal at sample rates significantly higher than 44.1kHz.  As a result, audio recordings exist at all sorts of different sample rates, and for distribution or playback compatibility purposes you may well have a good reason to want to convert existing audio data from one sample rate to another.

If you convert from a lower sample rate to a higher one, the process is called up-conversion.  In the opposite case, conversion from a higher to a lower sample rate is called down-conversion.  The alternative terminology of up-sampling and down-sampling can be interchangeably used.  I tend to use both, according only to whim.

We’ll start with a simple case.  Let’s say I have some music sampled at 44.1kHz and I want to convert it to a sample rate of 88.2kHz (which is a factor of exactly 2x the original sample rate).  This is a very simple case, because the 88.2kHz data stream comprises all of the 44.1kHz samples with one additional sample inserted exactly half way between each of the original 44.1kHz samples.  The process of inserting those additional samples is called interpolation.  In effect, what I have to do is (i) figure out what the original analog waveform was, and then (ii) sample it at points in time located at the mid-points between each of the existing samples.  Are you with me so far?

Obviously, the key point here is to recreate the original waveform, and I have already said that “we can relatively easily reconstruct the original waveform using the stored numbers”.  However, like a lot of digital audio, once you start to look closely at it you find that what is easy from a mathematical perspective, is often mightily tedious from a practical one.  For example, Claude Shannon (he of the Shannon-Nyquist sampling theorem) proved that the mathematics of a perfect recreation of the analog signal involves ‘simply’ the convolution of the sampled data with a continuous Sinc() function.  However, if you were to set about performing such a convolution, and evaluating the result at the interpolation points, you would find that it involves a truly massive amount of computation, and is not something you would want to do on any sort of routine basis.  Nonetheless, convolution with a Sinc() function does indeed give you a mathematically precise answer, and interpolations performed in this manner would in principle be as accurate as it is possible to make them.

So if a convolution is not practical, how else can we recreate the original analog signal?  The answer is that we can follow the process that happens inside a DAC (at least inside a theoretical DAC) and do something similar to recreate the original waveform in the digital domain.  Inside a DAC we pass the digital waveform through what is called a brick-wall filter, which is something that lets us block all of the frequencies above one-half of the sample rate while letting through as much as it can of all the frequencies below one-half of the sample rate.

This is the type of interpolation filter which is most commonly used.  What we do is make a sensible guess for what the interpolated value ought to be, and pass the result through a digital brick-wall filter to filter out any errors we may have introduced via our guesswork.  If we have made a good guess, then the filter will indeed filter out all of the errors.  But if our guess is not so good, then the errors can contain components which fold down into our signal band and can degrade the signal.  This filtering method has the disadvantage (if you want to think of it that way) of introducing phase errors into the signal, and has the effect that if you look closely at the resulting data stream you will see that most of the original 44.1kHz samples will have been modified by the filter.  There is some debate as to whether such phase errors are audible, and here at BitPerfect we believe that they actually may be.  So your choice of filter may indeed have an impact upon the resulting sound quality of the conversion.

Up-conversion in this manner is usually performed by a specialized filter which in effect combines the job of making the good guess and doing the filtering.

When up-converting by factors which are not nice numbers (for example when converting from 44.1kHz to 48kHz, a factor of 1.088x) the same process applies.  However, it is further complicated by the fact that now you cannot rely on a significant fraction of the original samples being reusable as samples in the output.  For example, if converting from 44.1kHz to 88.2kHz, every second sample in the output stream is derived from an interpolated value.  The interpolated values, which contain the errors, alternate with original 44.1kHz sample values which, by definition, contain no errors.  It can be seen, therefore, that the resultant error signal will be dominated by higher frequencies that were not present in the original music signal and can therefore be easily eliminated with a filter.  I hope that is clear.

On the other hand, if I am converting from 44.1kHz to 48kHz, then only 1 in every 160 samples of the 48kHz output stream will correspond directly to original samples from the 44.1kHz data stream (you’ll have to take my word for that).  In other words, 159 out of every 160 samples in the output stream will start off life as an interpolated value.  The quality of this conversion is going to be very dependent on the accuracy of those initial interpolation guesses.  Again, the process of making a best guess and doing the filtering is typically combined into a specialized filter, but the principle of operation remains the same.

Down-conversion is very similar, but with an additional wrinkle.  Lets start with a very simple down-conversion from 88.2kHz to 44.1kHz.  It ought to be quite straightforward - just throw away every second sample, no?  No!  Here is the problem:  With a 44.1kHz sample rate you cannot encode any frequencies above 22.05kHz (i.e. one-half of the 44.1kHz sample rate).  On the other hand, if you have a music file sampled at 88.2kHz you must assume that it has encoded frequencies all the way up to 44.1kHz.  So before you can start throwing samples away you have to first put it through a brick-wall filter to remove everything above 22.05kHz.  Once you’ve done that then, yes, it is just a question of throwing away every second sample (a process often referred to as decimation).

This additional wrinkle makes the process of down-sampling by non-integer factors rather more complicated.  In fact, there are two specific complications.  First, how in the name of heck to you decimate by a non-integer fraction?  Secondly, because you’re now interpolating a signal which may contain frequencies that would be eliminated by the brick-wall filter, you need to do the interpolation first, before you do the brick-wall filtering, and then the decimation last of all (I’m sorry if that’s not immediately obvious - you’ll just have to stop and think it through).  Therefore, to get around these two issues, the process of down-sampling by a non-integer factor will usually involve (i) interpolative up-sampling to an integer multiple of the target sample rate; (ii) applying the brick-wall filter (which would not be the same filter that you would use if you were just up-sampling for its own sake); and finally (iii) performing decimation.  That is quite a lot to swallow, but I couldn’t see an easy way to simplify it without making it way too long (and I think this post is quite long enough as it is).

I hope you have followed enough of what I just wrote to at least enable you to understand why I always recommend sample rate conversions between members of the same “family” of sample rates.  One family includes 44.1kHz, 88.2kHz, 176.4kHz, 352.8kHz, DSD64, DSD128, etc.  The other includes 48kHz, 96kHz, 192kHz and 384kHz.  If you feel the need to up- or down-sample (for any number of good reasons), try to stay within the same family.  In other words, convert from 44.1kHz to 88.2kHz rather than 96kHz.  But in any case, SRC does involve a substantial manipulation of the signal, and the principle that generally guides me is that if you can avoid it you are usually better off without it.

And when you buy digital downloads, if 88.2kHz or 176.4kHz are available as format options, choose them in a heartbeat over 96kHz and 192kHz.