I get asked about drawing waveforms from time to time. Over the years, I came to realize that this is a black art of sorts, and it requires a combination of some audio and drawing know-how on the Mac to get it right.
But first, a little story.
Once upon a time I used to write audio software for BeOS while I was in university. As almost every audio software author eventually does, I came to a point where I needed to render audio waveforms to the screen. I hacked up a straightforward drawing algorithm, and it worked well.
When I started working on a follow-on project, I decided to re-use the algorithm I wrote for the first application, but it didn’t work so well. The trouble is, when I originally wrote that algorithm, the audio clips in question were all very tiny—less than 2s. Now I was dealing with much longer clips (up to a few minutes, in practice), and the algorithm didn’t scale well at all.
Around this time, I interviewed with Sonic Foundry, with the hopes of joining the Vegas team. During my interview, I asked, “How do you guys draw waveforms on-screen for large audio clips, and so quickly!?”
“That’s proprietary information, sorry.”
At the time, I just figured the guys were just avoiding a long, drawn-out response. I coded this up myself, except for the fact that it wasn’t so fast—so it can’t be that difficult, right? Unfortunately, I got similar responses from other people I had asked afterwards.
Regardless of whether you’re new to audio, or you’ve been doing it for a while, you are aware that there aren’t too many books on the topic. Furthermore, you probably aren’t going to find too much in the way of detailed algorithms, or even pseudocode, to help you out.
I’m starting to realize that the reason is two-fold.
First off, there really aren’t a lot of people out there who need to draw audio waveforms (or large data sets, for that matter) to screen. Second, it’s really not all that hard once you think about it for a while.
Overview
Drawing waveforms boils down to a few major stages: acquisition, reduction, storage, and drawing.
For each of the stages, you have many implementation options, and you’ll choose the simplest one that’ll serve your application. I don’t know what your application is, so I’ll use Capo as the main example for this post, and throw around some hypothetical situations where necessary.
Early on, you have to set some priorities: Speed, Accuracy, and Display Quality. The order of those priorities will help you decide how to build your drawing algorithm, down to the individual stages.
In Capo, I wanted to make Display Quality the top priority, followed by Speed, and then Accuracy. Because Capo would never be used to do sample-precise edits, I could throw away a whole lot of data, and then make the waveform look as good as possible in a short time frame.
If I were writing an audio editor, my priorities might be Accuracy, followed by Speed, and then Display Quality. For a sequencer (like Garage Band), I’d choose Speed, Display Quality, then Accuracy, because you’re only viewing the audio at a high level, and it’s part of a larger group of parts. Make sense?
Once you have an idea of what you need, you will have a clear picture of how to proceed.
Acquisition
This is almost worth a post of its own. I like using the ExtAudioFile{Open,Seek,Read,Close} API set from AudioToolbox.framework to open various audio file formats, but you may choose a combo of AudioFile+AudioConverter (ExtAudioFile wraps these for you), or QuickTime’s APIs, or whatever else floats your boat.
Your decision of API to get the source data is entirely up to your application. You can’t extract movie audio with (Ext)AudioFile APIs, for instance, so they might not help much when writing a video editing UI. Alternatively, you may have your own proprietary format, or record short samples into memory, etc.
Given the above, I’m going to assume you’re working with a list of floating-point values representing the audio, because that’ll be helpful later on. Using ExtAudioFile, or an AudioConverter, make sure that your host format is set for floats, and you should be good.
When you’re pulling data from a file, keep in mind that it’s not going to be very quick, even on an SSD drive, thanks to format conversions. I’d advise doing all this work in an auxiliary thread, no matter how you get your audio, because it’ll keep your application responsive.
In Capo’s case, there is a separate thread that walks the entire audio file, doing the acquisition, reduction, and storage steps all at once. Because Display Quality and Performance were high on the priority list, the drawing step is done only when needed.
Reduction
Audio contains tons of delicious data. Unfortunately, when accuracy isn’t the top priority, it’s far too much data to be shown on the screen. With 44,100 samples/second, a second of audio would span ~17 30″ Cinema Displays if you displayed one sample value per each horizontal pixel.
If accuracy is your top priority, you’re still going to be throwing lots of data away most of the time, except when your user wants to maintain a 1:1 sample:pixel ratio (or, in some cases, I’ve seen a sample take up more than 1 pixel, for very fine editing). If you’re writing an editor, or some other application that needs high-detail access to the source data, you will have to re-run the reduction step as the user changes the zoom level. When the user wants to see 1:1 samples:pixels, you won’t throw anything away. When the user wishes to see 200:1 samples:pixels, you’ll throw away 199 samples for every pixel you’re displaying.
In the case of Capo, I chose to create an overview data set for the ‘maximum zoom’ level, and keep that on the heap (a 5 minute song should take ~1MB RAM). In my case, I chose a maximum resolution of 50 samples per pixel, and created a data set from that. As the user zooms out, I then sample the overview data set to get the lower-resolution versions of the data. Accuracy isn’t great, but it’s pretty fast.
Now, when I talk about “throwing away”, or “sampling” the data set, I’m not simply discarding data. In some cases, randomly choosing samples to include in the final output will work just fine. However, you may encounter some pretty annoying artifacts (missing transients, jumping peaks, etc) when you change zoom levels or resize the display. If Display Quality is low on your list—who cares?
If you do care, you have a few options. Within each “bin” of the original audio, you can take a min/max pair, just the maximum magnitude, or an average. I have found the maximum magnitude to work well for the majority of cases. Here’s an example of what I do in Capo (in pseudocode, of sorts):
// source_audio contains the raw sample data // overview_waveform will be filled with the 'sampled' waveform data // N is the 'bin size' determined by the current zoom level for ( i = 0; i < sizeof(source_audio); i += N ) { overview_waveform[i/N] = take_max_value_of( &(source_audio[i]), N ) }
Once you have your reduced data set, then you can put it on the screen.
Display
Here's where you have the most leeway in your implementation. I use the Quartz API to do my drawing. I prefer the family of C CoreGraphics CG* calls, because they're portable to CoreAnimation/iPhone coding, the most feature-rich, and generally quicker than their Cocoa equivalents. I won't get into any alternatives here (e.g. OpenGL), to keep it simple.
If we stick with the Capo example, then we've chosen to use the maximum magnitude data to draw our waveform. By doing so, we can exploit the fact that the waveform is going to be symmetric along the X axis, and only create one half of the final waveform path using some CGAffineTransform magic.
In the past, developers would create waveforms in pixel buffers using a series of vertical lines to represent the magnitudes of the samples. I like to call this the "traditional waveform drawing". It's still used quite a bit today, and in some cases it works great (especially when showing very small waveforms, and pixels are scarce like in a multitrack audio editor).
I personally prefer to utilize Quartz paths so that I get some nice anti-aliasing to the waveform edge. Because Capo features the waveform so prominently in the display, I wanted to ensure I got top-notch output. Quartz paths gave me that guarantee.
To build the half-path, we'll also be exploiting the fact that both CoreAudio and Quartz represent points using floating-point values. Sadly, this code is slightly less awesome in 64-bit mode, since CGFloats become doubles, and you have to convert the single-precision audio floats over to double-precision pixels. Luckily there are quick routines for that conversion in Accelerate.framework (A whole 'nother blog post, I know...).
<
p>
- (CGPathRef)giveMeAPath { // Assume mAudioPoints is a float* with your audio points // (with {sampleIndex,value} pairs), and mAudioPointCount // contains the # of points in the buffer.CGMutablePathRef path = CGPathCreateMutable(); CGPathAddLines( path, NULL, mAudioPoints, mAudioPointCount ); // magic! return path;
}
<
p>
Because magnitudes are represented in the range [0,1], and we're using Quartz, we can build a transform that'll scale the waveform path to fit inside half the height of the view, and then append another transform that'll translate/scale the path so it's flipped upside-down, and appears below the X axis line (which corresponds to a sample value of 0.0). Here's a zoomed in example of what I'm talking about.
And here's some code to give you an idea of what's going on to create the whole path:
// Get the overview waveform data (taking into account the level of detail to // create the reduced data set) CGPathRef halfPath = [waveform giveMeAPath]; // Build the destination path CGMutablePathRef path = CGPathCreateMutable(); // Transform to fit the waveform ([0,1] range) into the vertical space // ([halfHeight,height] range) double halfHeight = floor( NSHeight( self.bounds ) / 2.0 ); CGAffineTransform xf = CGAffineTransformIdentity; xf = CGAffineTransformTranslate( xf, 0.0, halfHeight ); xf = CGAffineTransformScale( xf, 1.0, halfHeight ); // Add the transformed path to the destination path CGPathAddPath( path, &xf, halfPath ); // Transform to fit the waveform ([0,1] range) into the vertical space // ([0,halfHeight] range), flipping the Y axis xf = CGAffineTransformIdentity; xf = CGAffineTransformTranslate( xf, 0.0, halfHeight ); xf = CGAffineTransformScale( xf, 1.0, -halfHeight ); // Add the transformed path to the destination path CGPathAddPath( path, &xf, halfPath ); CGPathRelease( halfPath ); // clean up! // Now, path contains the full waveform path.
Once you have this path, you have a bunch of options for drawing it. For instance, you could fill the path with a solid color, turn the path into a mask and draw a gradient (that's how Capo does it), etc.
Keep in mind, though, that a complex path with lots of points can be slow to draw. Be certain that you don't include more data points in your path than there are horizontal pixels on the screen—they won't be visible, anyway. If necessary, draw in a separate thread to an image, or use CoreAnimation to ensure your drawing happens asynchronously.
Use Shark/Instruments to help you decide whether this needs to be done—it's complicated work, and tough code to get working correctly with very few drawing artefacts. You don't even want to know the crazy code I had to get working in TapeDeck to have chunks of the waveform paged onto the screen. (Well, you might, but that's proprietary information, sorry. ;))
In Conclusion
People have suggested to me in the past that Apple should step up and hand us an API that would give waveform-drawing facilities (and graphs, too!). I disagree, and if Apple were to ever do this, I'd probably never use it. There are simply far too many application-specific design decisions that go into creating a waveform display engine, and whatever Apple would offer would probably only cover a small handful of use cases.
Hopefully the above information can help you build a waveform algorithm that suits your application well. I think that by breaking the problem up into separate sub-problems, you can build a solution that'll work best for your needs.