Rust Audio

How do we describe audio buffers in the ecosystem

One thing that I’m concerned about is that different audio crates can handle audio buffers very differently. In some crates, an audio buffer looks approximately like this

&[i16] // interleaved

whereas in other crates, an audio buffer looks approximately like this

&[&[i16]] // slice of channels

The reason I’m concerned about this is that I’m afraid it may be very complicated to “abstract away” this difference in “higher abstraction crates”.

I had a look at how various audio libraries handle audio buffers and I came to the following conclusion:

Summary

Array of channels:

  • Vst

Each channel is a different thing:

Interleaved:

  • PortAudio Edit: see below.

Interleaved/non interleaved

I would like to have a strategy for having an ecosystem of audio crates that can easily be combined. The “easiest” solution would be to agree on one audio format, but, at least in my eyes, there’s no clear winner. In rsynth, I’ve been using the “slice of channels” approach, and I can tell that it has the following advancages and disadvantages:

Advantages of "slice of channels"

  • No “preprocessing” in the form of interleaving needed for the currently supported and planned back-ends (vst, jack, lv2). This one is in fact very important for me, as I want the ‘rsynth’ abstraction to be as “low cost” as possible since CPU power is a very valuable resource in real-time audio processing.
  • Easier support for SIMD (speculative advantage as I have not implemented this yet). I’m concerned that using SIMD with interlaced audio is hard when the audio processing is quite different per channel (e.g. there’s a little delay on one channel)
  • Smoother transition to supporting a variable number of channels (speculative advantage as I have not implemented this yet)
  • In theory support for channels with different sample rates. Though I think a “push-based approach” where the buffer is passed as an argument to the callback is not well-suited in this case, and one should opt for a “pull based approach”, where the callback function has to query the host for the available channels.

Disadvantages of "slice of channels"

  • Rust lifetime issues. Basically the problem is that the audio host allocates memory for the samples, but not for the slice of channels, so you need to allocate memory for that and reuse it for sample slices with different lifetimes. This creates its own type of challenges. The vecstorage crate is part of the answer (and was developed specially for that purpose), but it’s not the whole answer. This disadvantage may be a “one off”, though, but I’m not sure.
  • In many cases, a slice of frames is easier to handle: one does not need to assert that all channels have the same length etc. (speculative disadvantage as I have no experience with this yet).

What are your thoughts and opinions on this?
I’m afraid my own thoughts are not yet that fleshed out, but I want to have this discussion as soon as possible.

Sorry to say that, but channel-interleaved audio doesn’t work that way in Rust:

Slices are fat pointers, which means that in memory, an &[T] slice is the same as a (*const T, usize) tuple, where the first value is the start of the memory location and the second value is the number of values in the slice. Some simple code to prove that:

use std::mem::*;

fn main() {
    // A slice has the size of two pointers.
    assert_eq!(2 * size_of::<usize>(), size_of::<&[i16]>());

    // A slice is a tuple of a pointer and a number of elements.
    let data: Box<[i16]> = Box::new([0i16; 32]);
    let data_slice: &[i16] = data.as_ref();
    let data_location: *const i16 = data_slice.as_ptr();
    let data_tuple: (*const i16, usize) =
        unsafe { transmute::<&[i16], (*const i16, usize)>(data_slice) };
    assert_eq!(data_location, data_tuple.0);
    assert_eq!(32, data_tuple.1);
}

This, however, means that the type &[&[i16]] equals to the type (*const &[i16], usize) = (*const (*const i16, usize), usize). It basically is a slice of other pointers, which may point to wherever they like.

“Channel-interleaved audio”, however, means that you have a single slice of i16s, of which every second word would belong to to one channel. In Rust, you would describe this with &[(i64, i64)] or &[[i64; 2]], or you would build a custom smart pointer that does the index calculations for you. I’m sorry to say that, but the “slice of channels”-approach is wrong.

Nonetheless, this is a very interesting question: From a software’s point of view, having a slice for every channel is the most natural data layout, but some devices and standards provide/require the audio to be interleaved. One solution would be keeping software generic over Index or Iterator. Slices implement these traits naturally, and you can easily create a smart pointer to extract single channels from an interleaved data slice:

use std::ops::Index;

pub struct InterleavedSlice<'a, T> {
    data: &'a [T],
    channels: usize,
    offset: usize,
}

impl<'a, T> InterleavedSlice<'a, T> {
    pub fn new(data: &'a [T], channels: usize, offset: usize) -> Self {
        Self {
            data,
            channels,
            offset,
        }
    }

    pub fn iter(&self) -> impl Iterator<Item = &T> {
        let mut bound: usize = self.data.len() / self.channels;
        if self.data.len() % self.channels > self.offset {
            bound += 1;
        }
        (0..bound).map(move |i| self.index(i))
    }
}

impl<'a, T> Index<usize> for InterleavedSlice<'a, T> {
    type Output = T;

    fn index(&self, index: usize) -> &T {
        &self.data[index * self.channels + self.offset]
    }
}

The index calculations done here are the same as if you would do them on your own and therefore is almost a zero-cost abstraction. Therefore, Software that was written for Index implementors can be used both for individual channel slices and for interleaved data. What do you think about that?

As I’ve mentioned in a comment to my answer there, “interleaved” is not the only option in PortAudio:

The underlying PortAudio library supports a flag paNonInterleaved to change that, but this is typically not available in the Python wrappers. Note that paNonInterleaved uses separate pointers for each channel, i.e. the whole audio data is not necessarily contained in a single contiguous block of memory.

Oops. My bad. Corrected.

1 Like

Hmm, very valuable input. I’ve some bad experience trying to put the buffers behind an interface (for<'a> popping up and propagating until the compiler said: “sorry, not supported in this context”). But the way you sketch it, gives me hope again. I think it can work. I’ll give it a try and let you know.

Yeah, lifetimes can be a nasty thing. The thing that makes this sketch smooth is the fact that the returned reference only lives for the lifetime of the smart pointer, not longer. Therefore, most lifetimes that would follow can be elided. But believe me, I know your pain. LV2 Atom is a lifetime nightmare! :wink:

Maybe I should note that in standard LV2, there are no interleaved audio channels either. The whole thing is more software-oriented; For example, f32 between -1.0 and 1.0 are used instead of i16s to make coding easier. However, you could of course define an extension that includes channel-interleaved, signed 16-bit audio. There’s nothing stopping you!

Yeah. Because of this, and for maximum performance, it’s important, I think, to not only support interleaved audio.

Ok, so I did some thinking on how to put the audio buffers behind a trait and it turns out that it’s … well … not so easy, to say at least.

If you think about it, in fact, we’re trying to come up with a collection trait, tailored to our needs. Now, Rust doesn’t have a collection trait because a collection trait would require Generic Associated Types (GATs). So basically, our AudioBuffer trait would also require GATs. I see three ways to get around this:

  1. “hoist the lifetime to the trait”: trait AudioBuffer<'a> {type FrameIter; type ChannelIter; /* ... */ }. This works, but leads to the for<'a> (higher ranked trait bound) that I mentioned before. The good news here is that I couldn’t find a context where a higher ranked trait bound was not supported. This approach is worked out in detail below (as a solution to the “challenge”).
  2. Don’t use an associated type, use a specific type. Downside: performance hit.
    struct FrameIter<T: AudioBuffer> {}
    /* Implement `Iterator` for `FrameIter` by using the get method. */
    trait AudioBuffer {
        // Only supporting f32 here as an oversimplyfication.
        fn get(&self, frame_index: usize, channel_index: usize) -> Option<f32>;
        fn frame_iter<'a>(&'a self) -> FrameIter<'a>;
        /* ...*/
    }
    
  3. Wait until GATs are stabilized.

If this too much abracadabra for you, here’s a more down-to-earth formulation of the problem, which I will formulate as a “challenge”.

Challenge

Flesh out the traits AudioRenderer and AudioBufferMut that are roughly defined as follows:

trait AudioBufferMut {
    type Iter;
    fn iter_mut(self) -> Self::Iter; // or fn iter_mut(&mut self) -> ???
}

trait AudioRenderer<ABM: AudioBufferMut> {
    fn render_audio(&mut self, buffer: ABM); // or render_audio_(&mut self, buffer: &mut ABM) or ???
}

Do this in a way such that the following works:

struct DualAudioRenderer<R1, R2> {
    renderer1: R1,
    renderer2: R2
}

impl<ABM, R1, R2> AudioRenderer<ABM> for DualAudioRenderer<R1, R2> 
where
    R1: AudioRenderer<ABM>, 
    R2: AudioRenderer<ABM>,
    ABM: AudioBufferMut
{
    fn render_audio(&mut self, buffer: ABM) {
        // render silence
        self.renderer1.render_audio(buffer);
        self.renderer2.render_audio(buffer);
    }
}

Solution #1

Click on the triangle to see the full details. I'm hiding it so that you can try it independently, maybe you have a better solution than I have.

Note: I typed this out of my head and didn’t check for compile errors. The full solution at the very end compiles.
First problem: AudioBufferMut does not extend Copy, so

self.renderer1.render_audio(buffer);
self.renderer2.render_audio(buffer); // Error: `buffer` has moved here

We can’t let AudioBufferMut implement Copy because contains an &mut reference. This can be solved by letting it borrow as mutable:

self.renderer1.render_audio(&mut buffer);
self.renderer2.render_audio(&mut buffer); // No more error!

But this means that the signature of render_audio needs to change:

fn render_audio(&mut self, buffer: &mut ABM);

Now let’s look at how AudioBufferMut was defined:

trait AudioBufferMut {
    type Iter;
    fn iter_mut(self) -> Self::Iter;
}

This requires a value, not just a reference, which would mean that it’s useless if you only have a reference. So we have to change this as well:

trait AudioBufferMut {
    type Iter;
    fn iter_mut(&mut self) -> Self::Iter;
}

This looks nice in theory, but if you try to implement it, you will run into an error:

struct MyAudioBuffer { /* ... */ }

struct MyAudioBufferMutIterator<'b> {
    buffer: &'b mut MyAudioBuffer;
}

impl AudioBufferMut for MyAudioBuffer {
    type Iter = MyAudioBufferMutIterator<'b>;  // Error: unknown lifetime <'b>
    // ...
}

In order to solve this, we “hoist the lifetime to the trait”:

trait AudioBufferMut<'b> {
    type Iter;
    fn iter_mut(&'b mut self) -> Self::Iter;
}

struct MyAudioBuffer { /* ... */ }

struct MyAudioBufferMutIterator<'b> {
    buffer: &'b mut MyAudioBuffer;
    // ...
}

impl<'b> AudioBufferMut<'b> for MyAudioBuffer {
    type Iter = MyAudioBufferMutIterator<'b>;  // Lifetime is known now!
    fn iter_mut(&'b mut self) -> Self::Iter {
        // ...
    }
}

All good, end good? Not yet! We didn’t see the for<'a> yet. How would the definition of AudioRenderer and the implementation of it for DualAudioRenderer look like?
One attempt is the following:

trait AudioRenderer<ABM> {
    fn render_audio<'b>(&mut self, buffer: &'b mut ABM)
    where ABM: AudioBufferMut<'b>;
}

impl<ABM, R1, R2> AudioRenderer<ABM> for DualAudioRenderer<R1, R2> 
where
    R1: AudioRenderer<ABM>, 
    R2: AudioRenderer<ABM>,
{
    fn render_audio<'b>(&mut self, buffer: &'b mut ABM) 
    where ABM: AudioBufferMut<'b>
    {
        // render silence
        self.renderer1.render_audio(buffer);
        self.renderer2.render_audio(buffer);
    }
}

But this gives an interesting error:

error[E0499]: cannot borrow `*buffer` as mutable more than once at a time
  --> testje.rs:41:37
   |
36 |     fn render_audio<'b>(&mut self, buffer: &'b mut ABM) 
   |                     -- lifetime `'b` defined here
...
40 |         self.renderer1.render_audio(buffer);
   |         -----------------------------------
   |         |                           |
   |         |                           first mutable borrow occurs here
   |         argument requires that `*buffer` is borrowed for `'b`
41 |         self.renderer2.render_audio(buffer);
   |                                     ^^^^^^ second mutable borrow occurs here

Why is this? Well, because only &'b mut ABM with this specific lifetime 'b implements AudioBufferMut<'b>, the lifetime of the borrow on line 40 must be 'b, which encloses the whole body of the function, so also it also includes the lifetime of the borrow on line 41, hence it’s borrowed as mutable more than once. So basically, what we want is that &'smaller ABM also implements AudioBufferMut<'b>, so that we can call render_audio on line 40 with a smaller lifetime. And this is where for<'a> comes into play. Full solution below:

trait AudioBufferMut<'b> {
    type Iter;
    fn iter_mut(&'b mut self) -> Self::Iter;
}

trait AudioRenderer<ABM> {
    fn render_audio(&mut self, buffer: &mut ABM)
    where for<'b> ABM: AudioBufferMut<'b>;
}

struct MyAudioBuffer { /* ... */ }

struct MyAudioBufferMutIterator<'b> {
    buffer: &'b mut MyAudioBuffer,
    // ...
}

impl<'b> AudioBufferMut<'b> for MyAudioBuffer {
    type Iter = MyAudioBufferMutIterator<'b>;  // Lifetime is known now!
    fn iter_mut(&'b mut self) -> Self::Iter {
        // ...
        unimplemented!();
    }
}

struct DualAudioRenderer<R1, R2> {
    renderer1: R1,
    renderer2: R2
}

impl<ABM, R1, R2> AudioRenderer<ABM> for DualAudioRenderer<R1, R2> 
where
    R1: AudioRenderer<ABM>, 
    R2: AudioRenderer<ABM>,
{
    fn render_audio(&mut self, buffer: &mut ABM) 
    where for<'b> ABM: AudioBufferMut<'b>
    {
        // render silence
        self.renderer1.render_audio(buffer);
        self.renderer2.render_audio(buffer);
    }
}

Question

How do we proceed with this? I see the three options described above:

  1. Use HRTB (for<'a>) as described in more detail above, at least until GATs are stabilized.
  2. Take the performance dip and use concrete iterator type (I didn’t describe this in full detail).
  3. Wait until GATs are stabilized (if ever).

Question to all: do you see other options? What would you prefer?

I personally don’t fully like the second option, because we would want to have full performance one day, and we’re just postponing the challenge.

I think it would be best to experiment with the first option. I would like to know upfront what you think about it because I want to spend my (and others) time on implementing ideas that are supported in the community.

1 Like

Okay, I had some thought iterations on this problem. At first, your AudioBufferMut and AudioRenderer traits felt a bit too purpose-built, too specific. Therefore, I started to think: What is it exactly that we want to do? My answer: We want to express a stream of items (frames) that are generated by one or multiple producers (input buffers) and feed one or multiple consumers (output buffers). There’s no need to narrow that down on audio frames, we can talk about any type of item at all.

I have previously done something like that with iterators. However, you can not properly express something like a consumer with the std iterator. Therefore, I’ve only used iterators to “pre-process” the data and did the actual work in a for loop.

I’ve begun my efforts by introducing a consumer trait that would work in tandem with std's iterator, but it soon became quite ugly. Then, a friend pointed out what I actually want: Pipes!

A pipe is something like an iterator, but it also has an input type:

trait Pipe {
    type InputItem;
    type OutputItem;

    fn next(&mut self, item: Self::InputItem) -> Self::OutputItem;
}

You can put one item into the pipe and receive another item. Then, in theory, a consumer would be a Pipe with InputItem=() and a producer would be a Pipe with OutputItem=(). However, there is no actual need to express that in code, which is why I dropped these concepts.

A more useful concept, however, is that of a Pipeline: A Pipe with InputItem=() and OutputItem=bool. You feed a () into it and it returns whether you may continue or not; It is executable, you can run it until it is depleted.

You can also concatenate pipes: Given two Pipes P0 and P1 with P0::OutputItem = P1::InputItem, you can simply take an input item for P0, generate the output from P0, feed it into P1 and then return the output from P1. There are also other operations possible which let you create rich and powerful pipes.

Another possibility of this concept is software-pipelining: Given multiple pipes with compatible input and output types, you can run these pipes in parallel and pass them the values that were previously calculated. If these pipes are properly balanced, you can dramatically boost hardware usage and performance using these very abstract, powerful tools.

There are many other great things you can do with pipes that I haven’t thought of yet, but they have a downside too: There is always a 1:1-relationship between input items and output items. This means that you can not take one input item and generate multiple output items or consume multiple input items to generate one output item; Basic and beloved operations like filter or filter_map simply aren’t possible on pipes. If you need to express something like that, you need to create a complete pipeline in advance and wrap it in an iterator or a producer. It will probably work and look quite well but solves the problem by working around it.

I’ve sketched the whole thing up in a repository and also added some sugar. You should take a look at the example first, I think that looks quite neat. What do you think about it? Maybe there is already something like this and just couldn’t find it, but if there isn’t I would love to continue the work on them. These pipes just feel great!

Hi Janonard, thanks for thinking about this and sharing your thoughts.
You’re answering a different question here, but I think that’s Ok: people often get struck because they don’t answer the right question. In this case, I already have an answer in the form of the AudioRenderer and the ContextualAudioRenderer traits. Before I explain you the reasoning behind the different choices that I made here, I would like to let you know that I like the Pipe trait that you describe because it’s very easy to understand what’s going on.

Here are my opinions about the main differences:

Associated types vs generics

The Pipe trait that you describe has associated types, but one could also use type parameters as follows:

trait Pipe<I, O> {
    fn next(&mut self, input: I) -> O;
}

The advantage of using generics is that one struct can implement the generic trait with more than one type parameter, but with an associated type, the trait can only be implemented once, with one particular associated type. This difference is relevant for VST’s, where the floating point type (f32 or f64) is only known at runtime. With generics, one can have one struct instance and then decide at run time, which method to call. With associated types, one struct can render f32 audio or f64 audio, but not both, so you would need two struct instances.

Single frames vs multiple frames

The Pipe trait is mostly usable in practice for rendering audio frame by frame (more about an important exception to that below). One could theoretically work as follows:

impl Pipe for MyStruct {
    type InputItem = ([f32; 128], [f32; 128]);
    type OutputItem = ([f32; 128], [f32; 128]);
    // ...
}

This inevitably implies holding large structs on the stack and then you would like to take care somehow to avoid copying these around too many times in order to keep the plugin fast. But why would one like to use such large buffers? In order to be able to use SIMD. If you work frame by frame on mono audio, there’s no easy way to use SIMD.

The Pipe approach that you describe can be used in a SIMD-oriented way as follows:

impl Pipe for MyStruct {
    type InputItem = ([f32x4], [f32x4]);
    type OutputItem = ([f32x4], [f32x4]);
    // ...
}

I think it would be worth the effort to explore such an approach (maybe with generics instead).

Context

A minor thing: the render_buffer method in the ContextualAudioRenderer trait has an extra parameter context: this allows for instance accessing the host, …

Some other notes:

I feel your enthousiasm here :slight_smile:

I went down this road (or a variation of this road), but when I mixed in event handling (e.g. midi events), the API became monstrous and I removed the whole thing. It looked nice in theory: chaining “middleware” where each middleware takes responsibility for one aspect, but it turned into a nightmare that led me to write very obscure macros.

I met a JUCE maintainer on Fosdem and I asked him how interleaved audio is handled in JUCE. He told me that JUCE just converts it to different channels and that while it seams to be a CPU-consuming thing to do, it is not a real problem in practice.

I’ve been thinking about this topic some more, and I think part of the problem is that we (or at least I) do not have a clear view on what the requirements are. I think that, instead of trying to solve everything in theory (which can easily lead me to a wrong path), I’ll simply wait, see what incompatibilities between crates emerge and try to fix these on a case-by-case basis before trying to find a general solution.

Thanks for your reply, I’m glad you like the concept! :relaxed:. It’s true, I didn’t directly answer the question. Let’s do that now:

Assuming that the question is “How do we describe input and output buffers in a general, portable way that includes both interleaved and non-interleaved audio?”, my answer is: “We should describe input buffers as producer pipes and output buffers as consumer pipes”. I hope this is more satisfying! :wink:

Associated types vs generics

Interesting point. At first, I just used associated types simply because Iterator does so too and I especially didn’t know about the f32 and f64 thing in VST. However, associated types have another advantage: They can be elided. For example, the pipe connector is defined as

pub struct Connector<P0, P1>
where 
    P0: Pipe,
    P1: Pipe<InputItem=P0::OutputItem>,
{
    pipe0: P0,
    pipe1: P1,
}

If pipes were generic, you would need to write something like this:

pub struct Connector<I, M, O, P0, P1>
where
    P0: Pipe<I, M>,
    P1: Pipe<M, O>,
{
    pipe0: P0,
    pipe1: P1,
    items: PhantomData<(I, M, O)>,
}

Also, if you have a pipe that internally stores frames (e.g. because it’s a delay), it has to be generic over the type of frame it stores. Then, the whole pipeline need have to be generic over the frame type too and you’ve won nothing. :confused: Also, if a type would implement Pipe multiple times, it would be harder to determine which type the result of a >> should have, maybe even impossible.

In the end, I think that associated types are the better option because the resulting trait is only implementable once.

Single frames vs multiple frames

I also wanted to write about that but forgot that too!

In most situations, a pipe’s next method is only called once in the entire binary (by another pipe or the root loop). Therefore, it is a perfect candidate for inlining. In the end, the generated LLVM assembly code should be the same as if you were manually iterating over a slice. Then, and only then, the LLVM linker would use the SIMD instructions to translate LLVM assembly code to x86_64 code. Maybe you need to explicitly tell the compiler to do so and one should definitely prove that by looking at the generated assembly code, but as far as I’m concerned, pipes are a zero-cost abstraction.

Event Handling

Yep, event handling (or more abstractly: 1:n item relationships) are bothering me as well as my colleague I talked to. We’ve also experimented with a definition that receives a FnMut to retrieve an arbitrary amount of items instead of a single item, but it didn’t quite work out. I hope to handle this problem by wrapping pipes into another consumer but didn’t venture into that too far.

Wrap-Up

I guess your point is quite right: We should simply try to continue our efforts and eliminate problems when they arise. I hope to continue with the pipes and maybe, if it works out, more projects can depend on it.