Rust Audio

Creating a simple audio player, mixer with fade in/out

Hi,

I want to create a simple audio player that can mix multiple audio tracks and fade in and out. I want to be able to control the fading curves (mathematically).

My final goal is to create smooth fades that sound like you walk away from something (e.g. a concert) so I guess an LFO would be something nice.

Currently I played around with rodio. Every few milliseconds I set a new volume with sink.set_volume, but I am not sure this is the best way to go. I would prefer to control the volume for each audio sample. Or is that a bad idea (I am very new to audio programming)? Is this possible with rodio?

Can you recommend a way to proceed or which crates to use?

In the end it should run on a Raspberry Pi.

Some background to my project:

Hi @Tobx, welcome to this forum!

I have to admit I have no experience with rodio. There have been quite some questions about it on this forum and it seems that it’s sometimes a bit too high level for what people want to do. Now in your case, I think you can perfectly give it a try (but see below).

If I look at the implementation of set_volume I see it uses a Mutex, which is in general a no-go in real-time audio, but there may be reasons why it’s ok in this context (I think the locking does not run in the real-time thread).

Another aspect is that if the volume (or rather amplitude) jumps from one value to another, this may give clicks and cause an artifact around 1000Hz. Now if you do a step every millisecond and you transition from silence to full volume in 60 seconds = 60000 milliseconds, the steps are very small and you will probably not note anything (unless you have a very good ear).

A third aspect is that if the amplitude transitions from silence to full volume linearly, the perceived volume will evolve in a logarithmic scale and you will have the impression that the volume increases a lot at the beginning, but not so much near the end. You can fix this by having the amplitude evolve exponentially. But then the jumps every millisecond will be bigger, so maybe you hear an artifact around 1000Hz.

So… I would give it a try and see if it works for your needs. If you want very smooth transitions in volume, I think rodio is not a good match and I would use cpal directly (rodio uses cpal under the hood). Disclaimer: I have no experience using cpal either. You will typically want to use cpal in the real-time thread and read the audio from disk (and maybe also adjust the volume) in another thread. Now, communicating the data between these two threads was a problem … until recently the real-time ring-buffer was announced. You can see why I’m so enthousiastic about this one!

Hi @PieterPenninckx,

thank you for your helpful answer. I already switched to using cpal directly, because of resampling issues and not being happy that fading is not easily possible per sample.

I tried dasp for linear resampling now and it works great. Sinc resampling has yet some issues, but I don’t think that I can hear the difference anyway so I keep it linear for the time.

That would be my next step, so your comment is just on time. I found audio_thread_priority and it looks like the way to go for the audio thread. I do not really understand the point of the ring buffer though. What is the difference to std::sync::mpsc::sync_channel for example? Is it all about memory allocation as described in your link?

You’ve discovered a cool crate with audio_thread_priority! That’s about the OS giving priority to the audio thread. Honestly, I don’t know how important that is and to what extend it helps, given that the operating systems we use are not designed as real-time operating systems. But I think it doesn’t hurt to use it.

The difference between rtrb and std::sync::mpsc::sync_channel is about locking: sync_channel may use a mutex or something similar in certain situations. I don’t know the precise details (it was in a reddit post that I cannot find back). rtrb is entirely lock-free.

I think the best approach would be to consider your audio player a library (rather than a stand-alone program).

This way, you don’t have to commit to a certain audio backend up front. You only have to decide if you want to provide the audio data to the backend in blocks or frame-by-frame. And whether you want to have a fixed number of channels or not.

Then you can implement a function/method that provides a frame or a block of data, which is supposed to be called repeatedly by the audio backend.

This way you can use it with rodio, cpal, JACK, PortAudio, or whatever backend you happen to find in the future. If it turns out that one backend doesn’t suit your needs, you can simply switch to another one.

You could even provide a C API for your player library, which would make it usable in many more situations.

Apart from being more flexible in its usage, it might also become easier to test. For example, you can use some plotting library to plot the generated signals for inspection. This is likely much harder if you hard-code everything in a monolithic program.

To find the right fading curves, it might even be better to step away from Rust for a moment and explore your options with a faster-to-iterate language like Python (which would be my choice), Lua, JavaScript, Julia, or whatever. Once you have found the perfect fading curves/parameters, it should be easy to transfer them to a Rust program/library.

1 Like

@mgeier Very good point, I think I did it kind of this way. The backend is currently a struct that is used to receive information like sample rate and number of channels and provides a Publisher<f32> channel (rtrb) to send audio information to (interleaved samples). The receiver is running in a thread with the proposed ring buffer. I could maybe make that a trait to exchange it later if needed. Or did you think more of a “real” rust library (crate)?

@PieterPenninckx I tried rtrb and sync_channel and the first issue was that since rtrb is not blocking, the sender has to try somehow repeatedly to send new data when the ring buffer capacity is full. At first I set the sleep duration for repeats to 1ms, but that required at least 512 audio frames buffer size to not glitch. I currently run my card with 96 kHz, but that seemed quite a lot (> 5ms latency). Setting the sleep to 1µs let me set the buffer to only 16 frames, but it was using almost 10 times more CPU cycles. Anyway, sync_channel used as much CPU cycles, so rtrb seems to be a good solution. Do you have any experience regarding buffer size and sleep times or do I do something very bad here?

Probably it is better send bigger blocks between threads than single samples?

I care about CPU cycles, because I initially tried something with Python. It worked on my Laptop, but glitched on the Raspberry PI. I guess an interpreter language is too slow to calculate exponential fading per audio sample on slow CPUs.

For the fading part I already tried some curves, my current favourite is exp(-6 * t) :slight_smile:

Wow, only now that I wrote that I realize that I only need to multiply the volume in each frame with something like 0.99 instead of using an expensive exp :woozy_face:

I have no experience with that. I believed that there was a blocking method as well that would only block the thread that uses it, but it turns out that there isn’t (maybe for good reason). I would send bigger blocks, not just samples.

The thing with timers is that they can run out of sync after a while, especially since the disk-reading-thread also does some other work that may take some time. So you need some mechanism to ensure that they stay in sync somehow and that the audio-playing-thread always has enough samples. I think you can use slots to have an idea (in the disk-reading-thread) of how long the audio playing thread has still work it can do. I think it’s also important to foresee a decent amount of “spare frames”, just in case the disk-reading-thread needs more time than expected to do its job.

No reason to be ashamed. Enjoy your own discovery of how beatifull math can be :slight_smile:

No, I guess I was talking about the conceptual idea of a library with some (more or less) well defined API. You can move that into a “real” crate later (if desired). You could also implement the library part in lib.rs and the application part (using a concrete audio backend) in main.rs. Or you can use a “workspace” to separate the sub-crates. Either way, I don’t think you have to decide that up front. The important thing is that you have an idea how the different parts are supposed to communicate with each other.

Just to make sure we are talking about the same thing: I assume you are talking about a disk-reading thread that is supposed to write (interleaved) audio data into the ring buffer, right?

When you talk about “buffer size”, do you mean the size of the ring buffer or the buffer size of the audio backend?

If you are talking about the size of the ring buffer, 512 frames seems very small if you are planning to read the audio data from a hard disk (let alone from a network drive!).

Another important question is: are the audio files and their starting times known in advance?

If yes, you can use a very big ring buffer and just pre-load several seconds of audio data.
If the starting time is not known, you might want to consider using multiple ring buffers, one for each audio file. This way you can pre-load the audio file without knowing when its playback will actually start. When it’s time to start playback, you can move the appropriate Consumer<f32> to the audio thread with a RingBuffer<Consumer<f32>>, where it can be read from the realtime thread with a Consumer<Consumer<f32>>. Isn’t it fun to combine multiple ring buffers like this?

If your ring buffer is very large, you can also make the sleep duration larger.
If your ring buffer contains several seconds worth of audio, it might be enough to check every second or so whether new audio data is needed.

Here again the “library” view helps: just expose the ring buffer size and the sleep duration in your library API. Let the user worry about the concrete values. It might be nice to suggest some reasonable default values in the documentation, though. The “ideal” value will differ between use cases, though.

This happens to be quite similar to the use case where I’m using the rtrb::RingBuffer: https://github.com/AudioSceneDescriptionFormat/asdf-rust/blob/master/src/streamer.rs. This is very much work in progress, but you can have a look anyway if you are interested. My code doesn’t use interleaved samples though. Instead, it uses a fixed block size with the contiguous channel data one after each other. I chose that because the number of channels can be quite high (tens, hundreds?). If you are targeting stereo playback, I think interleaved samples are a good choice.

You are not really “sending blocks”, if I understood correctly.
You have a fixed size ring buffer, one thread is writing to it and another thread is reading from it.

I think it makes sense to write to the ring buffer in kinda large blocks, because the OS will read the file in blocks anyway.
That doesn’t mean you have to read from the ring buffer with the same block size. You can still read the data frame-by-frame if that suits your needs.

OK, that’s a totally different, off-topic, but still interesting problem.
Sample-by-sample processing is known to be slow in Python (PyPy might help, but that might introduce other problems). In Python (especially CPython), you should use NumPy operations that calculate a whole audio block at once. This way the calculations can become orders of magnitude faster.

I don’t have experience with Raspberry Pi, but I’m quite sure you can do a few exponential fades on it if you use NumPy operations.

For a concrete comparison, you can check out these two examples of mine: https://github.com/spatialaudio/jackclient-python/blob/master/examples/midi_sine.py and https://github.com/spatialaudio/jackclient-python/blob/master/examples/midi_sine_numpy.py.

But back to the topic of this forum …

If you do that very often though (millions of times?), you might get accuracy problems, because the rounding errors of all operations might accumulate. Floating point numbers are often quite good at averaging those errors out, but you should still be aware of the potential problem.

Anyway, you should not over-estimate the expensiveness of exp. You should run a few benchmarks to avoid premature optimizations.

Another possible optimization (if needed) would be to store the fading curves in lookup tables.

Well the reason is that the whole thing is supposed to be non-blocking.
You might think that one side (in this case the side of the file-reading thread) could be blocking, but for waking up that thread, the realtime-thread would potentially have to make an OS call, which we want to avoid in the first place.

If you want to have something similar to a blocking call, you can just check repeatedly in a tight loop. You can call std::thread::yield_now() to avoid maxing out the CPU, or, as discussed above, just std::thread::sleep() a little time between repetitions.

@mgeier Yes you understood me correctly so far.

RingBuffer<Consumer<f32>>: Awesome! I think this solves a lot of headaches.

Preloading for music on disk and keep sound effects (for audio feedback) in memory should do :+1:

Regarding floating point errors, I just tried it with a billion steps, f32 is completely off, f64 surprisingly accurate. Anyway you are right, if the code is simpler with exp I should not optimize until necessary.

Now this took me a while, but still not sure I got it. I am using Stereo only, because I have not yet tried to solve the [f32; variable] // is not possible issue yet. I would like to have a variable channel count though. Do I understand that correctly:

If you have an interleaved stereo audio stream:
[L1, R1, L2, R2, ...]

You would map it to:
[L1, L2, L3, R1, R2, R3, L4, ...]

for block size 3?

Then you would read chunks of 6 samples (chunk size = channels * block size) and create a &mut “slice of slices” with rsor::Slice that looks like:
[[L1, L2, L3], [R1, R2, R3]]

But why is that different than:
[[L1, R1], [L2, R2], [L3, R3]]

In any case, am I somehow correct, that I do not want to iterate over frames like [f32; 2], because the channel count cannot easily be generic, so I only iterate over f32 and edit or read the stream by using those slice of slices references?

Thanks for checking this! TBH, my statement was purely theoretical, it’s nice to get some real data to (at least partially) confirm it.

Well, yes and no.

Yes, that is the ordering I’m using, but my use case is not limited to two-channel signals.

And no, I don’t have interleaved data to begin with. If I had interleaved data, I would probably just keep using it as interleaved.
It’s probably not relevant here, but just to help understand my use case: I have a rtrb::RingBuffer<f32> which I use to transport a multi-channel signal (the number of channels is not known at compile time).
The disk-reading thread is responsible for reading a certain number of files, each of them potentially having a different number of channels (potentially many more than 2) and different file formats. Those files are supposed to be played back at different times, and the channels of those files are somehow mapped to the channels of my ring buffer (in a way that two files never overlap).
Now I could of course store the channels of the ring buffer in an interleaved way, but I wouldn’t gain any advantage from that, because even if the original files are stored as interleaved, I still have to de-interleave them and then re-interleave them with the rest of the ring buffer channels. This sounds complicated.

On the Consumer side of the ring buffer interleaved data wouldn’t help me either, because my main application expects each channel to be contiguous.
My library can also be used as a Pure Data external (https://github.com/AudioSceneDescriptionFormat/asdf-rust/tree/master/pure-data), where the outlets are also provided as contiguous memory.

Long story short, using interleaved data would force me to do additional conversions on both sides of the ring buffer.

But that doesn’t mean that you shouldn’t use interleaved data!

You have to decide for yourself whether to use interleaved data or not. One advantage of interleaved data, compared to my aforementioned use case, is that you don’t have to limit yourself to a fixed block size.

Exactly.

This way it would be more work to extract the channels, which I need to be separated in the end.
If if you don’t need them separated, it might be less work this way.

That depends on whether you need it to be generic or not.

You can use a RingBuffer<[f32; 2]> if that makes sense in your situation (but I’ve actually never tried this).
You can also use a RingBuffer<f32> and treat its contents however you see fit, interleaved or not.
You could theoretically also use one RingBuffer<f32> for each channel, but that probably doesn’t make too much sense …

You don’t have to use the “slice of slices” approach, but I find it appropriate for my case.
You can also use split_at() and split_at_mut() to get access to your contiguous channel data. You can also use chunks_exact() and iterate over the channels.
If you use interleaved data, you don’t need any of this, you can just read it sample-by-sample as f32 or frame-by-frame as &[f32] (or as [f32; 2] if you did use a RingBuffer<[f32; 2]>).
In future Rust versions you can probably also use array_chunks().

I guess what I’m trying to say is that there isn’t a single “best” way of doing things, it depends on your exact requirements.