Ah, I see. Yes, ignoring discretization, a FIR filter corresponds to a spectral filter with a window of infinite size, where the spectral filter is the FFT of the convolution kernel, with zeroes padded to give it an infinite length as well. The relationship with a STFFT (Short-Term FFT) with overlapping windows is still a little unclear to me, unfortunately.

However, we can already figure out why the windowing is there. If you take an infinite sinc convolution kernel and chop off the ends to make it finite, chopping off the ends is the same as multiplying with a square window. So the corresponding spectral kernel is a convolution of the FFT of the sinc function with the FFT of the square window. Since the square window has quite some high frequency content, the spectral kernel is not perfect.

In the case of an STFFT, you can also explain it differently. For an STFFT, the finite input signal does not really have a start or end: it wraps around. This means that if there’s a jump from the last sample to the first one, it’s percieved by the STFFT as a signal with quite some high frequency content, even if it’s not really there. The windowing is there to mitigate that.

Thinking about it some more, if you want a low-pass filter with a cutoff-frequency of 200 Hz with a sampling frequency of 40kHz, you will need an FFT window of at least 128 samples, probably 256 samples. With this length, it’s already probably more computational efficient to do it in the frequency domain than in the time domain.

Also interesting link: https://ccrma.stanford.edu/~jos/sasp/Example_1_Low_Pass_Filtering.html