Demonstration on Music Source Separation Using Sampling-Frequency-Independent Convolutional Layer

Koichi Saito, Tomohiko Nakamura (The University of Tokyo), Kohei Yatabe, and Hiroshi Saruwatari (The University of Tokyo)

Abstract

Audio source separation is often used for the preprocessing of various tasks, and one of its ultimate goals is to construct a single versatile preprocessor that can handle every variety of audio signal. One of the most important varieties of the discrete-time audio signal is sampling frequency. Since it is usually task-specific, the versatile preprocessor must handle all the sampling frequencies required by the possible downstream tasks. However, conventional models based on deep neural networks (DNNs) are not designed for handling a variety of sampling frequencies. Thus, for unseen sampling frequencies, they may not work appropriately.
In this paper, we propose sampling-frequency-independent (SFI) convolutional layers capable of handling various sampling frequencies. The core idea of the proposed layers comes from our finding that a convolutional layer can be viewed as a collection of digital filters and inherently depends on sampling frequency. To overcome this dependency, we propose an SFI structure that features analog filters and generates weights of a convolutional layer from the analog filters. By utilizing time- and frequency-domain analog-to-digital filter conversion techniques, we can adapt the convolutional layer for various sampling frequencies. As an example application, we construct an SFI version of a conventional source separation network. Through music source separation experiments, we show that the proposed layers enable separation networks to consistently work well for unseen sampling frequencies in objective and perceptual separation qualities. We also demonstrate that the proposed method outperforms a conventional method based on signal resampling when the sampling frequencies of input signals are significantly lower than the trained sampling frequency.


Separated examples

Some separated results obtained by the following three methods are available. Mixture and groundtruth signals are from the MUSDB18-HQ dataset.
  • Conv-TasNet [3]: Directly separate mixture audio signals untrained sampling frequencies
  • Signal Resampling: Resample mixture audio signals with the trained sampling frequency (32 kHz) before separation
  • Proposed SFI Conv-TasNet [1]: Use the proposed SFI convolutional and transposed convolutional layers for Conv-TasNet


Conv-TasNet vs. Proposed SFI Conv-TasNet

Mixture Instrument Ground Truth Conv-TasNet [3] Proposed SFI Conv-TasNet [1]
Song 0
8 kHz
20 kHz
32 kHz
40 kHz
48 kHz
vocals 8 kHz
20 kHz
32 kHz
40 kHz
48 kHz
8 kHz
20 kHz
32 kHz
40 kHz
48 kHz
8 kHz
20 kHz
32 kHz
40 kHz
48 kHz
bass 8 kHz
20 kHz
32 kHz
40 kHz
48 kHz
8 kHz
20 kHz
32 kHz
40 kHz
48 kHz
8 kHz
20 kHz
32 kHz
40 kHz
48 kHz
drums 8 kHz
20 kHz
32 kHz
40 kHz
48 kHz
8 kHz
20 kHz
32 kHz
40 kHz
48 kHz
8 kHz
20 kHz
32 kHz
40 kHz
48 kHz
other 8 kHz
20 kHz
32 kHz
40 kHz
48 kHz
8 kHz
20 kHz
32 kHz
40 kHz
48 kHz
8 kHz
20 kHz
32 kHz
40 kHz
48 kHz
Song 1
8 kHz
20 kHz
32 kHz
40 kHz
48 kHz
vocals 8 kHz
20 kHz
32 kHz
40 kHz
48 kHz
8 kHz
20 kHz
32 kHz
40 kHz
48 kHz
8 kHz
20 kHz
32 kHz
40 kHz
48 kHz
bass 8 kHz
20 kHz
32 kHz
40 kHz
48 kHz
8 kHz
20 kHz
32 kHz
40 kHz
48 kHz
8 kHz
20 kHz
32 kHz
40 kHz
48 kHz
drums 8 kHz
20 kHz
32 kHz
40 kHz
48 kHz
8 kHz
20 kHz
32 kHz
40 kHz
48 kHz
8 kHz
20 kHz
32 kHz
40 kHz
48 kHz
other 8 kHz
20 kHz
32 kHz
40 kHz
48 kHz
8 kHz
20 kHz
32 kHz
40 kHz
48 kHz
8 kHz
20 kHz
32 kHz
40 kHz
48 kHz


Signal Resampling vs. Proposed SFI Conv-TasNet

Mixture Instrument Ground Truth Signal Resampling Proposed SFI Conv-TasNet [1]
Song 0
8 kHz
12 kHz
16 kHz
20 kHz
24 kHz
vocals 8 kHz
12 kHz
16 kHz
20 kHz
24 kHz
8 kHz
12 kHz
16 kHz
20 kHz
24 kHz
8 kHz
12 kHz
16 kHz
20 kHz
24 kHz
bass 8 kHz
12 kHz
16 kHz
20 kHz
24 kHz
8 kHz
12 kHz
16 kHz
20 kHz
24 kHz
8 kHz
12 kHz
16 kHz
20 kHz
24 kHz
drums 8 kHz
12 kHz
16 kHz
20 kHz
24 kHz
8 kHz
12 kHz
16 kHz
20 kHz
24 kHz
8 kHz
12 kHz
16 kHz
20 kHz
24 kHz
other 8 kHz
12 kHz
16 kHz
20 kHz
24 kHz
8 kHz
12 kHz
16 kHz
20 kHz
24 kHz
8 kHz
12 kHz
16 kHz
20 kHz
24 kHz
Song 1
8 kHz
12 kHz
16 kHz
20 kHz
24 kHz
vocals 8 kHz
12 kHz
16 kHz
20 kHz
24 kHz
8 kHz
12 kHz
16 kHz
20 kHz
24 kHz
8 kHz
12 kHz
16 kHz
20 kHz
24 kHz
bass 8 kHz
12 kHz
16 kHz
20 kHz
24 kHz
8 kHz
12 kHz
16 kHz
20 kHz
24 kHz
8 kHz
12 kHz
16 kHz
20 kHz
24 kHz
drums 8 kHz
12 kHz
16 kHz
20 kHz
24 kHz
8 kHz
12 kHz
16 kHz
20 kHz
24 kHz
8 kHz
12 kHz
16 kHz
20 kHz
24 kHz
other 8 kHz
12 kHz
16 kHz
20 kHz
24 kHz
8 kHz
12 kHz
16 kHz
20 kHz
24 kHz
8 kHz
12 kHz
16 kHz
20 kHz
24 kHz

References

[1] Koichi Saito, Tomohiko Nakamura, Kohei Yatabe, and Hiroshi Saruwatari, "Sampling-frequency-independent convolutional layer and its application to audio source separation," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2928--2943, Sep. 2022. pdf, code
[2] Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, and Rachel Bittner, "MUSDB18-HQ - an uncompressed version of MUSDB18," Aug. 2019.
[3] Yi Luo and Nima Mesgarani, "Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256--1266, May 2019.