Demonstration on Music Source Separation Using Multiresolution Deep Layered Analysis (MRDLA)

Tomohiko Nakamura, Shihori Kozuka, and Hiroshi Saruwatari (The University of Tokyo)

In this demo page, we show music source separation results using our proposed MRDLA [1] and conventional time-domain audio source separation methods. The mixture and ground truth signals of musical instruments (vocals, bass, drums, and other) are from the MUSDB18 dataset [2].

In addition to the separated audio signals of each method, we provide so-called minus-one audio estimates, which were computed by subtracting the separated audio signals from the mixture audio signals. The minus-one audio estimates were helpful for listeners to check the leakage of the target sources.


Mixture Instrument Separated/Minus-one Ground Truth WaveNet [3] Wave-U-Net [4] Conv-TasNet [5] Proposed MRDLA [1]
Song 1
vocals Separated
Minus-one
bass Separated
Minus-one
drums Separated
Minus-one
other Separated
Minus-one
Song 2
vocals Separated
Minus-one
bass Separated
Minus-one
drums Separated
Minus-one
other Separated
Minus-one
Song 3
vocals Separated
Minus-one
bass Separated
Minus-one
drums Separated
Minus-one
other Separated
Minus-one

References

[1] Tomohiko Nakamura, Shihori Kozuka, and Hiroshi Saruwatari, “Time-domain audio source separation with neural networks based on multiresolution analysis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1687–1701, Apr. 2021.
slides , poster , demo , code , [The Itakura Prize Innovative Young Researcher Award / 第17回日本音響学会・独創研究奨励賞板倉記念]
[2] Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, and Rachel Bittner, "The MUSDB18 corpus for music separation," Dec. 2017.
[3] Francesc Lluís, Jordi Pons, and Xavier Serra, "End-to-end music source separation: Is it possible in the waveform domain?," in Proc. INTERSPEECH, Sep. 2019, pp. 4619–4623.
[4] Daniel Stoller, Sebastian Ewert, and Simon Dixon, "Wave-U-Net: A multi-scale neural network for end-to-end audio source separation," in Proc. International Society for Music Information Retrieval Conference, Sep. 2018, pp. 334–340.
[5] Yi Luo and Nima Mesgarani, "Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256--1266, May 2019.