As I did 2 years ago on my DrummerNet paper, I’m open-sourcing the reviews my submission received. I did it back then, and I’m doing it again now, since when I had no paper in ISMIR, I was always very curious about how ISMIR review is being done. By not making this information available, I beileve, we’d be causing some survivorship bias – only those who have submitted gets some useful information.
So here we go. The submitted version of the paper is here. The title of the paper is “Listen, Read, and Identify: Multimodal Singing Language Identification of Music”.
I received many “Strongly Agree” from this reviewer. I’ll introduce some comments only here.
11. Please explain your assessment of reusable insights in the paper.
This paper provide a reproducible work in singing language identification by using a public Music4All dataset. Preprocessing steps for audio and texts inputs are well presented based on open-source Python libraries. Model structure is also based on well-known deep learning architectures.
18. Main review and comments for the authors
This paper presented LRID-Net, a deep learning model for singing language identification that takes multimodal data, including an audio input and a text input which combines track title, album name, and artist name. A series of experiments were done using a public dataset. This helps the community reproduce the work. Although the audio and text branch reused model structures such as ResNet-50 and MLP, this paper presented detailed experiments to demonstrate the performance and properties of LRID-Net under different use cases of input modalities. Thos provides insights to apply SLID in real world scenario.
There are three places that could be more clarified by the authors.
1. As seen in Figure 1, audio drop was applied just after the audio signal. Given that the audio clips are 30-second and dropout rate was fixed to 0.2, I understand such audio drop is to “mute” 6 seconds in the clip. This may not cope with the real-world cases when audio content is partially missing. Why not apply audio drop after melspectrogram?
2. In 5.4.1 from line 378 to 382, the reasons for “the effects of modality dropout are better reflected on weighted-average scores” need to be rephrased.
3. The authors mentioned they cannot provide a satisfying explanation about some experiment results. If possible, please add any new explanations for camera-ready.
Here, I seem to have confused the reviewer a bit. The audio dropout does *not* mute 20% of the signal – it drops the whole audio signal with a probability of 20%. In other words, what the reviewer #1 is suggesting is already happening.
18. Main review and comments for the authors
This paper proposes a solution for the task of detecting the lyrics language of a given song. It does so in a multimodal fashion, incorporating both raw audio data as well as textual data in the form of track metadata (artist, album and track names).
The proposed model is a relatively simple neural network architecture with separate branches for the audio and the textual features. The authors further propose to adopt the concept of modality dropouts to give their model the ability to handle missing features – e.g., still be able to classify a song based solely on audio data, if no textual metadata is available.
The paper is generally easy to read and understand. The authors stress reproducibility as one of their main contributions, and in my opinion, they reach that goal. They use an openly available dataset, and the information on their model architecture given in the paper seems sufficient to re-implement it, also owing to its relative simplicity. That being said, it would be great if the authors went one step further and made their implementation available as open source.
The experiments described by the authors seem sufficient for reaching their conclusions, and I can’t find any fault with their setup. One further potential limitation of their results is that the ground truth labels they use are also just estimates computed by langdetect (on lyrics data, as opposed to the metadata their model uses). Maybe this could/should be discussed?
As a note on the paper structure, I would suggest moving the Dataset section before the LRID-NET section. This would make some information in the LRID-NET section clearer when reading the paper from top to bottom.
Finally, here are some more detailed points:
– 135: What separator character is used for joining, if any?
– 163: What is the reason for the output size of 11? langdetect supports 55 languages, as you have mentioned before – did you decide to disregard some of those and reduce the set of languages to 11? Edit: I see this is explained later. Maybe shift the order around to explain this earlier. In general, I would suggest moving the Dataset section to before the LRID-NET section.
– 177: “One more difference of modality dropout is that there is no 1/(1−r) scaling when the input is not dropped.”
Maybe this could be explain in a bit more detail.
– 179: “During test time, a system with LRID-Net inputs an arbitrary zero vector to the model if a modality is missing.”
What is an “arbitrary zero vector”? Isn’t there only one zero vector (for every given size)?
– 277: No comma after “which”.
– 296: “German”
– 297: “Third, as summarized in Figure 3, in every metric and averaging strategy, TO model outperformed AO model”
But not in every language – do you have a potential explanation for this?
– 300: “[…] audio _is_ less […]”
– 313 and following: The use of + and – in parentheses is not consistent.
– 329: double “model”
– 358: Was there a particular reason for choosing a rate of 0.2?
– 379: “they”
– 444: I think you mean “eliminate”/”alleviate” the need?
Reviewer #2 gave me a good suggestion on clarifying that the ground truth is also coming from the langdetect applied to the lyrics. I’ll add this to the camera-ready version.
The reviewer also kindly fixed many grammar issues that I appreciate a lot!
I got a Weak Accept here but with all nice and careful comments. Given the positive comments though, the reason for weak accept is, I guess, limitation in the expected impact.
10. The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.
18. Main review and comments for the authors
*** Novelty of the paper + Stimulation potential ***
> The major contribution of this study is its reproducibility in terms of using a publicly accessible dataset for SLID works.
*** Appropriateness of topic + Importance ***
> While SLID is reportedly not widely studied in the MIT literature, there could have been more implications for future MIR research and practices. For example, how do SLID techniques contribute to the user-centered MIR system design and evaluation?
> The term “multimodal” might give the impression to readers that other less typical sources of information have been taken into account, in addition to text and audio. It would be better to give a clear definition early in the paper.
*** Scholarly / Scientific quality ***
> Is there any literature support for justifying the three selected metadata elements (i.e., track title, album name, artist name)?
> It is excellent that the missing data scenario has been considered. Some songs would be in the form of singles from unreleased albums or simply independent singles not affiliated with any albums.
> Why are multilingual lyrics treated as negligible cases? They might reveal important information for the SLID analysis .In relation to the theme of ISMIR this year (“Cultural Diversity in MIR”), it might be common for songs from some countries to contain lyrics from more than one language — this might help identify the unique characteristics of music from particular cultures.
> It is appreciated that the study prevented the problem of having artist-dependent information confound the SLID analysis. This would allow more leeway in that an artist (e.g., from Canada) has songs from more than one language (e.g., English, French).
*** Reusable insights ***
> Given the rising markets of pop music in Greater China (i.e., China, Taiwan, Hong Kong), the absence of specific mentions of Chinese songs merits some interpretation from the authors.
*** Readability and paper organization ***
> Thanks for spelling out the three main contributions of this study (Lines 85 – 97) , which could have also been added to the Abstract too to build up readers’ momentum.
One thing: I still think it’s fine to assume multi-lingual songs are negligibly rare. My definition of the language of lyric is not naively literal e.g., I’m fine with calling a dominantly Korean song with some English words like “baby”, “oooh”, “yes”, “oh my god” a Korean song. That’s because of the context pop music is consumed, which is the (assumed) target scenario. That kind of songs can be 100% perceived as Korean to any listener. An example of some “true” multi-lingual songs would be this one.
And this kind of song is pretty rare.
The meta review is also absolutely helpful. But on the reproducibility, the meta reviewer seems to have misunderstood a little.. which is kinda likely and understandable given the amount of work meta reviewers are given (and also, anyone can misunderstand anything). Let me explain it a bit.
18. (Initial) Main review and comments for the authors
The paper addresses the interesting problem of singing language identification (SLID). It proposes a machine learning model that combines textual metadata with audio features, and which is also capable of dealing with cases where some inputs are missing. Results of various experimental evaluations are reported.
The paper is generally well prepared and easy to follow. A number of language issues in particular in the later parts should however be fixed. The topic is relevant to the conference. The approach may have some novelty.
I am not entirely convinced by the work for a number of reasons.
* The authors claim to provide a reproducible work, but do not share the code they used in the experiments and they do not share the datasets that they used. Also, they do not report any model hyperparameters, and specifics regarding the data splitting procedure are missing.
- I specified that I used ResNet50 with base_dim=64. This is actually enough information to 100% reproduce the model since ResNet50 is a pronoun.
- Data splitting information is 100% transparent since I open-sourced the code and the result.
But all the other concerns are actually true!
* The authors do not report if their observed improvements or deteriorations are statistically significant. Some differences seem rather small.
* To me it was furthermore difficult to interpret the obtained precision and recall values on an absolute scale. Are these results good enough to be used in a practical application? The datasets is also highly imbalanced, which makes the interpretation even more difficult. I was also wondering if other metrics than F1/precision/recall could have been used for the multi-class classification problem.
* There are some unexpected observations for which the authors have no explanation. This is worrying as these observations could be the result of a technical error or a design error.
* From a technical perspective, the authors do not provide indications in which ways the audio model contains features that are suited to predict the singing language. Some background should be provided why we expect that such features exist.
This is it. Um… bye!