ISMIR2004
 
Melody Extraction Contest

 

 

Table of contents

 

1.      Purpose. 1

2.      Organizing committee. 1

3.      Calendar 1

4.      Tuning set 1

5.      Test set 2

6.      Format for submissions 2

7.      Submission procedure. 2

8.      Evaluation metrics 2

8.1.   Option 1. 2

8.2.   Option 2. 4

8.3.   Option 3. 5

9.      List of participants 5

10.    Evaluation results 5

11.    References 6

12.    Annex 1: Tables with results 7

1.    Purpose

The purpose of the ISMIR2004 Melody Extraction Contest is to compare state-of-the-art algorithms for melody detection within polyphonic audio: singing voice and solo instrument.

2.    Organizing committee

The ISMIR2004 Melody Extraction Contest has been organized by the MTG-UPF. The organization committee for this contest is composed, in alphabetical order, by:

-          Emilia Gómez (emilia.gomez@upf.edu)

-          Beesuan Ong (beesuan@iua.upf.es)

-          Sebastian Streich (sstreich@iua.upf.es)

3.    Calendar

ˇ         June 28: Final definition of the contest rules

ˇ         July 1-5: Tuning data made available to participants

ˇ         September 7: Deadline for participant submission of the algorithms (anonymous participation is allowed)

ˇ         October 10-14: Publication of the results of the tests. Prizes will be delivered during the ISMIR 2004 conference in Barcelona

4.    Tuning set

10 audio excerpts with a melodic transcription of the predominant voice are available for participants to tune their algorithms:

 

ˇ         2 items consisting of a MIDI synthesized polyphonic sound with a predominant voice.

ˇ         2 items of saxophone melodic phrases plus background music.

ˇ         2 items generated using a singing voice synthesizer plus background music.

ˇ         2 items of opera singing, one with man and another with a woman voice.

ˇ         2 items of pop music with singing voice.

 

The tuning set enables the participants to use the same evaluation algorithm that will be used for the final evaluation.

5.    Test set

A total of 20 audio excerpts, 10 new audio excerpts in addition to the 10 audio excerpts from the tuning set will be used to evaluate the algorithms.

Now that the contest is over we provide the full test set with the reference transcriptions (28.6 MB) for download in order to enable comparisons.

6.    Format for submissions

1.      Executable file with the following arguments:

-    Input: wav file, mono and 44.1 kHz sampling rate.

-    Output: txt file with estimated monophonic melody.

 

Example: >> computeMelody in.wav out.txt

 

Format for the output text file:

-    Option 1,2: list with float values representing the predominant F0 in Hz, according to the examples provided in the tuning set.

-    Option 3: three column list with two float values representing onset and offset time in seconds, and one integer values representing the MIDI note number in each line, according to the examples provided in the tuning set.

 

2.      Output text files for the tuning set. This data will be used to verify that the algorithm is working properly within the testing environment.

7.    Submission procedure

Submissions were made via an email interface. To do so, the participants had to send a message to the following mail address: melody-contest-submit@iua.upf.es with attached files and the following text in the subject line: SUBMIT <title of your submission. Some additional information was included in the message body, necessary to run the algorithm. The system then sent a reply message with further instructions on how to delete and update submission etc.

8.    Evaluation metrics

Three evaluation metrics have been provided:

 

8.1.             Option 1

The option 1 is a frame-based comparison of estimated F0 and reference F0 on logarithmic scale. The reference was obtained by analyzing the isolated leading voice[1] and then performing some manual checking[2]. A value of 0Hz was assigned to unpitched frames. The concordance was measured as the average absolute difference with a threshold of 1 semitone (= 100 cents) for the maximal error. Each frame contributed to the final result with the same weight. This measure was computed in MATLAB with the following function:

function [totalMatch, pitchMatch, unpitchMatch] = evalOption1(extracted, reference);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% [totalMatch, pitchMatch, unpitchMatch] = evalOption1(extracted, reference)

% algorithm for the evaluation of melody extractors after option 1

% input:

% extracted     string with path/filename of the extracted melody

% reference     string with path/filename of the reference melody

% Both files are assumed to be ASCII files containing data at the same frame rate.

% Unpitched frames are coded as 0Hz pitch.

% The algorithm assumes that the pitch information in Hz for each frame is stored in the
% last column of the files.

%

% output:

% pitchMatch    Concordance measure for the pitched frames (in reference) only

% unpitchMatch  Concordance measure for the unpitched frames (in reference) only

% totalMatch    Combined concordance measure      

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 

disp('-------------------------------------');

% load text files and crop frequency values

mel1 = load(extracted);

[frames1, cols] = size(mel1);

mel1 = mel1(:,cols);

mel2 = load(reference);

[frames2, cols] = size(mel2);

mel2 = mel2(:,cols);

 

% check length of files

if frames1<frames2

    disp(' Warning! Extracted melody is shorter than the reference!');

    disp(' Zeros are appended.');

    mel1(frames1+1:frames2)=0;

end

if frames1>frames2

    disp(' Warning! Extracted melody is longer than the reference!');

    disp(' Melody is truncated.');

    mel1(frames2+1:frames1)=[];

end

 

%%% compute unpitchMatch

%%% (frames have no pitch in the reference, but have one in the extracted melody)

 

unpitched = mel2==0;

nopitchdet = mel1==0;

unpitchMatch = 100*sum(nopitchdet(unpitched))/sum(unpitched);

disp([' unpitched frame accordance:  ',num2str(unpitchMatch),'%']);

 

%%% compute absolute errors on log frequency scale

% scale conversion for pitched frames

mel1(~nopitchdet) = 1200*(log2(mel1(~nopitchdet)/13.75)-0.25);

mel2(~unpitched) = 1200*(log2(mel2(~unpitched)/13.75)-0.25);

errCent = abs(mel1-mel2);

 

% 1 semitone is error threshold

errCent(errCent>100) = 100;

 

% compute pitchMatch

pitchMatch = 100 - mean(errCent(~unpitched));

disp([' pitched frame accordance:  ',num2str(pitchMatch),'%']);

 

% compute totalMatch

totalMatch = 100 - mean(errCent);

disp(['     TOTAL ACCORDANCE:  ',num2str(totalMatch),'%']);

disp('-------------------------------------');

 

%%% plot melody lines

figure;

plot(mel2,'ob');

hold on

plot(mel1,'xr');

 

8.2.             Option 2

It is the same as option 1. The different is that before computing the absolute difference, the values for F0 are mapped into the range of one octave. This measure was computed in MATLAB with the following function:

function [totalMatch, pitchMatch, unpitchMatch] = evalOption2(extracted, reference);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% [totalMatch, pitchMatch, unpitchMatch] = evalOption2(extracted, reference)

% algorithm for the evaluation of melody extractors after option 2

%

% input:

% extracted     string with path/filename of the extracted melody

% reference     string with path/filename of the reference melody

%

% Both files are assumed to be ASCII files containing data at

% the same frame rate. Unpitched frames are coded as 0Hz pitch.

% The algortihm assumes that the pitch information in Hz for each

% frame is stored in the last column of the files.

%

% output:

% pitchMatch    Concordance measure for the pitched frames (in reference) only

% unpitchMatch  Concordance measure for the unpitched frames (in reference) only

% totalMatch    Combined concordance measure      

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

disp('-------------------------------------');

% load text files and crop frequency values

mel1 = load(extracted);

[frames1, cols] = size(mel1);

mel1 = mel1(:,cols);

mel2 = load(reference);

[frames2, cols] = size(mel2);

mel2 = mel2(:,cols);

 

% check length of files

if frames1<frames2

    disp(' Warning! Extracted melody is shorter than the reference!');

    disp(' Zeros are appended.');

    mel1(frames1+1:frames2)=0;

end

if frames1>frames2

    disp(' Warning! Extracted melody is longer than the reference!');

    disp(' Melody is truncated.');

    mel1(frames2+1:frames1)=[];

end

 

%%% compute unpitchMatch

%%% (frames have no pitch in the reference, but have one in the extracted melody)

 

unpitched = mel2==0;

nopitchdet = mel1==0;

unpitchMatch = 100*sum(nopitchdet(unpitched))/sum(unpitched);

disp([' unpitched frame accordance:  ',num2str(unpitchMatch),'%']);

 

%%% compute absolute errors on log frequency scale with octave mapping

% scale conversion for pitched frames

mel1(~nopitchdet) = 1200*(log2(mel1(~nopitchdet)/13.75)-0.25);

mel2(~unpitched) = 1200*(log2(mel2(~unpitched)/13.75)-0.25);

 

% mapping into octave range

mel1(~nopitchdet) = 100 + mod(mel1(~nopitchdet),1200);

mel2(~unpitched) = 100 + mod(mel2(~unpitched),1200);

errCent = abs(mel1-mel2);

 

% circular thinking -> max error is half an octave

swap = find((errCent>600) & ~nopitchdet & ~unpitched);

errCent(swap) = 1200 - errCent(swap);

 

% 1 semitone is error threshold

errCent(errCent>100) = 100;

 

% compute pitchMatch

pitchMatch = 100 - mean(errCent(~unpitched));

disp([' pitched frame accordance:  ',num2str(pitchMatch),'%']);

 

% compute totalMatch

totalMatch = 100 - mean(errCent);

disp(['     TOTAL ACCORDANCE:  ',num2str(totalMatch),'%']);

disp('-------------------------------------');

 

%%% plot melody lines

figure;

plot(mel2,'ob');

hold on

plot(mel1,'xr');

 

8.3.             Option 3

Edit distance between the estimated melody and the correct melody. The correct melody was obtained by manual score alignment. The edit distance computation is implemented in GUILE for LINUX. The source code is available for download. More information on the edit distance and its implementation is found in Grachten et al. 2002.

9.    List of participants

ID

Name

Institution

email

1

Rui Pedro Paiva

University of Coimbra

ruipedro@dei.uc.pt

2

Sven Tappert

Berlin Technical University

s_tappert@yahoo.de

3

Graham Poliner

Columbia University

graham@ee.columbia.edu

4

Juan P. Bello

Centre for Digital Music, Queen Mary University of London

juan.bello-correa@elec.qmul.ac.uk

10.          Evaluation results

The results of the evaluation, expressed in percentage, for each of the audio excerpts of both the training and the test set are presented in Annex 1. Results were also computed using a monophonic pitch tracker developed in the context of the SMSTools (implemented by MTG of the UPF), in order to establish a baseline.

 

 

Evaluation results show that the algorithm that performs the best is the algorithm 1 by Rui Pedro Paiva. Congratulations!

 

We also computed an estimation of the computation time for each of the algorithms. This gives an idea of the performance of the different methods, although it is not a measure to get into account to decide which is the best of the algorithms. Algorithms were computed in two machines: WINDOWS PC Pentium 1.2 GHz, 1 Gb RAM and Linux PC Pentium 2 GHz, 500 Mb RAM. Results are presented in the following table:

 

ParticipanID

1

2

3

4

Operating system

 

Windows

 

Linux (MATLAB)

Linux

 

Linux (MATLAB)

Average Time Per Audio Excerpt
(in seconds)

3346,67

60,00

470,00

82,50

Average Time Per Audio Excerpt
(in minutes)

55,78

1,00

7,83

1,38

Average Time Per Audio Excerpt
(in hours)

0,93

0,02

0,13

0,02

 

This estimation shows that the fastest algorithms are algorithms 2 and 4. Algorithm 1 is the slowest one.

11.          Acknowledgments

Thanks to Maarten Gratchen for providing the algorithm for computing the edit distance between two melodies.

 

Thanks to all the participants and members of the MTG for their contributions.

12.          References

M. Grachten, J. Ll. Arcos, R. López de Mántaras: A Comparison of Different Approaches to Melodic Similarity. ICMAI02.
 http://www.iiia.csic.es/~maarten/articles/MelSim.pdf

 

13.          Annex 1: Tables with results

 

ID

1

2

3

4

Baseline

Option

1

2

Average12

1

2

Average12

1

2

Average12

1

2

Average12

1

2

Average12

T

R

A

I

N

I

N

G

 

 

daisy2

75,23

75,23

75,23

38,65

69,06

53,86

78,22

78,74

78,48

78,13

78,66

78,40

68,52

71,38

69,95

daisy3

91,10

91,10

91,10

80,15

80,48

80,31

86,87

87,18

87,03

79,61

79,61

79,61

1,21

29,39

15,30

jazz2

67,82

68,56

68,19

21,05

55,74

38,40

74,99

74,99

74,99

59,70

67,86

63,78

46,16

57,90

52,03

jazz3

56,10

56,10

56,10

63,31

65,80

64,56

80,84

80,84

80,84

73,87

73,87

73,87

34,43

43,06

38,74

midi1

74,77

77,58

76,17

37,80

41,79

39,80

66,60

66,79

66,69

15,79

33,63

24,71

2,58

16,40

9,49

midi2

74,03

74,03

74,03

75,46

76,43

75,94

78,53

78,53

78,53

77,68

77,68

77,68

17,28

34,38

25,83

opera_fem2

35,46

35,49

35,48

45,00

46,51

45,75

35,68

35,68

35,68

44,73

44,76

44,74

38,17

44,36

41,26

opera_male3

26,07

27,09

26,58

13,28

35,59

24,44

33,84

33,94

33,89

14,64

28,77

21,70

44,93

52,91

48,92

pop1

60,92

61,10

61,01

17,16

39,26

28,21

55,43

55,43

55,43

25,95

34,74

30,35

14,40

18,29

16,35

pop4

70,81

70,84

70,83

31,81

43,86

37,83

70,82

70,89

70,86

73,08

73,08

73,08

27,44

34,06

30,75

T

E

S

T

 

 

 

 

daisy1

66,55

66,55

66,55

50,71

62,52

56,61

60,38

62,72

61,55

77,23

77,23

77,23

58,18

64,57

61,37

daisy4

89,58

89,58

89,58

69,22

79,94

74,58

65,04

67,67

66,36

61,94

66,15

64,04

42,63

53,31

47,97

jazz1

61,46

61,82

61,64

39,37

57,87

48,62

49,67

50,11

49,89

65,66

66,51

66,08

49,74

58,49

54,12

jazz4

78,26

78,26

78,26

32,83

56,77

44,80

46,41

47,61

47,01

61,11

67,06

64,08

25,12

34,24

29,68

midi3

64,20

64,22

64,21

61,47

64,37

62,92

50,93

51,42

51,17

42,22

58,30

50,26

32,59

38,63

35,61

midi4

71,97

74,54

73,25

47,21

52,91

50,06

35,83

41,58

38,71

20,78

37,87

29,33

2,85

13,91

8,38

opera_fem4

46,96

46,96

46,96

55,84

56,36

56,10

20,04

23,51

21,77

44,40

44,40

44,40

23,44

38,94

31,19

opera_male5

46,51

47,19

46,85

18,42

49,74

34,08

29,43

30,43

29,93

8,58

34,32

21,45

70,25

74,18

72,21

pop2

63,94

64,08

64,01

18,89

38,98

28,93

57,67

58,04

57,86

28,96

36,25

32,61

31,70

34,95

33,33

pop3

73,02

73,73

73,37

26,11

43,56

34,83

45,64

46,69

46,17

62,85

73,17

68,01

23,31

31,24

27,27

 

Average

64,74

65,20

64,97

42,19

55,88

49,03

56,14

57,14

56,64

50,85

57,70

54,27

32,75

42,23

37,49

Table 1 : Evaluation results in % for each of the audio excerpts and the average for all of them, considering Options 1 and 2. Best results appear in green. You can download a zip-file with the transcriptions of each participant (click on the ID numbers in the table).

 

ID

1

4

T

R

A

I

N

I

N

G

 

 

daisy2

4,94

6,92

daisy3

0,49

0,56

jazz2

6,80

9,66

jazz3

6,74

6,09

midi1

6,58

26,78

midi2

7,66

7,26

opera_fem2

13,16

13,72

opera_male3

19,41

26,70

pop1

11,69

26,19

pop4

8,25

9,98

T

E

S

T

 

 

 

 

daisy1

8,37

10,24

daisy4

6,01

8,42

jazz1

9,80

6,64

jazz4

1,96

4,56

midi3

5,30

19,43

midi4

5,12

24,82

opera_fem4

9,22

8,64

opera_male5

22,79

31,39

pop2

10,30

21,72

pop3

8,10

12,63

 

Average

8,63

14,12

Table 2: Evaluation results as edit distance (option 3) for each of the audio excerpts. Best results appear in green. You can download a zip-file with the transcriptions of each participant (click on the ID numbers in the table). We included also synthesized midi files from the transcriptions for a comparison by listening.

 



[1] The analysis window size for the reference was set to 2048 samples, the hop size to 256 samples (with a sampling rate fs=44.1kHz). All participants were requested to use the same frame rate in their contributions, because no interpolation was performed.

[2] We only corrected frames that should be considered unpitched but got a pitch assigned by our reference algorithm. We could not guarantee that the F0 was estimated 100% correct in every single frame.