Table of contents
12. Annex 1: Tables with results
The purpose
of the ISMIR2004 Melody Extraction Contest is to compare state-of-the-art algorithms
for melody detection within polyphonic audio: singing voice and solo
instrument.
The
ISMIR2004 Melody Extraction Contest has been organized by the MTG-UPF. The
organization committee for this contest is composed, in alphabetical order, by:
- Emilia Gómez (emilia.gomez@upf.edu)
-
Beesuan
Ong (beesuan@iua.upf.es)
-
Sebastian
Streich (sstreich@iua.upf.es)
ˇ
June
28: Final definition of the contest rules
ˇ
July
1-5: Tuning data made available to participants
ˇ
September
7: Deadline for participant submission of the algorithms (anonymous
participation is allowed)
ˇ
October
10-14: Publication of the results of the tests. Prizes will be delivered during
the ISMIR 2004 conference in
10 audio
excerpts with a melodic transcription of the predominant voice are available
for participants to tune their algorithms:
ˇ
2
items consisting of a
ˇ
2
items of saxophone melodic phrases plus background music.
ˇ
2
items generated using a singing voice synthesizer plus background music.
ˇ
2
items of opera singing, one with man and another with a woman voice.
ˇ
2
items of pop music with singing voice.
The tuning
set enables the participants to use the same evaluation algorithm that will be
used for the final evaluation.
A total of
20 audio excerpts, 10 new audio excerpts in addition to the 10 audio excerpts
from the tuning set will be used to evaluate the algorithms.
Now that
the contest is over we provide the full
test set with the reference transcriptions (28.6 MB) for download in order
to enable comparisons.
1.
Executable file with the following arguments:
- Input: wav file, mono and 44.1 kHz
sampling rate.
- Output: txt file with estimated
monophonic melody.
Example: >>
computeMelody in.wav out.txt
Format for the output text file:
- Option 1,2: list with float values
representing the predominant F0 in Hz, according to the examples provided in
the tuning set.
- Option 3: three column list with two
float values representing onset and offset time in seconds, and one integer
values representing the
2.
Output text files for the tuning set. This data will be used to verify that the
algorithm is working properly within the testing environment.
Submissions
were made via an email interface. To do so, the participants had to send a
message to the following mail address: melody-contest-submit@iua.upf.es
with attached files and the following text in the subject line: SUBMIT <title of your submission. Some
additional information was included in the message body, necessary to run the
algorithm. The system then sent a reply message with further instructions on
how to delete and update submission etc.
Three evaluation
metrics have been provided:
The option
1 is a frame-based comparison of estimated F0 and reference F0 on logarithmic
scale. The reference was obtained by analyzing the isolated leading voice[1] and then
performing some manual checking[2]. A
value of 0Hz was assigned to unpitched frames. The concordance was measured as
the average absolute difference with a threshold of 1 semitone (= 100 cents)
for the maximal error. Each frame contributed to the final result with the same
weight. This measure was computed in MATLAB with the following function:
function [totalMatch, pitchMatch, unpitchMatch] = evalOption1(extracted, reference);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% [totalMatch, pitchMatch, unpitchMatch] = evalOption1(extracted, reference)
% algorithm for the evaluation of melody extractors after option 1
% input:
% extracted string with path/filename of the extracted melody
% reference string with path/filename of the reference melody
% Both files are assumed to be ASCII files containing data at the same frame rate.
% Unpitched frames are coded as 0Hz pitch.
% The algorithm assumes that the pitch
information in Hz for each frame is stored in the
% last column of the files.
%
% output:
% pitchMatch Concordance measure for the pitched frames (in reference) only
% unpitchMatch Concordance measure for the unpitched frames (in reference) only
% totalMatch Combined concordance measure
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
disp('-------------------------------------');
% load text files and crop frequency values
mel1 = load(extracted);
[frames1, cols] = size(mel1);
mel1 = mel1(:,cols);
mel2 = load(reference);
[frames2, cols] = size(mel2);
mel2 = mel2(:,cols);
% check length of files
if frames1<frames2
disp(' Warning! Extracted melody is shorter than the reference!');
disp(' Zeros are appended.');
mel1(frames1+1:frames2)=0;
end
if frames1>frames2
disp(' Warning! Extracted melody is longer than the reference!');
disp(' Melody is truncated.');
mel1(frames2+1:frames1)=[];
end
%%% compute unpitchMatch
%%% (frames have no pitch in the reference, but have one in the extracted melody)
unpitched = mel2==0;
nopitchdet = mel1==0;
unpitchMatch = 100*sum(nopitchdet(unpitched))/sum(unpitched);
disp([' unpitched frame accordance: ',num2str(unpitchMatch),'%']);
%%% compute absolute errors on log frequency scale
% scale conversion for pitched frames
mel1(~nopitchdet) = 1200*(log2(mel1(~nopitchdet)/13.75)-0.25);
mel2(~unpitched) = 1200*(log2(mel2(~unpitched)/13.75)-0.25);
errCent = abs(mel1-mel2);
% 1 semitone is error threshold
errCent(errCent>100) = 100;
% compute pitchMatch
pitchMatch = 100 - mean(errCent(~unpitched));
disp([' pitched frame accordance: ',num2str(pitchMatch),'%']);
% compute totalMatch
totalMatch = 100 - mean(errCent);
disp([' TOTAL ACCORDANCE: ',num2str(totalMatch),'%']);
disp('-------------------------------------');
%%% plot melody lines
figure;
plot(mel2,'ob');
hold on
plot(mel1,'xr');
It is the
same as option 1. The different is that before computing the absolute
difference, the values for F0 are mapped into the range of one octave. This
measure was computed in MATLAB with the following function:
function [totalMatch, pitchMatch, unpitchMatch] = evalOption2(extracted, reference);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% [totalMatch, pitchMatch, unpitchMatch] = evalOption2(extracted, reference)
% algorithm for the evaluation of melody extractors after option 2
%
% input:
% extracted string with path/filename of the extracted melody
% reference string with path/filename of the reference melody
%
% Both files are assumed to be ASCII files containing data at
% the same frame rate. Unpitched frames are coded as 0Hz pitch.
% The algortihm assumes that the pitch information in Hz for each
% frame is stored in the last column of the files.
%
% output:
% pitchMatch Concordance measure for the pitched frames (in reference) only
% unpitchMatch Concordance measure for the unpitched frames (in reference) only
% totalMatch Combined concordance measure
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
disp('-------------------------------------');
% load text files and crop frequency values
mel1 = load(extracted);
[frames1, cols] = size(mel1);
mel1 = mel1(:,cols);
mel2 = load(reference);
[frames2, cols] = size(mel2);
mel2 = mel2(:,cols);
% check length of files
if frames1<frames2
disp(' Warning! Extracted melody is shorter than the reference!');
disp(' Zeros are appended.');
mel1(frames1+1:frames2)=0;
end
if frames1>frames2
disp(' Warning! Extracted melody is longer than the reference!');
disp(' Melody is truncated.');
mel1(frames2+1:frames1)=[];
end
%%% compute unpitchMatch
%%% (frames have no pitch in the reference, but have one in the extracted melody)
unpitched = mel2==0;
nopitchdet = mel1==0;
unpitchMatch = 100*sum(nopitchdet(unpitched))/sum(unpitched);
disp([' unpitched frame accordance: ',num2str(unpitchMatch),'%']);
%%% compute absolute errors on log frequency scale with octave mapping
% scale conversion for pitched frames
mel1(~nopitchdet) = 1200*(log2(mel1(~nopitchdet)/13.75)-0.25);
mel2(~unpitched) = 1200*(log2(mel2(~unpitched)/13.75)-0.25);
% mapping into octave range
mel1(~nopitchdet) = 100 + mod(mel1(~nopitchdet),1200);
mel2(~unpitched) = 100 + mod(mel2(~unpitched),1200);
errCent = abs(mel1-mel2);
% circular thinking -> max error is half an octave
swap = find((errCent>600) & ~nopitchdet & ~unpitched);
errCent(swap) = 1200 - errCent(swap);
% 1 semitone is error threshold
errCent(errCent>100) = 100;
% compute pitchMatch
pitchMatch = 100 - mean(errCent(~unpitched));
disp([' pitched frame accordance: ',num2str(pitchMatch),'%']);
% compute totalMatch
totalMatch = 100 - mean(errCent);
disp([' TOTAL ACCORDANCE: ',num2str(totalMatch),'%']);
disp('-------------------------------------');
%%% plot melody lines
figure;
plot(mel2,'ob');
hold on
plot(mel1,'xr');
Edit distance
between the estimated melody and the correct melody. The correct melody was
obtained by manual score alignment. The edit distance computation is implemented
in GUILE for LINUX. The source
code is available for download. More information on the edit distance
and its implementation is found in Grachten et al. 2002.
ID |
Name |
Institution |
email |
1 |
|
|
|
2 |
Sven Tappert |
Berlin Technical University |
|
3 |
Graham Poliner |
Columbia University |
|
4 |
Juan P. Bello |
Centre
for Digital Music, Queen Mary University of London |
The results
of the evaluation, expressed in percentage, for each of the audio excerpts of
both the training and the test set are presented in Annex 1. Results were also
computed using a monophonic pitch tracker developed in the context of the
SMSTools (implemented by MTG of the UPF), in order to establish a baseline.
Evaluation
results show that the algorithm that performs the best is the algorithm 1 by
We also
computed an estimation of the computation time for each of the algorithms. This
gives an idea of the performance of the different methods, although it is not a
measure to get into account to decide which is the best of the algorithms.
Algorithms were computed in two machines: WINDOWS PC Pentium 1.2 GHz, 1 Gb RAM
and Linux PC Pentium 2 GHz, 500 Mb RAM. Results are presented in the following
table:
ParticipanID |
1 |
2 |
3 |
4 |
Operating
system |
Windows |
Linux
(MATLAB) |
Linux |
Linux
(MATLAB) |
Average
Time Per Audio Excerpt |
3346,67 |
60,00 |
470,00 |
82,50 |
Average
Time Per Audio Excerpt |
55,78 |
1,00 |
7,83 |
1,38 |
Average
Time Per Audio Excerpt |
0,93 |
0,02 |
0,13 |
0,02 |
This
estimation shows that the fastest algorithms are algorithms 2 and 4. Algorithm 1 is the slowest one.
Thanks to Maarten Gratchen for providing the
algorithm for computing the edit distance between two melodies.
Thanks to all the participants and members of
the MTG for their contributions.
M. Grachten, J. Ll. Arcos, R. López de
Mántaras: A Comparison of Different Approaches to Melodic Similarity. ICMAI02.
http://www.iiia.csic.es/~maarten/articles/MelSim.pdf
ID |
Baseline |
|||||||||||||||
Option |
1 |
2 |
Average12 |
1 |
2 |
Average12 |
1 |
2 |
Average12 |
1 |
2 |
Average12 |
1 |
2 |
Average12 |
|
T R A I N I N G |
daisy2 |
75,23 |
75,23 |
75,23 |
38,65 |
69,06 |
53,86 |
78,22 |
78,74 |
78,48 |
78,13 |
78,66 |
78,40 |
68,52 |
71,38 |
69,95 |
daisy3 |
91,10 |
91,10 |
91,10 |
80,15 |
80,48 |
80,31 |
86,87 |
87,18 |
87,03 |
79,61 |
79,61 |
79,61 |
1,21 |
29,39 |
15,30 |
|
jazz2 |
67,82 |
68,56 |
68,19 |
21,05 |
55,74 |
38,40 |
74,99 |
74,99 |
74,99 |
59,70 |
67,86 |
63,78 |
46,16 |
57,90 |
52,03 |
|
jazz3 |
56,10 |
56,10 |
56,10 |
63,31 |
65,80 |
64,56 |
80,84 |
80,84 |
80,84 |
73,87 |
73,87 |
73,87 |
34,43 |
43,06 |
38,74 |
|
midi1 |
74,77 |
77,58 |
76,17 |
37,80 |
41,79 |
39,80 |
66,60 |
66,79 |
66,69 |
15,79 |
33,63 |
24,71 |
2,58 |
16,40 |
9,49 |
|
midi2 |
74,03 |
74,03 |
74,03 |
75,46 |
76,43 |
75,94 |
78,53 |
78,53 |
78,53 |
77,68 |
77,68 |
77,68 |
17,28 |
34,38 |
25,83 |
|
opera_fem2 |
35,46 |
35,49 |
35,48 |
45,00 |
46,51 |
45,75 |
35,68 |
35,68 |
35,68 |
44,73 |
44,76 |
44,74 |
38,17 |
44,36 |
41,26 |
|
opera_male3 |
26,07 |
27,09 |
26,58 |
13,28 |
35,59 |
24,44 |
33,84 |
33,94 |
33,89 |
14,64 |
28,77 |
21,70 |
44,93 |
52,91 |
48,92 |
|
pop1 |
60,92 |
61,10 |
61,01 |
17,16 |
39,26 |
28,21 |
55,43 |
55,43 |
55,43 |
25,95 |
34,74 |
30,35 |
14,40 |
18,29 |
16,35 |
|
pop4 |
70,81 |
70,84 |
70,83 |
31,81 |
43,86 |
37,83 |
70,82 |
70,89 |
70,86 |
73,08 |
73,08 |
73,08 |
27,44 |
34,06 |
30,75 |
|
T E S T |
daisy1 |
66,55 |
66,55 |
66,55 |
50,71 |
62,52 |
56,61 |
60,38 |
62,72 |
61,55 |
77,23 |
77,23 |
77,23 |
58,18 |
64,57 |
61,37 |
daisy4 |
89,58 |
89,58 |
89,58 |
69,22 |
79,94 |
74,58 |
65,04 |
67,67 |
66,36 |
61,94 |
66,15 |
64,04 |
42,63 |
53,31 |
47,97 |
|
jazz1 |
61,46 |
61,82 |
61,64 |
39,37 |
57,87 |
48,62 |
49,67 |
50,11 |
49,89 |
65,66 |
66,51 |
66,08 |
49,74 |
58,49 |
54,12 |
|
jazz4 |
78,26 |
78,26 |
78,26 |
32,83 |
56,77 |
44,80 |
46,41 |
47,61 |
47,01 |
61,11 |
67,06 |
64,08 |
25,12 |
34,24 |
29,68 |
|
midi3 |
64,20 |
64,22 |
64,21 |
61,47 |
64,37 |
62,92 |
50,93 |
51,42 |
51,17 |
42,22 |
58,30 |
50,26 |
32,59 |
38,63 |
35,61 |
|
midi4 |
71,97 |
74,54 |
73,25 |
47,21 |
52,91 |
50,06 |
35,83 |
41,58 |
38,71 |
20,78 |
37,87 |
29,33 |
2,85 |
13,91 |
8,38 |
|
opera_fem4 |
46,96 |
46,96 |
46,96 |
55,84 |
56,36 |
56,10 |
20,04 |
23,51 |
21,77 |
44,40 |
44,40 |
44,40 |
23,44 |
38,94 |
31,19 |
|
opera_male5 |
46,51 |
47,19 |
46,85 |
18,42 |
49,74 |
34,08 |
29,43 |
30,43 |
29,93 |
8,58 |
34,32 |
21,45 |
70,25 |
74,18 |
72,21 |
|
pop2 |
63,94 |
64,08 |
64,01 |
18,89 |
38,98 |
28,93 |
57,67 |
58,04 |
57,86 |
28,96 |
36,25 |
32,61 |
31,70 |
34,95 |
33,33 |
|
pop3 |
73,02 |
73,73 |
73,37 |
26,11 |
43,56 |
34,83 |
45,64 |
46,69 |
46,17 |
62,85 |
73,17 |
68,01 |
23,31 |
31,24 |
27,27 |
|
Average |
64,74 |
65,20 |
64,97 |
42,19 |
55,88 |
49,03 |
56,14 |
57,14 |
56,64 |
50,85 |
57,70 |
54,27 |
32,75 |
42,23 |
37,49 |
Table
1
: Evaluation results in % for
each of the audio excerpts and the average for all of them, considering Options
1 and 2. Best results appear in green. You can download a zip-file with the transcriptions
of each participant (click on the ID numbers in the table).
|
ID |
||
T R A I N I N G |
daisy2 |
4,94 |
6,92 |
daisy3 |
0,49 |
0,56 |
|
jazz2 |
6,80 |
9,66 |
|
jazz3 |
6,74 |
6,09 |
|
midi1 |
6,58 |
26,78 |
|
midi2 |
7,66 |
7,26 |
|
opera_fem2 |
13,16 |
13,72 |
|
opera_male3 |
19,41 |
26,70 |
|
pop1 |
11,69 |
26,19 |
|
pop4 |
8,25 |
9,98 |
|
T E S T |
daisy1 |
8,37 |
10,24 |
daisy4 |
6,01 |
8,42 |
|
jazz1 |
9,80 |
6,64 |
|
jazz4 |
1,96 |
4,56 |
|
midi3 |
5,30 |
19,43 |
|
midi4 |
5,12 |
24,82 |
|
opera_fem4 |
9,22 |
8,64 |
|
opera_male5 |
22,79 |
31,39 |
|
pop2 |
10,30 |
21,72 |
|
pop3 |
8,10 |
12,63 |
|
|
Average |
8,63 |
14,12 |
Table 2:
Evaluation results as edit distance (option 3) for each of the audio excerpts.
Best results appear in green. You can download a zip-file with the transcriptions
of each participant (click on the ID numbers in the table). We included also
synthesized midi files from the transcriptions for a comparison by listening.
[1] The analysis window size for
the reference was set to 2048 samples, the hop size to 256 samples (with a
sampling rate fs=44.1kHz). All participants were requested to use the same
frame rate in their contributions, because no interpolation was performed.
[2] We only corrected frames that
should be considered unpitched but got a pitch assigned by our reference
algorithm. We could not guarantee that the F0 was estimated 100% correct in
every single frame.