classified instances
3. Feature selection and design of localized models us-
ing those M instances
4. Classification of the N segments using the localized
models
As it turns out, our attempts at the automatic rank-
ing and selection has not yet provided satisfactory results.
Therefore, we corrected manually the output of steps 1
and 2 and provided only correct percussion sound instances
to the feature selection step. Consequently, in this pa-
per, we present a more focused evaluation of the local-
ized sound model design, as well as a proper comparison
between general sound model and localized sound model
performances. Using the corrected instance subsets, we
investigate how the performance of the localized models
evolve as increasingly smaller proportions of the data is
used for feature selection and training.
Automatic techniques for instance ranking are currently
being evaluated. Together with the evaluation of the fully
automatic system they are the object of a forthcoming pa-
per.
2. METHOD
2.1. Data and features
The training data set for general model building consists
of 1136 instances (100 ms long): 1061 onset regions taken
from 25 CD-quality polyphonic audio recordings and 75
isolated drum samples. These were then manually anno-
tated, assigning category labels for kick, snare, cymbal,
kick+cymbal, snare+cymbal and not-percussion. Other
percussion instruments like toms and Latin percussions
were left out. Cymbals denote hi-hats, rides and crashes.
Annotated test data consists of seventeen 20-second ex-
cerpts taken from 17 different CD-quality audio record-
ings (independent from the training data set). The total
number of manually annotated onsets in all the excerpts is
1419, average of 83 per excerpt.
Training and test data are characterized by 115 spec-
tral features (averages and variances of frame values) and
temporal features (computed on the whole regions), see
[9] and [12] for details.
The experiments described in the remainder of this pa-
per were conducted with Weka .
1
2.2. Classification with general models
In order to design general drum sound models, we first
propose to reduce the dimensionality of the feature space
by applying a Correlation-based Feature Selection (CFS)
algorithm (Section 2.4) on the training data. From the total
of 115, an average of 24.67 features are selected for each
model.
This data is then used to induce a collection of C4.5
decision trees using the AdaBoost meta-learning scheme.
1
http://www.cs.waikato.ac.nz/ml/weka/
Bagging or boosting approaches has turned out to yield
better results when compared to other more traditional
machine learning algorithms [9].
2.3. Instance ranking and selection
The instances classified by the general models must be
parsed in order to derive the most likely correctly classi-
fied subset. Several rationales are possible. For instance,
we can use instance probability estimates assigned by some
machine learning algorithms as indicators of correct clas-
sification likelihood. Another option is to use clustering
techniques. Instances of the same percussion instrument,
which we are looking for, would form the most populated
and compact clusters while other drum sounds and non-
percussive instances would be outliers.
However, as mentioned above, we went for a “safe” op-
tion: manually parsing the output of the general classifica-
tion schemes. Using the corrected output, we investigated
how the performance of the localized models evolved as
increasingly smaller proportions of the instances selected
from a recording were used to classify the remaining sound
instances of the recording. Since dependence between
training and test data sets is known to yield overly op-
timistic results, these test were performed by doing ran-
domized, mutually exclusive splits on the complete col-
lection of instances for each recording.
2.4. Feature selection
A collection of correctly classified instances from a record-
ing are then used to build new, localized sound models.
Relevant features for the localized sound models are
selected using a Correlation-based Feature Selection (CFS)
algorithm that evaluates attribute subsets on the basis of
both the predictive abilities of each feature and feature
inter-correlations [6]. This method yields a set of features
with recording-specific good class separability and noise
independence. The localized models may differ from the
general models in two respects: they may be based on
1) a different feature subset (feature space) and 2) differ-
ent threshold values (decision boundaries) for specific fea-
tures. As a general comment, the features showed signifi-
cantly better class-homogeneity for localized models than
for the general models.
2.5. Classification with localized models
Finally, the remaining instances must be classified with
the recording-specific (localized) models.
For this final step, we propose to use instance-based
classifiers, such as 1-NN (k-Nearest Neighbors, with k =
1). Instance-based classifiers are usually quite reliable and
give good classification accuracies. However, usual criti-
cisms are that they are not robust to noisy data, they are
memory consuming and they lack generalization capabil-
ities. Nevertheless, in our framework, these are not issues
we should be worried about: by the very nature of our