Initially, our dataset was composed out of the iTunes library of a member of the research team. ffmpeg was used to convert files to the .wav format, as Python (using scipy.io.wavfile) handles that audio format well.
When this was found to be lacking in 'art' and 'traditional' music, steps were taken to expand the library in those directions, gathering open-domain recordings in both categories where available.
The files were renamed to a standardized format (year.artist.album.title.wav) and stored in a single location. A separate library file was generated to hold additional metadata, where appropriate.
Training was performed using the full dataset, with somewhere between 10 and 25 percent of the data being held aside as validation data.
Evaluation data saving was implemented later, with a similarly variable portion of the dataset being held aside for use as evaluation-only data. Both of these served to identify overfitting issues.
Several varying windows were used:
A variety of neural network structures were attempted, with the resulting models and weights stored and the training-reported accuracies logged.
For some trials, we did an in-depth analysis, having the trained neural network run through the entire library, logging the 'correct' category and the softmax result from the network. These were used to identify problems the network had.