Convolutional neural network deriving similarity-based predictability scores from audio

I would like to divide this job up into multiple milestones. After each milestone is complete there should be a functioning product and the pay for that milestone will be given out. In the full details below, I have ~~struck through~~ future goals I have for this project that do not pertain to this job, which is for the first milestone.

Milestone 1

For this job I would like to keep it as simple as possible, so the script will only need to

return scores on the smallest windows of data, something like 100-1000 ms at a time
pre-train (no real-time training, although we can try to get the training time down as much as possible)
display scores in the script itself, without returning them to the rest of my project in any way

The full details

I most likely need a neural network created, perhaps in Tensorflow/Pytorch.

I'm looking to analyze audio in a non-supervised fashion to extract whatever features/patterns might emerge. My first thought would be to use a convolutional neural network, but I am not at all set on that route if something else would be better to accomplish my goals. The same kind that's used for image recognition, but turned on audio, pulling out low-level features (an individual beat being on or off) all the way up to high-level (a repeated melody).

In the same way that a visual CNN will learn to recognize specific patterns at each node in the network I want to see if an auditory CNN could learn to recognize repeated motifs in music.

All that being said, what I'm looking for here may just be a clustering algorithm, such as K-means or KNN, or something else entirely like self-organizing maps. Our exact strategy is something we will need to discuss, so what do you think will work best?

The goal of this analysis would be to return similarity scores ~~for different sized windows~~. In other words, given a new sample timestep of data, how similar is it to the patterns it's already learned to identify? These scores could be say, 0-1, a 1 meaning it knew exactly how to classify what it heard and a 0 that it was completely caught off guard by something unfamiliar.

I would like to use bark coefficients as the analysis data, which would be 25 numbers representing roughly every 10 ms of audio (think of EQ bars on a stereo; that's basically what these numbers would be, bass on one side and treble on the other). If possible, the smallest windows might be something like 100-1000 ms long, ~~while the largest might be 6000 or even upwards of 20,000 ms~~. The use case will be with jam sessions that may provide anywhere from 5 minutes to a couple hours of material. I already have a system gathering this training data in the form of text files on my computer.

~~I would like to aim for the goal of having it not only run and return similarity scores in real-time, but periodically retrain in real-time as well.~~ However, it would probably be wise to get it working with non-real-time training first. The goal will be to identify what patterns are unique to a given jam session, so that will be the data that constitutes the ultimate training set.

It's important to also point out that I'm not looking for 99% accuracy here or anything. Not at all. The similarity scores it ultimately returns simply need to be something better than random and they will be useful to me once implemented in my project as a whole. As long as it can identify patterns with any sort of accuracy, my system will gradually tend towards a desirable outcome and that's what I'm looking for.

I would consider it a benefit if the strategy we chose was able to function on limited training examples as well as more plentiful ones. I don't know if a clustering algorithm, for example, might be more adept at working quickly on a very small training set than a full neural net would, so that's something we should discuss. I would also like to talk about how the method you have in mind would handle larger windows of audio in addition to the smaller ones we will be focusing on in this job.

Possible future milestones/goals

Get the Python script sending and receiving from my main project, probably by reading/writing text files continually or maybe via UDP?
Get it to derive some sort of BPM measurement from its analysis. This would basically be the most common interval between beats (spikes in the data)
Rather than just a single similarity score, have it produce one for each sized window/layer in the network, possibly combining those in some way to output the final score
Have it continually gather new data from my program and periodically retrain on it in real-time

You're the expert here, so I need your opinions. Do you think the approach I describe will be able to achieve my goals? Do you have any better ideas, ways to refine the strategy or a completely different suggestion?

There is no particular deadline as long as we stay in constant communication about the progress of the project.

All code will need to be clean, organized and very well-commented. Please contact me if you have any questions or would like clarification about anything! If you see a better way to do something than what I'm suggesting, please bring it up!

[HIRING] create a Convolutional Neural Network deriving similarity scores from audio

Convolutional neural network deriving similarity-based predictability scores from audio

Milestone 1

The full details

Possible future milestones/goals