Machine learning competitions
Machine learning competitions are often designed to encourage new approaches to solving various real life problems. Sometimes they are used to test new ideas in science related context. In either way, despite being very effective and novel, such algorithms are not easy to apply in production environments. They are often messy coded, not tested and have lots of hardcoded values. Those were some examples of potential problems.
But does it mean that competition level code has no practical value? Not necessary. It may take some time to mold it into production structure, but it should be possible.
New type of challenge
Recently, a competition, which aims to bring bleeding edge Kaggle algorithms to actual production, was started. It is called Concept to Clinic and targets early detection of lungs cancer. It is based on top notch Kaggle solutions and the goal is to build actually usable and open source suite, which will help medical professionals with patient diagnosis.
I actually very like this idea. I'm working professionally as engineer responsible for continuous integration and continuous delivery pipelines. I often meet with developers and other engineers who have lots of ideas how to build such systems and run scripts. But life often quickly verifies those ideas and lots of them is not even possible to be implemented as proof of concept.
It might be, that we are sometimes facing the same problems with bringing machine learning algorithms to production. I don't have much experience with introducing proof of concept algorithms to business environments, but I'm strongly looking forward to participate in this challenge
The challenge
Concept to clinic challenge is divided into three serial time blocks and each of those blocks is divided into four parallel logical blocks. So we have 12 categories. Time categories are divided into:
- MVP
- Feature Building
- Packaging
and every category has following logical blocks:
- Prediction
- Interface Frontend
- Interface Backend
- Community
Those divisions are designed to enable everyone to participate according to their experience and moments they join challenge. So if someone considers himself Data Scientist he or she should focus on tasks related to prediction. Software engineers could look into backend tasks and ui/ux developers will find frontend section interesting for them. There is also place for people less experienced in programming/designing and it is focused on coordinating community efforts and writing documentation. It seems that anyone really interested in this challenge could find something for him.
Organization
Competition is organized around Driven Data forum and GitHub repository. If you would like to participate, you should register yourself and signalize task/bug you would like to work on. And that's it. You could literally start fighting caner and building open source software in minutes . Why not to try?
Issue of a week: #1
Issue I'm currently thinking about is Feature: Implement identification algorithm. This issue is about locating possible nodules across CT image, not just classify it. Why to bother with identification of nodules which are indicators of possible cancer risk, and not just focus on whole image classification? The answer is quite simple - by pointing to potential cancerous nodules we are peeking into black box CNN model and we can use those potential areas of interest to show them to radiologists which can use their expertise and combine it with machine learning classifications.
If anyone of you have some experience with looking for potential areas of interests in images I would like to discuss this problem. In meantime I encourage to get familiar with this publication: The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI): A Completed Reference Database of Lung Nodules on CT Scans.