The goal of this contest is to "predict an employee's access needs, given his/her job role."
According to the details provided by Amazon:
There is a considerable amount of data regarding an employee’s role within an organization and the resources to which they have access. Given the data related to current employees and their provisioned access, models can be built that automatically determine access privileges as employees enter and leave roles within a company. These auto-access models seek to minimize the human involvement required to grant or revoke employee access.
We are given the following information:
- Action (whether the access was granted or denied)
- Role Rollup info
- Title (roll code)
- Role family
According to the data provided by Amazon, employees are granted the access they request 94.2% of the time. So a stupid algorithm that simply guesses "yes" every time will have 94.2% accuracy!
Given the skewness of the data, algorithms are measured based on the area under the ROC curve, or AUC, rather than accuracy. Intuitively, the AUC is the probability that an algorithm will rank a randomly chosen positive datapoint higher than a randomly chosen negative one.
At the time of this writing, the top algorithm on the leaderboard has a score of 0.906. I don't think this is good enough to fully automate access control, because the cost of a false positive (granting access when it shouldn't be granted) can be very high. My guess is that what Amazon will do is use the winning model to obtain a set of rules to figure out which roles should get which access, then someone will have to review that list by hand to make sure it's still up-to-date, then they can update their set of standard permissions by role. But if you need access that's not in the standard set for your role, I think you're still going to have to request that manually, and then a live person is going to have to approve or revoke it (maybe two people will need to review). Because with a .906 AUC, that's just not really good enough to automate new requests.
I used Vowpal Wabbit and was able to get a score of 0.876, which puts me at 31/178 at the time of this post.
It's a good deal for Amazon, because for only a $5,000 prize they can have a couple hundred people hack on their data and submit algorithms. It's useful for me, because I can compare my algorithms to those created by people around the world and improve my skillset.
Only I can't see this algorithms-as-a-contest model sticking around, because I don't think Kaggle is making any money. If you look at the prizes for the current active competitions, the total prize amount is $26,500. I don't know what their fee is on top of that, but let's say it's 50%. Then let's say each contest runs for 2 months. That puts their revenues at $79,500/year. They also have a sort of match-making service for data science consultants, but I can't see that being too lucrative either. Plus, each contest needs to be manually set up by a trained data scientist on Kaggle's end who packages/encrypts/anonymizes the data, determines the algorithm evaluation metric, an posts the instructions. It just doesn't look like a profitable model.