Domain: kaggle.com
Stories and comments across the archive that link to kaggle.com.
Comments · 25
-
Re:This is fantastic
How easy is it to use Python and Pandas to do these kinds of analytics? I have been exploring the Chicago crime data set as an example (see: https://www.kaggle.com/boldy71...) and I am interested to know how much expertise and time does it take to do something like this. I am building a data analytics tool that will allow non-programmers the ability to do simple analysis of large data sets using a point-and-click interface. I use this crime data set to test things out, but I want to explore more in-depth analysis to see if it can help even more than it already does. A 4 minute video demonstrates our tool. https://www.youtube.com/watch?...
-
Re:pre-trained machine learning
C'mon. As much as I despise them, I'm sure there are script-kiddies who can code Keras or PyTorch at Microsoft.
https://www.kaggle.com/gaborfo...
https://github.com/Cadene/pret... -
Kaggle
There is a dataset of recorded heartbeats at Kaggle free for everyone to download. Not that many results there, though. I wonder how much better the algorithms have gotten and if Apple can actually do something useful with their device.
-
Re:This is tailor made for public funding
Even better, just put the raw anonymized recidivism data on Kaggle and let everyone compete to come up with the best model.
-
The dataset appears to be missing
The article links this as being the dataset "consist[ing] of six downloadable zip files, with four containing around 10,000 profile photos each and two files with sample sets of around 500 images per gender."
https://www.kaggle.com/scolian...
Which gives a 404.
-
Re:Not one example?
Here is a page with some examples.
Here is a PDF of the paper, which has more examples.
I don't think it means much. Instead of showing that humans see better than computers, it really just shows that this one researcher is bad at programming computer vision systems. If he took his dataset, and made it a Kaggle Competition, I think someone would design a computer vision system that would do much better than his.
-
Kaggle
The single most motivating thing for me, personally, was to find real problems to solve and real examples and help on how to solve them. Bonus points for variety and competition and even prizes.
Enter Kaggle -- data mining competitions with an absurd amount of examples, datasets, community posts, forums, curated examples. I really cannot emphasize how much I've learned in this community. Join and try one of the example competitions -- the Titanic one is popular, follow the getting started guides and go from there.
I'm sure there are many other ways, and it may not be for everyone, but this has really been a great resource for me.
-
Re:Great experienceThey asked me a bunch of graph theory / number theory problems, if you didn't know the algorithms and could implement them quickly you'd be SoL.
For me programming challenges were my way to get a foot in the door without a degree. I remember getting interviews from Google after their first Code Jam. And facebook after I solved a bunch of programming puzzles they released. But topcoder was my real salvation from my parent's basement.
Even though I have a career now, I find solving them lots of fun. Sometimes I come to the solutions when I'm drowsy before bed and have to get up and write it down before I forget!
-
Re:So misleading.
This is so misleading. No program can do anything outside what it is explicitly programmed to do.
You are the misleading one.
Machine learning and Optimization are the science of getting programs to do things they are not explicitly programmed to do.
Evidence:
-
The Merk molecular activity challenge was won by data scientists who did not have themselves the capacity to perform the task.
http://blog.kaggle.com/2012/10... -
As described on wikipedia: "Machine learning is a subfield of computer science (CS) and artificial intelligence (AI) that deals with the construction and study of systems that can learn from data, rather than follow only explicitly programmed instructions".
http://en.wikipedia.org/wiki/M... -
Artificial evolution for instance is a special kind of Optimization algorithm.
http://en.wikipedia.org/wiki/E...
The whole point of machine learning is to program learning rules, not the explicit final program. The behaviour of the program is then determined by the data used to train it.
-
The Merk molecular activity challenge was won by data scientists who did not have themselves the capacity to perform the task.
-
Re:Something else?
While I agree with parent in the case you actually are interested in newt farming, I actually code mostly just for the fun of coding, and focus on the type of code rather than the end product. To give an alternate approach, then, depending on what type of code you like there's probably a hackathon or a set of "challenges" or some competition that can provide motivation if you just want random problems to solve. I'm mostly an algorithms guy, so I do a lot in Kaggle, and Project Euler. Project Euler for example has hundreds of problems that more or less increase in difficulty, making it relatively easy to find something that will increase your skill, and the Kaggle forums are full of code examples from past projects to help you get on your way.
If you're interested in graphics or UI programming these examples may be less help, but I'm sure there are similar things out there. The results of hackathons are great places to start because the code is generally written by competent programmers but they have no time to do clean up nor to build the spaghetti that years of updates often brings... bug fixes and hacks are common, so the code needs some TLC, but it typically has very few hands in it and so has some good consistency. iosDevCamp (from a quick google search), has links to github code for some of its results.
-
Re: Your (excellent) questions.
My questions: (1) What are some interesting computational neuroscience simulation problems that an individual with a workstation class PC can work on? ** These come up more frequently than you might think. Even what you'd think of as a regular home or office PC can do a lot with 8-16 gigs of memory, let alone amounts beyond that. I'd suggest that you start looking at http://www.kaggle.com/ as a place to start. Also, start looking at the discussion groups that you can find on (I hate it, but use it) LinkedIn. I prefer the discussion groups that you can get at the American Statistical Association, and even the listserve discussion groups for various statistical software packages (e.g., R, Stata, SAS). (2) Is it easy for a non-academic to get the required data? ** It depends on the problem being examined, and who "owns" the data. For Kaggle competitions, the data is given to you. For other projects, a lot of data is becoming "open sourced" so that people can get to it publicly. So, that's a qualified yes for some things, and a no for others. (3) I am familiar with (but not used extensively) simulators like Neuron, Genesis etc. Other than these and Matlab, what other software should I get? ** I tend to lean on Stata and R. Will be moving over to R after finishing current research project. It depends on the areas you want to examine. If you're willing to deal with the "learning curve" for R, I'd go with that. It's free and has a fantastic community. (4) Where online or offline, can I network with other DIY Computational Neuroscience enthusiasts? ** I hate LinkedIn, but I use it in my own field. You might try that, as well as G+ initially. I'd also be looking at the American Statistical Association and related professional groups. The listserves for various statistical software packages are good, but they get nasty about off topic posts (tangential to the use of the software) ** I think that the related StackOverflow forums would be very good. I've had good results with them.
-
Competitions, trading
You could try your hand at various programming competitions such as those offered on TopCoder or Kaggle. Some of the prizes in these competitions amount to serious dough.
Alternatively, you could try algorithmic trading. Several online brokerages offer an API, such as Interactive Brokers and TradeStation. -
Automatic creation of features
I wonder how much of these improvements in accuracy are due to fundamental advances
I was wondering the same thing, and just now found this interview on Google. Perhaps someone can fill in the details.
But basically, machine learning is at its heart hill-climbing on a multi-dimensional landscape, with various tricks thrown in to avoid local maxima. Usually, humans detemine the dimensions to search on -- these are called the "features". Well, philosophically, everything is ultimately created by humans because humans built the computers, but the holy grail is to minimize human invovlement -- "unsupervised learning". According to the interview, this one particular team (the one mentioned at the end of the Slashdot summary) actually rode the bicycle with no hands and to demonstrate how strong their neural network was at determining its own features, did not guide it, even though it meant their also-excellent conventional machine learning at the end of the process would be handicapped.
The last time I looked at neural networks was circa 1990, so perhaps someone writing to an audience more technically literate than the New York Times general audience could fill in the details for us on how a neural network can create features.
-
Re:Not a problem with resolution
The Kinect Gesture challenge over at Kaggle was a competition where the goal was to match gestures with a specified dictionary of previously-recorded gestures.
The problem isn't the resolution, it's the recognition algorithm.
Its a little bit of both, actually. The problem isn't resolution, from a hardware standpoint -- its the point density on the IR projector and the lens on the IR camera that limits how close you can be to a Kinect and still have any accuracy. Once your depth cues go wonky, gesture recognition becomes much harder.
Gesture recognition, while not trivial, is not intrinsically more complicated than whole body tracking. The way Kinect does it is very clever, knowing basically "where can the body have moved from where it last was" which makes the matching process very efficient computationally. Gestures are the same thing. Your joints can only each move one of a limited set of ways from where it was. Just like handwriting recognition is dramatically easier for computers when they can see the order of strokes, the same is true of gestures.
-
Not a problem with resolution
The Kinect Gesture challenge over at Kaggle was a competition where the goal was to match gestures with a specified dictionary of previously-recorded gestures.
The problem isn't the resolution, it's the recognition algorithm.
A human looking at the videos could easily distinguish between gestures and interpret the meaning. The problem was even easier for a human because you only had to choose the closest match from within the dozen-or-so gestures in the dictionary. This leads me to believe that it's not a problem with the resolution, or the hardware in general.
Despite this, gesture recognition is a very difficult problem. Aspects which humans would naturally interpret as similar can be wildly different for the computer. Hold your hand up and wave - if the hand is in a different position (relative to the torso), the angle of waving is different, the body is waving back and forth instead of still, the number of waves is different, the time cadence of the waving is different... all of these confuse the heck out of a match algorithm.
(One video had curtains in the background, apparently waving ever so slightly in the breeze - causing lots of motion for the camera. Another video (color channel) contained an intricate flower pattern, which was very complex to match against.)
Finger position and motion have limited resolution (they form only a small part of the input field), but a human could still interpret various ASL hand signs to a large extent. Perhaps very similar hand signs would be difficult to discriminate, but certainly many of the ones shown were recognizable.
This is pretty-much an aspect of hard AI. We're not that close to solving this problem, and breakthroughs are not expected any time soon.
-
Kaggle is unverifiable
I've entered a couple of Kaggle competitions, but I'm 'kinda put off by the opaque results.
After the first one ended (predict HIV progression), the released full dataset indicated that the data had been sorted before it was separated into train and test sets. IOW, after being sorted by length, all the short sequences were put into the training set, and the longer ones into the test set. This mistake may have invalidated the competition, and I strongly suspect it would have invalidated any paper written about the results.
More recently, the organizers of one competition stated flatly in the forums that they would release the entire data set once the competition had ended, but then didn't. I inquired about this, and a Kaggle data scientist replied saying "we almost never release the test data".
I'm not sure that Kaggle is all that scientific. If the full dataset can't be examined after the competitions close, there's no way to verify the results.
-
Kaggle is unverifiable
I've entered a couple of Kaggle competitions, but I'm 'kinda put off by the opaque results.
After the first one ended (predict HIV progression), the released full dataset indicated that the data had been sorted before it was separated into train and test sets. IOW, after being sorted by length, all the short sequences were put into the training set, and the longer ones into the test set. This mistake may have invalidated the competition, and I strongly suspect it would have invalidated any paper written about the results.
More recently, the organizers of one competition stated flatly in the forums that they would release the entire data set once the competition had ended, but then didn't. I inquired about this, and a Kaggle data scientist replied saying "we almost never release the test data".
I'm not sure that Kaggle is all that scientific. If the full dataset can't be examined after the competitions close, there's no way to verify the results.
-
Kaggle is unverifiable
I've entered a couple of Kaggle competitions, but I'm 'kinda put off by the opaque results.
After the first one ended (predict HIV progression), the released full dataset indicated that the data had been sorted before it was separated into train and test sets. IOW, after being sorted by length, all the short sequences were put into the training set, and the longer ones into the test set. This mistake may have invalidated the competition, and I strongly suspect it would have invalidated any paper written about the results.
More recently, the organizers of one competition stated flatly in the forums that they would release the entire data set once the competition had ended, but then didn't. I inquired about this, and a Kaggle data scientist replied saying "we almost never release the test data".
I'm not sure that Kaggle is all that scientific. If the full dataset can't be examined after the competitions close, there's no way to verify the results.
-
And if you want to join their data science team...
... Facebook is running an open call data science competition to win an interview/job on their data science team.
(Disclosure: My work is running the competition for them)
-
Got any ideas on how to do this. Win $$ in a cont
No joke! If you can write an algorithm to do this, you can win $60,000. See:
http://www.kaggle.com/c/asap-aes
... but time is running out... -
Re:Welcome to real world
The Firefox makelink plugin is great for pasting links with actual link text. It will use the page title by default, or a text selection as the link text if you make one. You can customize it for any forum or BB syntax in under 60 seconds.
At a casino, no one considers a call of the big pot to be a failure if got your chips in with 70% odds and then got burned on the river. Likewise, if it costs you $100 to participate in a $1000 pot and you got in at well better than 10% odds, how does that count as a failure? Real failures are small companies that burn through a ton on money on an idea that never could have worked. True failure is when people foolishly misallocate capital on blind hope or to bilk the investment base. Far from failure is when you come out on the wrong side of a well-judged risk.
At the same time, it's becoming a very common business model to create a forum for ambitious aspirants and profit from the vast majority who go away empty handed.
We're making data science a sport
So they admit it.
The 100 Greatest Hockey Arguments by Bob McCown and David Naylor
Repeating their own excerpt:
Of those 30,000 [Ontario players], just 232 were eventually drafted by an OHL team in their mid-teens, the first major cutoff for players hoping to stream towards the NHL. Less than half of those players, 105, actually played in an OHL game. Another 42 played in the top tier of U.S. college, which is another viable route to the NHL.
Overall, just 47 wound up with NHL contracts after being drafted in 1993 or 1994, or signing later as a free agent.
They then continue to summarize in their own words:
Ultimately, the sum total of players with more than one NHL season ends up being just 15, and only six had played 400 NHL games nine years later. Jason Allison and Todd Bertuzzi were the only names of note among the 30,000.
And this was considered to be a particularly strong crop.
-
Re:So...what's the answer?
Mandate all rich people give poor people everything every other generation?
I've already written once today (in partial jest) that there are two ways to obtain a benefit you haven't earned: through social programs and through inheritance--let's kill both.
There's a raging debate going on in the discussion thread at Richard Wilkinson: How economic inequality harms societies
I'm an R programmer IRL. I don't have much formal training in statistics, but when I need a second opinion, my bookshelf is stacked with the highest grade of bullshit detector. In the machine learning sector, that's a high grade indeed. You don't ascend to the top of the Kagglestalk by being full of shit. (I have not yet formed an opinion about Kaggle in general.)
My investigations quickly lead me to The Spirit Level Delusion: Chapter 10
I quickly came to the conclusion that the spousal unit of Richard Wilkinson and Kate Pickett have way oversold their analysis as an input to public policy. Nevertheless, it ought to be troubling how readily these slopes tip in an ugly direction. In data mining, most of what you get is suggestive. I find their approach closer to data mining than proper statistics. Human cognition for the most part is closer to data mining than proper statistics, so I'm not saying that suggestive signals are slight or worthless. I'm saying that juicy things you pick up off the floor should not enter mouth without second inspection.
From Snowdon's mad dog supplemental chapter:
It is fantastically implausible to think that Wilkinson and Pickett are not aware of the importance of outliers in statistics.
There's a certain type of thinker who loves to stop thinking at the invocation of a categorical word. Outlier is a word of many meanings in statistics. It's not an automatic red flag to invoke the purity reflex (conservatives are sometimes painted as having more intense purity/disgust pathways). An outlier due to a DRAM memory error is best discarded. When the outlier is a big fat juicy data point, you need to engage your brain. Your signal naturally shows up most intensely at the extremes. If you don't want to find a signal, by all means, terminate outliers with extreme prejudice, as Snowdon imprecates the vagrant bastards.
But if they really wished to "avoid being accused of picking and choosing" they would have used the same official measure throughout.
By page 200 or so, he's wound himself up to where he leaves his brain behind. Too bad, because his brain was useful when he used it. He's gone completely insane on the decision process of prudence: trying your best not to shop for the desired outcome, while also trying to step around contaminated inputs. One of the inputs W&P sensibly step around are self-reported psychiatric states. These are known to be dirtier than Netflix ratings. Snowdon by the end is promoting the merest sign of discretion as a hanging offence. I would also like to know why these small acts of discretion were invoked, but I don't immediately fear the worst. W&P could do much better in the scholarship department.
Snowdon loses it completely on race as a confound. Confounds aren't all that important until you get into causative interpretation, often a necessary step on the road to public policy. I don't think W&P is anywhere close to providing a solid foundation for public policy, so this whole causative rebuke leaves me cold. Attack dogs never weary of citing error, long after there was any point. If he's not an attack dog, why does he act like one?
Since there is no relationship between race and mental health, they cannot find a relationship with inequality. But since there are relationships between race and many o
-
Re:Sounds good to me, in my dreams
5 != 5!
Why don't they submit their prediction competition to Kaggle. There are quite a lot of "impossible" prediction competitions over there already.
-
Submission error
Already three teams have managed create systems that make more accurate predictions than the official Elo approach.
1 EdR* 0.729125
2 whiteknight* 0.731656
3 Elo Benchmark* 0.738107 {-- The "official Elo approach"Maybe we're counting from zero and they forgot to put it on the leaderboard?
-
Fontanelles is not in the running
From the forum page over at Kaggle:
The Fontanelles is a group which does HIV research professionally and so has some specialized information in this area. We're disqualifying our entry, but have put it in just for fun as a target. We may be back if someone beats it.
So despite the appearance of the professional entry, this looks like an interesting contest that anyone might enter.