Alternate Baseball Universes
Jamie found a NYTimes op-ed by a grad student and a professor from Cornell, outlining some research they did into alternate baseball universes. The goal was to find out how unlikely in fact was Joe DiMaggio's 56-game hitting streak, played out in the 1941 season. No one since has even come close to that record. The math guys ran simulations of the entire history of baseball from 1885 on — 10,000 of them. For each simulation they put each player up to the plate for each at-bat in each game in each year, just like it happened; and they rolled the dice on him, based on his actual hitting stats for that season. (Their algorithm sounds far simpler than whatever the Strat-O-Matic guys use.) The result: Joltin' Joe's record is not merely likely, it's basically a sure thing. Every alternate universe produced a streak of 39 games or better; one reached 109 games. Joe DiMaggio was not the likeliest player in the history of the game to accomplish the record, not by a long shot.
The most likely reason is that statistics isn't the appropriate method by which to study this problem.
This sort of a study is really more about curiosity, it doesn't deal with things like changes to the way in which the game is played. For instance early on, and for quite a while later, it was common for a pitcher to pitch 9 innings every game, and in many cases to pitch both games out of a double header. Meaning more opportunity for errors and since batters get time to rest up, there's a bit of an edge under that style of play to the batter which doesn't exist today.
That also doesn't include the variety of pitching which players see today or the fact that a player might get to see 3 different pitchers in a single game.
Even the length of the season has an effect on how players play. None of those things are easily quantified, much less analyzed by statisticians.
From reading the article (which is light on the details) it seems like they used nothing but batting average, at bats, and games played.
The problem is this doesn't control for variances in the quality of pitching. The chances of going that many games without running into a hot pitcher isn't accounted for.
Imagine you average a 75% chance of getting a hit in any individual game. If you face three average pitchers, your chances are (.75)^3 but if you face a good pitcher an average pitcher and a bad pitcher it might be (.5)(.75)(1.0) which gives a different probability, despite the same average number of hits.
In order to be realistic the calculation would need to account for the deviation from average in the ability of the pitchers (which would likely be higher 100 years ago because of fewer player and segregation, and now because of expansion, as compared to the 1950s)
What they don't report is how often there are long (but not record) streaks in their model, so there is no way of knowing how accurately it reproduces reality.
You don't understand. Baseball is so boring, the fans find the statistics exciting!
I was once at a friend's BBQ and a lot of the other guests were really into sports and talking a lot about their various sporting events etc. I made a comment about how baseball was one of those sports that is fun to play but boring as hell to watch. One of the guys responded with, simply, "I disagree". To which I replied "You're right. It's pretty boring to play too." He wasn't very amused.
:(
Talk about a great way to make an awkward social event even more awkward
Because baseball players aren't dice?
In every simulation, a ground ball went between Bill Buckner's legs in the 1986 World Series.
-- Of course I'm paranoid. I'm a sysadmin.
A good illustration of this is the so-called "birthday paradox", which asks what's the probability of having duplicate birthdays in a group of n people (whose birthdays are independent of each other). Think of adding the people to the room one by one. The first person doesn't have any chance of having a duplicate birthday, because there's nobody else in the room. The second person has 1/365 chances of duplicating, 364/365 of missing the first one. Let's follow up on the misses, they're easier to work with. In general, if we've got k people in the room without a duplicate, that means they've used up k of the 365 days in the year, and the next person we introduce to the room has to miss all of those days to avoid a duplication. So the probability of everybody missing everybody else, by the time we get up to n people in the room, is (365/365)*(364/365)*(363/365)*...*((365-n+1)/365), which starts diving towards zero really fast. The probability of having one or more duplicates is 1 - P(no duplicates), which correspondingly climbs to one really fast. If you write a short program to do the exact calculations, you'll find that by the time you have 23 people in the room the probability is greater than 0.5 of having a duplicate, and by the time you get 57 people it's greater than 0.99!
If you pick one particular person and ask what's the probability of duplicating that birthday it remains quite small. That's the difference between having a particular rare event rather than having some rare event. For a large enough group, some pair of people will almost surely share a birthday but the odds of it being you (or any other designated person) remain quite small.
Just to preserve my computing geek cred, this is why you need collision resolution for hashing algorithms. You don't know which entries will share hash values, but collisions are almost certain to happen by the time you've loaded 3 * sqrt(Hash Table Capacity) values, e.g., if your hash table has capacity 10000 you will almost surely see a duplicate within the first 300 entries.
This seems relevant:
http://abcnews.go.com/Technology/WhosCounting/story?id=3694104&page=1
Disclaimer: I'm not an American, so I know next to nothing about baseball - and care less!
Otherwise, buddy, you're way off base.
NL year-by-year stats.
Look at those ERAs pre-1920. Before 1920, the ERA on the NL never significantly exceeded 3.00. After 1920, it never dropped below 3.3 or so, with the exception of a 2.99 in 1968, after which MLB made changes to the rules, amongst them lowering the acceptable height of the pitcher's mound.
The time prior to 1920 was marked by pitchers such as Cy Young, Mordecai Brown, Walther Johnson, Ed Walsh, Christy Mathewson. You've probably heard of most of them.
Here are the single-season MLB ERA leaders. Outside of Bob Gibson in the aforementioned 1968, you have to go all the way to Greg Maddux in 1994 at #48 all time to find a season after 1920 on the list. Barely 10 of the 100 lowest single-season ERAs in MLB history occurred after 1920. And that's only because Pedro Martinez in 2000 and Ron Guidry in 1978 tied with 9 others for #100 on the list. So only 8 of the best single-season ERAs happened after 1920.
You need to research "dead ball era", and the response by baseball to "Black Sox". (Hint: just like the response to the 1994 strike, it involves the ball...)
The fact that you got a +5 out of such a demonstrably incorrect post is a major indictment of the baseball knowledge of the Slashdot faithful.