Alternate Baseball Universes
Jamie found a NYTimes op-ed by a grad student and a professor from Cornell, outlining some research they did into alternate baseball universes. The goal was to find out how unlikely in fact was Joe DiMaggio's 56-game hitting streak, played out in the 1941 season. No one since has even come close to that record. The math guys ran simulations of the entire history of baseball from 1885 on — 10,000 of them. For each simulation they put each player up to the plate for each at-bat in each game in each year, just like it happened; and they rolled the dice on him, based on his actual hitting stats for that season. (Their algorithm sounds far simpler than whatever the Strat-O-Matic guys use.) The result: Joltin' Joe's record is not merely likely, it's basically a sure thing. Every alternate universe produced a streak of 39 games or better; one reached 109 games. Joe DiMaggio was not the likeliest player in the history of the game to accomplish the record, not by a long shot.
I know the statisticians among you are going to bash me with a cluestick for such a naive question, but I'll ask anyway - if this event is so likely to occur, then why hasn't it happened again?
We all know what to do, but we don't know how to get re-elected once we have done it
This doesn't take into account that once a player achieves an impressive hit streak he gets more media attention, people start asking him about Dimaggio's record, and every time he steps up to the plate he's a bit more nervous about it than the last time, making it slightly less likely that he'll get a hit.
The global economy is a great thing until you feel it locally.
unfortunately, not many of my comments are insightful, so with my batting average, you will have to refer to a parallel universe
there you will find that this comment contains something worthwhile reading. sorry
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
One of the key points mentioned in this article is when does the hitting game streak occur? They mention that it was much more likely to occur during the early 1900's which is known as the deadball era. The baseball wasn't as springy and they tended to use the same ball during the entire game. During that time it was more efficient to try and knock the ball between the holes in the fielders and get a double or single then to try and hit it out of the park.
I think it would be more impressive to take a subset of the data, and compare from 1930 up until the present. Of course, there have been other major changes to; glove sizes, introduction of the slider for a pitch, steroid use.
From reading the article (which is light on the details) it seems like they used nothing but batting average, at bats, and games played.
The problem is this doesn't control for variances in the quality of pitching. The chances of going that many games without running into a hot pitcher isn't accounted for.
Imagine you average a 75% chance of getting a hit in any individual game. If you face three average pitchers, your chances are (.75)^3 but if you face a good pitcher an average pitcher and a bad pitcher it might be (.5)(.75)(1.0) which gives a different probability, despite the same average number of hits.
In order to be realistic the calculation would need to account for the deviation from average in the ability of the pitchers (which would likely be higher 100 years ago because of fewer player and segregation, and now because of expansion, as compared to the 1950s)
What they don't report is how often there are long (but not record) streaks in their model, so there is no way of knowing how accurately it reproduces reality.
From the descriptions I've seen of their research, it seems that they're treating all games identically for the purpose of determining a typical season's behavior. While this may me necessary to make the computation tractable, it's not realistic, and introduces a sizable bias towards long hitting streaks.
In reality, a league is typically very imbalanced from team to team and from pitcher to pitcher (probably even more so in the game of the early 20th century than now). It's easier to get hits off of two successive average pitchers than it is to get hits both off of a very good and a very bad pitcher. For example (to oversimplify a good deal):
Say the league is split 50/50 between "good" pitchers (pitchers you'll get a hit off of 50% of games) and "bad" pitchers (pitchers you'll get a hit off of 80% of games). In a typical 20 game stretch, you'll encounter 10 good pitchers and 10 bad ones, and your odds of getting a hit in all 20 games would be (0.50)^10(0.80)^10, about 1/9537.
Under their analyis as I understand it, they'd replace all the pitchers by mediocre pitchers who you'd get a hit off of 65% of the time, and your odds would be (0.65)^20, about 1/5517.
This one assumption almost doubled your chances of getting a hit in all 20 games.
There are other biases as well going the other way (ignoring the effect of hitting slumps, for example), but this one jumped out at me.
You don't understand. Baseball is so boring, the fans find the statistics exciting!
I was once at a friend's BBQ and a lot of the other guests were really into sports and talking a lot about their various sporting events etc. I made a comment about how baseball was one of those sports that is fun to play but boring as hell to watch. One of the guys responded with, simply, "I disagree". To which I replied "You're right. It's pretty boring to play too." He wasn't very amused.
:(
Talk about a great way to make an awkward social event even more awkward
In every simulation, a ground ball went between Bill Buckner's legs in the 1986 World Series.
-- Of course I'm paranoid. I'm a sysadmin.
This seems relevant:
http://abcnews.go.com/Technology/WhosCounting/story?id=3694104&page=1
Disclaimer: I'm not an American, so I know next to nothing about baseball - and care less!
Computers do not actually generate random numbers
That'll be a surprise to the multiple true random number generators build into most operating systems. There's many sources of random data in a computer. Timing between keystrokes, timing of mouse movements, network latency between packets, and of course hardware random number generators that use thermal noise as its source.
So to put it mildly, computers can, and DO generate truly random numbers that are completely unpredictable and free from bias.
(Oh, BTW, to do a Monte-Carlo simulation (which the referenced article is) you actually don't need true random numbers, you only need a pseudo-random source that's free from bias. Those pseudo-random sources do exist, and aren't that even that difficult to code.)
AccountKiller