And the Pulitzer Prize For SQL Reporting Goes To... (padjo.org)
theodp writes: Over at the Stanford Computational Journalism Lab, Dan Nguyen's Exploring the Wall Street Journal's Pulitzer-Winning Medicare Investigation with SQL is a pretty epic post on how one can use SQL to learn about Medicare data and controversial practices in Medicare billing, giving the reader a better appreciation for what was involved in the WSJ's Medicare Unmasked data investigation. So, how long until a journalist wins a Pulitzer for SQL reporting? And for all you amateur and professional Data Scientists, what data would you want to SELECT if you were a Pulitzer-seeking reporter?
Little Bobby Tables!
Wait, where did the award go...?
And for all you amateur and professional Data Scientists, what data would you want to SELECT if you were a Pulitzer-seeking reporter?
SELECT convert_style(story, MY_WRITING_STYLE) FROM all_the_stories WHERE interest_score >= PULITZER_LEVEL;
Though I'd probably put a LIMIT on there so I don't publish too many Pulitzer winning stories at once.
I can certainly appreciate a well written complex piece of SQL. Writing major summary reports in SQL can be unbelievably complex. However, it doesn't need to be complex in order to impress me. It just has to answer the correct question. Particularly true when querying a data warehouse, it is all about getting sums and averages over time periods, right? Now you take those results, throw them into a crosstab engine, start spitting out charts and looking for trends. Then you can start to see the anomalous trends.
An award winning SELECT statement, in my opinion, would simply be one that asks an insightful question.
The SQL tutorial looks at the numbers but doesn't emphasize two kind of glaring omissions in the WSJ article:
a) Dr Weaver is charging for a procedure _labeled_ 'cardiac', but there is no mention of what the procedure is, it's relevance to cardiology (if the label is accurate), or it's relevance to internal medicine (Dr Weaver's _labeled_ current specialty). For all we know, Dr Weaver is an ex-cardiologist, now practicing internal medicine for which he has found this procedure to be extremely useful in the patients he treats. For all we know, the procedure was mislabeled (esp. since it is pointed out that the data is noisy incl. spelling errors, multiple labels for same thing, etc.)
b) At one point, Dr. Weaver's _statistical_ use of the procedure (99.5%) is compared to a raw numerical value (6) by Cleveland Clinic cardiologists. For all we know, the clinic cardiologists only saw 6 patients for whom the procedure was relevant, or they never use the procedure because they have other more relevant/current techniques, or patients who are seen by the clinic are at a point where the procedure isn't required.
While the SQL tutorial is an interesting look at how to verify the accuracy of the statistics in an article, it tacitly provided validation for what is still poor reporting ie. the statistics need explanation and validation beyond simple numbers.
If you assume that most people are pretty honest (statistically they are), then the SQL queries are a neat way to highlight that the billing system (not the practioners) is in need of a second or third look.
"Consensus" in science is _always_ a political construct.