Slashdot Posting Bug Infuriates Haggard Admins
Last night we crossed over 16,777,216 comments in the database. The wise amongst you might note that this number is 2^24, or in MySQLese an unsigned mediumint. Unfortunately, like 5 years ago we changed our primary keys in the comment table to unsigned int (32 bits, or 4.1 billion) but neglected to change the index that handles parents. We're awesome! Fixing is a simple ALTER TABLE statement... but on a table that is 16 million rows long, our system will take 3+ hours to do it, during which time there can be no posting. So today, we're disabling threading and will enable it again later tonight. Sorry for the inconvenience. We shall flog ourselves appropriately. Update: 11/10 12:52 GMT by J : It's fixed.
Taken from http://franksworld.com/blog/archive/2005/01/04/600 .aspx
Chapter 2: Destructional Patterns
2.4 Detonator
The Detonator is extremely common, but often undetected. A common
example is the calculations based on a 2 digit year field. This bomb
is out there, and waiting to explode!
Some of you are asking which comment it was that got the cid 16,777,216. The answer is that none did. For redundancy, Slashdot is now running multiple-master replication which skips values for auto-increment. Our db-1 assigns odd-numbered primary key IDs, and db-2 assigns even-numbered. Right now writes are going to db-1 so newly created rows will have only odd IDs.
The comment that got 2**24-1 was this one, if anyone cares :)
Sorry about the inconvenience, everyone.
I used to work at Comair. Remember, that airline that stranded about 10,000 people in the airport a couple of Christmases ago? Same deal. Program was capable of handling only a certain number of changes. Hopefully your president won't have to resign.
Actually, comment 16,777,217 couldn't break it, because that comment's parent cid could have only been 16,777,215. Up until then, there wouldn't have been an overflow value put into the db.
There's no telling which comment it is, because (16,777,217 + 2n) might not have been a reply, meaning it would come up correctly.
:(){
Reply to comment number 16786251:
...but it's probably not zero-referenced. Typically, ids in SQL start at 1.
``You claim that the 16,777,216th comment would have broke it but I contest that actually the 16,777,217th comment poster would be the culprit. Since it should be able to handle that many comments if it is zero referenced''
Please correct me if I got my facts wrong.
http://slashdot.org/~themusicgod1/journal/137880 ...ok, so it was obvious...
poot_rootbeer asks why all the comments are in one table, when the data access pattern is such that 90% of our hits are on only the most recent entries in that table.
The answer is that we used to do it this way but it's a huge pain. In 2000 we converted from having two tables for 'stories', recent and archived, and merged them together. The performance hit was not big, and it made the code so much simpler it was a no-brainer.
It's the database's job to cache properly whether we split the table or not, and the database does that just fine. The only performance problem could be when there is a rush of inserts, or updates to the same sets of rows, spanning both newer and older portions of the table, and that just doesn't happen.
If we did want to do this we wouldn't split the tables manually; the code complexity is too high a price to pay. In MySQL 5.0 we would use a MERGE engine, which has issues of its own but would involve smaller changes to our code. That's still not worth it for us. What we're probably going to do is wait for MySQL 5.1 to get out of beta and then do some performance testing on tables partitioned by date and see if that gains us anything. For example, a SELECT on our comments table could be limited with a WHERE clause to only retrieve rows with a date >= the discussion object's date, which for 90% of our queries MySQL 5.1 could optimize to only look at the most recent partition. If the gains turn out to be significant, then since partitioning involves very limited code changes, we'll probably do that. Generally speaking, though, database performance is not a problem for us. So far our main bottlenecks have been CPU and RAM on the webheads. As long as we don't do anything stupid our database performance has been fine, though, as today proves, we are quite capable of being stupid.
[ Parent ]
Ah, MySQL. Where trying to insert a row with a column value larger than the column can actually store results in MySQL clipping it to the max value.
Fortunately, as of MySQL 5, you can fix this problem.
So, yes, early versions of MySQL had a brain-dead default SQL mode that simply "corrected" invalid column values, but MySQL 5 fixes this.
Now if only they would add column constraints...
You are in a maze of twisty little relative jumps, all alike.
I'd really like to see it. I bet it goes something like, "what's this stupid web thingy anyway? I bet it'll never make it to version 2.0..."
The Kai's Semi-Updated Website Thingy
Sorry, I think index was a bad choice of words.
Whenever a post is made and it has a parent cid, that number must be stored in the table.
If MySQL saturates instead of rolling over (see this comment), then all replies after comment 16,777,215 will have the wrong parent cid, and I don't think there's any way to fix it.
:(){
I guess this is another thing to add to the MySQL gotchas page. Of course, in a decent database engine, like PostgreSQL, if you alter a column data type then the indexes are updated to reflect this.
SIGSEGV caught, terminating
wait... not that kind of sig.
Before they did the fix the parents were pointing at 16777215 (which is why clicking parent displayed a dodgy template.
..... where cid=yourparent and sid=yourstory"
The reason why the thread for comment 16777215 is not massive is because the comments now use a double key, ParentID(cid) and StoryID (sid) - displaying a thread is now "select
liqbase