Slashdot Mirror


How Facebook Keeps Messenger From Crashing On New Year's Eve (ieee.org)

Wave723 quotes IEEE Spectrum: On New Year's Eve, millions of people will use Facebook's Messenger app to wish friends and family a 'Happy New Year!' If everything goes smoothly, those messages will reach recipients in fewer than 100 milliseconds, and life will go on. But if the service stalls or fails, a small team of software engineers based in the company's New York City office will have to answer for it.
The article says the team "tested and tweaked the app throughout the year and will soon face their biggest annual performance exam," since Messenger's 1.3 billion monthly active users send more messages on New Year's Eve than any other day of the year. Many of them hit "send" at the exact moment when their clock strikes midnight, "and people often try to resend messages that don't appear to make it through right away, which piles on more requests."

The solution appears to be load testing, re-directing traffic, message batching, and discarding "read receipts" and temporarily disabling other minor Facebook functions -- or, more generally, what their engineering manager describes as "graceful degradation."

69 comments

  1. So... by Anonymous Coward · · Score: 2

    "The solution appears to be ..." Stuff we've known since 1999?

    1. Re: So... by Anonymous Coward · · Score: 1

      Release the hounds, Smithers

    2. Re: So... by edris90 · · Score: 1

      They could gain some humanitarian PR, simply making a statement that Facebook Messenger is shutting down temporarily to promote people to pay attention to the real people they are spending time with in person. . Saves buttload of money , avoid a shit ton of embarrassment and look like they cared about peoples psychological health.

    3. Re:So... by jrumney · · Score: 1

      I find it hilarious that the fix for overloading caused in part by people resending messages which don't appear to have gone through, is to discard read receipts. Did anyone in the New York office that is going to be held responsible for the inevitable outage think this through? I wish you luck, Facebook.

  2. Facebook's New Slogan by Anonymous Coward · · Score: 1

    That should be Facebook's new corporate slogan: "Graceful Degradation."

  3. Simple by Sebby · · Score: 3, Insightful

    How Facebook Keeps Messenger From Crashing On New Year's Eve

    Simple, do awful things that will make people avoid using any of your services.

    --

    AC comments get piped to /dev/null
    1. Re:Simple by KiloByte · · Score: 1

      Simple, do awful things that will make people avoid using any of your services.

      Facebook has been trying that for years; doesn't help.

      --
      The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
    2. Re: Simple by Anonymous Coward · · Score: 0

      Software engineers in New York City - please - how difficult is it to press a few keys. Good lord I literally weep for Mark Zuckerberg, of the little silicon valley firm of Facebook

  4. How glib by SuperKendall · · Score: 4, Insightful

    "The solution appears to be ..." Stuff we've known since 1999?

    It's one thing to say you know how to do it...

    Quite another when literally BILLIONS of people are using your services all at once - especially around NYE where it's not even spread through the day, it's a huge DDOS equivalent with billions of messages at midnight exactly...

    Planning for that kind of load and super-extreme bursting is not easy, at all. No matter how much you "know".

    --
    "There is more worth loving than we have strength to love." - Brian Jay Stanley
    1. Re:How glib by Anonymous Coward · · Score: 0

      You do realise that those billions of users are actually spread out over multiple timezones, so midnight isn't the same time for all of them

    2. Re: How glib by datavirtue · · Score: 1

      Yeah, it's called money. All you need is money.

      --
      I object to power without constructive purpose. --Spock
    3. Re:How glib by KiloByte · · Score: 1

      The vast majority of messages avoid that peak: hardly anyone waits for the exact midnight to send a message. So the load gets smeared onto quite a chunk of time.

      The engineering problem boils down to: send short messages between pairs of arbitrary sources and destinations (although usually the source and destination are close to each other), with message size usually within 50-100 bytes. Let's be generous and say that with metadata they fit within 1500 bytes. Hmm... I wonder, have we seen such a problem before?

      Let's estimate the flow: after everyone raises the toast, exchanges hugs and kisses, says greetings, then sits down with the phone -- sending, let's say, 10 messages. This should take around half an hour. You get 300K messages per second. Not so impressive...

      --
      The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
    4. Re:How glib by Anonymous Coward · · Score: 0

      Well, pasting the traditional clichés about it for karma whoring doesn't exactly make you knowledgeable about it either.

    5. Re:How glib by Anonymous Coward · · Score: 0

      Well, Karma whoring is the thing these days.

      So obviously the numbers work for bandwidth, timing of delivery, etc.
      What happens when its NOT new years eve? Devs work all year long and they reuse/copy pasta stuff from these messaging systems because they are the most robust and have been code reviewed by these core teams.

      So I usually ask, ok, how does this work at all these other times? And the answer is pretty simple: If one piece of software and hardware, along with maintenance devs, can handle NYE, then that same or similar combination of HW/SW/Devs can handle any messaging bandwidth problem throughout the year.

      That is how facebook thinks, at least the guys I know in NYC.

      Lastly, if you don't want all that specialized code in every project, how and when do you decide whether or not to include it? Who is good at deciding that do you suppose? And the biggest question is: Do you make that decision before you deploy the first instance or afterward?

      Food for thought?

    6. Re:How glib by Anonymous Coward · · Score: 1

      One DDoS per hour then

    7. Re: How glib by Anonymous Coward · · Score: 0

      He probably realizes that the earth is round and rotates if that is what you are asking

    8. Re: How glib by Anonymous Coward · · Score: 0

      Donâ(TM)t remember ever having this problem with SMS. Maybe people shouldnâ(TM)t be so stupid and should ditch FB and go back to the original services that still work better, and donâ(TM)t sell their personal data. Maybe this articleâ(TM)s just slasverting for FB?

    9. Re: How glib by donstenk · · Score: 1

      Actually, it _is_ spread throughout the day. Itâ(TM)s not NYE at the same time in the world ;-)

      --
      Dennis Onstenk
    10. Re: How glib by Anonymous Coward · · Score: 0

      The problem exists for SMS as well. Last year my wife sent a batch message to about 20 people, including me, right at midnight. I was standing next to her, it took 10 minutes for me to get the SMS. Texts from other friends continued to trickle in for about an hour. At 4am I got a flurry of about 10 texts from a buddy who was in Vegas, we were in the same timezone and he'd gone back to his hotel and passed out by 2am.

    11. Re: How glib by Anonymous Coward · · Score: 0

      Why are you texting people who are standing right next to you? Sounds suspect as fuck.

    12. Re:How glib by Anonymous Coward · · Score: 0

      The vast majority of messages avoid that peak: hardly anyone waits for the exact midnight to send a message. So the load gets smeared onto quite a chunk of time.

      That is the easy part. Of course you can calculate load based on that scenario.

      The engineering problem boils down to: send short messages between pairs of arbitrary sources and destinations (although usually the source and destination are close to each other), with message size usually within 50-100 bytes. Let's be generous and say that with metadata they fit within 1500 bytes. Hmm... I wonder, have we seen such a problem before?

      The other problem, besides coordinating all that is since you have these short message code (HNY for Happy New Year), you have to reporpose these. How many are there? Since NYE only happens occasionally, I bet nobody knows exactly how many are used and how many are optimal.

      Let's estimate the flow: after everyone raises the toast, exchanges hugs and kisses, says greetings, then sits down with the phone -- sending, let's say, 10 messages. This should take around half an hour. You get 300K messages per second. Not so impressive...

      How long would it take if one of those FB-generated movies about you and your friend were sent? I bet that would be an interesting challenge.

    13. Re:How glib by KiloByte · · Score: 1

      How long would it take if one of those FB-generated movies about you and your friend were sent?

      I'm pretty sure there's not a single device with Fecesbook not blocked on multiple levels within ten meters from my current position. On the other hand, what friend? I spent the NYE fitting a dual-slot external-power-needed graphics card into a board on which the PCIe x4 slot takes most of the board's length -- the stereotypes about our kind won't reinforce themselves :p

      --
      The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
    14. Re: How glib by Anonymous Coward · · Score: 0

      Thatâ(TM)s the magic of Facebook dontcha know? :)

  5. Just don't by jwhyche · · Score: 2

    Want to keep it from crashing on New Year's Eve? Just to load the damn thing. There, simple. Problem solved.

    --
    I read at +2. If your post doesn't reach that level I will not see or respond to it.
    1. Re:Just don't by Anonymous Coward · · Score: 0

      learn how to type dipshit.

  6. "graceful degradation." by grep+-v+'.*'+* · · Score: 1

    what their engineering manager describes as "graceful degradation."

    If they'd just use SystemD their problems would be solved! For that matter though, I wish FaceBook would gracefully degrade to /dev/null.

    Good luck to them though, it's a good engineering textbook problem. Stupid, yet necessary. (We have specific peak load times because we just do. Same thing with water supply and SuperBowl breaks, or 8AM/5PM rush-hour traffic.)

    FB should also offer a "delivery within 100ms or your money back!" guarantee. See? The timestamp says it was _delivered_ to _our_ servers in 100ms; it's not OUR fault that the carrier couldn't get thru ... for a week, it's that missing Net Neutrality thing that routinely hits and throttles NetFlix. Yeah, that's the ticket.

    OTOH they could use one of the internet broadcast functions -- "Happy New Year" simulcast everywhere. And actually, I bet they've got a embedded HNY compression bit somewhere to slightly lessen the transfer load, the same for a few other extremely common phrases.

    --
    If the universe is someone's simulation -- does that mean the stars are just stuck pixels?
    1. Re:"graceful degradation." by Anonymous Coward · · Score: 0

      [...] And actually, I bet they've got a embedded HNY compression bit somewhere to slightly lessen the transfer load, the same for a few other extremely common phrases.

      HNY also stands for "honeytrap". Watch out for those.

  7. OH yea by Anonymous Coward · · Score: 0

    These boys are clever

  8. Loadbalancing... by shabble · · Score: 4, Funny

    ... couldn't they simply split their users up into, say, 24 groups, and reduce the load that way?

    1. Re:Loadbalancing... by mermeid007 · · Score: 0

      I'm familiar with this problem because I've dealt with it before where you get these crazy big spikes all at one time. The math is what they use to solve it, often with custom web engines. The math underneath the math is what matters. If you set it up right, do some calculations, discuss with experts, and then cross your fingers it can work, but only for the best and bravest internet firms. One thing not to forget is that as the HNY messages slam the network over the course of three hours in the US, there will be a whole hell of a lot of ads. It would be a lot smarter for them to test this tonight and tomorrow night, but I imagine their vendors are not as sophisticated as they are about testing and verifying etc.

    2. Re:Loadbalancing... by Anonymous Coward · · Score: 0

      That’s pretty much already handled by the fact that people tend to live in different time zones. Maybe you were joking, it wasn’t clear.

      Of course, users are not evenly distributed into those time zones, but at least the whole set of users aren’t all trying to post simultaneously.

    3. Re:Loadbalancing... by sj26 · · Score: 1
      https://infiniteundo.com/post/...

      56. There are only 24 time zones

  9. and Erlang: by Anonymous Coward · · Score: 0

    A modified version of ejabberd, including what they got through the Whatsapp acquisition. Guess PHP/C++ never solved all of Facebook's problems.

  10. translation: by Lehk228 · · Score: 1

    facebook messenger is brittle poorly tuned garbage that cannot handle an ordinary upsurge in human use. AIM never had to be reengineered to survive new years eve without crashing, and it wasn't really all that good it just wasnt a flaming heap of shit

    --
    Snowden and Manning are heroes.
    1. Re: translation: by Anonymous Coward · · Score: 0

      It is true AIM never had to be re-engineered, but the AIM devs never understood the capacity of their platform so instead of leapfrogging the competition the remained a cult product for the faithful.

      I think if you have little experience in marketing and software, you would likely vastly underestimate the advertising demand during this period

    2. Re:translation: by Actually,+I+do+RTFA · · Score: 1

      FB Messenger is harder to handle loads, because they need to run machine learning on all the messages to build better profiles of their users.

      --
      Your ad here. Ask me how!
    3. Re:translation: by helpfulcorn · · Score: 1

      No, it just crashed at other points. The service where buddies were stored (feedbag) had a disastrous set of databases and there were times they all just went down *or* the whiscer (and others, essentially presence) services went down, and no buddies would show, and so people would sign off and on repeatedly trying to fix it slowing down the entire thing. It was all held together with gun tape and faith, at least in the late 90s and very early 2000s.

      There was never a test like that for AIM because people weren't on their PCs at home during new year typically and mobile IM was not that huge. I can assure you it was a flaming heap of shit of a ridiculous amount of services that somehow managed to work most of the time. AIM probably could have never gotten much bigger than it did at its height and surely never could have handled Facebook level traffic. Facebook is just slightly less poorly engineered.

  11. By stealing dimes from the Elves by Anonymous Coward · · Score: 1

    By stealing dimes from the Elves?

  12. Money doesn't (fully) help by SuperKendall · · Score: 1

    Yeah, it's called money. All you need is money.

    How many years did we all have to suffer through Twitter Fail Whales while they were flush with cash?

    There are plenty of examples of giant well funded enterprises with websites that utterly suck and can handle just about no load - especially if you look at websites where tech is secondary, any kind of unexpected load and BAM they are usually down.

    Money can indeed help to buy the servers you may really need to handle load. Money can even help hire the people that understand how to handle load.

    But money does not ENSURE you will have either thing.

    --
    "There is more worth loving than we have strength to love." - Brian Jay Stanley
    1. Re: Money doesn't (fully) help by Anonymous Coward · · Score: 0

      Yeah that was also in its infancy. Facebook used to shit the bed all the time too. It probably wasnâ(TM)t until after I stopped using it in 2011 that I heard it wasnâ(TM)t a hot pile of shit.

      They also went public around then... huh...

    2. Re: Money doesn't (fully) help by Anonymous Coward · · Score: 0

      This is true. I also never send text messages - I always send 10-12 anime vampires I drew myself. I always put them in a sci-fi setting and out a celebrity in the picture as kind of a whereâ(TM)s Waldo thing. But Iâ(TM)m weird so maybe that wonâ(TM)t impact the bandwidth.

  13. Everything old is new again by Anonymous Coward · · Score: 0

    I remember my father dealing with the same issue (everyone wants to communicate at midnight) for New York Telephone half a century ago.

    1. Re:Everything old is new again by AndyKron · · Score: 1

      I still have an old rotary phone hanging in my basement complete with a Mr. Yuk sticker.

  14. The solution by Anonymous Coward · · Score: 0

    I thought that facebook would remedy the problem be censoring half of them.

  15. I don't understand by AndyKron · · Score: 1

    There's something like 26 midnights (timezones) around the world. Where's the problem?

    1. Re: I don't understand by Anonymous Coward · · Score: 0

      It is a good point. The international dateline is in the pacific so the first country of any size will be Japan proper and then so on. This is a big problem for WhatsApp as well because they are not built to scale. They are mostly about encryption and group chats

    2. Re:I don't understand by Anonymous Coward · · Score: 0

      Lies! There is only the big crystal New York ballsack!

    3. Re: I don't understand by Anonymous Coward · · Score: 0

      So what is your estimate of number of FB users in ET timezone? I would not be surprised if it is over 200 millions.

  16. Think again, your numbers are absurdly low by SuperKendall · · Score: 2

    The vast majority of messages avoid that peak: hardly anyone waits for the exact midnight to send a message. So the load gets smeared onto quite a chunk of time.

    Look around you at the next NYE party and you will see just how wrong you are. Most people queue them up ahead of time and lots of people are hitting Send as the ball drops... (hint to devs, if someone has typed a partial message transmit that to the server in case they come back and hit send later - course Facebook was just screwed by that recently when it was found they had cached images on the server from never sent messages...).

    At least it is spread across time zones but that is still a LOT of people, especially from the U.S. coasts.

    The engineering problem boils down to: send short messages between pairs of arbitrary sources and destinations (although usually the source and destination are close to each other), with message size usually within 50-100 bytes. Let's be generous and say that with metadata they fit within 1500 bytes

    Come on man, you know that modern web API's are not that compact, and we are talking Facebook here. You are off by an order of magnitude at least, way more when you stop to think that on NYE way more people are sending images also... One single response to a post on Facebook I just did with 14 words had a 9.5kb body going out, and a 21.2 k response.

    Let's estimate the flow: after everyone raises the toast, exchanges hugs and kisses, says greetings, then sits down with the phone -- sending, let's say, 10 messages. This should take around half an hour. You get 300K messages per second. Not so impressive...

    Think MILLIONS, possibly BILLIONS and you might be closer to the mark. On a *normal* day, Messenger and Whats App process over 60 billion messages a day... so that is 2.5billion messages every hour *normally*.

    And that was from 2016. Do you think people send more, or fewer messages now than then!

    --
    "There is more worth loving than we have strength to love." - Brian Jay Stanley
    1. Re:Think again, your numbers are absurdly low by KiloByte · · Score: 1

      Look around you at the next NYE party and you will see just how wrong you are. Most people queue them up ahead of time and lots of people are hitting Send as the ball drops...

      Ouch. At least there are no "smart"phone zombies so bad anywhere near me, neither among low-tech nor high-tech friends.

      hint to devs, if someone has typed a partial message transmit that to the server in case they come back and hit send later

      And that'll speed up that 14 words message... how?

      On a *normal* day, Messenger and Whats App process over 60 billion messages a day

      Thanks for the correction, I based my estimates on numbers in the article's summary.

      Come on man, you know that modern web API's are not that compact, and we are talking Facebook here. You are off by an order of magnitude at least, way more when you stop to think that on NYE way more people are sending images also... One single response to a post on Facebook I just did with 14 words had a 9.5kb body going out, and a 21.2 k response.

      I'm talking about the problem to solve not their implementation. Of course Facebook runs a PHP script that runs a bunch of NPM modules to produce a 1.5MB response, but after you cut down that bloat, you can get the same result with orders of magnitude less resources.

      --
      The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
    2. Re: Think again, your numbers are absurdly low by Anonymous Coward · · Score: 0

      Well, PHP is more of a liability than a good thing these days.
      Speaking of HW from another post, I would imagine that, although FB is not a HW shop, it is large enough to dictate terms to suppliers. They need hardware that can handle the strain when devs hit deploy code and
      apply changes.
      I will be interested to see what happens on NYE. I am curious enough now to go to some different sites and see if I can manually push a bunch of messages through.
      Funny thing. A system may look a certain way in the UI but it could have internal components that adjusts to changes in user behavior and bandwidth needs.

    3. Re:Think again, your numbers are absurdly low by SuperKendall · · Score: 1

      Ouch. At least there are no "smart"phone zombies so bad anywhere near me, neither among low-tech nor high-tech friends.

      I am highly doubtful the people around you are as pure as you claim.

      And that'll speed up that 14 words message... how?

      Read again about actual message sizes instead of fixating on content size alone. You want to get that traffic up t the server ASAP and send only a rigger signal. Even if it WERE just 14 words it would still be... rather nice to have a few billion 14 word messages already transmitted to the server and ready for destination ahead of time.

      I'm talking about the problem to solve not their implementation.

      Article is literally about Facebook. That is the problem to solve for, given how it is built.

      Of course Facebook runs a PHP script that runs a bunch of NPM modules to produce a 1.5MB response

      Not really.

      --
      "There is more worth loving than we have strength to love." - Brian Jay Stanley
    4. Re:Think again, your numbers are absurdly low by kriston · · Score: 1

      Of course Facebook runs a PHP script that runs a bunch of NPM modules to produce a 1.5MB response

      You honestly don't believe each request to a server spools up a new PHP script instance like it's still 2006, do you?

      --

      Kriston

    5. Re:Think again, your numbers are absurdly low by KiloByte · · Score: 1

      They forked PHP as HHVM, optimized the hell of it, and do some recompilation to C++, yeah. But you can optimize it only so much.

      PHP is a shithouse -- and I don't mean a building you defecate in, I mean one whose structural material is dried excrement (as still done by some tribes). It might have been adequate for a literal "Personal Home Page" with little traffic, but trying to throw more hardware at it to get it to scale to modern Facebook workloads is a fool's errand. It's kind of like banks running unmanageable COBOL code from 1960 on emulators.

      --
      The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
    6. Re:Think again, your numbers are absurdly low by KiloByte · · Score: 1

      Article is literally about Facebook. That is the problem to solve for, given how it is built.

      That's an XY problem -- if the transport has scaling issues, instead of throwing more hardware at it at some point it's good to take a step back and see if there are better approaches. And the core functionality is so simple that replacing just that part while keeping parts of the several-hundreds-of-megabytes-per-phone bloat they insist so much on having intact is a viable proposition.

      --
      The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
    7. Re:Think again, your numbers are absurdly low by kriston · · Score: 1

      My point is that every click doesn't spawn a new PHP instance in their architecture unlike regular PHP.

      COBOL is running just fine--its only problem is the lack of human knowledge in the marketplace.

      And, at least with IBM COBOL, today's IBM z/OS runs COBOL programs originally compiled on the System/360 in the 1960s.

      --

      Kriston

  17. WTF by ledow · · Score: 4, Interesting

    Millions of people.. sending a small TCP packet... containing a couple of hundred characters...

    Wow. Gosh. The infrastructure that must take to handle...

    Like... a couple of servers in a rack and a few gigabits of uplink at worst.

    Honestly, has modern technology come to this?

    One single YouTube video probably has more bandwidth, more data transferred, more CPU usage and less latency.

    1. Re:WTF by Anonymous Coward · · Score: 0

      It's all about the PHPs, Pythons and NodeJS'. You old fart boomers just wouldn't understand how interpreted, dynamically-typed languages are in because our salaries are more expensive than CPUs.

  18. WTF Commenters by Anonymous Coward · · Score: 1

    There's no way to reply to all the misinformed commenters here, but I'm really surprised at how naive the majority are. Clearly most of you have never worked with problems at this scale...this is a far more difficult problem to solve than you all think.

  19. It's stupid easy by Anonymous Coward · · Score: 0

    Ever hear of time zones?

    1. Re: It's stupid easy by Anonymous Coward · · Score: 0

      It is more about what the population is in these time zones and if they can afford computers that tells you how many NYE messages they will send

    2. Re: It's stupid easy by Anonymous Coward · · Score: 0

      Why not just send an extra copy of every message every time and modify the recipient code to toss duplicate messages? Everything gets to where it needs to go no fuss just using available bandwidth you might not even have been aware of. I guess worst case you create some unnecessary overhead for the software engineers but why not?

  20. for any other company by Anonymous Coward · · Score: 0

    this wouldn't be worthy of an explanation, but i suppose given the shit quality of facebook's code it's nothing short of a miracle. don't believe me? look at what they have on github. it's trash. don't believe that's representative of the quality of the code inside the company? if you can find one, ask any decent software engineer that works there and they'll let you know it is in fact worse.

    if this type of load takes them any more than a few machines, they should be embarrassed.

  21. a facebook puff piece? by Anonymous Coward · · Score: 0

    Facebook having a bad year... let's finish it off with a feel good about how facebook is the only company that helps all these billions communicate...

  22. Graceful degradation by Bengie · · Score: 4, Insightful

    "Graceful degradation" is the unsung hero of properly engineered systems.

    1. Re: Graceful degradation by Anonymous Coward · · Score: 0

      Excellent point. And I would agree except that it isnt the same for distribution of hardware. That is more like a restaurant in New Yearâ(TM)s Eve. You need people to serve tables. That is an industry that has been around a very long time (how many centuries?) and would not follow the processes of software very well

  23. let's see how it glows by Anonymous Coward · · Score: 0

    well obviously not every device with its own messenger app gets a dedicated physical connection to the server. somewhere along the line, those 10 devices send to a antenna-node and the tower maybe has 10 elements and then the tower aggregates these into ONE physical link to a sub-station, which itself is connected to another 10 towers. from this sub-station another ONE physical link aggregates to a county sub-station etc etc.
    what i would look out for (and marvel at) is how these aggregation routers spin-and-weave all this single connections into bigger and bigger SINGLE connections ...

  24. By pissing people off? by BishopBerkeley · · Score: 1

    Another factor may be that FB pissed so many people off by abusing their privacy that they deleted Messenger altogether. I did, anyway. Come on, people. Invite your friends for a gathering or accept another friend or family member's invitation. A messenger greeting blast has about as much impact and is about as memorable as a highway billboard encountered at 80 miles per hour. Do something meaningful.

    --
    "...who search the reason of things
    Are those who bring the most sorrow on themselves." --Euripides, The Medea