Large Scale Web Apps Built on Open Source
prostoalex writes "Brad Fitzpatrick presented at OSCON with on overview of his little project. Interesting facts about the evolution of the Livejournal back-end architecture."
← Back to Stories (view on slashdot.org)
It's all LAMP.
you can do allot with Lamp, just look at....SLASHDOT!
CB$@#--C
free ipod and free gmail!
LiveJournal? Not anymore...
OMG! Today I had CEREAL!!!!!
With MILK!!!! OMG!!
My companies backend is mostly Java.
:)
We are using Oracle as the database, and Solaris as the UNIX, but we could be using MySQL and Linux.
In fact, we are investigating that right now
comment directly in my journal
Uh, like, you mean the Web itself? That's large scale, certainly was built, and is most certainly built on open source.
So, yeah, I reckon it can be done. I'm using the proof-of-concept to submit this comment.
Anyone know what that document format is since it's roughly half the size of the pdf?
Karma: Chameleon (mostly due to the fact that you come and go).
Why is there a password on this sxi file (star office presentation)... is the file not open source?
...right here.
It's powered by GForge, so it's backed by PHP and PostgreSQL.
There are a bunch of other sites running GForge listed here...
The Army reading list
Yes, the empty string.
It's better to be the foot on the boot than the face on the pavement. ~~ tkx Kadin2048
Maypole is a Perl framework for MVC-oriented web applications, similar to Jakarta's Struts. Maypole is designed to minimize coding requirements for creating simple web interfaces to databases, while remaining flexible enough to support enterprise web applications.
Ok, so most of the Journals lack even a scrap of entertainment value... but the data feeds are normally fun. Is there anyone left that hasn't wasted a few bytes on the following url?
http://www.livejournal.com/stats/latest-img.bml
Hint - its a constantly updating list of all the new images posted to journals. After a while you give up waiting for a hot chick to post and decide crazy survey graphics are as good as it gets. And then some hot chick posts her birthday party pictures, but she's only 14 and suddenly you wish you'd spent the day doing something else.
0daymeme.com: Great stuff.
Back in the .com days, I worked at a huge (now defunct) porn site. We had about 50,000 active hosted sites, 500,000 hit counters and a bunch of other stuff. We were getting tens of millions of page views daily, maxing out two 100 megabit circuits at times. It was all FreeBSD, a little Redhat, Perl, mysql, squid, apache, mod_perl and C. The only real closed stuff we used were BigIPs and traffic monitoring software.
The web is really a mixed bag that allows a mix of open standards, and proprietary software. To claim it is all open source is misleading. It is a dynamic network that allows development on multiple layers.
The most important aspect of the web is that the interface of the different layers were well defined and exposed...not that each line of code in the different layers is exposed.
It's a pervasive belief among the suddenly famous. IBM, MS, or Sun doesn't need this. It's the small website with a bright idea that is all of a sudden gaining popularity which goes through almost each of the stages described in this document.
This is for people with absolutely no budget and infinite traffic. This is how to live through that and come out winning like Brad apparently has.
I guess Amazon.com is one of those not-properly-designed websites that doesn't do anything real?
GForge really is great. We're using it internally at my workplace for request tracking and project management. Now, if only 4.0 would come out soon... :)
Who said Freedom was Fair?
About a month or so ago, slashdot was regularly dying while fetching pages. Anybody know what was actually causing the problem? I suspected it was Mysql, but don't know.
In any case, it seems to have quieted down some.
How is this "large scale?" Maybe it's medium-scale as far as the web goes, but otherwise, it's very much a lightweight app. From livejournal.org:
Per Hour: 6818
Per Minute: 114
That's 2 inserts a second, and maybe a hundred queries a second. Quite honestly, that could be handled by MySQL & PHP. Definitely not what I'd call "large scale".
I don't respond to AC's.
If you are looking for scalable OSS solutions, also look into Zope with Zope Enterprise Objects (ZEO).
A little harsh considering the guy's starting point, but it is true that most people / companies don't think things through. I put in a lot of startup web sites in the 90's, and used to give lectures on, among other things, why replicating databases doesn't scale. Looks like people still think that replicating databases is a solution, almost ten years later. It makes me glad I opted out of the e-com performance world, or I'd still be solving exactly the same problems.
Simple lessons:
-replicating database all over the place doesn't work
-adding lots of servers doesn't work unless the apps are designed to work that way
-object-relational and object databases are useful for a narrow class of problems, and Do Not Scale
-java/perl/etc. are great, but you have to learn some SQL because doing things like sorting data in code is stupid when the database is 10x faster doing on retrieval than your code
There's the material I used to get $2000 for for a 1 hour lecture. Share and enjoy.
Milk
But I think their 'Ballad of Michael Hutchens' off Bigger Than The Devil was the absolute best.
Get thee glass eyes, and, like a scurvy politician, seem to see things thou dost not.--King Lear
Not large enough scale to survive a Slashdotting...
I can't speak to the "Perl Sux!" allegation but I would say that MySQL is at least partially at fault, too, especially considering the limited clustering, partitioning, replication, and locking schemes it has.
They could/should have moved to a much better DBMS. Although the DBMS licensing fee would've been non-trivial it would have meant SIGNIFICANTLY reduced hardware costs and much much less application code development. I even suggested this several years ago but I was told that licensing costs were prohibitive even as they were throwing away $40K on useless hardware.
Thanks,
--
Matt
You'll note that large websites who actually do real things besides logging people's daily problems don't use mod_perl and a thousand servers.
*Cough* amazon.com. *Cough* ticketmaster.com.
I have used livejournal for some time to communicate and record various things with the lady in my life and I think it is very valuable as an imperfect effort to learn from.
When I initially started using it I found it to be relatively responsive, but over the past year years things seem to be getting slower and slower.
It is clear that his design isn't scaling well without reading the presentation, but after reading it I know see it as a sort of 'case study' to learn from.
But beyond that it has reminded me why any blog you actually want people to read should be elsewhere. Then again the quality of LJ blogs infamous...
OMG! He's got a goat link right on the front page.
Hey, Windows users, there is no such thing as "forward" slash, there is only slash and backslash.
If you fixed that non-threaded code I hope you sent in a patch to the relevant people!
You need to get over your favorite language/technology/term you read in the trade-rag you read last week. And then you need to get over yourself.
Give it up slashdot crowd. mod_perl is not a valid technology for a large scale website! Perl was designed for a task, and that task was NOT enterprise application development.
Spoken like someone who has never had to build a very large site (doing "real" work) completely in Perl/mod_perl. I can tell you that it most certainly can scale to enterprise needs. Did this guy do it right? I don't think so either but he most certainly learned a valuable lesson. Hopefully other people will study what he has done and improve their own systems based on his work.
For the record, Java wasn't built for enterprise application development either. As with Perl, people discovered that Java had a future there and here we are today.
A properly designed website with n-tier sepperation will be able to handle a large load and scale infinitly. You'll note that large websites who actually do real things besides logging people's daily problems don't use mod_perl and a thousand servers. There's a reason for this.
You're assuming two dangerous things... (1) That you can't have n-tier and Perl. And (2) that large mod_perl sites require lots of servers. To believe any of these things is to demonstrate your horrific misunderstanding of computer science in general. I pity the company that lets you design their architecture. Wait, no I don't.... I'll gladly take their money for fixing your mistakes.
Oh yeah, and let us not forget some other languages that are showing promise... specifically Python+Zope. In fact, I know of several people implementing n-tier applications with PHP on the front, Python in the middle and PostgreSQL in the back with much success.
And for the record, here are some large companies and sites heavily using mod_perl.
Want more?
As an employee, I can tell you that this comment is somewhat full of shit.
It still is a very segregated system with tons and tons of front-end boxes that each do specific things. All the "magic" of Amazon happens in Java and C++ anyway.
I was talking about the orginal base from the web, not it's current state. And I didn't make a joke, merely a humorous remark, and a very subtle one at that. Subtle humor often has to be explained before people even realize what hit them.
As a paying subscriber of Livejournal, I can say the only reason I even have an account is because of the friends that I have who use it. I would never use it as a case study for any technology. It's got huge performance problems, data loss issues, and usability issues. This may not be the fault of using OSS, but it definitely doesn't help it look good.
There is no longer anything that can be done with computers that is nontrivial and clearly legal. -- Paul Phillips
Is there something interesing here?
A firewall can not protect you from yourself. Turn off what you do not need. Do not use the firewall to do your work.
Oh, yes it is.
This sig no verb.
Also PHP -> PL/PgSQL -> Postgres
I count PL/PgSQL and postgres different tiers because they have different functions and in the case of one system I'm working on all database interactions are moderated by PL/PgSQL stored procedures. They could just as easily be PL/Python or PL/Perl stored procedures if I wanted them to be.
I count PL/PgSQL and postgres different tiers because they have different functions and in the case of one system I'm working on all database interactions are moderated by PL/PgSQL stored procedures.
That's interesting, what you have certainly provides the ideal MVC separation but I'm not sure that it would technically qualify as 3-tier. Only because you couldn't scale up or swap-out the PL/PgSQL without also affecting Postgres.
<crazy mode>
That being said, it might be possible though. (And this is probably a really bad idea.... lol) but you could deploy middle-tier Postgres installations that held no data and used the dblink contrib package to do the real work. It would probably work. Albiet slower and maybe breaking atomicity.. But you would then be able to scale the stored procs without touching the database. The whole thing would be purely academic since most SP time is spent dealing with data anyway... Oh well.
</crazy mode>
Some may find it interesting that Wikipedia (covered earlier today on Slashdot) uses some code that came out of LiveJournal for caching: memcached.
Simpy
It's somewhat amusing that in the first load balancing example, one of the points of failure was Kenny. Especially since Kenny ALWAYS DIES.
Karma: It's all a bunch of tree-huggin' hippy crap!
Would you consider, say, a billion hits a day to be large scale? There's at least a couple mod_perl applications at that scale, and a dozen or so over 100 million.
I'm not sure from your comments that you actually understand what scaling means, though. It doesn't mean that a fixed number of machines can serve unlimited requests. It means that the ratio of machines (or cost) to requests is constant. So, at a certain level, yeah, you'll need a thousand servers. (And the ability to manage them.)
How do they do to make the PHP front talk to the Python middle layer?
I love python and I've been trying to use Python in the front too which it isn't too good at. PHP+Python sounds interesting.
Inside LiveJournal's Backend or, "holy hell that's a lot of hits!"
Believe me, it is taking all my strength to avoid making a certain obvious joke about this title..
exactly. try using it at 7 pm on a weeknight, its all but useless.
I agree wholeheartedly. PostgreSQL and FireBird would suit LiveJournal way better than MySQL. However, PostgreSQL's replication is not exactly fail-save (not sure if that's a requirement here) nor automatic, nor does it have the kind of partitioning features that some of the 'bigger' boys have.
I was thinking mostly of Sybase Replication Server combined with Sybase ASE or Oracle 10g/Oracle Clustering, things that would go really, really nicely in the environment and workload the LiveJournal folk are experiencing.
Thanks,
--
Matt
Thanks, appreciate the words of support. I was getting tired of being nit-picked to pieces.
The PHP layer issues either a SOAP or XML-RPC call to a Python server. You can either write a stand-alone server or use Zope to handle the requests.
The P in LAMP refers to any of the following: Perl, Python, PHP.
The one that you are refering to uses java.
Moderators, please verify your sources before you mod informative.
;-)
It's kind of sad that moderators tend to be so biased at times on /. that it can be so easy to be a karma whore. Anyways, on to what I have to say...
Her explanation, of course, is not that she has a greatly inflated opinion of her abilities but that he teacher is anti-Christian.
Yes, this young lady may be on the ditzy side, as are most teens these days, regardless of their beliefs. I've seen the same type of drivel from a Marilyn Manson worshipping, black makeup-wearing gay boy as I have from the bubble-headed, bible-thumping cheerleader, except that the teacher was "a homophobic Christian bigot" instead of ANTI Christian.
At any rate, the little Christian cheerleader is right about one thing: a lot of teachers ARE biased against students--particularly in the humanities (English, and in Canada high school Social Studies in particular). I experienced this first hand in my senior year of high school. My beliefs tend towards libertarian ideology and conservative/free market economics. Social Studies teachers tend to be more socialist. I wrote a position paper in support of reducing government welfare programs to a minimum (whether it be corporate or personal). The resulting mark was 78 percent if I remember right. The highest mark I ever received from that teacher was 83 percent.
At the end of high school where I live, final exams are standardised, government-issued tests marked by a panel of teachers independent of the local high school--your teacher cannot influence your grade on the exam (I believe they aren't even permitted to see your completed test before it is marked). By chance, I could write about the same subject as the above-mentioned paper (you had a choice of three topics). Of course I couldn't remember the papaer word for word, however I used the same arguments, in close to the same order, as I did on the paper that originally scored 78 percent. Later I learned that I scored NINETY PERCENT on the essay (and 95 n the multiple choice/short answer...woo hoo!).
I think it's fairly save to say that a twelve percent difference indicates that there is quite a lot of bias and subjectivity in grading there...
You continue on stating opinion without making any strong argument by saying:
these zealots will continue to try to take control of this country
I've heard almost the same exact statement made a couple times before. One time it was coming out of the mouth of a hooded, cross-burning man to a news reporter in reference to Jewish people. The other time was quite recently, except the insult wasn't "zealot"--it was "pervert". That was from a demonstrator holding a cross and marching in a demonstration against gay rights. Fact is, most evangelical Christians are not zealots that want to toake control of their country--they just want to live their lives free of persecution and with the respect of others around them. This is no different from Muslims, or Jews or even athiests or gay couples who wish to have their relationships acknowledged by the state. Certainly, within ALL of those groups, there are funamentalist/extreme minority factions that would indeed love to take control.
You are free to state whatever beliefs you may have and I'll go to my grave to defend your right to do so, however I'd like to give you some advice: think a bit before making a blanket statement about a large group of people, whether it be positive or negative. You are likely to come across as closed-minded and even offensive to more than a few people.
Something the Slashdot coders could learn from, perhaps?
/.'s coders could actually *learn*?)
404: Situation Not Found.
(I'm sorry, did you suggest
---
Mod me down, you fucking twits. Go ahead. I dare you.
(I read with sigs off.)
LiveJournal wasn't adequately planned from a business perspective either. Like many .com era companies, they went for massive uncontrolled growth.
Because of the ballooning user population, they've ended up in a situation where they've had to install a bunch of anonymous moderators to "control" abusive users, apparently using inadequate tools for the task and with little guidance. And as everyone knows, anonymity + power + no oversight = abusive behavior. See my signature link.
Brad admits he basically has no idea what the "abuse" team is doing, so the whole LJ organization is dysfunctional.
GCHQ Quantum Insert installed. If only our tongues were made of glass, how much more careful we would be when we speak
I didn't realize that it had automatic expiration. I must have missed that somewhere while reading the documentation.
The cache going away does not lead to data loss. It does lead to really shitty performance while the cache repopulates, but all of the data will still be in the database which is completely separate from the cache. If it was considered necessary, it wouldn't really be hard to load up a bunch of key objects into the cache from a script but that would be guessing which objects are going to be needed while just letting it repopulate and suffering some slowness for a few hours gets the right objects into the cache. Different applications have different needs.
Don't feed me bullshit. memcached dies and so does your entire cache. That's significant data loss no matter how you want to spin it.
I don't know what HA-NFS and AFS are, but I know that using Squid (assuming you're talking about the HTTP proxy) would be caching at the wrong level. Caching constructed pages is pointless because most pages are completely different for each logged in user. memcache caches the atoms of data necessary to build the page, such as information about users and journal templates.
Squid doesn't just cache pages, you know. I can cache a wide range of data that's served over http. Sound familiar? If you've read the memcached protocol documentation, it should.
As for the others: OpenAFS and HA-NFS. So much for "evaluated other solutions". These are both lightning fast high-availability NFS replacements - AFS sports numerous features such as client-side caches. And yes, they are open source.
Whoop de doo. Slashdot is looking at memcached. Their DBMS is notorious for corrupting itself, so that tells me quite a bit about their availability concerns.
Like I said - this may work great for LJ and Slashdot, but there are enormous e-commerce sites (that believe it or not, use a heckuva lot of OSS) that have a little more to worry about than losing ad revenue for the 10 minutes it takes to repopulate memcached. Having that kind of downtime simply is not possible. You not only lose sales, depending on your caching strategy, you can get unrecoverable orders, or just outright lose customers because your site is slow. It's not uncommon, either, it's pretty much a guarantee if your site gets slow or goes down for any extended period of time - your full-service uptime directly correlates to sales for sometimes several months, and god knows you're fucked if it happens during the christmas season.
Hehehe...Karma still "Excellent."
:D
See ya in the M2 Buddy
WTF?!? I got modbombed!!!
***WHINE***
err, you missed one...
slashdot.