RMS was onto something: "I cannot in good conscience sign a nondisclosure agreement or a software license agreement". We need more of that sort of thinking.
When consuming an average of 60 megabits per second of outgoing bandwidth, I have some interest in seeing that reduced to 50 megabits per second through better JPEG compression. A difference of $700-$1500 per month in recurring costs isn't insignificant. Average outgoing bandwidth use for the last 10 minutes is 90 megabits/s. Transfer between 1 Jan and 11 Jan inclusive is 6.3TB. Last month was 13TB.
It can often be slow, most visibly to those who view while logged in. Apache load balancing isn't close to as even as we want it. Many of us are looking at Foundry gear and wishing...:) Not going to happen, barring a donation.
You may have noticed the comment above about Squid cache servers near Paris serving their first production requests yesterday. That's been in work for some six months, with assorted delays along the way to frustrate us.
We're looking for the same in other places - notably countries distant in internet response time from Florida. Someone in Australia want to save some trans-Pacific transfer? Love to hear from you. Korea, Japan, South Africa, Iceland? Same thing. Not only those places.
Longer term we're looking to have 3-4 server farms with squids, apaches and database server slaves in different places around the world. Think in terms of a couple or more racks and a few hundred megabits per second.
You won't hear about these things until they are almost ready to happen. We have the confidentiality of those we talk with to consider and they might have things like stock market rules to abide by, so we take that seriously. We're VERY interested in approaches from companies with significant bandwidth and server farm capability, either those who are big in their own country or who are big in world terms.
What won't interest us much is one megabit in the US. Too small for the cost to manage it. The same in Iceland with three oldish boxes to serve as squids? Probably welcomed with joy.
James isn't exaggerating.:) We really are planning for 200-500 servers in one year and could easily need more to keep up with demand. Really strongly depends on how the projects continue to grow and when we serve all desired requests after everyone interested has discovered the sites.
Take traffic and consider doubling times of 8-16 weeks and you end up with some really big numbers. Just keeping up takes a lot of work. Caching will help more and more as traffic grows, though, because we'll have an increasing percentage of requests in cache. We hope.:)
Holes today were an unruly crawler. The appropriate/25 is now firewalled at the squids. Yesterday two of the five database slaves were down for a while. Site was available but it was slower than usual on the database side.
Performance issues these days are mostly due to uneven apache load balancing. We're working on it.
Servers? We have servers? You must mean the 40 servers which donations purchased in 2004. Thanks to those who donated.:) Now, if someone could just tell me when we'll stop growing so I can work out whether I need to plan for 200 or 500 by this time next year...:)
"Just as one day kids will wonder if there was life before Google". Well, I'd say it is good that Wikipedia is in the company of Google.:) And also in the top 100 English language web sites according to Alexa. I suppose it's certain that this experiment is doomed to be a flop.:)
I'm biased, since I'm one of the roots for the Wikipedia/Wikimedia servers.
I suppose I should ask: any interest in a Slashdot interview on the capacity planning and technical side of Wikipedia? That's my area... of course, that also means I'll say what we'd love to have donated (anyone got a couple of racks and 100 megabits/s spare?:)) Oh, sorry, I'm supposed to have a neutral point of view...:) Or is that I'm supposed to be serious in public? Never can get that straight...:)
Consider that at present there are generally about 250,000 changes to just the English language encyclopedia each week.
Rather than explicit moderation, stability and viewer/reviewer counting may provide a suitable automatic quality measure for each version. If 50 or 100 people have viewed it and it's been unchanged for 3 months, it's probably in pretty good shape. If it's been changed 5 minutes ago and viewed twice, all bets are off.:) It won't provide a measure of actual artcle quality though - that has to be more systematic.
I tend to think of the project as being the former with a good attempt at the latter continually under way.
Either is undermined largely by a combination of those who want to delete specialised knowledge (often claiming it's insignificant because they don't know the field, a growing problem) or want to eliminate multimedia coverage of events in the last 75 years because it's not GFDL-pure and they want things not matching their views gone instead of keeping them around until replaced or the growing market power persuades the copyright holder to grant a GFDL license. The petty vandalism is insignificant by comparison when considering the HG or EG as objectives: it takes systematic short-sightedness to do real harm and vandals lack that.
Expect to see a petty good approximation to the HG on most cell phones within a few years. A cell phone isn't quite as big as the Hitchiker's Guide from the TV series but smaller is progress.:)
The views expressed here are mine, not any official policy of the Wikimedia Foundation or necessarily of anyone else. So there.:)
Half of the last funding drive target was "pushed through" by me, when I suggested raising it from $25,000 to $50,000.
My motivations are very simple: I estimate what I think reasonable growth based on past performance will require and project roughly what it will cost to buy the equipment to keep up, then suggest a sufficient target to cover those needs.
For the quarter now ending that estimate was three database slaves and 15 Apache web servers as the reasonable maximum we'd need based on past growth, with 2/10 more likely. 2/10 was just about sufficient and we've been discussing and I'm preparing the last of the three anticipated orders for the quarter now. Performance suffered for a while because of equipment failures (more than 5 still out of service), delays getting those computers (compatibility issues the vendor sorted out, bits of bureaucracy and timing issues largely). So we're preparing to handle a larger number of failures as well...:)
For the next quarter I'm looking at something higher. I'm expecting to be in the top 100 sites on the net during the spring quarter, with a fair probability of the top 50. Not at all bad for a place funded solely by donations from well-meaning people who want and like the resource.:)
The "big" item coming soon is ordering a new master database server to handle the English and Japanese encyclopedias, so we'll have it in test service for two months before switching to it. Followed soon by similar very capable database slaves for them. If anyone knows a place willing to donate 12-40 15K SCSI drives...?:) Or, for that matter, any fairly fast drives, including drive maker refurbs, since everything is RAID. Or anything in the way of quite high end disk systems or high capacity RAM modules, for that matter. It's a fine opportunity for high profile public good PR.
Japanese is paired with English because Japanese load is falling while English is rising and vice-versa.
Yes, that TiVO ad move told me I don't want TiVO
on
The VHS is Dead
·
· Score: 2, Insightful
I like the product I buy to keep on working as it's supposed to at the time I bought it, not have the company reduce its value later. I'll pick something trustable instead of TiVO.
It's rapidly becoming the leading reference work in terms of number of users. How many pros are going to be content with the most used work having an inaccurate description when it's so easy for them to correct it?
You seem to have assumed that I was only interested in responding to your points, not raising one of my own.:)
Yes, it's certainly true that 3 years after being started the Wikipedia is sometimes not as well rounded and at times not as accurate as EB is. In other areas, that balance is reversed, in part because keeping the Wikipedia current is easier.
Yes, much of EB is a pay site. As you can see from the traffic, that means that most readers aren't being served at all by it. The accuracy or completeness of EB doesn't matter very much to those who aren't using it because of its chosen business model.
When discussing standards it's good also to remember who is actually meeting the information needs of the readers, for that tells you where time intended to help readers will best be spent.
Itappears that the Wikipedia is doing a rather better job of meeting the needs of readers to find information than the Encyclopedia Britannica is online. To the tune of ten times the number of page views as Britannica.com. Readers are voting with their feet and the winner isn't EB.
What do you make of the hope that a Wikipedia article will eventually be more authoritative than a single human source work because it must survive review by all of those in a field?
It'll probably be more authoritative because it doesn't suffer from a key structural flaws of most encyclopedias: selection of editors who may have a specific viewpoint. Because a Wikipedia article has to survive the review of all of those in a field it has little choice but to eventually cover all of the major views within the field. that's likely to be particularly significant for rapidly changing fields, where those who are well known may also be out of touch with the newest developments.
However, that's a rather different definition of authoritative from that traditonally used with print sources, a single indvidual staking a personal reputation on the contents of some piece of writing. The Encyclopedia Britannica has already removed that personal reputation aspect. Today simply being in the EB is sufficient, without the vast majority of readers looking to see who wrote or reviewed the EB article.
Add Memcached for session state and storing some parsed pages. It significantly offloads the Apaches and database.
For capacity: 24 machines did 1100 requests per second and were responding slower than we liked. That's close to the non-surge capacity. For surge to a few pages, the Squids would handle most of it and could go a lot higher.
The Squids handle 70%+ of the requests, so the Apaches were dealing with about 15 requests per second. Not a lot because some of them are quite expensive in CPU power - PHP doing parsing, not an optimal choice. We do it that way because MediaWiki needs to run on a shared hosting safe mode PHP setup. For ourselves and those with greater control, we're introducing a PHP plugin which offloads much of the CPU-intensive work from PHP.
While the hardware is commodity rackmount boxes, most of us are interested in using the home PC type of box for the raw CPU work done by the Apaches. We don't really notice the failure of a few Apache boxes (unles it happens to do something like hit most of the memcached machines). Not cost-effective until we've filled the second rack and switch to a room instead of going to a third rack.
We use DNS round robin and many virtual IPs per box for load balancing the Squids. That's a pain when one fails and we have to maually switch IPs around, so we're thinking of switching to a pair of LVS boxes in font of them in a failover configuration. This should both help perceived site reliability when a box fails and be a bit more even in load balancing.
The Apache load balancing does the job but it's not as even as we'd like (sensitive to both network topology and Apache speed), so we're also contemplating LVS as a load balancing layer between them and the Squids.
What we have now has worked well enough to get us to about the top 200-250 sites range. It'd go further but we're conscious that people now depend on us being there all the time, so we're spending increasing amounts of attention and sometimes money on reliability things.
The limits. Squids: cache miss penalty, storing the missed page to their disk cache. We're investigating more fancy disk setups than single disk SATA to try to increase the capacity per squid - may end up with lots of old and tiny disks per Squid. Or not - will depend on benchmark results. Apaches: pure CPU power and PHP. Database: full text search is by far the biggest load, in part because we're currently using a query which is less efficient than normal MySQL full text search - that'll change soon.
Sorry, you do need that testing and experience. Nobody here can answer your question without that. Your application is too different from all others for anyone to make really sensible specific recommendations. However, here's a generic try:
1. Database server with room for lots of disks and RAM, starting out with a pair of drives in RAID 1 and a gigabyte or two of RAM. Add disks and/or RAM as required. 2. Software written to handle lots of web servers/ page builders. Start out with one and add more as needed.
Beyond that, you're stuck unless you have some test data. It's not currently good enough to give anyone a meaningful quote but it might be good enough for a customer who understands your situation to work with you on it, if you're willing to replace money with the time to work with the customer.
A sensible consultant should turn your business down. You're setting them up for a failure unless you're willing to do the required testing or they have the equipment and budget to do it on their own.
How long ago did you try it? There definitely are issues with some builds. We're fine on our master and a SCSI slave but one of the SATA slaves has regular relay log damage (easy to fix but a pain). Too early for those who don't have a pressing need for it, I think, even though it can work.
PID USER PR NI nFLT VIRT SWAP RES SHR S %CPU %MEM TIME #C COMMAND 23983 mysql 15 0 559 6621m 353m 6.1g 18m S 0.2 78.9 11:44 0 mysqld
That's the master server of wikipedia.org, a few minutes before I posted this reply. Dual Opteron, 8GB, 6x15K SCSI in RAID 10. FC2. 5GB+ for InnoDB. Been that way for many months now, without any major troubles. Not that there haven't been some, but it does the job. A pair of 4GB slaves are FC2 as well.
2.6/FC2 has issues in some builds on some systems, so it's not entirely smooth going. Just to complete the picture:
MySQL on localhost (4.*) up 0+18:41:33 [06:25:18] Queries: 151.4M qps: 2359 Slow: 57.5k Se/In/Up/De(%): 60/00/00/00 qps now: 1075 Slow qps: 0.2 Threads: 92 ( 12/ 179) 74/00/00/00 Cache Hits: 26.7M Hits/s: 416.6 Hits now: 183.0 Ratio: 29.3% Ratio now: 23.0% Key Efficiency: 98.9% Bps in/out: 29.4k/51.3k Now in/out: 109.2k/486.9k
You might have intended to exclude Opteron systems from X86 but it's worth clarifying.
Lots of excellent points in your post. Like forget RAID 5 for the database servers.
Contributor reliability ratings have traditionally been somewhat discouraged for social reasons. More of that is likely to happen and you'd certainly fulfil some people's wish-lists if you did something along those lines and merged it with head. Do discuss on the mailing list first though - no telling what soeone else is working on and might do for you.:) Trying to work out good reliability (rather tha popularity) metrics is an interesting social problem.
Colored contributions might be quite interesting. You do seem to have a viable approach, though it'll be interesting to see how well you manage to retain the info after a complicated edit. I have a feeling at least some other places will like this.
LDAP may be fairly easy - single signon support for multiple wikis is on the way (for 1.4). Nobody has coded LDAP support yet though. Likely to be appreciated by others.
Any capable devs who are interested are very welcome - it's one of the resource shortages.
Like the protections provided by the MediaWiki software used by Wikipedia/Wikimedia. The Wikimedia Foundation wiki requires an account to edit it (and access is not casually granted). The Wikipedia main page is locked against editing by non-admins. Too big a troll target.
It's just part of the feature set good wiki software should have, to reflect the range of uses its users are likely to need.
RMS was onto something: "I cannot in good conscience sign a nondisclosure agreement or a software license agreement". We need more of that sort of thinking.
When consuming an average of 60 megabits per second of outgoing bandwidth, I have some interest in seeing that reduced to 50 megabits per second through better JPEG compression. A difference of $700-$1500 per month in recurring costs isn't insignificant. Average outgoing bandwidth use for the last 10 minutes is 90 megabits/s. Transfer between 1 Jan and 11 Jan inclusive is 6.3TB. Last month was 13TB.
I know I'm writing for the /. audience, so I get all technical sometimes.:) If anyone needs a translation, just ask.:)
It can often be slow, most visibly to those who view while logged in. Apache load balancing isn't close to as even as we want it. Many of us are looking at Foundry gear and wishing...:) Not going to happen, barring a donation.
You may have noticed the comment above about Squid cache servers near Paris serving their first production requests yesterday. That's been in work for some six months, with assorted delays along the way to frustrate us.
We're looking for the same in other places - notably countries distant in internet response time from Florida. Someone in Australia want to save some trans-Pacific transfer? Love to hear from you. Korea, Japan, South Africa, Iceland? Same thing. Not only those places.
Longer term we're looking to have 3-4 server farms with squids, apaches and database server slaves in different places around the world. Think in terms of a couple or more racks and a few hundred megabits per second.
You won't hear about these things until they are almost ready to happen. We have the confidentiality of those we talk with to consider and they might have things like stock market rules to abide by, so we take that seriously. We're VERY interested in approaches from companies with significant bandwidth and server farm capability, either those who are big in their own country or who are big in world terms.
What won't interest us much is one megabit in the US. Too small for the cost to manage it. The same in Iceland with three oldish boxes to serve as squids? Probably welcomed with joy.
James isn't exaggerating.:) We really are planning for 200-500 servers in one year and could easily need more to keep up with demand. Really strongly depends on how the projects continue to grow and when we serve all desired requests after everyone interested has discovered the sites.
Take traffic and consider doubling times of 8-16 weeks and you end up with some really big numbers. Just keeping up takes a lot of work. Caching will help more and more as traffic grows, though, because we'll have an increasing percentage of requests in cache. We hope.:)
Holes today were an unruly crawler. The appropriate /25 is now firewalled at the squids. Yesterday two of the five database slaves were down for a while. Site was available but it was slower than usual on the database side.
Performance issues these days are mostly due to uneven apache load balancing. We're working on it.
Servers? We have servers? You must mean the 40 servers which donations purchased in 2004. Thanks to those who donated.:) Now, if someone could just tell me when we'll stop growing so I can work out whether I need to plan for 200 or 500 by this time next year...:)
"Just as one day kids will wonder if there was life before Google". Well, I'd say it is good that Wikipedia is in the company of Google.:) And also in the top 100 English language web sites according to Alexa. I suppose it's certain that this experiment is doomed to be a flop.:)
I'm biased, since I'm one of the roots for the Wikipedia/Wikimedia servers.
I suppose I should ask: any interest in a Slashdot interview on the capacity planning and technical side of Wikipedia? That's my area... of course, that also means I'll say what we'd love to have donated (anyone got a couple of racks and 100 megabits/s spare?:)) Oh, sorry, I'm supposed to have a neutral point of view...:) Or is that I'm supposed to be serious in public? Never can get that straight...:)
Consider that at present there are generally about 250,000 changes to just the English language encyclopedia each week.
Rather than explicit moderation, stability and viewer/reviewer counting may provide a suitable automatic quality measure for each version. If 50 or 100 people have viewed it and it's been unchanged for 3 months, it's probably in pretty good shape. If it's been changed 5 minutes ago and viewed twice, all bets are off.:) It won't provide a measure of actual artcle quality though - that has to be more systematic.
I tend to think of the project as being the former with a good attempt at the latter continually under way.
Either is undermined largely by a combination of those who want to delete specialised knowledge (often claiming it's insignificant because they don't know the field, a growing problem) or want to eliminate multimedia coverage of events in the last 75 years because it's not GFDL-pure and they want things not matching their views gone instead of keeping them around until replaced or the growing market power persuades the copyright holder to grant a GFDL license. The petty vandalism is insignificant by comparison when considering the HG or EG as objectives: it takes systematic short-sightedness to do real harm and vandals lack that.
Expect to see a petty good approximation to the HG on most cell phones within a few years. A cell phone isn't quite as big as the Hitchiker's Guide from the TV series but smaller is progress.:)
The views expressed here are mine, not any official policy of the Wikimedia Foundation or necessarily of anyone else. So there.:)
Half of the last funding drive target was "pushed through" by me, when I suggested raising it from $25,000 to $50,000.
My motivations are very simple: I estimate what I think reasonable growth based on past performance will require and project roughly what it will cost to buy the equipment to keep up, then suggest a sufficient target to cover those needs.
For the quarter now ending that estimate was three database slaves and 15 Apache web servers as the reasonable maximum we'd need based on past growth, with 2/10 more likely. 2/10 was just about sufficient and we've been discussing and I'm preparing the last of the three anticipated orders for the quarter now. Performance suffered for a while because of equipment failures (more than 5 still out of service), delays getting those computers (compatibility issues the vendor sorted out, bits of bureaucracy and timing issues largely). So we're preparing to handle a larger number of failures as well...:)
For the next quarter I'm looking at something higher. I'm expecting to be in the top 100 sites on the net during the spring quarter, with a fair probability of the top 50. Not at all bad for a place funded solely by donations from well-meaning people who want and like the resource.:)
The "big" item coming soon is ordering a new master database server to handle the English and Japanese encyclopedias, so we'll have it in test service for two months before switching to it. Followed soon by similar very capable database slaves for them. If anyone knows a place willing to donate 12-40 15K SCSI drives...?:) Or, for that matter, any fairly fast drives, including drive maker refurbs, since everything is RAID. Or anything in the way of quite high end disk systems or high capacity RAM modules, for that matter. It's a fine opportunity for high profile public good PR.
Japanese is paired with English because Japanese load is falling while English is rising and vice-versa.
I like the product I buy to keep on working as it's supposed to at the time I bought it, not have the company reduce its value later. I'll pick something trustable instead of TiVO.
Certainly there are no guarantees.
It's rapidly becoming the leading reference work in terms of number of users. How many pros are going to be content with the most used work having an inaccurate description when it's so easy for them to correct it?
You seem to have assumed that I was only interested in responding to your points, not raising one of my own.:)
Yes, it's certainly true that 3 years after being started the Wikipedia is sometimes not as well rounded and at times not as accurate as EB is. In other areas, that balance is reversed, in part because keeping the Wikipedia current is easier.
Yes, much of EB is a pay site. As you can see from the traffic, that means that most readers aren't being served at all by it. The accuracy or completeness of EB doesn't matter very much to those who aren't using it because of its chosen business model. When discussing standards it's good also to remember who is actually meeting the information needs of the readers, for that tells you where time intended to help readers will best be spent.
Thanks.
It appears that the Wikipedia is doing a rather better job of meeting the needs of readers to find information than the Encyclopedia Britannica is online. To the tune of ten times the number of page views as Britannica.com. Readers are voting with their feet and the winner isn't EB.
What do you make of the hope that a Wikipedia article will eventually be more authoritative than a single human source work because it must survive review by all of those in a field?
It'll probably be more authoritative because it doesn't suffer from a key structural flaws of most encyclopedias: selection of editors who may have a specific viewpoint. Because a Wikipedia article has to survive the review of all of those in a field it has little choice but to eventually cover all of the major views within the field. that's likely to be particularly significant for rapidly changing fields, where those who are well known may also be out of touch with the newest developments.
However, that's a rather different definition of authoritative from that traditonally used with print sources, a single indvidual staking a personal reputation on the contents of some piece of writing. The Encyclopedia Britannica has already removed that personal reputation aspect. Today simply being in the EB is sufficient, without the vast majority of readers looking to see who wrote or reviewed the EB article.
Add Memcached for session state and storing some parsed pages. It significantly offloads the Apaches and database.
For capacity: 24 machines did 1100 requests per second and were responding slower than we liked. That's close to the non-surge capacity. For surge to a few pages, the Squids would handle most of it and could go a lot higher.
The Squids handle 70%+ of the requests, so the Apaches were dealing with about 15 requests per second. Not a lot because some of them are quite expensive in CPU power - PHP doing parsing, not an optimal choice. We do it that way because MediaWiki needs to run on a shared hosting safe mode PHP setup. For ourselves and those with greater control, we're introducing a PHP plugin which offloads much of the CPU-intensive work from PHP.
While the hardware is commodity rackmount boxes, most of us are interested in using the home PC type of box for the raw CPU work done by the Apaches. We don't really notice the failure of a few Apache boxes (unles it happens to do something like hit most of the memcached machines). Not cost-effective until we've filled the second rack and switch to a room instead of going to a third rack.
We use DNS round robin and many virtual IPs per box for load balancing the Squids. That's a pain when one fails and we have to maually switch IPs around, so we're thinking of switching to a pair of LVS boxes in font of them in a failover configuration. This should both help perceived site reliability when a box fails and be a bit more even in load balancing.
The Apache load balancing does the job but it's not as even as we'd like (sensitive to both network topology and Apache speed), so we're also contemplating LVS as a load balancing layer between them and the Squids.
What we have now has worked well enough to get us to about the top 200-250 sites range. It'd go further but we're conscious that people now depend on us being there all the time, so we're spending increasing amounts of attention and sometimes money on reliability things.
The limits. Squids: cache miss penalty, storing the missed page to their disk cache. We're investigating more fancy disk setups than single disk SATA to try to increase the capacity per squid - may end up with lots of old and tiny disks per Squid. Or not - will depend on benchmark results. Apaches: pure CPU power and PHP. Database: full text search is by far the biggest load, in part because we're currently using a query which is less efficient than normal MySQL full text search - that'll change soon.
Sorry, you do need that testing and experience. Nobody here can answer your question without that. Your application is too different from all others for anyone to make really sensible specific recommendations. However, here's a generic try:
1. Database server with room for lots of disks and RAM, starting out with a pair of drives in RAID 1 and a gigabyte or two of RAM. Add disks and/or RAM as required.
2. Software written to handle lots of web servers/ page builders. Start out with one and add more as needed.
Beyond that, you're stuck unless you have some test data. It's not currently good enough to give anyone a meaningful quote but it might be good enough for a customer who understands your situation to work with you on it, if you're willing to replace money with the time to work with the customer.
A sensible consultant should turn your business down. You're setting them up for a failure unless you're willing to do the required testing or they have the equipment and budget to do it on their own.
How long ago did you try it? There definitely are issues with some builds. We're fine on our master and a SCSI slave but one of the SATA slaves has regular relay log damage (easy to fix but a pain). Too early for those who don't have a pressing need for it, I think, even though it can work.
Contributor reliability ratings have traditionally been somewhat discouraged for social reasons. More of that is likely to happen and you'd certainly fulfil some people's wish-lists if you did something along those lines and merged it with head. Do discuss on the mailing list first though - no telling what soeone else is working on and might do for you.:) Trying to work out good reliability (rather tha popularity) metrics is an interesting social problem.
Colored contributions might be quite interesting. You do seem to have a viable approach, though it'll be interesting to see how well you manage to retain the info after a complicated edit. I have a feeling at least some other places will like this.
LDAP may be fairly easy - single signon support for multiple wikis is on the way (for 1.4). Nobody has coded LDAP support yet though. Likely to be appreciated by others.
Any capable devs who are interested are very welcome - it's one of the resource shortages.
Like the protections provided by the MediaWiki software used by Wikipedia/Wikimedia. The Wikimedia Foundation wiki requires an account to edit it (and access is not casually granted). The Wikipedia main page is locked against editing by non-admins. Too big a troll target.
It's just part of the feature set good wiki software should have, to reflect the range of uses its users are likely to need.