I'm on the publisher side and I think there are some bits you're missing about this. First of all, I work in the poetry and academic spaces, where a print run of a thousand is a solid seller. We see 40% of retail cost in the actual printing, and easily another 10% for shipping. Now, you can easily object that we do much higher quality paper (acid free, thick) than the average paperback, and also have a lot of diagrams and photo inserts and crap that drive up cost. Fair enough. But your experience is also pretty atypical--paperbacks don't wholesale for $20 or retail for $50, so printing and shipping costs are a much larger relative share.
The other side of it though is that publishers frequently eat an enormous number of bookstore returns, frequently eat the interest between when they pay to print it and the bookstore gets around to selling it to someone else and actually paying the publisher for it, and often pay bookstores for prominent shelf placement. Those things can easily amount to 20-40% of the retail of a book and obviously aren't relevant for ebooks. So 20% for printing and 20% for logistics should equal a solid 40% off.
The reason it doesn't is that goods are, as a general rule, sold on a market basis rather than a cost basis. People who read ebooks are already shelling out for a Kindle, so they're probably a market segment which is getting more consumer surplus to begin with, and should be segmented by the pricing model in a more expensive slot. Last minute airline tickets don't cost more because the user is a bigger chunk of the fuel price, after all...
I manage to pass for an intellectual without knowing who Lautreamont is, but I share your lament on the divorce. And ten points for Euler's formula...proving that rocked my world and took my breath away.
1. It's philosophers like Rene Descartes and Francis Bacon who came up with the idea that mathematics should be used to control nature, thus making life better for man.
2. Philosophy as a field doesn't punish sloppy work? Are you crazy? You've clearly never been to a thesis defense, read peer reviewers' comments on an article, or sat in on a tenure review. Philosophy graduates have some of the highest GRE scores and lowest unemployment numbers for a reason. I used to study philosophy, now I'm a successful sysadmin, and the latter is vastly, vastly easier and more tolerant of error. Yes, I said that--running a site that loses thousands and thousands of dollars during any downtime is easier and more tolerant of error than being a philosopher in U.S. universities today.
3. That is in fact one of the major objection's to Searle's Chinese Room story, and was made almost immediately by a large number of people. It's because he has made careful answers that, while possibly wrong, certainly further stimulate thought that it hasn't been laughed out. Your ignorance of those debates doesn't make their practitioners dumb.
Because what would the 2LD in your 3LD be, your ISP? That stinks. The whole reason everyone wants their own domain to begin with is to avoid being tied to their ISP. The people who didn't care about that are just me.wordpress.com or me.googlepages.com right now, and never were in the domain name market in the first place.
I'm sorry to say it, but if you want privacy, this is wrong. You can have authentication without encryption (digital signatures) but encryption without authentication = Man in the Middle. PGP and SSH don't get around this in any way, shape, or form--they just seed trust differently, with PGP using the web-of-trust model and SSH a repeatability model. Neither of those work very well for the classic "online banking" use case, however--average users are not going to seed their trust webs, and expect to be able to bank from computers at cafes, work, and friends' houses--none of which would have connected previously, making the SSH model unworkable.
That's not to say there's nothing here--extensions to the SSL model like EV certs, DNSSEC, and phishing databases have all made these attacks harder. Perhaps browsers will implement web-of-trust or trust-history type extensions to make it harder yet. And it may well be the case that you simply cannot safely bank at computers you don't own, though with pre-shared keys and time-generated PINs both embedded into mailed fobs, the possibilities open up enormously as long as the execution is correct.
But at the end of the day there's no true privacy without authentication built-in and for the core e-commerce use case, SSL is probably the best model.
I'm also on the ops side, and I think a lot of people not running dozens of SQL servers really underestimate the pain of this. I do worry a bit about the visibility with NoSQL though--if you trust it to manage that there are 3 copies of your data at all times or whatever, how do you really guarantee that? And how do you know which bits to back up? The promise is good, but I definitely worry about whether we're there yet as we start to deploy Cassandra in production.
That is, for better or worse, just not true. I work for a $400MM company, nearly half profit, and there's no way we'd invest in Oracle & DBAs for our OLTP systems. It's not the total dollar volume, it's that since you always try to grab the most profitable niches first, you tend to grow by eeking profit out of places with less room for it, e.g. less profit per user/transaction/whatever, and thus anything but open source on commodity hardware means that your costs grow faster than your profits and the tech department is unpopular. You might think that's silly, but in the consumer online space that's how the business thinks.
Strongly agreed, though I do worry that many NoSQL projects' websites are overly blase about runtime issues, including crash safety and online schema changes, as well as upgrade-safety. Now this is really all about using alpha software rather than anything conceptual/design related, but it is a real issue at the present time.
What would be better for keeping every user's profile thumbnail in memory than memcache?
And would it have gotten off the ground in the first place if it weren't written in a scripting language? Probably not. Now that they have a million lines of PHP code, would it survive a rewrite? Probably not.(see: Netscape). So it's ugly to be sure, but it's almost certainly rational.
Hear, hear. NoSQL is all about running into the write performance limits of commodity hardware and realizing that moving from ACID to BASE is loads cheaper than sharding.
It's 2010. "GB in size" no longer means something big anymore. That said, MySQL and PostgreSQL both handle datasizes up to a terabyte and several-billion-row tables just fine with mostly standard SQL using the usual tricks. If you're talking petabytes, now that's a separate grade of mess. But if you follow the people using this in production at scale and talking about it in public (eg Facebook, Google) you'll see that the issues for MySQL are really around update/insert performance, replication speed, and replication transaction safety. I don't see anybody talking about Postgres scaling quite that publicly, but in my own experience Slony also has some issues with replication speed (and hot standby is great, but until you can query the slave it's not solving a huge class of realworld problems--and that capability has been forthcoming for a long time, but I'm not sure we're really any closer). Anyway, that frustration aside, PostgreSQL is a damn fine database, and I wish I didn't have to deal with MySQL at all.
Really? Because I can almost saturate a gigabit pipe for around $100k/yr these days. Say I've got MySQL on almost 100 cores...now sure you'll say Oracle is more efficient, and it probably is, but I still need at least two boxes in case of hardware failure. I haven't seen anything suggesting I can get Oracle on say 8 cores for $10k. Now maybe this just means I don't "need" Oracle, which in some sense is trivially true since it's running on MySQL and it works, but that doesn't seem very helpful.
For systems that can be stateless, this is always the best approach. master-master replication with conflict resolution isn't always that easy, however, especially when you think about something like the way wikipedia edits can potentially interact. So developing a conflict resolution scheme can be extraordinarily expensive, and MySQL isn't the most stable in multi-master anyway. Thus while you're right in principle, the expense can be prohibitive.
I wasn't quite able to figure out the attitudes there. Where I was (Chengdu), everyone used anonymous proxies like crazy, and while they were quickly blocked more would spring up, with DNS/IPs often distributed on email lists. It was treated a bit like speeding in the U.S. I guess--technically illegal, best to avoid the cops, but everyone does it. I was using my corporate VPN as an easier access method, and even though VPNs are, as best I can tell, in the same sort of legal grey area, my usage really freaked people out. The very idea of encryption (even used to view the same exact material) gave them visions of visitors in the night.
The bad: with email instead of dedicated knowledge management, you'll pay a lot more in licensing, hardware, and maintenance for each bit of bolt-on functionality that you need/want, and even then you won't end up with as much functionality embedded in as slick an interface.
The good: email is a huge industry, so you really can find some provider to add functionality for each line-item requirement (traceability, search, archiving, even workflow), and if you stuff those things transparently into their existing clients/servers they might actually use the stuff. The return on investment of the unused product is always zero.
Just wanted to concur with this. Bought KnowledgeTree at my last job and thought it was just fantastic. Beware that if you go with the purely open source version, Windows users will have to upload documents individually through a web interface, which is not usually a big hit. But great security, auditing, workflow, delegation...and yet actually simple to admin unlike your average Microsoft product. Integrates well with Active Directory.
Some concerns I would have if I were you (and these are mostly not specific to Knowledge Tree): 1. If you don't have a shared infrastructure, how will you handle authentication to the system? KT will allow you to enter standalone users in addition to/instead of connecting to LDAP, but users don't tend to remember those passwords, etc.
2. Fileshares and email tend to be the existing backbones of office workflow. Just because you introduce something more with features that some people care about (search, auditing, workflow) and maybe even make it easy to use and train on it, doesn't mean anyone will actually switch. You have to have an end-user incentive, and even then some people won't switch unless you make the old options impossible. There has to be real management buy-in for success.
We're building those tools, right now. Puppet for configuration management, func for scalable scripting, capistrano for deployments, RANCID for switch configs, Splunk for log slurping. This is what the Visible Ops and/or Infrastructure as Code movement is up to. We even have a conference: O'Reilly Velocity. Adam Jacob, the creator of chef, likes to say "if it doesn't have an API, then it doesn't exist."
I totally agree with you about the problem, but devtools didn't exist until developers got tired of assembly and/or Notepad and built them. Automated infrastructure tools, allowing us to focus on design and business requirements rather than logging into 100 boxes to do this and that all day long, are just now coming into being. Because those of use who would rather design than firefight are building them.
Right, but that's really nothing to do with it. The point is that you're treating a bunch of data that's probably ordinal and non-normal as if it's ratio and normal (http://en.wikipedia.org/wiki/Level_of_measurement). With the distribution of your data, you can either go with a high-effort approach like estimating an appropriate parametric distribution or subsampling, or you can just use some stock non-parametric tests. It's not just the "grades" -- the whole concept of "standard deviation" is undefined with regard to distributions that are substantially different than Gaussian. Basically you've got a divide-by-zero error in your statistics.
It's not just sample size. If the distribution isn't approximately normal, it's still not going to work. Your professors were basically deciding that the distribution of grades *ought* to be normal, whether the distribution of learning was or not. You may be able to do something like that in a work environment too, obviously, but it's important to recognize that it's a normative assumption not driven by the data.
Think of normal curves with standard deviations as something like the curve of adult human heights. So if you have some other kind of data which clusters very differently (say, incomes, where the high end of the curve goes on for a long time) or like your data, where you might have a team where everyone is very nearly at the median (what we call "narrow-tailed") and you just define the distribution as normal (which is what you're doing when you look at 'standard' deviations) then you're basically using statistics to remap small differences as if they're as large as the difference in human heights. And yes, for normal distributions you generally want sample sizes of 40 or so depending on the size of the effect (but in my experience that's reasonable for organizational study planning, where you obviously don't know the effect size beforehand). If what you're trying to do is just rank people in a statistically robust way, then you want something like Pearson's R, which only assumes a rank ordering, not a normal distribution, and is much more robust to small sample sizes. Of course, this will not result in grade-like scoring, but I'm not sure what can be done about that. Hope that helps a bit.
I'll second that. Y! defers everybody, it's just a fact of life. But I send about 2 million emails a day to the big-5 and don't use DKIM. I hope we don't end up having to, b/c it sounds like a real PITA.
The abuse being....a lower cache hit rate on caching DNS servers? We're talking about Akamai here, not wildcarding. DNS service just isn't that expensive to provide, and when you consider that ISPs actively encourage Akamai to have caching servers inside the cages on their head ends, I think the "more DNS queries" vs "lower upstream bandwidth usage and better latency for our customers" doesn't seem like a tradeoff they're complaining about.
Title hardly makes for argument (note I wasn't the one throwing around the ad-homs here); I just wanted to point out that I was speaking from experience.
I don't understand how this is a problem with http...connecting tcp around the world takes an enormous amount of time compared to udp. That's just reality. Remember the issue here isn't what my servers can deliver, but rather latency, which is a function of the global network I don't control. Using Akamai for DNS allows me to use Akamai for midgress and mostly avoid this.
I'm on the publisher side and I think there are some bits you're missing about this. First of all, I work in the poetry and academic spaces, where a print run of a thousand is a solid seller. We see 40% of retail cost in the actual printing, and easily another 10% for shipping. Now, you can easily object that we do much higher quality paper (acid free, thick) than the average paperback, and also have a lot of diagrams and photo inserts and crap that drive up cost. Fair enough. But your experience is also pretty atypical--paperbacks don't wholesale for $20 or retail for $50, so printing and shipping costs are a much larger relative share.
The other side of it though is that publishers frequently eat an enormous number of bookstore returns, frequently eat the interest between when they pay to print it and the bookstore gets around to selling it to someone else and actually paying the publisher for it, and often pay bookstores for prominent shelf placement. Those things can easily amount to 20-40% of the retail of a book and obviously aren't relevant for ebooks. So 20% for printing and 20% for logistics should equal a solid 40% off.
The reason it doesn't is that goods are, as a general rule, sold on a market basis rather than a cost basis. People who read ebooks are already shelling out for a Kindle, so they're probably a market segment which is getting more consumer surplus to begin with, and should be segmented by the pricing model in a more expensive slot. Last minute airline tickets don't cost more because the user is a bigger chunk of the fuel price, after all...
Expedia was spun out of Microsoft over a decade ago...Microsoft is worried about the Bing Travel site, which is rather different.
I manage to pass for an intellectual without knowing who Lautreamont is, but I share your lament on the divorce. And ten points for Euler's formula...proving that rocked my world and took my breath away.
1. It's philosophers like Rene Descartes and Francis Bacon who came up with the idea that mathematics should be used to control nature, thus making life better for man.
2. Philosophy as a field doesn't punish sloppy work? Are you crazy? You've clearly never been to a thesis defense, read peer reviewers' comments on an article, or sat in on a tenure review. Philosophy graduates have some of the highest GRE scores and lowest unemployment numbers for a reason. I used to study philosophy, now I'm a successful sysadmin, and the latter is vastly, vastly easier and more tolerant of error. Yes, I said that--running a site that loses thousands and thousands of dollars during any downtime is easier and more tolerant of error than being a philosopher in U.S. universities today.
3. That is in fact one of the major objection's to Searle's Chinese Room story, and was made almost immediately by a large number of people. It's because he has made careful answers that, while possibly wrong, certainly further stimulate thought that it hasn't been laughed out. Your ignorance of those debates doesn't make their practitioners dumb.
Because what would the 2LD in your 3LD be, your ISP? That stinks. The whole reason everyone wants their own domain to begin with is to avoid being tied to their ISP. The people who didn't care about that are just me.wordpress.com or me.googlepages.com right now, and never were in the domain name market in the first place.
Hence my allusion to fobs late in the post, which of course many banks are adopting. But that still leaves no good alternative to SSL for ecommerce.
I'm sorry to say it, but if you want privacy, this is wrong. You can have authentication without encryption (digital signatures) but encryption without authentication = Man in the Middle. PGP and SSH don't get around this in any way, shape, or form--they just seed trust differently, with PGP using the web-of-trust model and SSH a repeatability model. Neither of those work very well for the classic "online banking" use case, however--average users are not going to seed their trust webs, and expect to be able to bank from computers at cafes, work, and friends' houses--none of which would have connected previously, making the SSH model unworkable.
That's not to say there's nothing here--extensions to the SSL model like EV certs, DNSSEC, and phishing databases have all made these attacks harder. Perhaps browsers will implement web-of-trust or trust-history type extensions to make it harder yet. And it may well be the case that you simply cannot safely bank at computers you don't own, though with pre-shared keys and time-generated PINs both embedded into mailed fobs, the possibilities open up enormously as long as the execution is correct.
But at the end of the day there's no true privacy without authentication built-in and for the core e-commerce use case, SSL is probably the best model.
I'm also on the ops side, and I think a lot of people not running dozens of SQL servers really underestimate the pain of this. I do worry a bit about the visibility with NoSQL though--if you trust it to manage that there are 3 copies of your data at all times or whatever, how do you really guarantee that? And how do you know which bits to back up? The promise is good, but I definitely worry about whether we're there yet as we start to deploy Cassandra in production.
That is, for better or worse, just not true. I work for a $400MM company, nearly half profit, and there's no way we'd invest in Oracle & DBAs for our OLTP systems. It's not the total dollar volume, it's that since you always try to grab the most profitable niches first, you tend to grow by eeking profit out of places with less room for it, e.g. less profit per user/transaction/whatever, and thus anything but open source on commodity hardware means that your costs grow faster than your profits and the tech department is unpopular. You might think that's silly, but in the consumer online space that's how the business thinks.
Strongly agreed, though I do worry that many NoSQL projects' websites are overly blase about runtime issues, including crash safety and online schema changes, as well as upgrade-safety. Now this is really all about using alpha software rather than anything conceptual/design related, but it is a real issue at the present time.
What would be better for keeping every user's profile thumbnail in memory than memcache?
And would it have gotten off the ground in the first place if it weren't written in a scripting language? Probably not. Now that they have a million lines of PHP code, would it survive a rewrite? Probably not.(see: Netscape). So it's ugly to be sure, but it's almost certainly rational.
Hear, hear. NoSQL is all about running into the write performance limits of commodity hardware and realizing that moving from ACID to BASE is loads cheaper than sharding.
It's 2010. "GB in size" no longer means something big anymore. That said, MySQL and PostgreSQL both handle datasizes up to a terabyte and several-billion-row tables just fine with mostly standard SQL using the usual tricks. If you're talking petabytes, now that's a separate grade of mess. But if you follow the people using this in production at scale and talking about it in public (eg Facebook, Google) you'll see that the issues for MySQL are really around update/insert performance, replication speed, and replication transaction safety. I don't see anybody talking about Postgres scaling quite that publicly, but in my own experience Slony also has some issues with replication speed (and hot standby is great, but until you can query the slave it's not solving a huge class of realworld problems--and that capability has been forthcoming for a long time, but I'm not sure we're really any closer). Anyway, that frustration aside, PostgreSQL is a damn fine database, and I wish I didn't have to deal with MySQL at all.
Really? Because I can almost saturate a gigabit pipe for around $100k/yr these days. Say I've got MySQL on almost 100 cores...now sure you'll say Oracle is more efficient, and it probably is, but I still need at least two boxes in case of hardware failure. I haven't seen anything suggesting I can get Oracle on say 8 cores for $10k. Now maybe this just means I don't "need" Oracle, which in some sense is trivially true since it's running on MySQL and it works, but that doesn't seem very helpful.
For systems that can be stateless, this is always the best approach. master-master replication with conflict resolution isn't always that easy, however, especially when you think about something like the way wikipedia edits can potentially interact. So developing a conflict resolution scheme can be extraordinarily expensive, and MySQL isn't the most stable in multi-master anyway. Thus while you're right in principle, the expense can be prohibitive.
I wasn't quite able to figure out the attitudes there. Where I was (Chengdu), everyone used anonymous proxies like crazy, and while they were quickly blocked more would spring up, with DNS/IPs often distributed on email lists. It was treated a bit like speeding in the U.S. I guess--technically illegal, best to avoid the cops, but everyone does it. I was using my corporate VPN as an easier access method, and even though VPNs are, as best I can tell, in the same sort of legal grey area, my usage really freaked people out. The very idea of encryption (even used to view the same exact material) gave them visions of visitors in the night.
No, you're not crazy, though there are tradeoffs:
The bad: with email instead of dedicated knowledge management, you'll pay a lot more in licensing, hardware, and maintenance for each bit of bolt-on functionality that you need/want, and even then you won't end up with as much functionality embedded in as slick an interface.
The good: email is a huge industry, so you really can find some provider to add functionality for each line-item requirement (traceability, search, archiving, even workflow), and if you stuff those things transparently into their existing clients/servers they might actually use the stuff. The return on investment of the unused product is always zero.
Just wanted to concur with this. Bought KnowledgeTree at my last job and thought it was just fantastic. Beware that if you go with the purely open source version, Windows users will have to upload documents individually through a web interface, which is not usually a big hit. But great security, auditing, workflow, delegation...and yet actually simple to admin unlike your average Microsoft product. Integrates well with Active Directory.
Some concerns I would have if I were you (and these are mostly not specific to Knowledge Tree):
1. If you don't have a shared infrastructure, how will you handle authentication to the system? KT will allow you to enter standalone users in addition to/instead of connecting to LDAP, but users don't tend to remember those passwords, etc.
2. Fileshares and email tend to be the existing backbones of office workflow. Just because you introduce something more with features that some people care about (search, auditing, workflow) and maybe even make it easy to use and train on it, doesn't mean anyone will actually switch. You have to have an end-user incentive, and even then some people won't switch unless you make the old options impossible. There has to be real management buy-in for success.
We're building those tools, right now. Puppet for configuration management, func for scalable scripting, capistrano for deployments, RANCID for switch configs, Splunk for log slurping. This is what the Visible Ops and/or Infrastructure as Code movement is up to. We even have a conference: O'Reilly Velocity. Adam Jacob, the creator of chef, likes to say "if it doesn't have an API, then it doesn't exist."
I totally agree with you about the problem, but devtools didn't exist until developers got tired of assembly and/or Notepad and built them. Automated infrastructure tools, allowing us to focus on design and business requirements rather than logging into 100 boxes to do this and that all day long, are just now coming into being. Because those of use who would rather design than firefight are building them.
Right, but that's really nothing to do with it. The point is that you're treating a bunch of data that's probably ordinal and non-normal as if it's ratio and normal (http://en.wikipedia.org/wiki/Level_of_measurement). With the distribution of your data, you can either go with a high-effort approach like estimating an appropriate parametric distribution or subsampling, or you can just use some stock non-parametric tests. It's not just the "grades" -- the whole concept of "standard deviation" is undefined with regard to distributions that are substantially different than Gaussian. Basically you've got a divide-by-zero error in your statistics.
It's not just sample size. If the distribution isn't approximately normal, it's still not going to work. Your professors were basically deciding that the distribution of grades *ought* to be normal, whether the distribution of learning was or not. You may be able to do something like that in a work environment too, obviously, but it's important to recognize that it's a normative assumption not driven by the data.
Think of normal curves with standard deviations as something like the curve of adult human heights. So if you have some other kind of data which clusters very differently (say, incomes, where the high end of the curve goes on for a long time) or like your data, where you might have a team where everyone is very nearly at the median (what we call "narrow-tailed") and you just define the distribution as normal (which is what you're doing when you look at 'standard' deviations) then you're basically using statistics to remap small differences as if they're as large as the difference in human heights. And yes, for normal distributions you generally want sample sizes of 40 or so depending on the size of the effect (but in my experience that's reasonable for organizational study planning, where you obviously don't know the effect size beforehand). If what you're trying to do is just rank people in a statistically robust way, then you want something like Pearson's R, which only assumes a rank ordering, not a normal distribution, and is much more robust to small sample sizes. Of course, this will not result in grade-like scoring, but I'm not sure what can be done about that. Hope that helps a bit.
I'll second that. Y! defers everybody, it's just a fact of life. But I send about 2 million emails a day to the big-5 and don't use DKIM. I hope we don't end up having to, b/c it sounds like a real PITA.
The abuse being....a lower cache hit rate on caching DNS servers? We're talking about Akamai here, not wildcarding. DNS service just isn't that expensive to provide, and when you consider that ISPs actively encourage Akamai to have caching servers inside the cages on their head ends, I think the "more DNS queries" vs "lower upstream bandwidth usage and better latency for our customers" doesn't seem like a tradeoff they're complaining about.
Title hardly makes for argument (note I wasn't the one throwing around the ad-homs here); I just wanted to point out that I was speaking from experience.
I don't understand how this is a problem with http...connecting tcp around the world takes an enormous amount of time compared to udp. That's just reality. Remember the issue here isn't what my servers can deliver, but rather latency, which is a function of the global network I don't control. Using Akamai for DNS allows me to use Akamai for midgress and mostly avoid this.