Well, as a bioinformaticist who has been following this work for a while (both the first and last authors, along with most of the others present at our weekly group meeting), I'd say that the work isn't sloppy.
It is controversial, as it doesn't match with the fossil record. But if you knew the guys involved (and the internal vetting process at the Broad), you'd understand that this work has gone through massive peer review by some of the most gifted individuals in genetics I've seen.
I'd guess that John Hawks isn't a genetics specialist (Just as David isn't an anthropologist), so when data starts conflicting, it's hard for anyone to give ground. I think it's exciting, because it allows for more experiments to be divised on both ends, and for more clarification to be arrived.
You know, the chromosomes are of different size. This goes from about (logs into database and checks) 246 Mb to 47 Mb.
You'd guess (and be mostly right) that genes are randomly placed on chromosomes, so chromosome length determines this. You also have to remember that there are (or were) gaps in the genome where you data isn't mapped. Also, not all genes have been predicted yet, and some genes are predicted, but not yet proved to exist. So, these numbers are in flux.
Here's the length of all the chromosomes in base pairs, given in order of chromosomes from 1-22, then X and Y (this is based on build HG16):
If you do 30 seconds of bioinformatics, you can generate the chromosome, number of genes per chromosome, and average number of genes per 1M bases:
1 2262 9.19 2 1469 6.03 3 1252 6.28 4 812 4.23
I officially have no idea why 17 and 19 are unusually rich in gene count. However, this is from 2 builds back in the genome, so it would be interesting to see how data has changed (but it's outside the scope of this slashdot post for me to work up that analysis when I'm just wasting time here.)
You didn't mention motherboard manufactures, and those have a ton of influence of system stability. Those cheap-ass Via motherboards are just that - cheap.
If you buy a decent mb (nvidia nForce 4 is my personal pick, in a mini configuration to fit in a shuttle xpc), then you're good to go with a rock solid system.
I think the actual processor is rarely the problem, unless you have cooling issues.
Also, as with anything else, direct causation is almost impossible to prove
Obviously, you don't work in biology.
It's not impossible to show that an attribute (enviornmental or genetic) can affect relative risk for any phenotype (in this case, heart disease.) You need a strong understanding of statistics, a well designed study, and a large enough number of patients to see weaker effects.
You can bolster you information by doing things like twin studies (you can negate differences in genetics, and / or differences in enviornment up to a certain time, depending on the type of twins.)
Take some time to look at a journal or two, like Science or Nature. You might not understand all (or much) of it, but you might get an idea of the statistical work and the intense study design required to do a study correctly.
Introns haven't been considered junk in quite a while. It's been know that there are some regulatory regions hiding out in introns.
For more interesting 'proof', see my paper in febuary nature genetics about conserved noncoding regions under selection - one of the strongest signals for selection was seen in intronic regions. We found parts of introns that were under as strong selection as coding regions.
I just saw a talk at MIT by a grad student who was looking at conserved sequences and repeat regions. Looks like he found a repeat region that actually provides regulation in the genome. That's sequence that was inserted by a 'genetic parasite' (repeats make copies of themselves in the genome), and wound up getting 'lucky' and taking on function. Once it took up useful function, it could be selected for, and the frequency of that repeat region could rise.
It blew my mind, because we always filter out things like repeat regions when we're doing analysis.
Oh, and if this makes it any more believable, I just had a paper published in nature genetics last month about Conserved Noncoding Regions under selective constraint...
The problem is that the doc is 8 years old, back when poor old java was maybe hitting it's 3rd birthday. It's come a long way since then, and the VM has changed dramatically. Hell, was BigDecimal (etc) in the API back in '98?
Also, is the quality of the framework consistent across the whole system? For instance, if you have network class libraries and gui class libraries, are they both equally good? Or are you sacrificing on one side to get the benefit of another?
What I don't understand about this question is: why would you have a framework that covers both network operations and the GUI? Aren't those seperate concerns? Wouldn't you use a specialized framework for each of those operations? Example: In java, I'd use a network framework (RMI/webservices/etc) to communicate between client and server (if neccesary), a framework to access databases (hibernate, ibatis, spring) and a presentation framework for whatever kind of client I wanted (struts, JSF, swing, SWT, etc).
If frameworks exist that do "all of the above", I've gotta wonder how well they do any of them. Seems like modular use of frameworks (picking 1-2 for each type of task) buys you the most ease of development and flexibility. Besides, most good frameworks even talk about their interoperability with other frameworks, and brag about how loosely coupled they are to the other layers.
Until fairly recently, biology (with exceptions for some subfields such as ecology) was, to put it bluntly, the science you went into if you wanted to do science but weren't very good at math. And I think it's fair to say that most "wet-lab" biologists still think more qualitatively than quantitatively.
That's funny...where I work (the broad institute) we have a LARGE number of scientists who are incredibly good at both math and statistics. You need those tools to play the game, if you're into identifying mutations that are associated with phenotypes. The same goes for espression data, and a lot of other high throughput assay techniques. How do you examine hundreads of thousands of datapoints (or more) for answers without being able to model null distributions, carry out permutation tests, etc?
Perhaps in some wet labs where people are doing functional work, they aren't using math - but then, they don't need to, for the most part. They might need basic statistics, and can rely on other kinds of evidence that's easier to spot conclusive answers (cell based assays, etc.)
Maybe my work isn't 'the usual', but we place incredibly heavy weight on statistics and analysis methodology. That might be why we're a world class institution.
And what's especially funny is that most of the commenters here on Slashdot have no idea what this software does, and they shoot their uneducated, ignorant opinions into the whole issue.
You're new to slashdot, huh? As a biologist/bioinformatics guy, every time I read articles on slashdot that involve my field of research, I see that 90%+ of the 3+ or better comments are crap.
This leads me to believe that in areas that are not my speciality, slashdotters are equally full of shit. Sure, it's just a hypothesis right now, but I'm sure with a little help I could gather convincing evidence...
Some of us started out on the bench, and now work as full time bioinformaticians. We still do research projects collaborating with the scientists, and still deal with the data created by the bench folk. You can be tightly coupled to the other groups pretty easily.
Actually, if you don't know all the caveats to how the data was generated, you may not be able to write analysis software successfully (recent example: many genotype platforms generally fail to call an inordinate number hetrozygotes when they run assays of poor quality. This has a dramatic effect on things like HWE, but also on association studies, etc.)
What if your software design is a work in progress?
Let's say you're writing analysis software. Every time you do analysis, your results inform you as to what you have learned, what you haven't, and what you can do in addition to take the next step?
This is the world of scientific computing, and it is always in flux. Analysis changes, because you're designing new methods as you go. You may add completely different sets of data as you go. Your needs may change dramatically over the course of the work.
How can I treat this sort of design as hardware? None of it's set in stone, or anything even close. All I can provide are: flexible 'plug in' solutions, use smart design for the more obvious parts (databases, ORM frameworks, numerical frameworks, etc), and try to write each part to be as reusable and flexible as possible.
Not everyone has to write petstore over and over. Sometimes, the only way to find out where you're going is to take a step forward.
"It should come with a fundmental re-evaluation of what counts as verification or falsification."
What do you think peer review is? Why do you think experts in the field review other people's work? Why is it that scientists don't bother to read papers that have not been peer reviewed?
I think you're ignoring the fact that we're gathering new data in science all the time. You have the body of available data, and you apply different methodologies to that data to try and generate evidence to support hypothesis. As you gather more data, you may show that a hypothesis that previously appeared true is now false.
An example: In genetics, you use sets of patients to coorelate genes and disease. The larger the number of patients you can test, the more likely you are to positively coorelate a gene (or a mutation, really) with a disease.
Say you run a test on 100 patients. You might be able to say "If that mutation increases the odds of having the disease by 300%, we're able to see it!" Now, you don't observe anything. Does that mean the gene does not have an effect? NO! It just means you didn't have the data you needed to see that effect.
What if (similarly) you're using models that don't account for OTHER data, like gene pathway information (an upstream gene mutation makes this mutation 3x as strong). Without that info, you also can't say anything.
End result: You declare what you know with the available data. With more data, you may change you mind. That doesn't mean you got it wrong. That's just the limits of what you can currently observe.
That's how science works. If you want to punish people for that, then frankly, you're an ass. If you want to punish people because they missed something that was "simple", then why in gods name did nobody see it? Perhaps because you have the benefit of additional data and HINDSIGHT?
I don't think scientists ever claim to "KNOW" anything. We all just have our best guesses based on the data we have available to us.
The GP didn't say data wasn't available. "No historical statistical correlation" sounds like a test was performed with avaiable data, and the null hypothesis was the most likely.
Perhaps, with increased data size, the correlation might be found, but just because a correlation isn't there doesn't mean a test HASN'T been attempted.
I agree completely! Learning biology opened the door to a career in bioinformatics for me. Without that background, I wouldn't be able to easily understand and anticipate the users' needs. Hell, the programming langugage in question is just the tool that allows me to expose new data to the user, analyse their information, or generate new hypothesis (or do more interesting mathematical analysis.)
When I walk into a room of scientists, and they throw around all the vocabulary of the industry, it's good to be able to understand exactly what they are talking about, and add to their conversation about what they are doing, how they are doing it, and what direction to take next. On top of that, the programming side gives you the ability to understand 'process', which the scientists may not see as easily. The reason scientists don't always see 'process' is that many labs have 'their own way' of doing things, and rarely do many different labs look at their methods, and realize exactly how much they have in common.
Interestingly, if you falsify your data, it's far worse than never publishing anything in the first place.
IAAS, and if I falisfied anything (and anyone found out), I'd never expect to work in the field again. Science is based on trust, and once you are proven untrustworthy, you might as well get your ass to the deep fryer.
Well, as a bioinformaticist who has been following this work for a while (both the first and last authors, along with most of the others present at our weekly group meeting), I'd say that the work isn't sloppy.
It is controversial, as it doesn't match with the fossil record. But if you knew the guys involved (and the internal vetting process at the Broad), you'd understand that this work has gone through massive peer review by some of the most gifted individuals in genetics I've seen.
I'd guess that John Hawks isn't a genetics specialist (Just as David isn't an anthropologist), so when data starts conflicting, it's hard for anyone to give ground. I think it's exciting, because it allows for more experiments to be divised on both ends, and for more clarification to be arrived.
In other words, the scientific process.
You know, the chromosomes are of different size. This goes from about (logs into database and checks) 246 Mb to 47 Mb.
You'd guess (and be mostly right) that genes are randomly placed on chromosomes, so chromosome length determines this. You also have to remember that there are (or were) gaps in the genome where you data isn't mapped. Also, not all genes have been predicted yet, and some genes are predicted, but not yet proved to exist. So, these numbers are in flux.
Here's the length of all the chromosomes in base pairs, given in order of chromosomes from 1-22, then X and Y (this is based on build HG16):
If you do 30 seconds of bioinformatics, you can generate the chromosome, number of genes per chromosome, and average number of genes per 1M bases:
1 2262 9.19
2 1469 6.03
3 1252 6.28
4 812 4.23
I officially have no idea why 17 and 19 are unusually rich in gene count. However, this is from 2 builds back in the genome, so it would be interesting to see how data has changed (but it's outside the scope of this slashdot post for me to work up that analysis when I'm just wasting time here.)
You can't patent seqeuence info. You haven't been able to since 2000. Get with the times.
Finally, my knowledge of statistics helps me understand a joke on slashdot.
I guess that makes us both HUGE NERDS. Bravo!
I have a dell d800 with a 1920x1200 screen as well, and it's 3 years old. I love the screen resolution, and hate everything else about the laptop..
This sounds like bioinformatics.
The um...field I've been working in for the last 6 years.
Programming + Biology + Statistics + Algorhitm development.
You have commited the following logical fallacy:
Appeal to Misleading Authority.
Chemistry professors are not climatologists.
You didn't mention motherboard manufactures, and those have a ton of influence of system stability. Those cheap-ass Via motherboards are just that - cheap.
If you buy a decent mb (nvidia nForce 4 is my personal pick, in a mini configuration to fit in a shuttle xpc), then you're good to go with a rock solid system.
I think the actual processor is rarely the problem, unless you have cooling issues.
I'm playing at 1600x900.
AMD 4200 dual core, 2 gigs ram, nvidia 7800gt.
Not exactly the fastest system, but decent.
PS: Resolution doesn't mean too much unless you know what options are turned on...
Also, as with anything else, direct causation is almost impossible to prove
Obviously, you don't work in biology.
It's not impossible to show that an attribute (enviornmental or genetic) can affect relative risk for any phenotype (in this case, heart disease.) You need a strong understanding of statistics, a well designed study, and a large enough number of patients to see weaker effects.
You can bolster you information by doing things like twin studies (you can negate differences in genetics, and / or differences in enviornment up to a certain time, depending on the type of twins.)
Take some time to look at a journal or two, like Science or Nature. You might not understand all (or much) of it, but you might get an idea of the statistical work and the intense study design required to do a study correctly.
Who do you believe more, peer reviewed abstracts, or wikipedia?
Introns haven't been considered junk in quite a while. It's been know that there are some regulatory regions hiding out in introns.
For more interesting 'proof', see my paper in febuary nature genetics about conserved noncoding regions under selection - one of the strongest signals for selection was seen in intronic regions. We found parts of introns that were under as strong selection as coding regions.
It's nowhere NEAR junk DNA.
I just saw a talk at MIT by a grad student who was looking at conserved sequences and repeat regions. Looks like he found a repeat region that actually provides regulation in the genome. That's sequence that was inserted by a 'genetic parasite' (repeats make copies of themselves in the genome), and wound up getting 'lucky' and taking on function. Once it took up useful function, it could be selected for, and the frequency of that repeat region could rise.
It blew my mind, because we always filter out things like repeat regions when we're doing analysis.
Oh, and if this makes it any more believable, I just had a paper published in nature genetics last month about Conserved Noncoding Regions under selective constraint...
The problem is that the doc is 8 years old, back when poor old java was maybe hitting it's 3rd birthday. It's come a long way since then, and the VM has changed dramatically. Hell, was BigDecimal (etc) in the API back in '98?
What I don't understand about this question is: why would you have a framework that covers both network operations and the GUI? Aren't those seperate concerns? Wouldn't you use a specialized framework for each of those operations? Example: In java, I'd use a network framework (RMI/webservices/etc) to communicate between client and server (if neccesary), a framework to access databases (hibernate, ibatis, spring) and a presentation framework for whatever kind of client I wanted (struts, JSF, swing, SWT, etc).
If frameworks exist that do "all of the above", I've gotta wonder how well they do any of them. Seems like modular use of frameworks (picking 1-2 for each type of task) buys you the most ease of development and flexibility. Besides, most good frameworks even talk about their interoperability with other frameworks, and brag about how loosely coupled they are to the other layers.
That's funny...where I work (the broad institute) we have a LARGE number of scientists who are incredibly good at both math and statistics. You need those tools to play the game, if you're into identifying mutations that are associated with phenotypes. The same goes for espression data, and a lot of other high throughput assay techniques. How do you examine hundreads of thousands of datapoints (or more) for answers without being able to model null distributions, carry out permutation tests, etc?
Perhaps in some wet labs where people are doing functional work, they aren't using math - but then, they don't need to, for the most part. They might need basic statistics, and can rely on other kinds of evidence that's easier to spot conclusive answers (cell based assays, etc.)
Maybe my work isn't 'the usual', but we place incredibly heavy weight on statistics and analysis methodology. That might be why we're a world class institution.
You're new to slashdot, huh? As a biologist/bioinformatics guy, every time I read articles on slashdot that involve my field of research, I see that 90%+ of the 3+ or better comments are crap.
This leads me to believe that in areas that are not my speciality, slashdotters are equally full of shit. Sure, it's just a hypothesis right now, but I'm sure with a little help I could gather convincing evidence...
Some of us started out on the bench, and now work as full time bioinformaticians. We still do research projects collaborating with the scientists, and still deal with the data created by the bench folk. You can be tightly coupled to the other groups pretty easily.
Actually, if you don't know all the caveats to how the data was generated, you may not be able to write analysis software successfully (recent example: many genotype platforms generally fail to call an inordinate number hetrozygotes when they run assays of poor quality. This has a dramatic effect on things like HWE, but also on association studies, etc.)
What if your software design is a work in progress?
Let's say you're writing analysis software. Every time you do analysis, your results inform you as to what you have learned, what you haven't, and what you can do in addition to take the next step?
This is the world of scientific computing, and it is always in flux. Analysis changes, because you're designing new methods as you go. You may add completely different sets of data as you go. Your needs may change dramatically over the course of the work.
How can I treat this sort of design as hardware? None of it's set in stone, or anything even close. All I can provide are: flexible 'plug in' solutions, use smart design for the more obvious parts (databases, ORM frameworks, numerical frameworks, etc), and try to write each part to be as reusable and flexible as possible.
Not everyone has to write petstore over and over. Sometimes, the only way to find out where you're going is to take a step forward.
"It should come with a fundmental re-evaluation of what counts as verification or falsification."
What do you think peer review is? Why do you think experts in the field review other people's work? Why is it that scientists don't bother to read papers that have not been peer reviewed?
I think you're ignoring the fact that we're gathering new data in science all the time. You have the body of available data, and you apply different methodologies to that data to try and generate evidence to support hypothesis. As you gather more data, you may show that a hypothesis that previously appeared true is now false.
An example: In genetics, you use sets of patients to coorelate genes and disease. The larger the number of patients you can test, the more likely you are to positively coorelate a gene (or a mutation, really) with a disease.
Say you run a test on 100 patients. You might be able to say "If that mutation increases the odds of having the disease by 300%, we're able to see it!" Now, you don't observe anything. Does that mean the gene does not have an effect? NO! It just means you didn't have the data you needed to see that effect.
What if (similarly) you're using models that don't account for OTHER data, like gene pathway information (an upstream gene mutation makes this mutation 3x as strong). Without that info, you also can't say anything.
End result: You declare what you know with the available data. With more data, you may change you mind. That doesn't mean you got it wrong. That's just the limits of what you can currently observe.
That's how science works. If you want to punish people for that, then frankly, you're an ass. If you want to punish people because they missed something that was "simple", then why in gods name did nobody see it? Perhaps because you have the benefit of additional data and HINDSIGHT?
I don't think scientists ever claim to "KNOW" anything. We all just have our best guesses based on the data we have available to us.
The GP didn't say data wasn't available. "No historical statistical correlation" sounds like a test was performed with avaiable data, and the null hypothesis was the most likely.
Perhaps, with increased data size, the correlation might be found, but just because a correlation isn't there doesn't mean a test HASN'T been attempted.
List l = new ArrayList();
That was hard, how?
I agree completely! Learning biology opened the door to a career in bioinformatics for me. Without that background, I wouldn't be able to easily understand and anticipate the users' needs. Hell, the programming langugage in question is just the tool that allows me to expose new data to the user, analyse their information, or generate new hypothesis (or do more interesting mathematical analysis.)
When I walk into a room of scientists, and they throw around all the vocabulary of the industry, it's good to be able to understand exactly what they are talking about, and add to their conversation about what they are doing, how they are doing it, and what direction to take next. On top of that, the programming side gives you the ability to understand 'process', which the scientists may not see as easily. The reason scientists don't always see 'process' is that many labs have 'their own way' of doing things, and rarely do many different labs look at their methods, and realize exactly how much they have in common.
Interestingly, if you falsify your data, it's far worse than never publishing anything in the first place.
IAAS, and if I falisfied anything (and anyone found out), I'd never expect to work in the field again. Science is based on trust, and once you are proven untrustworthy, you might as well get your ass to the deep fryer.
-Jim