As a former Computational Linguistics student, I'd say the main problem is either the lack of computational power or the lack of manual labour. Ie.: even a very well defined liguistic area needs to be defined with too many rules (in a complex system) or needs too much data and CPU time (in a brute force) to be feaible, commercially viable, interesting in the Turing-sense... too much effort to just make it work.
This is a very popular opinion, conirmed by the amount of money corporations through at manual processes like taxonomy maintenance and training, but it's out of date. There are several scalable, commercially viable approaches that do not require manual labor or prohivitive processing to be feasible.
Google News is an example of a large-scale application built with automated NLP. Think Tank 23 makes a NLP-based, ad-hoc categorization engine that powers, among other sites, the Waypath Project.
I was in the same spot a month ago and just finished evaluating all the options (including Wikis). The Wikis are good, but not enough for everything you'll need. I'm a few weeks into evaluating infocetera and think I'll wind up keeping it. It's got a Wiki, plus a host of default databases (contacts, calendar, etc.), plus the ability to build your own, plus messaging and attachments, plus ACLs...all with a GUI for admin. Payment is on the 'honor' system (I haven't paid yet, but expect to soon--really).
If that were the case--that it's learned from parent to kitten--then you'd expect cat cultures to evolve, where different communities of cats would have different vocabularies. This is seen in chimps right in the same neighborhood. With cats being so widespread, it wouldn't be hard to demonstrate that cats in Albania have developed a different culture of vocalizations than their cousins in Brazil. (Has anyone done this? I couldn't find any evidence.)
Also, just because it isn't Darwinian evolution, depending on differential reproduction to pass traits, doesn't mean it's not evolution. Long before Darwin, there was Lamark, who recognized evolution and gave his own theory as to the mechanism. After a century of ridicule by Darwin advocates (not including Darwin, who seems to have thought highly of Lamark's work), a large number of findings in cellular biology beginning in the 1970s show support for both natural selection and environmental influence in passing new traits to offspring.
What does that have to do with cats? I dunno. But there's more to this evolution thing than they're arguing in southern courts. Don't discount new ideas just because they show up in the media (though it's not any sort of endorsement, for sure).
A company called ION Systems makes browser plugin designed to make Web browsing easier for people with vision impairment. It's called Web Eyes. It won't help you program, but it'll make reading/. a little easier. (That, and the threshold filter...)
Re:Differences between Squid and Octopus?
on
Giant Octopus
·
· Score: 3, Informative
Where to begin?
Squid and octopi come from the same branch of the tree as oysters, snails, and chitons, meaning that they're mushy, non-segmented (think millipede or vertebrae), have a shell, and taste yummy. Of all the molluscs, squid and octopi are most closely related to each other, but there are several key differences.
In squid, the shell is reduced to a beak and a thin, flexible support called the pen. The pen lets them have that long, tubular body. Octopi have a beak, but no pen, making them pretty mushy (and able to fit through really small holes).
For the most part, squid live in the water column and hunt and octopi on the bottom and scavenge. (You could probably call picking on clams hunting, but really...). Squid are fast; octopi slow.
Here's a key difference if you want to keep one behind glass. Squid have to keep moving. Put them in a tank and they die quickly. You can keep an octopus for years, just throw it a raw crab once in a while. Because of this, we are able to find out that many octopus species are intelligent. Some squid may be just as smart (they haven't caught a live Architeuthes yet, so they're smarter than tuna!), but we have no way of knowing because we can't really do tests on them.
Both are cephlapods, meaning they have their feet on their heads. Octopi have eight arms, more or less identical. Squid have eight stubby arms and two long ones with grabby pad on the end, the tentacles. Inside, they're pretty much the same: gut, ink sac...nothing you really want to eat.
There you have it, the highlights, at least. God, I'm a nerd.
The problem with this kind of approach is that it doesn't scale well to growing repositories of content where the conceptual span changes over time. For job listings and resumes, this works well because the set of concepts encoded in the content changes very slowly. If I recall correctly, WhizBang Labs recently partnered with LexisNexis to classify legal stuff. It'll probably work there too, as long as they've got a room full of monkeys to keep the training up to date.
But for dynamic environments email, usenet, news/weblog rss feeds, knowledge bases, etc., the WhizBang approach, and just about all approaches that rely on sample-based training or handbuilt taxonomies, just doesn't scale.
They draw you in with the bit about unstructured data, but it turns out to be more about differently structured data. I think they missed their own point.
I just attended the Knowledge Technologies conference in Seattle. It's scary how many people think the way to mine unstructured data is to force it into a structure. So many people spending years developing standard taxonomies--different standards, of course. And so many companies (like Semio, for example) that want you to develop your own taxonomy. Then you wind up with the very problem this article really discusses.
[Skip next section to avoid my self-promotion]
I'm a big fan of mining unstructured (and differently structured) data by throwing a mining layer on top of it. All of us at Think Tank 23 are. Check out the demo of our technology, Waypoint 2.0, which pulls concepts from unstructured documents, then uses the concepts as the basis for finding relationships between them.
I'm in a very similar situation. There are lot of ideas in these replies, many good, but many bad. In a nutshell...
0. A lot of the posts assume you're selling them the technology, not licensing it. You've got more flexibility in a licensing situation--in most cases there is no need for them to see the code at all, just to determine if it meets their needs. (see 2)
1. A patent is a very good idea, if you really have anything patentable. A provisional patent can be filed in a day and costs -lt $100. It lets you say "patent pending" and serves as a 12-month placeholder for a formal, expensive patent application. Even if you never follow through, or if your designs turn out to be unpatentable, if they don't know which parts are patent pending and which aren't, they'll be less likely to reproduce any of it.
2. Are they licensing your code as an engine or piece of a larger application? In other words, can you give them just an API? If so, obfuscate the code and give them a watered-down API doc that just gives them the methods they need to integrate with their systems. The goal is to hide as much of the internal architecture and actual methodology as possible.
3. A lawyer is essential. But, you'll never be able to prevent them, contractually, from creating a similar technology if they don't buy yours. Obviously, they need one, otherwise they wouldn't be thinking of licensing yours. You can, and should, use a contract to
force them to acknowledge that the information you provide is confidential, proprietary and intended only for a specific use;
spell out what that specific use is; and
restrict them from unobfuscating the code (which is not enforceable, but if you ever have to sue them, finding unobfuscated code on their servers is a lot more incriminating than just the code that they "forgot" to delete).
4. Combine the review contract with the licensing contract. If they're serious about licensing, then make them go through the trouble of agreeing on and drafting all the terms of the licensing contract before they can touch anything. If you can swing it, spell out what specific, measurable conditions must be met during the evaluation. If they're met, the contract should allow for two options: licensing or consolation payment. If they're not met, then the contract could allow you some time (10 days, 30 days, whatever) to address the deficiency and satisfy the requirements--otherwise, you've got nothing to squawk about. While there's no such thing as a "standard" license, this approach is pretty standard.
5. Most importantly, trust your instincts. If someone (or a company) wants to screw you, they're going to find a way to screw you. If it doesn't feel right, don't do it. Even if it does feel okay, be prepared for anything. Either way, you'll be much more savvy next time. There's always a next time.
Format migration is required by almost every commercial content- or knowledge-management system, and the more structured and metadata laden your content, the better as far as these products are concerned. But as other have pointed out, getting content into new formats is only part of the battle; then you have to put an interface on top of it, as well as reorient users to create content in the new system.
It's a fight, but I haven't given up. I do a lot of consulting around structuring new approaches to information management and migrating content from old formats to new formats. Rather than look for another job, I've been looking for ways to change this one. One approach we've hit on lately is to stop beating ourselves silly migrating format A into highly structured format B. Instead, we leave things the way they are (if open format isn't important) or migrate to a minimally structured format, such as html or very loose xml. Then, instead of relying on a CMS to dictate a production system that produces well-formed documents, we develop systems that provide high-precision navigation across highly unstructured document sets. After all, for most companies, the point of all this work is to improve document access and navigation.
If you've got a lot of content (and it sounds like you do), it's often better to put as little work in fixing past content than developing systems to handle future content. It's a lot cheaper, at least. The minimal conversion approach, combined with a good navigational overlay, can save a lot of time and money without compromising document access. Done properly, you can start creating new docs in the open format of choice, leave the old stuff alone, and actually improve document access in the process.
As a former Computational Linguistics student, I'd say the main problem is either the lack of computational power or the lack of manual labour. Ie.: even a very well defined liguistic area needs to be defined with too many rules (in a complex system) or needs too much data and CPU time (in a brute force) to be feaible, commercially viable, interesting in the Turing-sense... too much effort to just make it work.
This is a very popular opinion, conirmed by the amount of money corporations through at manual processes like taxonomy maintenance and training, but it's out of date. There are several scalable, commercially viable approaches that do not require manual labor or prohivitive processing to be feasible.
Google News is an example of a large-scale application built with automated NLP. Think Tank 23 makes a NLP-based, ad-hoc categorization engine that powers, among other sites, the Waypath Project.
Talk about a killer app!
Done. Next?
I was in the same spot a month ago and just finished evaluating all the options (including Wikis). The Wikis are good, but not enough for everything you'll need. I'm a few weeks into evaluating infocetera and think I'll wind up keeping it. It's got a Wiki, plus a host of default databases (contacts, calendar, etc.), plus the ability to build your own, plus messaging and attachments, plus ACLs...all with a GUI for admin. Payment is on the 'honor' system (I haven't paid yet, but expect to soon--really).
If that were the case--that it's learned from parent to kitten--then you'd expect cat cultures to evolve, where different communities of cats would have different vocabularies. This is seen in chimps right in the same neighborhood. With cats being so widespread, it wouldn't be hard to demonstrate that cats in Albania have developed a different culture of vocalizations than their cousins in Brazil. (Has anyone done this? I couldn't find any evidence.)
Also, just because it isn't Darwinian evolution, depending on differential reproduction to pass traits, doesn't mean it's not evolution. Long before Darwin, there was Lamark, who recognized evolution and gave his own theory as to the mechanism. After a century of ridicule by Darwin advocates (not including Darwin, who seems to have thought highly of Lamark's work), a large number of findings in cellular biology beginning in the 1970s show support for both natural selection and environmental influence in passing new traits to offspring.
What does that have to do with cats? I dunno. But there's more to this evolution thing than they're arguing in southern courts. Don't discount new ideas just because they show up in the media (though it's not any sort of endorsement, for sure).
A company called ION Systems makes browser plugin designed to make Web browsing easier for people with vision impairment. It's called Web Eyes. It won't help you program, but it'll make reading /. a little easier. (That, and the threshold filter...)
Where to begin?
Squid and octopi come from the same branch of the tree as oysters, snails, and chitons, meaning that they're mushy, non-segmented (think millipede or vertebrae), have a shell, and taste yummy. Of all the molluscs, squid and octopi are most closely related to each other, but there are several key differences.
In squid, the shell is reduced to a beak and a thin, flexible support called the pen. The pen lets them have that long, tubular body. Octopi have a beak, but no pen, making them pretty mushy (and able to fit through really small holes).
For the most part, squid live in the water column and hunt and octopi on the bottom and scavenge. (You could probably call picking on clams hunting, but really...). Squid are fast; octopi slow.
Here's a key difference if you want to keep one behind glass. Squid have to keep moving. Put them in a tank and they die quickly. You can keep an octopus for years, just throw it a raw crab once in a while. Because of this, we are able to find out that many octopus species are intelligent. Some squid may be just as smart (they haven't caught a live Architeuthes yet, so they're smarter than tuna!), but we have no way of knowing because we can't really do tests on them.
Both are cephlapods, meaning they have their feet on their heads. Octopi have eight arms, more or less identical. Squid have eight stubby arms and two long ones with grabby pad on the end, the tentacles. Inside, they're pretty much the same: gut, ink sac...nothing you really want to eat.
There you have it, the highlights, at least. God, I'm a nerd.
The problem with this kind of approach is that it doesn't scale well to growing repositories of content where the conceptual span changes over time. For job listings and resumes, this works well because the set of concepts encoded in the content changes very slowly. If I recall correctly, WhizBang Labs recently partnered with LexisNexis to classify legal stuff. It'll probably work there too, as long as they've got a room full of monkeys to keep the training up to date.
:)
But for dynamic environments email, usenet, news/weblog rss feeds, knowledge bases, etc., the WhizBang approach, and just about all approaches that rely on sample-based training or handbuilt taxonomies, just doesn't scale.
But at least you found a job
They draw you in with the bit about unstructured data, but it turns out to be more about differently structured data. I think they missed their own point.
I just attended the Knowledge Technologies conference in Seattle. It's scary how many people think the way to mine unstructured data is to force it into a structure. So many people spending years developing standard taxonomies--different standards, of course. And so many companies (like Semio, for example) that want you to develop your own taxonomy. Then you wind up with the very problem this article really discusses.
[Skip next section to avoid my self-promotion]
I'm a big fan of mining unstructured (and differently structured) data by throwing a mining layer on top of it. All of us at Think Tank 23 are. Check out the demo of our technology, Waypoint 2.0, which pulls concepts from unstructured documents, then uses the concepts as the basis for finding relationships between them.
0. A lot of the posts assume you're selling them the technology, not licensing it. You've got more flexibility in a licensing situation--in most cases there is no need for them to see the code at all, just to determine if it meets their needs. (see 2)
1. A patent is a very good idea, if you really have anything patentable. A provisional patent can be filed in a day and costs -lt $100. It lets you say "patent pending" and serves as a 12-month placeholder for a formal, expensive patent application. Even if you never follow through, or if your designs turn out to be unpatentable, if they don't know which parts are patent pending and which aren't, they'll be less likely to reproduce any of it.
2. Are they licensing your code as an engine or piece of a larger application? In other words, can you give them just an API? If so, obfuscate the code and give them a watered-down API doc that just gives them the methods they need to integrate with their systems. The goal is to hide as much of the internal architecture and actual methodology as possible.
3. A lawyer is essential. But, you'll never be able to prevent them, contractually, from creating a similar technology if they don't buy yours. Obviously, they need one, otherwise they wouldn't be thinking of licensing yours. You can, and should, use a contract to
4. Combine the review contract with the licensing contract. If they're serious about licensing, then make them go through the trouble of agreeing on and drafting all the terms of the licensing contract before they can touch anything. If you can swing it, spell out what specific, measurable conditions must be met during the evaluation. If they're met, the contract should allow for two options: licensing or consolation payment. If they're not met, then the contract could allow you some time (10 days, 30 days, whatever) to address the deficiency and satisfy the requirements--otherwise, you've got nothing to squawk about. While there's no such thing as a "standard" license, this approach is pretty standard.
5. Most importantly, trust your instincts. If someone (or a company) wants to screw you, they're going to find a way to screw you. If it doesn't feel right, don't do it. Even if it does feel okay, be prepared for anything. Either way, you'll be much more savvy next time. There's always a next time.
Look, you can make ball lightning in your own home. Note, don't try this at home.
r ow ave.html
http://www.angelfire.com/electronic/cwillis/mic
Format migration is required by almost every commercial content- or knowledge-management system, and the more structured and metadata laden your content, the better as far as these products are concerned. But as other have pointed out, getting content into new formats is only part of the battle; then you have to put an interface on top of it, as well as reorient users to create content in the new system.
It's a fight, but I haven't given up. I do a lot of consulting around structuring new approaches to information management and migrating content from old formats to new formats. Rather than look for another job, I've been looking for ways to change this one. One approach we've hit on lately is to stop beating ourselves silly migrating format A into highly structured format B. Instead, we leave things the way they are (if open format isn't important) or migrate to a minimally structured format, such as html or very loose xml. Then, instead of relying on a CMS to dictate a production system that produces well-formed documents, we develop systems that provide high-precision navigation across highly unstructured document sets. After all, for most companies, the point of all this work is to improve document access and navigation.
If you've got a lot of content (and it sounds like you do), it's often better to put as little work in fixing past content than developing systems to handle future content. It's a lot cheaper, at least. The minimal conversion approach, combined with a good navigational overlay, can save a lot of time and money without compromising document access. Done properly, you can start creating new docs in the open format of choice, leave the old stuff alone, and actually improve document access in the process.