...the problem with rule-based grammars that lack any statistical weights is that they come up with an unbelievably large number of parses for many real-world sentences.
Generative grammars suffer from that problem and scales very poorly, and may indeed be impractical to use for real world text. Our constraint grammars and finite-state analysers do not have that problem. With CG, we inject all the possible ambiguity into the very first analysis phase, then use contextual constraints to whittle them down, where context is the whole sentence or even multiple sentences. This means performance scales linearly with number of rules.
So the 96% accuracy claim is suspect, not to mention that a comparison of the Google system is already difficult because Spanish =/= English. (Spanish has more morphology on verbs, it's pro-drop, it has relatively free word order compared to English,...)
The paper is for Spanish, because that's what I could find. Our other parsers, including English, are also at the 96% or better stage, but because it's mindbogglingly boring to do a formal evaluation, we don't have up-to-date numbers.
So I don't believe you can say that "Google is hopelessly behind the state of the art."
Given that we had 96% in 2006, 10 years ago, and Google only now has reached 94% (90% for other domains), I feel confident in saying Google is very far behind.
The nature of machine learning does. All they're giving away is an algorithm and a system trained using that algorithm. Linguistic machine learning is a field where even a 0.5% improvement takes years to get and is worth a paper. So even if they aren't giving away their top algorithm, their best one can't be much better.
It'd seem that way, but it's really not if you factor in the whole chain.
Machine learning needs high quality annotated treebanks to train from. Creating those treebanks takes many many years. It is newsworthy when a new treebank of a mere 50k words is published. Add to that the fact that each treebank likely uses different annotations, and you need to adjust your machine learner for that, or add a filter. Plus each treebank is for a specific domain, so your finished parser is domain-specific. If you want to work with other kinds of text, you need to produce a treebank for that domain and then train on it.
Thus, the bulk work is in annotation and mathematical models. Google skipped the step of creating a treebank, and instead use available ones. There aren't any usable treebanks for smaller languages, making the whole machine learning endeavor useless for all but the large languages.
Rule-based parsers are the opposite of that. You can put the same amount of man hours into creating rules as you otherwise would a treebank plus mathematical model, but you can do so on any old laptop with almost zero data to work from. You just need to know the language. A parser produced in this way is not domain specific, but can be easily specialized for a domain if needed. And a rule-based parser can be used as a bootstrap engine for creating high quality treebanks, because the rules are upwards 99% accurate, meaning humans only need to put a fraction of work on top of it.
And as I wrote, rules are debuggable. You can figure out exactly why a word was misanalyzed, and fix it. Machine learning can't do that. The edit-compile-test loop of machine learning is in weeks or hours - with rules it's in minutes or seconds.
94% syntax is definitely good, for a machine learning parser. Now if you were to come to the land of rule-based parsers, 94% is the norm.
Google loves machine learning, and it's easy to see why. That's how they made their whole stack. They have the huge amounts of data to train on, and the hardware to do so. It's so seductive to just throw a mathematical model at huge amounts of data and let it run for a few weeks.
Rule-based systems don't need any data to work with - they just need a computational linguist to spend a year writing down the few thousand rules. But the end result is vastly better, fully debuggable, easily updatable, understandable, and domain independent. That last bit is really important. A system trained for legalese won't work on newspapers, but a rule-based system usually works equally well for all domains.
In 2006, VISL had a rule-based parser doing 96% syntax for Spanish (PDF) - our other parsers are also in that range, and naturally improved since then. Google is hopelessly behind the state of the art.
I've used iTunes on Windows for many years as a music player only, and while it definitely has some annoyances, nothing else seems to do all the things that I want: - auto-organize its own folder - not reorganizing external folders - volume normalization - smart playlists
It is oddly lacking support for Ogg Vorbis and FLAC, but you can install 3rd party support for those.
I've tried several other music players, but none seem to do all of the above. The most promising ones unfortunately lack the expressive power of iTunes smart playlists, such as a playlist of "matching album Diablo or Torchlight, rated 3+, limited to the 25 least often played items, auto-update list after each play".
Customers would be able to see a map of 'risk zone' data for places they want to go, such as stores, restaurants and roads. They could then plan the day 'with an eye toward how risky such endeavors may be,' according to the patent application."
Want to drive a competitor out of business? Stage some "risky" things in his area.
And who gets to decide what's risky anyway? This could blow up tiny incidents to something that causes massive droves of people to avoid a store.
And yes, while this is already somewhat possible with today's internet, we don't have a central authority who decides what's risky, and certainly not one with money invested in inventing riskiness.
Depends on what you mean by fail, but nothing new ever really beats C++. Sure fancy new languages keep popping up with features, but it either lacks portability or performance or control or higher level constructs or reliability or something else that C++ can provide. And eventually C++ gains the language feature anyway, but without sacrificing efficiency for it. Many languages have tried to take C++'s place, but so far none have gotten close. And even where other languages have a good hold (Java, C#, ObjC, etc), when you want an efficient library shared between those ecosystems, you'll write that in C++.
And no, C doesn't really count in the comparison, because it doesn't grow - while WG14 publishes new standards, it's still mostly C89 used in the wild, with a few extensions.
So popular, and yet they still haven't fixed the hugely annoying core issue of emulating magic quotes, even years after PHP itself completely threw out the feature.
I'm curious - what sorts of data at home do you store that contain lots of duplication?
I should've qualified that. The home backup system is the part of it that I have here at home, but the data is from several servers around the world, plus my personal files. And of course there's other backup sites so it doesn't 100% rely on my house or connection. And I have since improved my part with a dedicated machine rather than VirtualBox, though still USB attached storage because I had the disks anyway.
Er, snapshots should be immutable. They're used as sources for backups and replication, allowing them to be mutable would defeat the main purpose.
zfs clone if you want a writable copy. What's wrong with that?
The problem with zfs clone is that "clones can only be created from a snapshot" which means that deleting a file from a clone does not delete the file from the underlying snapshot, so the space is never actually freed. So when I accidentally have a very large temporary file in my backup set, it's stuck taking up space until it cycles out of history.
I used to use ZFS on my hacky home backup solution (Linux in VirtualBox with USB storage - yes, I know), but it would corrupt the disks once per month or so. Switched to btrfs, and it just works.
Features that btrfs has over ZFS, and I use: - Mutable snapshots. It is infuriating that ZFS's snapshots are immutable. Mind you, I very rarely modify snapshots, but I damn well want to be able to without having to dump+reload all data. This alone is reason enough that I'll never again use ZFS where btrfs is available. - Offline on-demand deduplication. Being able to dedup files when I want is very nice. cp --reflink is also super. - Sane hardware requirements. ZFS is designed for extremely high quality hardware (and lots of RAM) that doesn't lie to the OS, which is just not what most of us are running. btrfs is designed for everyday use.
Features that I miss from ZFS: - Online live deduplication. But it's sooo sloooow and requires so much memory, that I don't miss it much.
Asides from that they're pretty equal in my experience. They both offer transparent compression, which is what I really want.
Every story is excessively shrill in support of Uber...
There has certainly been a lot of Uber stories, but my reading of them is that yet again Uber is caught doing something shady or just illegal. Looking over http://slashdot.org/?fhfilter=... I see 10 negative stories about them (shady practices, illegal activities, etc), 4 positive (SA women, SF drunk drivers, etc), and 2 neutral. This particular story counts as negative, since they're doing something illegal.
So, even if Slashdot is paid to put up Uber content, they're not praising them. But I guess any publicity is good publicity.
How is this 1) legal, 2) accepted? Doesn't this directly fall under false advertising?
Here in Denmark, smear campaigns generally don't happen. You do not talk bad about other people or products - you instead talk about what you're doing better. And if you do smear competitors, you will lose face in the public eye.
It seems that in the US, that's entirely opposite. So bizarre.
Qt's containers expose STL-compatible iterators, making it quite easy to use with any other well designed library. Sure there's some baggage due to history, but you can mix Qt into other code very nicely these days.
...just because you don't see it doesn't mean it's not there.
True, but so what? I can still write less C++ code but get the same performance as C. I don't care that it temporarily expands during compilation and then shrinks during optimization. I care about my code and the output - the IR is less interesting.
There is no standard ABI in C, the ABI is platform dependent and always has been.
I meant, the C Standard introduces new features in ways that won't break anyone's ABI. C11's _Generic is so bloody ugly compared to C++'s overloading, but it preserves the ABI. The fear of introducing ABI-breaking features is making C even more ugly to read.
I suspect you have very little experience with C and this is why you think C++ is always the right answer
Well, you suspected wrong. I have quite a lot of experience with C, C++, PHP, Perl, JavaScript, Java, Python, etc. I even started with C as my first real language. I just vastly prefer C++.
I do not think C++ is always the right answer, except when asked whether to use C or C++. The only cases where I'd say use C is when there is no choice, such as cross-language APIs.
The C code is a mess of extra code and checks, compared to the C++ version. The C++ version is as readable as the scripting languages. And C vs. C++ perform almost identically.
Now, that is obviously a tiny example, but I find this to be true for larger codebases as well. If you use all that C++ offers you, you can make the code read like elegant functional or script (or a mix), and still have all the performance of C.
The FQA? Really? Read it, laughed at it, moved on.
Modern C++ is as easy to learn as Python or Java or C# or similar. The code will even look superficially the same.
C++ is only hard to learn for those who insist on learning "C with Classes" and mix in all sorts of silly C'isms, like insisting on starting with raw memory management. If you start out with the high level stuff and containers, it's easy, safe, fast.
Standard C++ has never been a strict superset of Standard C. You have never been able to take all C89 code and just compile it as C++98, primarily due to C++'s stricter type system (e.g. malloc() will need a cast added).
Anyway, sure they share a lot of syntax and basic concepts, but when you compare modern C++ with C, it's like comparing a high level scripting language with, well, C. If you take code in Python, Perl, PHP, Java, C#, and C++, they'll all look mostly alike. If you code the same in C, it'll look foreign and require many more explicit sanity checks. But the C++ and C code will have the same performance, while the other languages will suffer lots of overhead.
I always tell people who want to learn to code to go for modern C++ (preferably C++11 or newer) first, and then if necessary learn some C afterwards.
It is crazy how much more C code is needed to get the same level of performance and security that equivalent C++ has, and C coders know it. Just look at all the extensions that C compilers, and even the C11 Standard, borrow from C++ (generics, RAII) - but in a convoluted ugly way to preserve the precious ABI for 50 years.
And for all those who will say that C++ can't fit in the tight spaces that C can...well, you're wrong. Just disable the parts of C++ that you don't want (usually exceptions), and you can still get most of the benefits of clean code and RAII, with the same or better performance.
I fiddled with Firebug on the Beta site, and made a few changes that amazingly improves the look'n'feel of it:
- Remove article images. - Remove the CSS line-height property from both submission and comments. - Distinguish where submitter intro ends and submission begins. "Quotes" are not enough - the old blockquote worked nicely. - Make the submission text color black. It feels hazy as it is now. - Let comments flow full-width. Having them constrained by the huge sidebar is awful.
In general, it seems like you're turning Slashdot from a community driven site to a more modern publisher/aggregator style site, which won't work. If the comments aren't the primary focus, Slashdot loses what makes it Slashdot.
I can get up to date news everywhere - I can't get quality commentary anywhere but Slashdot.
...the problem with rule-based grammars that lack any statistical weights is that they come up with an unbelievably large number of parses for many real-world sentences.
Generative grammars suffer from that problem and scales very poorly, and may indeed be impractical to use for real world text. Our constraint grammars and finite-state analysers do not have that problem. With CG, we inject all the possible ambiguity into the very first analysis phase, then use contextual constraints to whittle them down, where context is the whole sentence or even multiple sentences. This means performance scales linearly with number of rules.
So the 96% accuracy claim is suspect, not to mention that a comparison of the Google system is already difficult because Spanish =/= English. (Spanish has more morphology on verbs, it's pro-drop, it has relatively free word order compared to English,...)
The paper is for Spanish, because that's what I could find. Our other parsers, including English, are also at the 96% or better stage, but because it's mindbogglingly boring to do a formal evaluation, we don't have up-to-date numbers.
So I don't believe you can say that "Google is hopelessly behind the state of the art."
Given that we had 96% in 2006, 10 years ago, and Google only now has reached 94% (90% for other domains), I feel confident in saying Google is very far behind.
Who said they're giving away their best stuff?
The nature of machine learning does. All they're giving away is an algorithm and a system trained using that algorithm. Linguistic machine learning is a field where even a 0.5% improvement takes years to get and is worth a paper. So even if they aren't giving away their top algorithm, their best one can't be much better.
which seams much more expensive than
It'd seem that way, but it's really not if you factor in the whole chain.
Machine learning needs high quality annotated treebanks to train from. Creating those treebanks takes many many years. It is newsworthy when a new treebank of a mere 50k words is published. Add to that the fact that each treebank likely uses different annotations, and you need to adjust your machine learner for that, or add a filter. Plus each treebank is for a specific domain, so your finished parser is domain-specific. If you want to work with other kinds of text, you need to produce a treebank for that domain and then train on it.
Thus, the bulk work is in annotation and mathematical models. Google skipped the step of creating a treebank, and instead use available ones. There aren't any usable treebanks for smaller languages, making the whole machine learning endeavor useless for all but the large languages.
Rule-based parsers are the opposite of that. You can put the same amount of man hours into creating rules as you otherwise would a treebank plus mathematical model, but you can do so on any old laptop with almost zero data to work from. You just need to know the language. A parser produced in this way is not domain specific, but can be easily specialized for a domain if needed. And a rule-based parser can be used as a bootstrap engine for creating high quality treebanks, because the rules are upwards 99% accurate, meaning humans only need to put a fraction of work on top of it.
And as I wrote, rules are debuggable. You can figure out exactly why a word was misanalyzed, and fix it. Machine learning can't do that. The edit-compile-test loop of machine learning is in weeks or hours - with rules it's in minutes or seconds.
94% syntax is definitely good, for a machine learning parser. Now if you were to come to the land of rule-based parsers, 94% is the norm.
Google loves machine learning, and it's easy to see why. That's how they made their whole stack. They have the huge amounts of data to train on, and the hardware to do so. It's so seductive to just throw a mathematical model at huge amounts of data and let it run for a few weeks.
Rule-based systems don't need any data to work with - they just need a computational linguist to spend a year writing down the few thousand rules. But the end result is vastly better, fully debuggable, easily updatable, understandable, and domain independent. That last bit is really important. A system trained for legalese won't work on newspapers, but a rule-based system usually works equally well for all domains.
In 2006, VISL had a rule-based parser doing 96% syntax for Spanish (PDF) - our other parsers are also in that range, and naturally improved since then. Google is hopelessly behind the state of the art.
I've used iTunes on Windows for many years as a music player only, and while it definitely has some annoyances, nothing else seems to do all the things that I want:
- auto-organize its own folder
- not reorganizing external folders
- volume normalization
- smart playlists
It is oddly lacking support for Ogg Vorbis and FLAC, but you can install 3rd party support for those.
I've tried several other music players, but none seem to do all of the above. The most promising ones unfortunately lack the expressive power of iTunes smart playlists, such as a playlist of "matching album Diablo or Torchlight, rated 3+, limited to the 25 least often played items, auto-update list after each play".
Customers would be able to see a map of 'risk zone' data for places they want to go, such as stores, restaurants and roads. They could then plan the day 'with an eye toward how risky such endeavors may be,' according to the patent application."
Want to drive a competitor out of business? Stage some "risky" things in his area.
And who gets to decide what's risky anyway? This could blow up tiny incidents to something that causes massive droves of people to avoid a store.
And yes, while this is already somewhat possible with today's internet, we don't have a central authority who decides what's risky, and certainly not one with money invested in inventing riskiness.
Depends on what you mean by fail, but nothing new ever really beats C++. Sure fancy new languages keep popping up with features, but it either lacks portability or performance or control or higher level constructs or reliability or something else that C++ can provide. And eventually C++ gains the language feature anyway, but without sacrificing efficiency for it. Many languages have tried to take C++'s place, but so far none have gotten close. And even where other languages have a good hold (Java, C#, ObjC, etc), when you want an efficient library shared between those ecosystems, you'll write that in C++.
And no, C doesn't really count in the comparison, because it doesn't grow - while WG14 publishes new standards, it's still mostly C89 used in the wild, with a few extensions.
So popular, and yet they still haven't fixed the hugely annoying core issue of emulating magic quotes, even years after PHP itself completely threw out the feature.
I'm curious - what sorts of data at home do you store that contain lots of duplication?
I should've qualified that. The home backup system is the part of it that I have here at home, but the data is from several servers around the world, plus my personal files. And of course there's other backup sites so it doesn't 100% rely on my house or connection. And I have since improved my part with a dedicated machine rather than VirtualBox, though still USB attached storage because I had the disks anyway.
Er, snapshots should be immutable. They're used as sources for backups and replication, allowing them to be mutable would defeat the main purpose.
zfs clone if you want a writable copy. What's wrong with that?
The problem with zfs clone is that "clones can only be created from a snapshot" which means that deleting a file from a clone does not delete the file from the underlying snapshot, so the space is never actually freed. So when I accidentally have a very large temporary file in my backup set, it's stuck taking up space until it cycles out of history.
I used to use ZFS on my hacky home backup solution (Linux in VirtualBox with USB storage - yes, I know), but it would corrupt the disks once per month or so. Switched to btrfs, and it just works.
Features that btrfs has over ZFS, and I use:
- Mutable snapshots. It is infuriating that ZFS's snapshots are immutable. Mind you, I very rarely modify snapshots, but I damn well want to be able to without having to dump+reload all data. This alone is reason enough that I'll never again use ZFS where btrfs is available.
- Offline on-demand deduplication. Being able to dedup files when I want is very nice. cp --reflink is also super.
- Sane hardware requirements. ZFS is designed for extremely high quality hardware (and lots of RAM) that doesn't lie to the OS, which is just not what most of us are running. btrfs is designed for everyday use.
Features that I miss from ZFS:
- Online live deduplication. But it's sooo sloooow and requires so much memory, that I don't miss it much.
Asides from that they're pretty equal in my experience. They both offer transparent compression, which is what I really want.
Every story is excessively shrill in support of Uber...
There has certainly been a lot of Uber stories, but my reading of them is that yet again Uber is caught doing something shady or just illegal. Looking over http://slashdot.org/?fhfilter=... I see 10 negative stories about them (shady practices, illegal activities, etc), 4 positive (SA women, SF drunk drivers, etc), and 2 neutral. This particular story counts as negative, since they're doing something illegal.
So, even if Slashdot is paid to put up Uber content, they're not praising them. But I guess any publicity is good publicity.
I read the headline as "Samsung Pays Lunches in the US" and wondered what weird out-of-court settlement they now had agreed to.
How is this 1) legal, 2) accepted? Doesn't this directly fall under false advertising?
Here in Denmark, smear campaigns generally don't happen. You do not talk bad about other people or products - you instead talk about what you're doing better. And if you do smear competitors, you will lose face in the public eye.
It seems that in the US, that's entirely opposite. So bizarre.
Many years ago in a Donald Duck comic, Gyro Gearloose's autonomous cars were panned by the mayor and city council of Duckburg for not having windows.
Prescient writers.
Qt's containers expose STL-compatible iterators, making it quite easy to use with any other well designed library. Sure there's some baggage due to history, but you can mix Qt into other code very nicely these days.
Unfortunately, Indian schools still require learning to code with Turbo C++, which is ancient and incompatible with any modern open source code.
Sad but true: http://google.com/search?q=Ind...
That brought a tear to my eye. Hit the style quite nicely.
its inclusion in the National Defense Authorization Act (the park legislation wouldn't pass otherwise)
You guys seriously need to fix your shit. Having bill riders is a fundamental government fail.
In the civilized world, a bill has a strictly defined topic, and anything not directly pertaining to that topic simply isn't allowed to be attached.
...just because you don't see it doesn't mean it's not there.
True, but so what? I can still write less C++ code but get the same performance as C. I don't care that it temporarily expands during compilation and then shrinks during optimization. I care about my code and the output - the IR is less interesting.
There is no standard ABI in C, the ABI is platform dependent and always has been.
I meant, the C Standard introduces new features in ways that won't break anyone's ABI. C11's _Generic is so bloody ugly compared to C++'s overloading, but it preserves the ABI. The fear of introducing ABI-breaking features is making C even more ugly to read.
I suspect you have very little experience with C and this is why you think C++ is always the right answer
Well, you suspected wrong. I have quite a lot of experience with C, C++, PHP, Perl, JavaScript, Java, Python, etc. I even started with C as my first real language. I just vastly prefer C++.
I do not think C++ is always the right answer, except when asked whether to use C or C++. The only cases where I'd say use C is when there is no choice, such as cross-language APIs.
Well, the example I have laying around is one I made, just so my bias is clear: http://tinodidriksen.com/2010/...
C: http://tinodidriksen.com/uploa... - 77 lines, 2.4k
C++: http://tinodidriksen.com/uploa... - 33 lines, 1k
The C code is a mess of extra code and checks, compared to the C++ version. The C++ version is as readable as the scripting languages. And C vs. C++ perform almost identically.
Now, that is obviously a tiny example, but I find this to be true for larger codebases as well. If you use all that C++ offers you, you can make the code read like elegant functional or script (or a mix), and still have all the performance of C.
The FQA? Really? Read it, laughed at it, moved on.
Modern C++ is as easy to learn as Python or Java or C# or similar. The code will even look superficially the same.
C++ is only hard to learn for those who insist on learning "C with Classes" and mix in all sorts of silly C'isms, like insisting on starting with raw memory management. If you start out with the high level stuff and containers, it's easy, safe, fast.
Standard C++ has never been a strict superset of Standard C. You have never been able to take all C89 code and just compile it as C++98, primarily due to C++'s stricter type system (e.g. malloc() will need a cast added).
Anyway, sure they share a lot of syntax and basic concepts, but when you compare modern C++ with C, it's like comparing a high level scripting language with, well, C. If you take code in Python, Perl, PHP, Java, C#, and C++, they'll all look mostly alike. If you code the same in C, it'll look foreign and require many more explicit sanity checks. But the C++ and C code will have the same performance, while the other languages will suffer lots of overhead.
I always tell people who want to learn to code to go for modern C++ (preferably C++11 or newer) first, and then if necessary learn some C afterwards.
It is crazy how much more C code is needed to get the same level of performance and security that equivalent C++ has, and C coders know it. Just look at all the extensions that C compilers, and even the C11 Standard, borrow from C++ (generics, RAII) - but in a convoluted ugly way to preserve the precious ABI for 50 years.
And for all those who will say that C++ can't fit in the tight spaces that C can...well, you're wrong. Just disable the parts of C++ that you don't want (usually exceptions), and you can still get most of the benefits of clean code and RAII, with the same or better performance.
I fiddled with Firebug on the Beta site, and made a few changes that amazingly improves the look'n'feel of it:
- Remove article images.
- Remove the CSS line-height property from both submission and comments.
- Distinguish where submitter intro ends and submission begins. "Quotes" are not enough - the old blockquote worked nicely.
- Make the submission text color black. It feels hazy as it is now.
- Let comments flow full-width. Having them constrained by the huge sidebar is awful.
In general, it seems like you're turning Slashdot from a community driven site to a more modern publisher/aggregator style site, which won't work. If the comments aren't the primary focus, Slashdot loses what makes it Slashdot.
I can get up to date news everywhere - I can't get quality commentary anywhere but Slashdot.