Mining Neologisms from Wikipedia
holy_calamity writes "Natual Language Programming researchers have developed a tool called Zeitgeist that can discover the meaning of new words for itself using Wikipedia. It looks for entries for words not in the WordNet database and works out their meaning by looking for known words linked to them. Development of the tool is focusing on using it to understand what bloggers (using slang and neologisms) are saying about companies' products."
...one entity gathers what another entity spills...
All pass beyond reach of medicine. None pass beyond the reach of love.
if they pointed it at slashdot...
"ass-hat" and "tard" could take on a whole new meaning
Imagine the chaos and reboots as the program analyzes a George W. Bush speech
Infiltrated dot Net
Figuring out what people on the net says about your products is the "new" thing apparantly. IBM has their own engine for the task too. Kind of makes you wonder how much power the net community will in fact have in day-to-day decision making in the corp head quarters' marketing strategy depts.
George W. Bush
n.
1. 43rd president of the United States.
2. miserable failure.
Put it back in the circuits and your fine.
The trouble is that Wikipedia has a policy of not writing about (or using) Neologisms:
http://en.wikipedia.org/wiki/WP:Neologism
Many articles about neologisms *do* get created in violation of this policy - but they are generally put up for deletion via the Wikipedia process for deleting inappropriate material - so they only exist briefly.
So, for example, the article entitled "Windows Rot" is being debated today, Although it looks like this one will be merged into an existing article, it won't survive as the name of an article - so Zeitgeist presumably won't be able to find it.
It may be that enough of these kinds of articles slip through the system to be useful to Zeitgeist but that is not by design - so coverage will be patchy at best.
A further consequence of this is that the articles that Zeitgeist does find will most likely be so new that only one person will have worked on them - which will make for poor quality.
Also, it is very common for people such as bloggers who come up with what they consider to be clever new words to try to wedge them into common usage by writing about the word in Wikipedia. This 'vanity word' problem is one of the main reasons that Wikipedia seeks to avoid articles on neologisms.
www.sjbaker.org
For example, in french slang, the same person could use the word "batard" as either an insult or a display of respect, and neither of these meaning is related to the target's father.
I wish them good luck...
31g 3r0+her iz wa+ch1ng U!
This is what the Urban Dictionary is for.
Intron: the portion of DNA which expresses nothing useful.
and started creating its own gazornaplatting words that no-one but the program itself could middlybundy? It could eat up bibblys of disk space as all the new words chimmdudlied in a grawn.
I want a list of atrocities done in your name - Recoil
This sounds like a great way to locate (and sue) walmartsuck.com type sites.
Corporate censorship. Now Automated with "Zeitgeist".
Think I'm a nut.
Call me back in 5 years...
Obama's legacy: (N)othing (S)ecure (A)nywhere and (T)error (S)imulation (A)dministration
Sounds like a excellect chance to inject some new perfectly cromulent words into wide use.
-- 3 events that reshaped the world in the 20th century: WW1, WW2, and WWW
Time for step two: deliver a mild electric shock to neologism users. Then I won't have to hear "blogosphere" ever again.
Use the Firehose to mod down Second Life stories!
What is with people using the term ZeitGeist? Google uses it for its end of the year search roundup. It is even used more heavily by others not associated on the internet.
OK why not the term DefMiner? Then get an old guy to be the site mascot? On second thought, never mind. Just dont be supprised when people get you confused with another product.
Procrastinating life a way at a rapid rate of speed.
One of my personal favorites is the word Santorum.
Censorship is obscene. Patriotism is bigotry. Faith is a vice. Slashdot 2.0 sucks.
If I see a new word in text, I hypothesize its meaning from its context rather than lookup its meaning. However, recently dictionary lookup been easier when reading online with Google Define: available.
This usually only works in languages I know fairly well. If there are two or three unknown terms in a paragraph I'll have less success in understanding them.
You do not need a fancy program to do this. I can do it for you, without even reading the blogs in question.
Watch.
They are saying your products suck, and that your customer support is worthless.
See how easy that was? Now, you might be wondering how I know this. Simple. They don't use made up words to say good things about you. I'm not sure why (maybe they aren't worried about being sued for saying good things?), but the pattern is very consistent. If somebody goes to the trouble of writing about you in their blog using made up words, they don't like you or the horse you rode in on.
Likewise, if you are a journalist, they call you funny names (Steno Sue, Laura Dildo, Kneepads Miller, "Dollar a Word" Armstrong, etc.) because they've noticed that you consistently write to favour a certain party, position, politician, company, or lifestyle, even when this requires ignoring a pile of facts the size of Paraguay, any one of which would shred your position.
And if you're a politician, it means that someone noticed that what you say in speeches is so unconnected to what you do with the office you hold that the only link between them is the way in which they combine to mollify your nominal constituents while maximizing the benefit to your corporate sponsors.
If you are an industry association, they are saying they hate you, period, and that you are evil incarnate.
See how easy this is? If you still don't get it, I am willing to come out of retirement as a consultant to explain it to you, provided the price is right.
--MarkusQ
paq8hp3 is the current Hutter Prize lead contender and has compressed the first 100M of Wikipedia to just over 17M. Wordnet's .exe file is just over 17M. One wonders what would happen if the "cream" of Wordnet's vocabulary were compressed using paq8hp3 and then incorporated into paq8hp3 to be a better compressor by inferring what words are more likely than others to appear near various combinations of words. You wouldn't have to go very deep to generate a large temporary file of word associations. Identifying the "cream" of Wordnet would be more than just frequency of usage. Some refactorization of the definitions may be in order to find which words are most powerful descriptors of other words. How much of that sort of work has been done?
Seastead this.
... enables companies to automate the litigation process with Java AutoCeaseAndDesist email modules.
If you needed any more proof that the slashdot font sucks, here you go.
It's a sad day when
is mistaken forNext thing you know, pom enthusiasts stray into the wrong conversation, and you can never go back from that.
Public use of any portable music system is a virtually guaranteed indicator of sociopathic tendencies. -- Zoso
TFA says it looks at the links and uses the keywords on those links, and doesnt even look at the content of the link, or the context of the word in the paragraph. Remember Mark V Shaney from usenet?
It'd be better of using markov chains and a nice index of all the content where the word is found and the content on the outgoing links, and determine its meaning based on its context within the article. Google would really be the people to ask for this kind of info, with how huge their index is, i'd love to let a semantic program chunk of some of that data.
So, I tried WordNet and it didn't work! Natual! Indeed!
Shouldn't they also crawl through something like the urban dictionary which will have ten times more slang definitions?
Wikipedia greatly endorses the Neologism (or perhaps Protologism according to their page) "initialism".
For some reason, someone decided to redefine acronym and make up a new word to cover what acronym covered before. And Wikipedia uses it constantly, despite the pointlessness of it and the fact that the word hasn't caught on widely, thus making it a protologism. Although protologism isn't a word that has caught on widely either, thus making it a protologism itself at best, more likely a vanity word.
http://lkml.org/lkml/2005/8/20/95
That's ALMOST as bad as when people use "i.e." instead of "e.g.", e.g. "Ie: 'I would of been rich by now if i hadn't...'" (explanation).
"hip" in the beginning?? I doubt "hip" would have been used to describe it in the 16th and 17th centuries, when the word was much more prevalent. It is German and means spirit (geist) of the time (zeit) and is used to refer to, you guessed it, the spirit of a particular time/era/century/etc. It is a show of arrogance for Google to use it but, it makes clear what this program actually does. Personally, I think the loop would be much more gezuhnderfast.
good link, I've tried to avoid those not always able to remember which was which. thanks
There is much cruelty in the universe, John.
Yeah, we seem to have the tour map.
And thus protologism would then be true of itself, making it homological. Reminds me of this quote:
If a homological adjective is one that is true of itself, e.g., "polysyllabic", and a heterological adjective is one which is not true of itself, e.g., "bisyllabic", then what about "heterological?" Is it heterological or not?
- Grelling's Paradox
It doesn't get promoted to neologism just because of its age.
http://lkml.org/lkml/2005/8/20/95
Zeitgeist today decided that it was a perfectly cromulant product.
I wish there were a better feedback system for sites that could be useful slang dictionaries, like urbandictionary.com (I think that is the url). Some entries reflect actual usage, some are obvious inventions on the spot, but get ranked highly anyway, because someone thinks they are funny or useful enough.
this page randomly combines prefixes and suffixes to create neologisms.
Zeitgeist is german for 'Ghost of time' or 'Time spirit'
what does this have to do with this technology?
odd name...
May I offer my heartiest contrafibularities!
I am leaving now, but I shall return interfrastically.
(5 points to whoever places the origin of this bastardized quote)
This space for rent
Excellent insight by Larry Niven.
To extend, the lack of huge crowds of time-travelling tourists at events such as the WTC collapse is the best evidence that time travel from the future into our present cannot happen. Either that, or time travellers are required to cloak themselves. Otherwise, the streets and skies of NYC would have been packed with tricked-out Deloreans on 9-11-01.
Stephen Hawking thinks this may be because the furthest you can travel back in time is to the invention of the time machine (and it hasn't been invented yet). Even if there is no technological limitation, it might also be that going back to the time before the invention of the time machine is societally forbidden, in order to preserve the timeline that created the possibility of time travel. But that's another can of wormholes.