jrtom · Slashdot Mirror

He's not lying; you're not reading carefully. on The Future of Google Search and Natural Language Queries · 2007-12-18 11:01 · Score: 1

Quoting from the summary:

'We think what's important about natural language is the mapping of words onto the concepts that users are looking for. But we don't think it's a big advance to be able to type something as a question as opposed to keywords ... understanding how words go together is important ... That's a natural-language aspect that we're focusing on. Most of what we do is at the word and phrase level; we're not concentrating on the sentence.'"

That is, he explicitly says that _most_ (that is, not all) of their work is at the word/phrase level. This implies that some is at levels of abstraction above that. They may not be "concentrating on the sentence" but that doesn't mean that they're ignoring it entirely. Furthermore, there are well-known ways of creating good approximations of the meaning of a document that don't consider word order at all. The classic is the TF-IDF model, but there are others (Latent Semantic Analysis, other types of topic models) that are richer and more descriptive. No, they don't capture everything about the semantics or pragmatics of a document, but they do well enough to (for instance) provide good predictors of the grade of an essay as assigned by a panel of human graders.

Re:If you can't store it, you can't count on it on Google Goes Green · 2007-11-28 06:07 · Score: 1

electricity can't be stored in any useful quantity

This may be true; it's outside my area of expertise.

However, it does not necessarily follow that the associated energy, or at least a large fraction of it, cannot be stored. Wikipedia quotes the energy efficiency of water electrolysis as >= 50% (depending on whose numbers you believe, but at least that). I assume that storing sufficiently large quantities of hydrogen and oxygen is more feasible than ginormous electrical storage batteries. 50% efficiency clearly isn't as much as one might like, but it at least gives you a form of persistent storage.

link to Boldewin's page incorrect on Malware Hijacks Windows Update · 2007-05-16 02:49 · Score: 1, Redundant

It's http://reconstructer.org/ not http://reconstruction.org./

Also, Brian Krebs' blog has an informative post on the phenomenon.

computers in the '30s? on Open Source Car on the Horizon · 2006-12-08 11:15 · Score: 1

"I think Packard and Deusenberg had computers in the 30's."

Last I checked, the computer hadn't been invented yet in the 1930s, so it seems unlikely. Unless you mean something rather different than is typically meant by "computer" these days.

("I know! We'll control certain aspects of our automobile's operation with a computer!" "Hey, what a great idea--of course, the computer will be several times the size of the rest of the car and will require its own staff of expert operators, but why not?") ;)

Genesis (and originality) of PageRank on Microsoft Releases Book Search · 2006-12-07 11:38 · Score: 1

PageRank, published in 1998, is an application of eigenvector centrality.

HITS ("hubs and authorities") is another eigenvector-based method of ranking nodes in a network, also published in 1998 (in this case by Jon Kleinberg).

Eigenvector centrality itself was proposed as early as 1949 (Seeley, "The net of reciprocal influence") as a means of ranking nodes in a network. There were plenty of papers on this topic in the intervening 49 years. (The concept of eigenvectors, of course, is considerably older than this.)

[moranar] "I wouldn't call PageRank a "simple tweak" of anything."

No offense intended, but how much do you know about the details of how PageRank works? (There was a link on /. to a rather nice overview earlier today; it's worth reading.)

The specific difference that PageRank has from standard eigenvector centrality is the addition of 'virtual edges' from each node to each other node, which are collectively traversed with probability beta (a parameter of the model), which does two things:
(1) it gives the algorithm something reasonable to do in the case where it runs into a sink (node with no outgoing edges)
(2) it supplies a way of "smoothing" the rank values over the network according to the value of beta.

That's it. Personally, I consider this a relatively "simple tweak".

Credit where due: it was clever of Page and Brin to apply it to the web graph because of the particular semantics of hyperlinks, but as Kleinberg's simultaneous publication suggests, it's a pretty obvious thing to try to use the information inherent in the hyperlinks.

Furthermore, it's worth noting that the reason pure PageRank worked so well initially is that noone who was creating web pages at the time was thinking "hey, let's boost our rankings by manipulating our link structure!". At the time, SEO consisted of inserting bogus text on your web pages, because textual similarity was where it was at. The use of hyperlink information, which was totally orthogonal, revolutionized web search precisely because no one had tried to game the system in that way yet. Now that such gaming is commonplace, either PageRank or the underlying interpretation of the hyperlink data (probably both, I'd guess) has been radically modified from the original algorithm.

As to whether this is an innovation "in Computer Science", this isn't really computer science, as such: it's (applied) mathematics. (This is coming from someone who has degrees in each of them.) Google does a lot of computer science in the context of implementing its methods, refining them to take account of additional information (e.g. adjusting the transition probabilities according to the anchor text content), and most especially in scaling them to the web, but the original method is just applying a known mathematical operator.

ask Cory Doctorow... on Intellectual Property Discussion in the Classroom? · 2006-10-26 08:55 · Score: 1

...he's giving a lecture series at USC on this. Do a search on "cory usc" on http://boingboing.net/ and you'll get links to short posts on what he's been lecturing on.

Re:brief explanation of the method on Text Mining the New York Times · 2006-08-02 15:46 · Score: 1

The Author-Topic model is actually due to Steyvers et al. at UC Irvine. McCallum's contribution was the Author-Recipient-Topic model, which extended the AT model to the domain of directed communications. The AT model is actually very closely related to Steyvers' topic model. I recommend reading the summaries on his page referenced above (in my original comment).

monumental task? naah... on Halving Half Lives · 2006-08-02 05:25 · Score: 1

Courtesy of the public service announcements so kindly provided by David Crosby and Graham Nash and by Fred Small, this is already a solved problem. ;)

Re:brief explanation of the method on Text Mining the New York Times · 2006-07-30 11:25 · Score: 1

If you want to read the papers I pointed you to, and become specifically acquainted with the techniques and their advantages, then we can talk. Otherwise, in the absence of any technical papers that describe your technology (of which there seem to be none on your website), I don't see any particular reason why I should pay further attention to your anonymous claims.

brief explanation of the method on Text Mining the New York Times · 2006-07-29 17:30 · Score: 4, Informative

I'm a PhD student in the research group that worked on this. My research is somewhat different (machine learning and data mining on social network data sets) but I've gone to a lot of meeting and presentations on this work, and I've used the model they're describing in my own research. Certainly people have worked on document classification before, but posters that are suggesting that this isn't new don't understand what this method accomplishes. For example:

basically, the model assigns a probability distribution over topics to each document
i.e., documents aren't assigned to a single topic (as in latent semantic analysis (LSA))
topics are learned from the documents automatically, not pre-defined
this means, incidentally, that they're not automatically labeled, although a list of the top 5 words for a topic generally characterizes it pretty well.
the technique can learn which authors are likely to have written various pieces of a given document, or which cited documents are likely to have contributed most to this document
side benefit: you can also discover misattributions (e.g., authors with the same name)

For a good high level description of what these models are doing, see Mark Steyvers' research page (MS is one of the authors); that page also has links to a number of the preceding papers. Those interested in seeing what the output of a related model looks like might like to check out the Author-Topic Browser.

Re:authors' analysis doesn't just miss the boat... on Metcalfe's Law Refutation Explained · 2006-07-14 08:22 · Score: 1

I read the article. I stand by my positions.

Specifically:

(1) You do indeed need a metric to model growth, because unless you can agree on what you're measuring, you don't know how to tell whether (or how much) it's growing. That is: we know how fast the _network_ is growing; the question is how fast the _value_ is growing, and that's not meaningful unless you define "value". Your objection is not on point because you appear to implicitly assume that it's obvious what's being measured (which is true of a person's height, and definitely not true in the case of a network).

(2) What, exactly, do you mean by saying "we don't interact with the internet on our own all the time"? Of course I look at other peoples' content via search engines, but in that context--the web--the question is, IMO, what the value of the web is, and how that changes as more people participate by adding sites and links. In my opinion, you don't add value to the web by browsing it, so your observations of the network via a search engine are not relevant.

The internet is one way of enabling certain types of social interactions. It does not inherently increase the human capacity for interaction.

As for the O(n lg n) model: the authors assume a power law distribution is involved, but they don't justify this assumption, and they don't provide supporting evidence (throwing in a few factoids does not count). Again, in the absence of an actual model, their article is just so much speculation, and not in fact an effective refutation of anything.

authors' analysis doesn't just miss the boat... on Metcalfe's Law Refutation Explained · 2006-07-14 06:23 · Score: 1

...it never even realizes that the boat existed.

That is: the authors' analysis is fundamentally flawed in a couple of different respects.

(1) They don't even attempt to establish an actual metric for the value of a network. Without that, any counterarguments to any previous assertions regarding network value that one might make are basically so much handwaving. (One can of course make the same objection to Metcalfe's Law, but saying "My hand-wavy claim is better than yours!" isn't much of an argument.)

On a related note, any discussion of network value that doesn't even take into account the semantics of the network (i.e., what the edges represent) is even more useless.

(2) Let's say that we will approximate the value of a network as being proportional to the number of edges (links); we'll ignore, for the moment, the possibility that edge weights might differ.

_In practice_, the number of individuals that one can be meaningfully connected to in a network--i.e., the number of incident edges to a node--will be limited. _This_ is the real problem with Metcalfe's Law.

For example, suppose that the edges represent friendships (as identified by the individuals, i.e., A is connected to B if A and B each agree that they are friends.) The number of friends that I can have is limited: I can't even _meet_ six billion people for one minute each (even assuming that I could remember them as individuals) before my personal clock runs out. This same argument applies (although the limitations will be different) to financial transactions, telephone calls, IM buddies, and so forth. Sure, if all 6 billion people were signed into Your Favorite IM Client and I could open a group chat with them, they could all read my words...but that's not a meaningful relationship, it's a broadcast--and it has really nothing to do with any network topology.

nothing to do with social networks on Barcodepedia - a Social Network Barcode DB · 2006-07-06 06:38 · Score: 2, Insightful

At least as it now is, this site doesn't appear to have anything to do with social networks--nor does it claim to. Apparently the submitter either (a) knows something about the site that the admins haven't chosen to release or (b) assumes that any community site must automatically be a "social network" thing.

physicists rediscovering network analysis. again. on Web Game Helps Predict Spread of Epidemics · 2006-01-25 19:49 · Score: 1

Epidemiological studies such as this are not exactly new, although I'm sure that this is (in some ways) a nice data set for investigating this sort of problem. I've skimmed the paper; it appears to be the case that the authors are completely unaware of the body of work that people in the field of social network analysis (SNA) have generated on this problem over the past few decades.

Unfortunately, this is not particularly new, either: for the past several years, physicists have been "discovering" problems, and models, that the SNA folks have known about for quite some time. To give credit where due, physicists' quantitative models for these problems are generally well-constructed, and I appreciate the fact that their entrance into the field has placed more emphasis on quantitative methods. However, their assumptions are not necessarily well-explored, and so their conclusions are not infrequently invalid.

It's true that I haven't checked out this paper in detail, and it's possible that they really did come up with something new. But Nature (the publication, that is :) ) has a way of publishing physicists' papers in the field of SNA without checking to see whether the authors have done their homework...and a cursory check of the references suggests that in this case they may not have.

Re:It's not built yet on New Uses For LCD Technology · 2006-01-14 17:56 · Score: 1

Been done, more or less: You're In Control

mod parent down: misinformative on Computers, Long Hours and Vision Problems? · 2006-01-09 11:40 · Score: 1

It's a well-known fact that wearing corrective lenses causes the eye to learn to depend on the lens

Actually, my optometrists told me for years that rigid gas-permeable contact lenses would slightly _decrease_ the prescription that I would require (because they shape the cornea a little bit, I believe, which soft lenses don't do). I can verify that when I stopped wearing such lenses for a year or so prior to getting LASIK surgery, my eyes became measurably (if not dramatically) more myopic.

"Legally blind" is a bit misleading, as at least one other has pointed out; it's usually only applied to those with uncorrectable vision. "Legally blind without corrective lenses", sure.

Side note: 20/600 is poor vision but not inherently uncorrectable to 20/20. My vision was considerably worse (appx. -12 diopters in each eye, corresponding to a focal length of ~3.25", or about 20/1200); it was correctable to 20/20 with the aforementioned contacts, or with glasses (although apparently not with soft lenses). If you're only getting 20/45 corrected vision, you either have some other vision problem which is not being addressed, or you need a new optometrist and/or optician.

article summary is misleading on Fatal Flaw Weakens RFID Passports · 2005-11-04 04:16 · Score: 2, Interesting

From the summary:

The passports will also include a 'Tin Hat' that limits the RFID signal to only a few inches, but a demonstration has been made that using specialized hardware, the signal can be intercepted from up to 69 feet.

The poster apparently did not carefully RTFA (skipped page 2, is my guess). The 69-foot detection range does not apply to the RFID chips in this case, because of that 'Tin Hat' (the passport is radio-shielded when closed); Schneier was referring to RFID chips in general when he brought that statistic up, not this particular instance. Arguably (if you're going to put RFID chips in passports) this is one of the few things that they've actually fixed.

(I personally think that the whole thing is a bad idea...but let's attack the system on its demerits, not on no-longer-relevant bugs.)

Inkscape on 29 Vector Drawing Programs · 2005-08-01 19:54 · Score: 1

You probably haven't seen it mentioned because it's one of those listed in TFA. Which, you know, might be worth R-ing before you post. :)

Actually, HP does still do "blue-sky" R&D. on HP Fires Father of OOP · 2005-07-21 16:05 · Score: 2, Interesting

HP's not in the blue-sky R&D business, and hasn't been for many years now.

Not true at all. I worked for HP Labs last summer in their Information Dynamics Lab. Much of the research that this group, and others that I'm personally aware of, does is of a distinctly speculative nature and doesn't directly lead to a product. This is fine by HP, because pure research generally pays off in one way or another in the long run.

Corporate blue-sky R&D doesn't generally make the papers until it's no longer blue-sky, i.e., just because you don't see it happening doesn't mean it's not there. If you want to know who's doing research, try reading the scientific literature instead, .

this is a news article, not a technical paper on The Evil in E-Mail · 2005-06-12 07:26 · Score: 1

If you want to criticize what Skillcorn is doing on a technical basis, try reading the actual technical report that he wrote on the subject, rather than basing your conclusions on a news article. Heck, you might even learn something.

Skillcorn's papers, including this one, can be found on his Queen's U. website.

Re:Two Keys: Data Mining and Delay on Cracking the Google Code... Under the GoogleScope · 2005-05-10 13:28 · Score: 2, Interesting

The parent post is largely composed of misinformation, ignorance and irrelevance. I'd suggest to its author that it might be a good idea to do some basic research before posting on a subject which is, I suspect, outside his area of expertise.

(1) What you have described as Google's "algorithm" is a distortion of one particular technique used in data mining (actually machine learning, but we'll let the vocabulary slide); furthermore, no one other than a first-year AI/machine learning student would use exhaustive search in parameter space ("brute force") to come up with a solution. In fact, a very brief search on your favorite search engine (for, say, "PageRank algorithm") would reveal that the basic algorithm is actually very simple, and does not in fact involve learning from labeled examples, as you suggest. (More recent versions of the Google ranking mechanism may safely be assumed to be more sophisticated, but I'd bet serious cash that they're nothing like what you describe.)

(2) PageRank--the basic algorithm, that is--is not, and never has been, based, even in part, on inbound link count. This can also be easily verified by a few minutes' research as above.

(3) Your refrigerator example doesn't actually support your point. If Google's ranking algorithm is continually changing, as you suggest, then you can never know whether any change you made had any effect on your ranking. (And "algorithm can vary by the relative date of various things"? Say what?)

Re:An Idea on Computer Cracks 5x5 Go · 2005-02-21 16:59 · Score: 1

Something like this has been tried; Pierre Baldi has worked on neural network approaches to playing Go. I don't know how much progress his group has made on the problem, though.

Re:Election "incidents" on Verified Voting · 2004-10-28 10:51 · Score: 1

No, it's not the only way. Redundancy works, too.

desirable properties of voting systems on Researchers And Registrars Debate E-Voting · 2004-10-13 14:43 · Score: 1

In order to get anywhere with Shakrai's question--which is a good one--we need to try to agree on the essential principles and desirable qualities for a system. A related point I'll make briefly is that it's worth considering the (de)merits of both the voting machines themselves, and the system that makes use of them. Good designs for each are necessary in order to get good results, so it's not sufficient to just evaluate the machines.

So here's a link to a short essay that I've written, in response to this post:

http://www.livejournal.com/users/jrtom/1007.htm

from which a brief excerpt:

I haven't used the lever operated machines that Shakrai describes, so my analysis is based on my best guesses and his brief description, and I might have missed something. In any event, it sounds like it covers the essentials fairly well, although it's not clear how well it provides security. As for the desirable qualities, if we score them on a 1 (bad) to 5 (excellent) scale, I'd guess clarity: 3 (no pictures by the names), flexibility: 1 (changing scheme probably requires replacing hardware), transparency: 4+, convenience: 2 (sounds difficult for blind, disabled, illiterate, or non-English-speaking voters), efficiency: 2, interoperability: 2, privacy: 4 (not 5 because the low convenience may, as you pointed out later, require a voter to get help to make their choices). So: a workable system, but one that has room for improvement.

Slashdot Mirror

User: jrtom

Comments · 24