An Algorithm To Stop Joke Plagiarists

Posted by timothy on Thursday September 10, 2015 @04:12AM from the stop-me-if-you've-heard-this-one-before dept.

Bennett Haselton writes: The comedy world crucified Josh "Fat Jew" Ostrovsky for building his career on re-tweeting other people's jokes without attribution. But Twitter, or whichever company rises as their successor, could easily implement an algorithm that could stop plagiarists from building a following, while still rewarding joke writers who come up with original content. Read on for Bennett's take on how such a system could work.

The basic algorithm is very similar to the random-sample-voting algorithm that I've advocated as a way to stop vote manipulation on Digg, how to handle abuse reports in a scalable way on Twitter and on Facebook, and how to identify the best ideas submitted to the White House's "We The People" petition site. The algorithm can be used to rate the best jokes (at least according to the average rating of users, not according to some Platonic ideal), while still flagging plagiarized jokes and preventing anyone from building up a following by using them.

Under the algorithm, suppose a subset of users -- let's say, 1 million -- signs up to receive tweets in the general humor category. When a would-be amateur comedian comes up with a funny tweet, then in addition to tweeting it to their followers (if they have any), they can submit it to the humor category generally. The joke is first pushed to the feeds of, say, 1,000 randomly selected users, who have the option of rating it (independently of each other, without seeing the opinions of other raters). Once the joke has acquired enough ratings to constitute a statistically significant sample -- so that the average rating really does reflect the community's "opinion" of the joke -- then the joke gets released into the general pool of jokes available to all 1 million users subscribed to the "humor" category. Those users can decide what threshold of quality they want to set for the jokes that show up in their feed -- for example, if you only want to see jokes that got an average rating of 9 out of 10 or higher, you might only see 50 a day, but if you can lower your standards down to an 8, you might see 100 or 200. And if a user really likes a particular joke that they see in their "threshold feed," they can browse the other jokes in that author's Twitter feed and decide whether to follow them.

So if your joke sucks, it will only end up wasting the time of about 1,000 people, but if it gets a high rating, it will be available in the feeds of up to 1 million people. Thus from the user's point of view, only about 0.1% of the jokes that they see in their feed, are sucky jokes that were pushed to them as part of an initial "focus group" to measure their quality; the other 99.9% is made up of jokes that met whatever threshold they set for the average rating.

As I've stressed in the case of other applications of the random-sample-voting algorithm, this system is scalable, because the number of available reviewers grows as the community grows. It's also non-gameable -- because the raters are randomly selected, even if you create a large number of zombie accounts to try and upvote your own joke, the zombies won't constitute a significant portion of the raters, if the raters are selected from the entire pool of 1 million users.

Still, even under this system, it would be possible to take a highly rated joke and re-word it slightly (to fool any text filters looking for blatant copy-and-paste jobs), and pass it off as your own, hoping that your re-worded version will also get pushed out to a wide audience and net you some extra followers. To prevent this, you can implement a "duplicate" flagging feature that also relies on the random-sample-voting system:

If a user recognizes a joke as a re-worded version of someone else's tweet, they can flag it as a "duplicate", with a link to the earlier tweet that they think is similar. (Flagging it as intentional "plagiarism" would be a bit harsh, since it's quite common for multiple comedians to come up with the same joke.)
The flagged joke, along with a copy of the earlier joke, would once again be sent out to a random sample of subscribers to the humor category, who are then asked to vote on whether the two jokes are substantially similar.
If a statistically significant majority of those users vote that the two jokes are essentially duplicates, then the second tweet gets displayed with a flag icon (shorthand for "our users have identified this as a duplicate of an earlier joke") with a link back to other tweet that was identified as an earlier version of essentially the same joke.
If a majority votes that the two jokes are not similar, then nothing happens. Optionally, if an overwhelming majority of the users vote that the two jokes are not at all similar, then some kind of reputation point penalty could be applied to the user who flagged the second joke as a "duplicate". This discourages people from frivolously duplicate-flagging a joke.

This does have the unfortunate result that if you unintentionally write a joke that duplicates someone else's, it will still end up with the "duplicate" flag after users recognize the similarity to the earlier version. This is, however, something that I don't think any algorithm can solve, because it's impossible to detect the difference between someone copying another person's joke and independently coming up with it on their own. A comedian whose joke ends up being labeled with the "duplicate flag", just because someone else came up with the same gag first, could leave the joke in their feed, but they might consider the duplicate flag to be a mild embarrassment.

On the other hand, if you're just a full-time plagiarist like the Fat Jew, and virtually all of your jokes end up being flagged as clones of other people's work, then your entire feed will be littered with "duplicate" flags that mark you as a hack. Depending on whether Twitter's terms of service prohibit serial plagiarism, your account could even get suspended.

Meanwhile, anybody could still set themselves up as a curator who re-tweets other people's jokes with the original attribution intact. Many users would find that they wouldn't need curators at all, when they can just subscribe to all jokes that get an average rating of, say, 8.5 or higher, but if your humor happens to align very closely with the kind of jokes picked out by a particular curator, you could subscribe to get jokes re-tweeted directly from them. And since the original attribution would be intact, any time you saw a joke that you really liked, you could subscribe to updates directly from that author. Curating can still serve a valuable function that plagiarism does not.

In addition to dealing with plagiarists, though, what I think is interesting about this system is how it would overturn everything we know about what it takes to build a reputation. In the current ecosystem, to build a following, it helps to have good content, but what really matters is hustle -- making friends in high places who might be able to give you a boost with a re-tweet or a shout-out, looking out for opportunities for free publicity, etc. Well, I admire the people who have the energy to keep that up. But from an economic standpoint, "hustling" is a non-productive activity, because it doesn't actually make your content better, it's just an attempt to crowd out someone else's content with your own, which may be better or worse, and it's a zero-sum game. The "hustling" ecosystem is also non-optimal from the user's point of view -- if Joe is better at writing jokes, but Bob is better at hustling, then you as the user are more likely to be exposed to Bob's sub-optimal content, and may never even hear about Joe.

The random-sample rating system, however, makes the entire notion of "hustling" obsolete. The only way to get your content in front of lots of people, is to write content that gets a high average rating from the initial sample of people who see it.

If such a system ever gets implemented, by Twitter or any other company, maybe the Fat Jew can find out if any of his own original material meets the bar. But don't hold your breath -- the marquee joke currently displayed on his Twitter feed is "You can't get an STD if you never get tested."

7 of 128 comments (clear)

Min score:

Reason:

Sort:

tl:dr by AndyKron · 2015-09-10 04:40 · Score: 5, Informative

TL:DR
I got a joke for you by Anonymous Coward · 2015-09-10 04:47 · Score: 5, Informative

Did you hear the one about the Bennett Haselton post that wasn't tedious bullshit?
Me neither.
Bennett Post Blocking Script by Anonymous Coward · 2015-09-10 05:13 · Score: 2, Informative

aardvarkjoe's Bennett blocking script
An algorithm to stop B.H. by mishehu · 2015-09-10 05:14 · Score: 4, Informative

Seriously, Slashdot should look into the development of such a system...
Re:GODDAMIT, I THOUGHT THIS HAD FINALLY ENDED! by AmiMoJo · 2015-09-10 05:21 · Score: 2, Informative

It's like he misses the abuse, and just couldn't stay away.

--
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
As Usual by Sarten-X · 2015-09-10 05:24 · Score: 3, Informative

Yet again, Bennett Haselton inspires us with a short-sighted solution, having never considered whether his system will actually work.

1. If a user recognizes a joke as a re-worded version of someone else's tweet, they can flag it as a "duplicate", with a link to the earlier tweet that they think is similar.
Right there in step 1 is the problem. By requiring a link to a sentence someone read months ago, the burden on the user is raised unacceptably. Users won't bother policing when it's difficult, unless the case is severe enough to stir up an outrage - which would already result in more damage than just flagging a user's tweets.
Of course, the potential for abuse is also high. Changing a single word can parody an original post, yet changing a different single word may not avoid plagiarizing. An automated algorithm won't likely be able to tell the difference, so it will fall to manual effort to identify which flagged duplicates are actually malicious. In context, even an identical phrase may be making a very different statement, so taking the tweet out of context for manual review makes false positives very likely.
Shakespeare plagiarized. Plato plagiarized. Tom Lehrer penned many verses praising plagiarism. The bottom line is that plagiarism goes hand-in-hand with creation, and it should always be evaluated only in the entire context of both works - the plagiarizing and the plagiarized. What is being said is often not what's being written.

--
You do not have a moral or legal right to do absolutely anything you want.
Duplicate of Bennet on Digg, Bennet on petitions, by raymorris · 2015-09-10 06:24 · Score: 1, Informative

Flagged as a duplicate of Bennett's thoughts on Digg, which is a duplicate of Bennett's thoughts on petitions.
Rated -1, lame and unoriginal.