Search
Search the archive with full-text matching across story titles, bodies,
and comments. Phrases are quoted; or, -word,
and parentheses behave as in a web search. Queries must be at least
3 characters.
Search the archive with full-text matching across story titles, bodies,
and comments. Phrases are quoted; or, -word,
and parentheses behave as in a web search. Queries must be at least
3 characters.
The fact is, LK, that Sodomy laws by and large (if not entirely) are unneeded legislation if its not hurting another person, in which case I think we can all agree it would be considered rape. If its two consenting adults, regardless of the sex, how is this hurting anything? Maybe those icky sodomy-germs can ooze through closed doors and infect passersby as they walk by the evil sodomy houses across America. Ew! Sodomy! Ew, it makes me wanna puke, lets make legislation! Ewewewewew! Frankly... its immature to think that something should be considered illegal because it grosses others out. But I bet it doesn't gross LK out. I'm sure he has a very intelligent, well-read, well-researched, well-founded reason why sodomy is bad which doesn't fall back on emotion or that pesky zip-zaggin' morality meter. Right?
You're not going to get away with playing that word game with me. Fetus means baby. What you're doing is equivalent to saying that "It's not a baby, it's a baby!".If we want to be really anal about it (hee hee, get it? Anal? I just mentioned Sodomy and then I said anal. LK, I hope that's ok with you. Need something to calm ya down?) the definition reads thusly from Merriam-Webster:
So... if we decide that "developing" doesn't mean "developED" for your argument, a fetus isn't a fetus until the third month. So.. you don't have a problemw ith abortions before its a fetus, do ya? Or is ther a sub-fetus or sub-sub-fetus that I'm missing somewhere?
Science is able to tell us that the baby is a separate and distinct person from the mother because they have different genetic codes. Science is able to tell us that the mother's body recognizes the baby as a foreign entity because of her immunological response to the presence of the baby.Ok, I'm gonna get extreme here.. 'cause our pal LK loves extremity. Fine.. the gestating sub-sub-sub-sub-fetus has a different genetic code and elicits a immunological response. But is it really a "person?" Don't get me wrong.. I'm not an evil gal with an axe out there for babies (although I'm sure you think I am). At that point its a bundle of cells, at the whim of that body's ability to sustain it. If its a person, and if that body rejects the baby? Is it then murder? Or rather, manslaughter.. since the woman really wasn't directly responsible for her child's slaughter? Or what if the gestating child somehow is hurting the mother, and endangering her life? Whatr if that baby, when brought to term (against the wishes of the mother), ends up with that mother's death? If that baby was a person, and is considered to be equal in the eyes of the law... shouldn't IT be charged with something? I mean.. as you said, its a person. Or maybe the baby being charged in the 5th precinct is too extreme... could the church that pressured the woman to birth the child be charged with manslaughter?
The fact of the matter is, for good or for ill, our laws are meant to protect those beings which have found its way OUT of the womb and are a part of society. If that sounds harsh and cold and heartless to you, I apologize.
Why should people be trusted to do the right thing when it comes to any type of murder? Would fvor trusting people to do the right thing when it comes to deciding when Grandma has lived long enough and should be put out of her misery? After all it's a moral and value judgement as to whether an elderly person's life is still worth living.Well, again.. it would depend on your definition of living. Once grandma is unable to make decisions for herself because se's in an unrevivable coma and can be kept alive against her body's will through the use of technology alone... I think grandma would agree that its time to exit stage left.
At my current gig (OnGame ie. PokerRoom.com) we've been using a kind of modified Scrum. The idea is to start "El Project Grande", spend a day breaking the project into parts and deciding what depends on what, then spend another day in groups breaking down the parts into parts until you have tiny pieces which ideally will take less than 16 man hours a piece.
You now have a big pile of tasks, sub-tasks, sub-sub-tasks and so forth with dependencies and time estimates (all set by the programmers themselves.) It's then up to the Scrum Master aka. "Get-those-assholes-off-our-back-person" to show the results to the customer and decide priorites.
The priorities are passed to programming team, and they then plan the first "iteration". Now, the goal of every "iteration", which typically spans max one month, is to produce a fully implemented, tested and launchable subset of the complete product.
Every day, all the coders have a *short* meeting (10-30 min). Each individual states what he is working on, what he's done since the last meeting, what he's planning to do until the next meeting and most importantly he's strongly encouraged to cry for help if he's blocked in any way.
In addition to this there's an evil, buggy Excel sheet with all the (sub-sub-sub)tasks along the vertical, and days along the horizontal. It's each programmers responsibility to update estimated hours left for tasks he has been working on each day. NB! *hours left* It's prefectly alright to up the time remaining if you have unexpected problems. So a certain task may go "16 12 4 56" hours left.
The main points of Scrum (for a programmer):
Each separate knot has six variables with two possible values (I gave three from memory, there are three more) and one with 24 values - colour. Therefore each knot represents a kind of "super byte" which can hold a single value from 0 to 1535. The data is not in the string but in the joins between strings. From a base string, there are multiple strings, with each not having 1536 possible values. From each string, substrings, sub-sub-strings and sub-sub-sub strings may be appended. Each append operation adds another "superbyte". The number of strings is indetermiante, because, as the article says, effectively the data spreads out in 3-space.
Of course, I don't reckon they used it to maximum density, and the use of the bits may well have been representational. But it might be that it in facts encodes a chapter/paragraph/sentence/word structure. Simple sentences ("Fred owes Bill 5 goats") would be base plus one level of attached strings - a fairly simple level of encoding, with a super-byte at each knot. But it would not be diffcult to generalise from this onec it became common. In fact, this would tend to happen automatically if Bill tied all his IOUs onto one "backing string": from spine, substring identifies debtors, sub-substrings identify multiple debts.
The inability of people to get domains which are safe from either squatters or big business. The idea behind the TLD's was to allow separation, for example, of two companies that do things in different spheres or places. This has failed. For example, try going to Tuvalu and setting up a database of vulpine observations. I think you'll find that your "fox.tv" will get a very sharp letter and your ISP will drop you like a stone about 4 days later! The number of cases like this has got out of hand and can only get worse over time. In addition the limited number of TLDs and the way they are doled out gives ICANN far too much power.
Are you proposing that people type in "66.35.250.150"
Possibly, if 66 TLDs were sold and then the owner of #65 sold 35 subdomains and the owner of #35 sold 250 sub-subdomains and the owner of the 250th of those sold 150 sub-sub-subdomains and you wanted the site on the that sub-sub-sub domain, then yes. In reality I doubt that the sub divisions would ever get that deep since there would be no 255 limit; http://86625000/ would be totally legal and would allow as many individual sites as you would need in the example you gave. Would you care if the URLs in your bookmarks or on Google looked like this? How many sites do you regularly type the URL for? How many phone numbers have you memorised that are even longer than 86625000?
What happens when IP addresses change?
I specifically said that this is an abstraction on top of IP's and that there would be no change. If you like, I'm sort of suggesting that DNS servers accept numbers as legal name characters. There is absolutely no connection between web numbers and IP numbers.
TWW
I bought a new TV last Christmas, and my recent experience tends to agree. It wasn't even a smart TV, just a standard one. I still had a whole bunch of problems.
* Terrible interface for trying to figure out how to adjust color/brightness
* Terrible interface for trying to scan for over the air channels. Ran through this probably 5 times to get it right.
* Discovered a firmware bug which would turn the TV on once every 24 hours. This could not be disabled. Spent two weeks trying to upload patched firmware, which included a web site that said the model number was wrong even when it wasn't (no ability to browse, you've just got to know and type it from the box), multiple calls to the main vendor and then sub-vendors, finally getting the firmware patch with no instructions, calling back to find out totally unintuitive process for uploading the firmware, part of which includes "wait for 5-20 minutes while it takes care of itself in the background, and if you interrupt this invisible process you may brick your TV".
* Found that patched firmware didn't actually fix the bug, but that it at least allowed for a sub-feature that, if there's no signal to the TV, it will turn back off 15 minutes later.
* TV had terrible sound. Tried multiple versions of traditional (audio jack) and USB speakers, none of which worked for inexplicable reasons. Eventually took a big risk spending $80 on a soundbar that would handle digital audio, hoping it would work, and got lucky. (Sub-issue: soundbar goes to sleep if the TV is paused for a while, and when you wake it back up, the TV doesn't recognize it. You've got to turn the TV on/off to get sound working again. Sub-sub issue: sometimes Netflix loses track of sound, even when the TV had located it; same fix.)
* I've got 3 remotes: TV, streaming device, and sound bar. The wife and kids get it, but none of our visitors or relatives can figure anything out. I'm *this* close to printing a laminated cheat sheet of instructions, which they probably won't use because it's too complicated.
* Relatives tend to leave the TV either tuned to an over the air signal or a powered-on streaming device, so that when the firmware bug kicks in, then the TV stays on until someone realizes it was accidental and turns it off hours later.
I want to double down upon what you are saying as you know far better than the parent what is going on.
Just today I had sub-sub contractors from Comcast trying to fix the cable from the box across the street to my home for an issue I first reported the issue in mid-December but after a few months of nonsense things finally got worked out.
After first having a visit from a person who appeared to be a Comcast employee declaring the connection between my home and the distribution box across the street bad (I was seeing .1 mbs upload(should have been closer to 10mbs) yet semi-normal downloads), he wrote it up for replacement... and so began a multi-month process.
A week later received a note from a sub-contractor of Comcast (though with the Comcast letterhead on the door hanger and the sub-contractors name in the fine print) which said my cable needed to be replaced. Over the next couple of months I'd call them to check on the status with the work order # on the tag as things slowly worked their way through the Comcast and local city bureaucracies.
Eventually they told me that the work had been issued to a 'sub-contractor' (really a sub-sub-contractor) who took about a month to get things worked out as well between the city and them (which included two paintings of the paths of various utility lines under and around the street (much to the annoyance of the neighbors who didn't like the paint on their property)).
Finally the day of repair arrived (today) and they did their digging... alas they hit a rock when tunneling under the driveway of the neighbor in front of me (and right next to the distribution box) so they had to fill in most of what they did (amazingly professional in this way) and say that another team from the same company would have to come out in a week with a different boring machine to complete the work.
The pathetic thing about this whole process was that as far as Comcast is concerned, my issue has been resolved months ago by virtue of it being sent to an outside vendor... in the close out email even citing the fact that my signal strength had returned to normal (hint: it hadn't fully).
There is a part of me that is considering dropping Comcast service once this whole repair effort is complete (costing them $5-10k)... however they (unfortunately) provide the fastest internet for the price... when it works.
This IT project did NOT fail. In the UK, this is the primary form of political corruption. Government ministers give massive projects to their cronies, receiving astonishing kick-backs and the 'employment' of their family members at the heart of these fake endeavours. The 'leaders' of these projects are people who are ALWAYS promoted to the House of Lords at some point. Meanwhile, the project activates endless levels of sub-contracting, so that the people at the bottom doing the real work are separated by as many sub-sub-sub-sub contracting shell companies, ensuring they neither care nor have the money or infrastructure to stand any chance of getting the job done.
After a few such failures, Blair's scum then justify massive sub-contracting to places like India, where some form of working system is spewed out, albeit of the lowest possible quality imaginable.
In China they shoot officials for engaging in corrupt practices a thousandfold less significant than what is commonplace with almost every Westminster politician. However, in the West, it is considered a perk of being in political office.
Other than basic format icons, which have been pretty standard for 20 years, every item in the ribbon has a text label. You can look for text or icons as you see fit. Whatever works for you. It confuses the hell out of me how people like yourself bitch about having to find things by icon when the text is plainly visible.
Menus do even worse in confined spaces. You either have to scroll, or you have to make them stack up. And don't even get started on the submenus, and sub-sub-menus and sub-sub-sub-menus. Plus, once you click an item, they have to disappear so you can see your work again. If you need to do the action again, you have to find it again in the menu, possibly several levels deep. Or if you need a similar action, you have to do the same thing. With the ribbon, the item is there with a single click.
Menus have so many flaws. The more functions you add, the harder they are to navigate. They're difficult for people with poor dexterity to use. And it's a sea of text that you're lucky if it has an icon next to the command to help you pick it out.
Of course I didn't RTFA but we've had discussion about this kind of thing in Finland recently related to the new nuclear plant construction site and the worker benefit/paycheck violations made by sub-sub-sub-sub-subcontractors. AFAIK in Finland the buyer [of services] should already be responsible for that the contractor she grants a job to is playing by the law.
It's only reasonable that sub-contractors using pirated software would be regarded as the contractor using pirated software, as you can most likely assume that sub-contractor has lower operating costs because they use illegally obtained software, thus possibly lowering the costs of their job for the contractor.
Now, if someone would dare to pursue worker rights in the same fashion *ducks* :) And the other away around, this software/illegal activities by contractors should be reflected upon the buyer in all around fashion, at least to some reasonable extent of requiring the buyer to monitor and react to detected anomalies..
I'm not convinced in the slightest that a multiverse exists (in any sense of the word), but I agree that assuming things like Brane cosmology are true, the logical conclusion is that these other universes would, based on probability, have something recognizable to us as 'stars' and even 'life'
Possibility always wins when we play the probability game.
As I said above, I think the multiverse theories are a pantload of stink. When it comes to the very big and the very small, we can't ever seem to reconcile with infinity. There's always a smaller particle or a bigger cosmological super-structure...how long before we find that Higgs Bosons actually have a sub-particle or that there are actually multiple multi-verses? It's a reductive zero sum game...when in doubt, just add another layer of complexity.
I submit that instead of spinning our wheels thinking up bigger and smaller structures so we can get research grants, we should instead do whatever needs to be done to nail gravity down cold. Once we understand gravity as well as other forces, things will start making more sense such that these unprovable multi-verses and sub-sub-sub-sub particles will be unnecessary.
BT is the equivalent of Bell/AT&T in the US. It's impossible to sue them into oblivion. The best you can hope for is that one of the sub-sub-sub-sub-sub-CEOs gets a slap on the wrist and won't be invited to the next golf tournament.
Mozilla Dev1: OMFG, we haven't been mentioned on the front page of slashdot in like 20 minutes or something!
Mozilla Dev2: No problem. We'll just release yet another pointless sub-sub-sub-sub-sub version.
Me: (looking at yet another firefox recompile from FreeBSD ports) Screw it, I'm going back to seamonkey.
Would you care to list some? The only thing that comes to my mind besides security/bugfixes is support for hard drives bigger than 8GBs (SP4, IIRC). I guess SKEY too, but it was junk and saw no use by anyone.
No, Active Desktop did not come in any service pack. The Active Desktop update came with the upgrade to Internet Explorer version 4.
I have to disagree with you completely here. Windows 2000 changed greatly. Everything was in a different place, and almost everything had to be operated differently. Windows NT4 didn't come with a device manager or PnP automatic hardware dialog, but with Windows 2000, these were built-in and the ONLY way to get your hardware working. It was FAR more like 98 than NT4 (even with SP6), even though they managed to obscure and bastardize the interface and organization to make it much more miserable to use.
I have to disagree on everything here. The security model stayed nearly the same, the stability was no better in my experience, and the UI changes were quite major, and very negative. Things that were straight forward options in the control panel in NT4, became nestled in sub-sub-sub-sub-sub menus, of the system control subsystem.
The advantages to Windows 2000 were all on the users end. It made a better desktop because of updated DirectX, Media Player, Windows98-like interface, standby/hibernate, PnP, etc.
Signed drivers were there in Windows 2000, and the built-in firewall has been there at least since the beginning of NT4.0. I don't recall if it was there in NT3.51 (or NT3.1), but I wouldn't be surprised one bit.
Critical Update Notification (didn't install) was introduced in NT4. Automatic updates were in Windows 2000 when it was released, so the feature isn't new to XP by any means.
You act like this is something new. After a couple years, Microsoft stops releasing new versions of programs for older OSes. DirectX, Windows Media Player, Internet Explorer, etc. This happened to Win95, NT4, etc. It just happened to be at the same time as XP SP2 was released. Saying that it was somehow a catch is really misreading the situation.
I've tried MythTV, and found it to be terribly lowsy. The interface is poor, it uses it's own stupid format, seeking is quite slow, your local media files have to be accessed through a different sub-sub-menu (and MythTV has a lot of sub-sub-sub-menus), and it's conflict resolution/priority system seems massively over-complex to me.
In addition, if you are going to be doing software encoding, MythTV has terribly lowsy quality at even very high bitrates, and yet eats up tons of CPU power to do it. It's been some time since I tried MythTV, but I doubt much has changed.
Freevo might be a much better bet, but I was turned-off at the channel setup step, and decided it would be far quiker and easier for me to write a few shell scripts using mencoder for recording, and a simple file manager that opens video files with mplayer.
I was doing this with a cheapo analog capture card (and eating up CPU cycles encoding to MPEG4 in realtime with mencoder), but hearing the rave reviews about the Hauppauge PVR 250/350, I spent the money to get a PVR250, and found the quality to be no better than what I was getting, and the hardware encoder requires massive bitrates (4000K ie. 2GBs/hr) to produce video without artifacts, whereas mencoder would easily produce video at 1/4 the bitrate, at about the same quality.
If you want to roll your own, instead of using Freevo/MythTV, here's a few tips:
webvcr+ provides a nice web interface to schedule recordings, and uses mysql and xmltv to get listings, just as MythTV does. Also install Links 2.0, so you have a web browser that you can navigate easily using your remove. Just bind a key on your remote to launch links (in GUI mode) and open a page to your webvcr+ interface.
I recomend using any old filemanager, and a bare-minimal window manager (I use blackbox). Then something like bbkeys to bind a remote-control button to open the filemanager to your "videos/" folder, and you're pretty-much set. Just go up/down to your files, and hit "OK" to play them.
I also have some very basic shell-scripts, which allow me to edit my files in avidemux2, re-encode them to smaller bitrates (if I want to save them), make a data CD out of any number of video files, or re-encode them to VCDs/SVCDs/DVDs any of which can be played on most DVD players. etc, etc. These are things you can't do from within MythTV's interface, but any simple filemanager will have no problem with.
Wikipedia - controversial, contradictory, sometimes extreme points of view on current politics.
I would politely argue that "sometimes" is too weak a word: as I noted before, Wikipedia has a wide-spread, highly distributed and non-trivial problem with bias in it's articles. From my own sampling of controversial articles, I would say that most, if not all, of these pages contain at least one or two paragraphs that state opinion as undisputed fact. That doesn't cover the numerous stubs and "hidden" articles (sub-sub and sub-sub-sub-etc articles that hide amidst the forest of articles in Wikipedia). Short of a complete slash-and-burn approach I don't see a solution to this.
I guess it does not matter too much to me since I don't think I would ever use wikipedia as a source for political information. I think most of the people concerned about politics are concerned with the information provided to posterity. I just want facts and some popular hypothesis.
But, as the article states, the average reader probably does not have the means to resolve factual errors (nor should he/she be expected to). Similarly, he/she has no means to recognise the the polical bias. Whilst you may not go to Wikipedia for political information, others may do so and that is my concern.
Finally, bias may not simply extend to controversial articles. I noticed "bias creep", where particularly dedicated authors modify any and all articles even loosely related to a controversial topic so as to better push their agenda.
CSC was the prime contractor; they built a 12-story building across the street from one of their big DC datacenters with over 1000 people in it, and surrounding offices. Have their own stop on the Orange train line.
And about one in one hundred did jack crap. It was a complete pig trough. I was a sub-sub-sub-sub-sub contractor, working for a company with 5 employees.
Ideally, all pages of a website should be accessible from one menu system (I realize that this is impossible for really large sites) and that menu system should exist on every page of the site, enabling users to drill down or up to sub, sub-sub, sub-sub-sub, etc. site levels from any page.
Milonic's DHTML menu system does this quite nicely. There are other examples out there, I only use this one because I am quite familiar with it.
People don't seem to see how great this is. Maybe it's because most people don't have all that much data.
On my home systems, I have over 250GB online. That doesn't even count my music or videos/movies, which I keep on seperate, removable, optical storage.
I can tell you from experience, that managing that much data is a huge hassle. Let's say you've got your files organized well. You probably have hundreds of folders for each subject, and you have to broswe to each one with each new file you save. I have a folder (several actually, for various subjects) where I save thing that I've haven't taken a look at yet. Let's say it's a program that I haven't installed. Well once I do install it, I need to clean up all the temporary files, then browse around to another folder (takes a minute or two when you have hundreds of folders), where I save installed programs, and browse to the appropriate sub-folder, and save it. But then I end up doing the same thing with a video clip... Watching it, deciding to delete or save it, then browsing to a sub-sub-sub-sub folder to move it.
Of course, that's enough of a hassle, but things get complicated when I want to move things to another systems, which obviously isn't going to have the same filesystem. Merging each individual folder, into each different folder is seriously time-consuming, and teedious. Without fail, there always ends up being a couple folders in the wrong place, because they were a sub-folder of something else, that I did happen to see when I coppied the contents of the folder.
Then matters are even further complicated, because I may choose to delete older content months later or so, and locating everything is a huge mess.
Personally, I would like to save everything in one place, not having to change folder to folder for each file. When saving something, I could just enter a handful of keywords (eg. "picture penguin snow") which would be much less work than moving to directories or even typing in a long filename. From there, a simple database system would be be able to know what type of file it is, how large it is, and how old it is. That would make it incredibly easy to manage. Whenever I want a file, I type-in "images older than 1 years" or "programs marked as archived" and I get EVERYTHING I'm looking for in a fraction of the time. Not only that, but it makes pruning out old data as easy as it could possibly be. Just search for "linux" and delete older version, no worries about what folder it's in... If it's in a temporary folder and you haven't used it yet, or if it's archived and been in-use on your system forever. Obviously you'll be able to see that information, but unlike in our current systems, it won't stand in your way when you want to find things.
It's absoultely no work at all to transfer files, since the info should stay with them, and it will automatically integrate perfectly with your local file management/organization scheme. What's more, data like marking something as "archived" is great in that your system could automatically move it over the network where you archive your files. Since your filesystem would be a smart database, when you search for the file, it could still turn up in the search results, and be automatically moved back where you need it, when you need it.
Personally, I think this would not only save time and effort, but money as well, because so many people wouldn't be dealing with their file problems by just throwing more space in their systems, instead of spending time on figuring out where every file is, what they can get rid of, dupilcate files, and junk like that.
With this, I should be able to say "tar -xjf 'newest version of mplayer'" However, this will need to be in the actual filesystem to be useful, not just supported for GNOME applications.
Bursty and Hierarchical Structure in Streams * Jon Kleinberg # Abstract A fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. E-mail and news articles are two natural examples of such streams, each characterized by topics that appear, grow in intensity for a period of time, and then fade away. The published literature in a particular research field can be seen to exhibit similar phenomena over a much longer time scale. Underlying much of the text mining work in this area is the following intuitive premise -- that the appearance of a topic in a document stream is signaled by a "burst of activity," with certain features rising sharply in frequency as the topic emerges. The goal of the present work is to develop a formal approach for modeling such "bursts," in such a way that they can be robustly and efficiently identified, and can provide an organizational framework for analyzing the underlying content. The approach is based on modeling the stream using an infinite-state automaton, in which bursts appear naturally as state transitions; in some ways, it can be viewed as drawing an analogy with models from queueing theory for bursty network traffic. The resulting algorithms are highly efficient, and yield a nested representation of the set of bursts that imposes a hierarchical structure on the overall stream. Experiments with e-mail and research paper archives suggest that the resulting structures have a natural meaning in terms of the content that gave rise to them. *A version of this work appears as Cornell Computer Science Technical Report 02-1863 (March 2002). #Department of Computer Science, Cornell University, Ithaca NY 14853. Email: kleinber@cs.cornell.edu. Supported in part by a David and Lucile Packard Foundation Fellowship, an ONR Young Investigator Award, NSF ITR/IM Grant IIS-0081334, and NSF Faculty Early Career Development Award CCR-9701399. 1. 1 Introduction Documents can be naturally organized by topic, but in many settings we also experience their arrival over time. E-mail and news articles provide two clear examples of such document streams: in both cases, the strong temporal ordering of the content is necessary for making sense of it, as particular topics appear, grow in intensity, and then fade away again. Over a much longer time scale, the published literature in a particular research field can be meaningfully understood in this way as well, with particular research themes growing and diminishing in visibility across a period of years. Work in the areas of topic detection and tracking [2, 3, 5, 61, 62], text mining [36, 56, 57, 58], and visualization [26, 43, 60] has explored techniques for identifying topics in document streams comprised of news stories, using a combination of content analysis and time-series modeling. Underlying a number of these techniques is the following intuitive premise -- that the appearance of a topic in a document stream is signaled by a "burst of activity," with certain features rising sharply in frequency as the topic emerges. The goal of the present work is to develop a formal approach for modeling such "bursts," in such a way that they can be robustly and efficiently identified, and can provide an organizational framework for analyzing the underlying content. At one level, the approach presented here can be viewed as drawing an analogy with models from queueing theory for bursty network traffic (see e.g. [32]). In addition, however, the analysis of the underlying burst patterns reveals a latent hierarchical structure that often has a natural meaning in terms of the content of the stream. My initial aim in studying this issue was a very concrete one: I wanted a better organizing principle for the enormous archives of personal e-mail that I was accumulating. Abundant anecdotal evidence, as well as academic research [6, 42, 59], suggested that my own experience with "e-mail overload" corresponded to a near-universal phenomenon -- a consequence of both the rate at which e-mail arrives, and the demands of managing volumes of saved personal correspondence that can easily grow into tens and hundreds of megabytes of pure text content. And at a still larger scale, e-mail has become the raw material for legal proceedings [34] and historical investigation [8, 38, 44] -- with the National Archives, for example, agreeing to accept tens of millions of e-mail messages from the Clinton White House [45]. In sum, there are several settings where it is a crucial problem to find structures that can help in making sense of large volumes of e-mail. An active line of research has applied text indexing and classification to develop e-mail interfaces that organize incoming messages into folders on specific topics, sometimes recommending further actions on the part of a user [4, 9, 13, 29, 30, 39, 46, 47, 49, 50, 51, 53, 54] -- in effect, this framework seeks to automate a kind of filing system that many users implement manually. There has also been work on developing query interfaces to fully-indexed collections of e-mail [7]. My interest here is in exploring organizing structures based more explicitly on the role of time in e-mail and other document streams. Indeed, even the flow of a single focused topic 2. is modulated by the rate at which relevant messages or documents arrive, dividing naturally into more localized episodes that correspond to bursts of activity of the type suggested above. For example, my saved e-mail contains over a thousand messages relevant to the topic "grant proposals" -- announcements of new funding programs, planning of proposals, and correspondence with co-authors. While one could divide this collection into sub-topics based on message content -- certain people, programs, or funding agencies form the topics of some messages but not others -- an equally natural and substantially orthogonal organization for this topic would take into account the sequence of episodes reflected in the set of messages -- bursts that surround the planning and writing of certain proposals. Indeed, certain subtopics (e.g. "the process of gathering people together for our large NSF ITR proposal") may be much more easily characterized by a sudden confluence of message-sending over a particular period of time than by textual features of the messages themselves. One can easily argue that many of the large topics represented in a document stream are naturally punctuated by bursts in this way, with the flow of relevant items intensifying in certain key periods. A general technique for highlighting these bursts thus has the potential to expose a great deal of fine-grained structure. Before moving to a more technical overview of the methodology, let me suggest one further perspective on this issue, quite distant from computational concerns. If one were to view a particular folder of e-mail not simply as a document stream but also as something akin to a narrative that unfolds over time, then one immediately brings into play a body of work that deals explicitly with the bursty nature of time in narratives, and the way in which particular events are signaled by a compression of the time-sense. In an early concrete reference to this idea, E.M. Forster, lecturing on the structure of the novel in the 1920's, asserted that . . . there seems something else in life besides time, something which may conveniently be called "value," something which is measured not by minutes or hours but by intensity, so that when we look at our past it does not stretch back evenly but piles up into a few notable pinnacles, and when we look at the future it seems sometimes a wall, sometimes a cloud, sometimes a sun, but never a chronological chart [17]. This role of time in narratives is developed more explicitly in work of Genette [19, 20], Chatman [11], and others on anisochronies, the non-uniform relationships between the amount of time spanned by a story's events and the amount of time devoted to these events in the actual telling of the story. Modeling Bursty Streams. Suppose we were presented with a document stream -- for concreteness, consider a large folder of e-mail on a single broad topic. How should we go about identifying the main bursts of activity, and how do they help impose additional structure on the stream? The basic point emerging from the discussion above is that such 3. bursts correspond roughly to points at which the intensity of message arrivals increases sharply, perhaps from once every few weeks or days to once every few hours or minutes. But the rate of arrivals is in general very "rugged": it does not typically rise smoothly to a crescendo and then fall away, but rather exhibits frequent alternations of rapid flurries and longer pauses in close proximity. Thus, methods that analyze gaps between consecutive message arrivals in too simplistic a way can easily be pulled into identifying large numbers of short spurious bursts, as well as fragmenting long bursts into many smaller ones. Moreover, a simple enumeration of close-together sets of messages is only a first step toward more intricate structure. The broader goal is thus to extract global structure from a robust kind of data reduction -- identifying bursts only when they have sufficient intensity, and in a way that allows a burst to persist smoothly across a fairly non-uniform pattern of message arrivals. My approach here is to model the stream using an infinite-state automaton A, which at any point in time can be in one of an underlying set of states, and emits messages at different rates depending on its state. Specifically, the automaton A has a set of states that correspond to increasingly rapid rates of emission, and the onset of a burst is signaled by a state transition -- from a lower state to a higher state. By assigning costs to state transitions, one can control the frequency of such transitions, preventing very short bursts and making it easier to identify long bursts despite transient changes in the rate of the stream. The overall framework is developed in Section 2. It can be viewed as drawing an analogy to the use of on-off Markov sources in modeling bursty network traffic (see for example the overview article by Kelly [32]), as well as drawing on the formalism of hidden Markov models [48]. Using an automaton with states that correspond to higher and higher intensities provides an additional source of analytical leverage -- the bursts associated with state transitions form a naturally nested structure, with a long burst of low intensity potentially containing several bursts of higher intensity inside it (and so on, recursively). For a folder of related e-mail messages, we will see in Sections 2 and 3 that this can provide a hierarchical decomposition of the temporal order, with long-running episodes intensifying into briefer ones according to a natural tree structure. This tree can thus be viewed as imposing a fine-grained organization on the sub-episodes within the message stream. Following this development, Section 4 focuses on a case in which the document stream is comprised not of e-mail messages but of computer science conference paper titles over the past several decades; the set of bursts in this stream corresponds roughly to the appearance and disappearance of certain terms of interest in the papers. Section 5 discusses the connections to related work in a range of areas, particularly the striking recent work of Swan, Allan, and Jensen [56, 57, 58] on overview timelines, which forms the body of research closest to the approach here. Finally, Section 6 discusses some further applications of the methodology -- how burstiness in arrivals can help to identify certain messages as "landmarks" in a large corpus of e-mail; and how the overall framework can be applied to logs of Web traffic. 4. 2 A Weighted Automaton Model Perhaps the simplest randomized model for generating a sequence of message arrival times is based on an exponential distribution: messages are emitted in a probabilistic manner, so that the gap x in time between messages i and i + 1 is distributed according to the "memoryless" exponential density function f (x) = ffe-ffx, for a parameter ff > 0. (In other words, the probability that the gap exceeds x is equal to e-ffx.) The expected value of the gap in this model is ff-1, and hence one can refer to ff as the rate of message arrivals. Intuitively, a "bursty" model should extend this simple formulation by exhibiting periods of lower rate interleaved with periods of higher rate. A natural way to do this is to construct a model with multiple states, where the rate depends on the current state. Let us start with a basic model that incorporates this idea, and then extend it to the models that will primarily be used in what follows. A two-state model. Arguably the most basic bursty model of this type would be constructed from a probabilistic automaton A with two states q0 and q1, which we can think of as corresponding to "low" and "high." When A is in state q0, messages are emitted at a slow rate, with gaps x between consecutive messages distributed independently according to a density function f0(x) = ff0e-ff0x When A is in state q1, messages are emitted at a faster rate, with gaps distributed independently according to f1(x) = ff1e-ff1x, where ff1 > ff0. Finally, between messages, A changes state with probability p 2 (0, 1), remaining in its current state with probability 1 - p, independently of previous emissions and state changes. Such a model could be used to generate a sequence of messages in the natural way. A begins in state q0. Before each message (including the first) is emitted, A changes state with probability p. A message is then emitted, and the gap in time until the next message is determined by the distribution associated with A's current state. One can apply this generative model to find a likely state sequence, given a set of messages. Suppose there is a given set of n + 1 messages, with specified arrival times; this determines a sequence of n inter-arrival gaps x = (x1, x2, . . . , xn). The development here will use the basic assumption that all gaps xi are strictly positive. We can use the Bayes procedure (as in e.g. [14]) to determine the conditional probability of a state sequence q = (qi1, . . . , qin); note that this must be done in terms of the underlying density functions, since the gaps are not drawn from discrete distributions. Each state sequence q induces a density function fq over sequences of gaps, which has the form fq(x1, . . . , xn) = Qnt=1 fit(xt). If b denotes the number of state transitions in the sequence q -- that is, the number of indices it so that qit 6= qit+1 -- then the (prior) probability of q is equal to ( Y it6=it+1 p)( Yit=it+1 1 - p) = p b(1 - p)n-b = p 1 - p ! b (1 - p)n. 5. (In this calculation, let i0 = 0, since A starts in state q0.) Now, Pr [q | x] = Pr [q] fq(x)P q0 Pr [q0] fq0 (x) = 1Z p1 - p ! b (1 - p)n nY t=1 f it(xt), where Z is the normalizing constant Pq0 Pr [q0] fq0 (x). Finding a state sequence q maximizing this probability is equivalent to finding one that minimizes - ln Pr [q | x] = b ln 1 - pp ! + nX t=1 - ln f it(xt)! - n ln(1 - p) + ln Z. Since the third and fourth terms are independent of the state sequence, this latter optimization problem is equivalent to finding a state sequence q that minimizes the following cost function: c (q | x) = b ln 1 - pp ! + nX t=1 - ln f it (xt)! Finding a state sequence to minimize this cost function is a problem that can be motivated intuitively on its own terms, without recourse to the underlying probabilistic model. The first of the two terms in the expression for c (q | x) favors sequences with a small number of state transitions, while the second term favors state sequences that conform well to the sequence x of gap values. Thus, one expects the optimum to track the global structure of bursts in the gap sequence, while holding to a single state through local periods of non-uniformity. Varying the coefficient on b controls the amount of "inertia" fixing the automaton in its current state. The next step is to extend this simple "high-low" model to one with a richer state set, using a cost model; this will lead to a method that also extracts hierarchical structure from the pattern of bursts. An infinite-state model. Consider a sequence of n + 1 messages that arrive over a period of time of length T . If the messages were spaced completely evenly over this time interval, then they would arrive with gaps of size g = T /n. Bursts of greater and greater intensity would be associated with gaps smaller and smaller than g. This suggests focusing on an infinite-state automaton whose states correspond to gap sizes that may be arbitrarily small, so as to capture the full range of possible bursts. The development here will use a cost model as in the two-state case, where the underlying goal is to find a state sequence of minimum cost.
Thus, consider an automaton with a "base state" q0 that has an associated exponential density function f0 with rate ff0 = g-1 = n/T -- consistent with completely uniform message arrivals. For each i > 0, there is a state qi with associated exponential density fi having
6.
0 1 32
g ln n per state
20 1 3 tree representation
0 1 32 bursts
b)
time
optimal state sequence
a) q q q q0 1 2 3 qi
transition cost transition cost 0 emissions at rateg-1 s i
Figure 1: An infinite-state model for bursty sequences. (a) The infinite-state automaton A*s,fl; in state qi, messages are emitted at a spacing in time that is distributed according to f(x) = ffie-ffix, where ffi = g-1si. There is a cost to move to states of higher index, but not to states of lower index. (b) Given a sequence of gaps between message arrivals, an optimal state sequence in A*s,fl is computed. This gives rise to a set of nested bursts: intervals of time in which the optimal state has at least a certain index. The inclusions among the set of bursts can be naturally represented by a tree structure.
rate ffi = g-1si, where s > 1 is a scaling parameter. (i will be referred to as the index of the state qi.) In other words, the infinite sequence of states q0, q1, . . . models inter-arrival gaps that decrease geometrically from g; there is an expected rate of message arrivals that intensifies for larger and larger values of i. Finally, for every i and j, there is a cost o/ (i, j) associated with a state transition from qi to qj. The framework allows considerable flexibility in formulating the cost function; for the work described here, o/ (*, *) is defined so that the cost of moving from a lower-intensity burst state to a higher-intensity one is proportional to the number of intervening states, but there is no cost for the automaton to end a higher-intensity burst and drop down to a lower-intensity one. Specifically, when j > i, moving from qi to qj incurs a cost of (j - i)fl ln n, where fl > 0 is a parameter; and when j 0, since all gaps are positive.) If q* is an optimal state sequence in Aks,fl, then it is also an optimal state sequence in A*s,fl .
Before proceeding to the proof, here are two key points to note. First, in all the experiments here, an optimal state sequence in A*s,fl can be found by restricting to a number of states k that is a very small constant, always at most 25.
Second, some condition requiring gaps to be positive is necessary in order for the theorem to hold, as the following example shows. Suppose that x were to consist of n gaps, each equal to 0, where n is large enough that sn > nfl . Then the state sequence q(j) in which all n states are equal to qj has cost
c iq(j) | xj = j(fl ln n) - n ln fj(0) = j(fl ln n) - n ln ffj
= j(fl ln n) - nj ln s + n ln g = j(fl ln n - n ln s) + n ln g.
8.
For increasing values of j, these costs c iq(j) | xj form a sequence of negative numbers tending to -1, and hence there is no state sequence in A*s,fl that achieves a cost less than or equal to that of all others.
When all gaps are positive, however, no such example is possible, since Theorem 2.1 establishes that there is a state sequence in A*s,fl achieving the minimum cost.
Proof of Theorem 2.1. Let q* = (q`1 , . . . , q`n) be an optimal state sequence in Aks,fl, and let q = (qi1, . . . , qin) be an arbitrary state sequence in A*s,fl. As always, set `0 = i0 = 0, since both sequences start in state q0; for notational purposes, it is useful to define `n+1 = in+1 = 0 as well. The goal is to show that c (q* | x) = j0 >= j* + 1, then - ln fj00 (xt) >= - ln fj0 (x).
Since k = d1 + logs T + logs ffi(x)-1e, one has
ffk-1 = g-1sk-1 = nT * sk-1 >= 1T * slogs T +logs ffi(x)-
1 = 1
T
T ffi(x) =
1 ffi(x).
Since ffi(x)-1 >= x-1t for any t = 1, 2, . . . , n, the index k - 1 is at least as large as the j for which - ln fj(xt) is minimized. It follows that for those t for which it 6= it0 one has-
ln fi0t(xt) it0 = k - 1.
Combining these inequalities for the state transition costs and the gap costs, one obtains
c (q0 | x) =
n-1X
t=0 o/ (i
0t, i0t+1)!+ nX
t=1 - ln f
i0t(xt)! 0.
Note that although the final computation of an optimal state sequence is carried out by recourse to a finite-state model, working with the infinite model has the advantage that a number of states k is not fixed a priori; rather, it emerges in the course of the computation, and in this way the automaton A*s,fl essentially "conforms" to the particular input instance.
3 Hierarchical Structure and E-mail Streams Extracting hierarchical structure. From an algorithm to compute an optimal state sequence, one can then define the basic representation of a set of bursts, according to a hierarchical structure.
For a set of messages generating a sequence of positive inter-arrival gaps x = (x1, x2, . . . , xn), suppose that an optimal state sequence q = (qi1, qi2, . . . , qin) in A*s,fl has been determined. Following the discussion of the previous section, we can formally define a burst of intensity j to be a maximal interval over which q is in a state of index j or higher. More precisely, it is an interval [t, t0] so that it, . . . , it0 >= j but it-1 and it0+1 are less than j (or undefined if t - 1 n).
It follows that bursts exhibit a natural nested structure: a burst of intensity j may contain one or more sub-intervals that are bursts of intensity j + 1; these in turn may contain subintervals that are bursts of intensity j + 2; and so forth. This relationship can be represented by a rooted tree \Gamma , as follows. There is a node corresponding to each burst; and node v is a child of node u if node u represents a burst Bu of intensity j (for some value of j), and node v represents a burst Bv of intensity j + 1 such that Bv ` Bu. Note that the root of \Gamma corresponds to the single burst of intensity 0, which is equal to the whole interval [0, n].
Thus, the tree \Gamma captures hierarchical structure that is implicit in the underlying stream. Figure 1(b) shows the transformation from an optimal state sequence, to a set of nested bursts, to a tree.
Hierarchy in an e-mail stream. Let us now return to one of the initial motivations for this model, and consider a stream of e-mail messages. What does the hierarchical structure of bursts look like in this setting?
I applied the algorithm to my own collection of saved e-mail, consisting of messages sent and received between June 9, 1997 and August 23, 2001. (The cut-off date is chosen here so as to roughly cover four academic years.) First, here is a brief summary of this collection. Every piece of mail I sent or received during this period of time, using my cs.cornell.edu email address, can be viewed as belonging to one of two categories: first, messages consisting of one or more large files, such as drafts of papers mailed between co-authors (essentially, e-mail as file transfer); and second, all other messages. The collection I am considering here consists simply of all messages belonging to the second, much larger category; thus, to a
10.
rough approximation, it is all the mail I sent and received during this period, unfiltered by content but excluding long files. It contains 34344 messages in UNIX mailbox format, totaling 41.7 megabytes of ascii text, excluding message headers.1
Subsets of the collection can be chosen by selecting all messages that contain a particular string or set of strings; this can be viewed as an analogue of a "folder" of related messages, although messages in the present case are related not because they were manually filed together but because they are the response set to a particular query. Studying the stream induced by such a response set raises two distinct but related questions. First, is it in fact the case that the appearance of messages containing particular words exhibits a "spike," in some informal sense, in the (temporal) vicinity of significant times such as deadlines, scheduled events, or unexpected developments? And second, do the algorithms developed here provide a means for identifying this phenomenon?
In fact such spikes appear to be quite prevalent, and also rich enough that the algorithms of the previous section can extract hierarchical structure that in many cases is quite deep. Moreover, the algorithms are efficient enough that computing a representation for the bursts on a query to the full e-mail collection can be done in real-time, using a simple implementation on a standard PC.
To give a qualitative sense for the kind of structure one obtains, Figures 2 and 3 show the results of computing bursts for two different queries using the automaton A*2. Figure 2 shows an analysis of the stream of all messages containing the word "ITR," which is prominent in my e-mail because it is the name of a large National Science Foundation program for which my colleagues and I wrote two proposals in 1999-2000. There are many possible ways to organize this stream of messages, but one general backdrop against which to view the stream is the set of deadlines imposed by the NSF for the first run of the program. Large proposals were submitted in a three-phase process, with deadlines of 11/15/99, 1/5/00, and 4/17/00 for letters of intent, pre-proposals, and full proposals respectively. Small proposals were submitted in a two-phase process, with deadlines of 1/5/00 and 2/14/00 for letters of intent and full proposals respectively. I participated in a group writing a proposal of each kind.
Turning to the figure, part (a) is a plot of the raw input to the automaton A*2, showing the arrival time of each message in the response set. Part (b) shows a nested interval representation of the set of bursts for the optimal state sequence in A*2; the intervals are annotated with the first and last dates of the messages they contain, and the dates of the NSF deadlines are lined up with the intervals that contain them. Note that this is a schematic representation, designed to show the inclusions that give rise to the tree \Gamma ; the lengths and centering of the intervals in the drawing are not significant. Part (c) shows a drawing of the resulting tree \Gamma . The root corresponds to the single burst of intensity 0 that is present in any state sequence. One sees that the two children of the root span intervals surrounding the
1These figures reveal that I receive less e-mail per day than many of my colleagues; one contributing factor is that I do not subscribe to any high-volume mailing lists based outside Cornell.
11.
2/1410/28-2/21/0010/28/99-
11/1610/28- 11/1611/2- 11/15
11/9-
7/10/00-10/31/00 7/10-7/14
1/2-2/4 1/2-1/5
0 20 40 60 80 100 120 140
1.4e+06 1.5e+06 1.6e+06 1.7e+06 1.8e+06 1.9e+06 2e+06 2.1e+06 2.2e+06 2.3e+06 2.4e+06 2.5e+06 a)
c)
Minutes since 1/1/97
Message #
b)
2/14
10/28 11/16 1/2/00
11/16 11/15
11/2 11/9
2/4 7/10
2/21
7/10
7/14 10/31
10/28/9910/28
(large proposals) 11/15: letter of intent deadline
2/14: full proposal deadline
0 1 2 3 4 5
1/2 1/5
1/5: pre-proposal deadline (large proposals)
4/17: full proposal deadline(large proposals)
7/11: unofficial notification
9/13: official announcementof awards
intensities
(small proposals) (small proposal)
Figure 2: The stream of all messages containing the word "ITR," analyzed using the automaton A*2. (a) The raw input data: the x-axis shows message arrival time; the y-axis shows message sequence number. (b) The set of bursts in the optimal state sequence for A*2, drawn schematically to show the inclusions that form the tree \Gamma . (Lengths of intervals are standardized and hence not to scale.) Intervals are annotated with starting and ending dates, and the dates of the NSF ITR program deadlines are lined up with the intervals that contain them. (c) A representation of the tree \Gamma , showing inclusions among the bursts.
submission deadlines and notification dates, respectively. Moreover, the sub-tree rooted at the first of these children splits further into two sub-trees that are concentrated over a week leading up to the deadline for letters of intent (11/15/99), and four days leading up to the pre-proposal deadline (1/5/00). Finally, note that there is no burst of positive intensity over the final deadline for large proposal, since we did not continue our large submission past the pre-proposal stage.
Figure 3 shows an analysis of the stream of all messages containing the word "prelim," which is the term used at Cornell for (non-final) exams in undergraduate courses. One sees that the raw data in this example (part (a) of the figure) exhibits an arguably more regular structure than in the previous example. I taught undergraduate courses in four of the eight semesters covered by the collection of e-mail, and each of these courses had two prelims.
12.
prelim 24/11/00 prelim 12/24/00 prelim 24/15/99 prelim 12/25/99
11/13/00prelim 2
1 2 3 40 5 6 7 8
prelim 110/4/00 intensities 0 50 100 150 200 250 300 350 400
200000 400000 600000 800000 1e+06 1.2e+06 1.4e+06 1.6e+06 1.8e+06 2e+06 2.2e+06 2.4e+06
Minutes since 1/1/97
Message #
a)
c)
b)
Figure 3: The stream of all messages containing the word "prelim," analyzed using the automaton A*2. Parts (a), (b), and (c) are analogous to Figure 2, but date annotations are omitted. In part (b), the dates of prelims (exams) are lined up with the intervals that contain them.
For the first of these courses, correspondence with students was restricted almost exclusively to a special course e-mail account, and hence very little appears in my own saved e-mail. The remaining three courses are captured very cleanly by the tree \Gamma computed from the optimal state sequence of A*2 (parts (b) and (c) of the figure) -- each course corresponds to a long burst, and each contains two shorter, more intense bursts for the particular prelims. Specifically, the three children of the root are centered over the semesters in which the three undergraduate courses were taught (Spring 1999, Spring 2000, and Fall 2000); and the subtrees below these children split further into two sub-trees each, concentrated either directly over or slightly preceding the two prelims given that semester.
Overall, these structures suggest how a large folder of e-mail might naturally be divided into a hierarchical set of sub-folders around certain key events, based only on the rate of message arrivals. The appropriateness of Forster's comments on the time-sense in narratives is also fairly striking here: when organized by burst intensities, the period of time covered
13.
in the e-mail collection very clearly "piles up into a few notable pinnacles" [17], rather than proceeding uniformly.
4 Enumerating Bursts Given a framework for identifying bursts, it becomes possible to perform a type of enumeration: for every word w that appears in the collection, one computes all the bursts in the stream of messages containing w. Combined with a method for computing a weight associated with each burst, and for then ranking by weight, this essentially provides a way to find the terms that exhibit the most prominent rising and falling pattern over a limited period of time. This can be applied to e-mail, and it can be done very efficiently even on the scale of the e-mail corpus from the previous section; roughly speaking, it can be performed in a single pass over an inverted index for the collection. Here, however, I consider a different application of this technique: extracting bursts in term usage from the titles of conference papers. Two distinct sources of data will be used here: the titles of all papers from the database conferences SIGMOD and VLDB for the years 1975-2001; and the titles of all papers from the theory conferences STOC and FOCS for the years 1969-2001.
The first issue that must be addressed concerns the underlying model: unlike e-mail messages, which arrive continuously over time, conference papers appear in large batches -- essentially, twenty to sixty new papers appear together every half year. As a result, the automaton A*s,fl is not appropriate, since it is fundamentally based on analyzing the distribution of inter-arrival gaps. Instead, one needs to model a related kind of phenomenon: documents arrive in discrete batches; in each new batch of documents, some are relevant (in the present case, their titles contain a particular word w) and some are irrelevant. The idea is thus to find an automaton model that generates batched arrivals, with particular fractions of relevant documents. A sequence of batched arrivals could be considered bursty if the fraction of relevant documents alternates between reasonably long periods in which the fraction is small and other periods in which it is large.
Suppose there are n batches of documents; the tth batch contains rt relevant documents out of a total of dt. Let R = Pnt=1 rt and D = Pnt=1 dt. Now, define an automaton B*s,fl as follows, by close analogy with the construction of A*s,fl. For each state qi of B*s,fl , for i >= 0, there is an expected fraction of relevant documents pi. Set p0 = p = R/D, and pi = p0si. Since it does not make sense for pi to exceed 1, the state qi will only be defined for i such that pi = 1; thus, B*s,fl will be a finite-state automaton. One can further restrict B*s,fl to k states, resulting in the automaton Bks,fl. Viewed in a generative fashion, one can imagine state qi in these models as producing a mixture of relevant and irrelevant documents according to a binomial distribution with probability pi.
The cost of a state sequence q = (qi1, . . . , qin) in B*s,fl is defined as follows. If the automa14.
Word Interval of burst data 1975 SIGMOD -- 1979 SIGMOD base 1975 SIGMOD -- 1981 VLDB application 1975 SIGMOD -- 1982 SIGMOD bases 1975 SIGMOD -- 1982 VLDB design 1975 SIGMOD -- 1985 VLDB relational 1975 SIGMOD -- 1989 VLDB model 1975 SIGMOD -- 1992 VLDB large 1975 VLDB -- 1977 VLDB schema 1975 VLDB -- 1980 VLDB theory 1977 VLDB -- 1984 SIGMOD distributed 1977 VLDB -- 1985 SIGMOD data 1980 VLDB -- 1981 VLDB statistical 1981 VLDB -- 1984 VLDB database 1982 SIGMOD -- 1987 VLDB nested 1984 VLDB -- 1991 VLDB deductive 1985 VLDB -- 1994 VLDB transaction 1987 SIGMOD -- 1992 SIGMOD objects 1987 VLDB -- 1992 SIGMOD object-oriented 1987 SIGMOD -- 1994 VLDB parallel 1989 VLDB -- 1996 VLDB object 1990 SIGMOD -- 1996 VLDB mining 1995 VLDB -- server 1996 SIGMOD -- 2000 VLDB sql 1996 VLDB -- 2000 VLDB warehouse 1996 VLDB -- similarity 1997 SIGMOD -- approximate 1997 VLDB -- web 1998 SIGMOD -- indexing 1999 SIGMOD -- xml 1999 VLDB --
Figure 4: The 30 bursts of highest weight in B22, using titles of all papers from the database conferences SIGMOD and VLDB, 1975-2001.
ton is in state qi when the tth batch arrives, a cost of
oe(i, t) = - ln "dtr
t!p
rti (1 - pi)dt-rt#
is incurred, since this is the negative logarithm of the probability that rt relevant documents would be generated using a binomial distribution with probability pi. There is also a cost of o/ (it, it+1) associated with the state transition from qit to qit+1 , where this cost is defined precisely as for A*s,fl. A state sequence of minimum total cost can then be computed as in Section 2.
In the analysis of conference paper titles here, the main goal is to enumerate bursts of
15.
Word Interval of burst grammars 1969 STOC -- 1973 FOCS automata 1969 STOC -- 1974 STOC languages 1969 STOC -- 1977 STOC machines 1969 STOC -- 1978 STOC recursive 1969 STOC -- 1979 FOCS classes 1969 STOC -- 1981 FOCS some 1969 STOC -- 1980 FOCS sequential 1969 FOCS -- 1972 FOCS equivalence 1969 FOCS -- 1981 FOCS programs 1969 FOCS -- 1986 FOCS program 1970 FOCS -- 1978 STOC on 1973 FOCS -- 1976 STOC complexity 1974 STOC -- 1975 FOCS problems 1975 FOCS -- 1976 FOCS relational 1975 FOCS -- 1982 FOCS logic 1976 FOCS -- 1984 STOC vlsi 1980 FOCS -- 1986 STOC probabilistic 1981 FOCS -- 1986 FOCS how 1982 STOC -- 1988 STOC parallel 1984 STOC -- 1987 FOCS algorithm 1984 FOCS -- 1987 FOCS graphs 1987 STOC -- 1989 STOC learning 1987 FOCS -- 1997 FOCS competitive 1990 FOCS -- 1994 FOCS randomized 1992 STOC -- 1995 STOC approximation 1993 STOC -- improved 1994 STOC -- 2000 STOC codes 1994 FOCS -- approximating 1995 FOCS -- quantum 1996 FOCS --
Figure 5: The 30 bursts of highest weight in B22, using titles of all papers from the theory conferences STOC and FOCS, 1969-2001.
positive intensity, but not to emphasize hierarchical structure. Thus, the two-state automaton B22 is used; given an optimal state sequence, bursts of positive intensity correspond to intervals in which the state is q1 rather than q0. For such a burst [t1, t2], we can define the weight of the burst to be
t2X
t=t1(oe(0, t) - oe(1, t)). In other words, the weight is equal to the improvement in cost incurred by using state 1 over the interval rather than state 0. Observe that in an optimal sequence, the weight of every burst is non-negative. Intuitively, then, bursts of larger weight correspond to more prominent periods of elevated activity. (This notion of weight can be naturally extended to
16.
larger numbers of states, as well as to the automaton model from Section 2.)
In Figure 4, this framework is applied to the titles of SIGMOD and VLDB papers for the years 1975-2001. For each word w (including stop-words), an input to B22 is constructed in which rt is the number of titles at the tth conference (chronologically) that contain the word w, and dt is the total number of titles at the tth conference. The 30 bursts with the highest weight, over all possible words w, are then depicted in the figure, sorted by year of appearance. The bursts with no given ending date (`mining', `warehouse', `similarity', `approximate', `web', `indexing', and `xml') are those for which the interval extends to the most recent conference, suggesting terms that are in the middle of a large-weight burst at present. Note that no pre-processing is done on the titles, other than to convert each word to lower-case. One observes that the words in Figure 4 are almost all quickly recognizable as carrying technical content, even though they are the top results in an enumeration where bursts were computed and ranked for all words, including stop-words.2
Figure 5 shows the results of the same computation on the titles of STOC and FOCS papers for the years 1969-2001. For both these collections, it is important to note that the number of occurrences of a word w is in general a quantity that, at a local scale, changes very rapidly from one conference to the next; thus, many of the intervals depicted in the figures span conferences in which the indicated word did not appear at all, and omit ones with large numbers of occurrences. The non-trivial cost of state transitions in B22 is crucial in making it possible for intervals of any reasonable length to form in the presence of this data.
5 Related Work The Topic Detection and Tracking (TDT) study [2, 3, 61, 62] articulated the problem of extracting significant topics and events from a stream of news articles, thereby framing the type of document stream analysis questions considered here. Much of the emphasis in the TDT study was on techniques for the on-line version of the problem, in which events must be detected in real-time; but there was also a retrospective version in which the whole stream could be analyzed. Similar issues have recently been addressed in the visualization community [26, 43, 60], where the problem of visualizing the appearance and disappearance of themes in a sequence of news stories has been explored.
Following on the TDT work, Swan, Allan, and Jensen [56, 57, 58] developed a method for constructing overview timelines of a set of news stories. For each named entity and noun phrase in the corpus, they perform a O/2 test to identify days on which the number of occurrences yields a value above a certain threshold; contiguous sets of days meeting this condition are then grouped into an interval that is added to the timeline. Thus, the high-level structure of their approach is parallel to the enumerative method in Section 4. However, the
2The bursts for `data,' `base,' and `bases' in the years 1975-1981 arise in large part from the fact that the term "database" was written as two words in a significant number of the paper titles during this period.
17.
underlying methodology is quite different from the present work in two key respects. First, Swan et al. note that the use of thresholds makes it difficult to construct long intervals of activity for a single feature -- such intervals are often broken apart by brief gaps in which the feature does not occur frequently enough, and subsequent heuristics are needed to piece them together. The present work, by modeling a burst as a state transition with costs, allows for a long interval to naturally persist across such gaps; essentially, in place of thresholds, the optimization problem inherent in finding a minimum-cost state sequence adaptively groups nearby high-intensity intervals together when it is advantageous to do so. Second, the work of Swan et al. does not attempt to infer any type of hierarchical structure in the appearance of a feature.
Lewis and Knowles analyze the dynamics of message-sending over a very short time scale, searching for features that can determine whether one message is a response to another [37]. This is applied to develop robust techniques for identifying threads, a popular metaphor for organizing e-mail and newsgroup postings [15, 22]. In a very different context, Grosz and Sidner develop structural models for discourse as a means of analyzing communication [21]; their use of stack models in particular results in a nested organization that bears an intriguing, though distant, relationship to the nested structure of bursts studied here.
The present work clearly overlaps with the large areas of time series analysis and sequence mining [10, 25]; connections to related probabilistic frameworks such as bursty on-off sources [32] and hidden Markov models [48] have already been discussed above. Ehrich and Foith [16] proposed a method for constructing a tree from a one-dimensional time series, essentially by introducing a branch-point at each local minimum and a leaf at each local maximum (see also [55]). In the context of the applications here, such an approach would yield trees of enormous complexity, due to the ruggedness of the underlying temporal data, with many local minima and maxima.
The search for a minimum-cost state sequence in the automata of Section 2 and 4 can also be viewed as a search for approximate level sets in a time series, and hence related to the large body of work on piece-wise function approximation in both statistics and data mining (see e.g. [23, 24, 27, 31, 33, 35, 41]). In a discrete framework, work on mining episodes and sequential patterns (e.g. [1, 12, 25, 40]) has developed algorithms to identify particular configurations of discrete events clustered in time, in some cases obeying partial precedence constraints on their order. Finally, there is an interesting general relationship to work on traffic analysis in the areas of cryptography and security [52]; in that context, temporal analysis of a message stream is crucial because the content of the messages has been explicitly obscured.
6 Extensions and Conclusions In the settings discussed above, the analysis has made use of both the temporal information and the underlying content. The role of temporal data is clear; but the content of course
18.
plays an integral role as well: Section 3 deals with streams consisting of the response set for a particular query to a larger stream; and Section 4 considers streams with batched arrivals, in which a particular subset of each batch is designated as relevant. And in fact, there is strong evidence that the interplay between content and time is crucial here -- that an arbitrary set of messages with same sequence of arrival times would not exhibit an equally strong set of bursts. Adapting a permutation test from Swan and Jensen [58], one can start with a complete e-mail corpus having arrival times t1, t2, . . . , tN , choose a random permutation ss, and shuffle the corpus so that message ss(i) arrives at time ti (instead of message i), for i = 1, 2, . . . , N . The resulting shuffled corpus has the same set of arrival times and the same messages, but the original correspondence between the two is broken; do equivalently strong "spurious" bursts appear in this new sequence? In fact, they clearly do not: when the weight of bursts for all words (with respect to A*2) is computed using the e-mail corpus in Section 3, the total weight associated with the true corpus is more than an order of magnitude greater than the average total weight over 100 randomly shuffled versions (369,980 versus 25,141). Moreover, the shuffled versions exhibit almost no non-trivial hierarchical structure; the average total number of words generating bursts of intensity at least 2 (i.e. inducing trees \Gamma with two or more levels below the root) is 16.7 over the randomly shuffled versions, compared with 3865 in the true corpus.
I have also applied the overall framework developed here to Web clickstream data collected by Gay et al. [18]. The dataset in [18] was compiled as part of a study of student usage of wireless laptops: The browser clicks of roughly 80 undergraduate students in two particular classes at Cornell were collected (with consent) from wireless laptops over a period of two and a half months in Spring 2000. Bursts with respect to A*s,fl can be computed by an enumerative method, as in Section 4: for every URL w, all bursts in the stream of visits to w are determined; the full set of bursts is then ordered by weight. Each burst, associated with a URL w, now has an additional quantity associated with it: the number of distinct users who visited w during the interval of the burst. This allows one to distinguish between collective activity involving much of the class and that of just a single user. As it turns out, if one focuses on bursts that involve at least 10 distinct users, then many of those with the highest weight involve the URLs of the on-line class reading assignments, centered on intervals shortly before and during the weekly sessions at which they were discussed.
A final observation is that the use of a model based on state transitions leads to bursts with sharp boundaries; they have clear beginnings and ends. In particular, this means that for every burst, one can identify a single message on which the associated state transition occurred. This is akin to the TDT study's notion of (retrospective) first story detection [2], although in the automaton model of the present work, identifying initial messages does not constitute a separate problem since it follows directly from the definition of the state transitions. In the context of e-mail, the contents of such an initial message can often serve as a concentrated summary of the circumstances precipitating the burst -- in other words, there is frequently something in the message itself to frame the flurry of message19.
sending that is about to occur. And for messages on which bursts for several different terms are initiated simultaneously, this phenomenon is even more apparent; such messages often represent natural "landmarks" at the beginning of a long-running episode.
In many domains, we accumulate extensive and detailed records of our own behavior -- in the e-mail we send and receive, the Web pages we visit, the queries we issue to search engines. An underlying theme, of which several aspects have been developed here, is that a great number of these settings have a fundamental temporal aspect; they are punctuated by the sharp and sudden onset of particular episodes, and can be organized around rising and falling patterns of activity. There is a great amount of complexity underpinning such a picture. But by developing a better understanding of it, one can hope ultimately to find structure in the raw data that we generate through the basic process of interacting and communicating.
Acknowledgements. I thank Lillian Lee for valuable discussions and suggestions throughout the course of this work.
... of course not openly, but through a maze of sub-sub-sub-sub-contractors ultimately handling the "cloud" hardware the NHS information will reside on.
And I am sure they will keep that data safe, and well back-up-ed, given how valuable it might become when tinkering with the next election or blackmailing the next politician.