Internet Data Mining for Investment Analysis
CaroKann writes "Reuters is reporting on a Wall Street investment research company, Majestic Research, that is using web crawling techniques to track business performance. Instead of attempting to estimate business conditions by talking to company management, or pounding the pavement visiting stores, this company uses data mining systems to collect real-time sales data and other information on companies that have a web presence. Using this data, Majestic attempts to estimate company earnings more accurately than traditional research outfits."
Economics and future fiscal predictions are completely theoretical. There are just too many variables involved, folks.
My work here is dung.
They can create bogus pages to feed to the Majestic bot like in the BMW vs. Google case.
We can expect yet another huge rise in fake blogs, fake product reviews on Amazon and such, and paid shills in chats and message boards. Swell.
Slashdot Burying Stories About Slashdot Media Owned
based on manually mining (eg reading) Slashdot I determine a spike in Majestic's share price about now...
So who said you can't track business performance with a regular expression, you pointy haired noob?!
It takes a man to suffer ignorance and smile
Be yourself no matter what they say
TFA mentions data about drug prescriptions by hundreds of physicians. Is that lying around unorganised on the net? Tell me which algorithm you are going to use to predict how many XBOX365 are going to get sold next month by webcrawling??? You think supermarkets post their sales-figures to public webpages? Wallmart is said to have more data off-line than is available on the entire public section of the net. Now give me access to that.. But on the other hand; if you work for the sales-tax administration (in Europe) and all the big companies file their invoices weekly, that is also a good starting point...
10 ?"Hello World" life was simple then
I wrote a project in perl some years ago that would download online financial news stories and count the critical words and weigh their connotational weight, and compare that to the direction of the stock market. For example, if the words "stocks" and "down" started showing up a lot in sentences in online news stories, you might expect a downward trend.
I posted the preliminary code online in the perl newsgroup.
google "data mining" "news" "perl" etc
eat shiat and bark at the moon
A friend of mine has developed software that goes even further. It parses streaming news stories for good/bad news and executes orders before humans even finish reading. That advantage is enough to make this company a mint.
But the real problem with everything like this is.. even if it works well for many things... there will be those who will try to missuse it.. and finding all those will be very hard. Further it only takes one major problem case and your nice product becomes a laughing stock.
Computers should be able to give a much more unbiased assessment of the economy than any person ever could. People are essentially incapable of interpreting economic data in a straightforward way, political agendas always seem to work their way into economists opinions about the economy. By using algorithms to do the analysis (and allowing market forces to refine those algorithms), we should be able to get a much better understanding of the REAL economy.
This is a good thing for mankind.
This is interesting stuff. I would like to learn more about the algorithms they use to analyze their data - the article has very few details. It is neat how systems like this are becoming favored over traditional human analysts (or at least reducing the need for people).
I remember back in grad school in the late 90s I worked on a major project to design an intelligent agent based system including the same functionality, but, in addition to pulling information off the internet, it could also take into account whatever other information could be gathered and interfaced into it (for example, there is also a lot of content on TV which could be fed into a system, in addition to the online data). It was a design project though and not implemented, perhaps I will need to resurrect it!
I do think the whole area of quantitative or at least semi-quantitative analysis of information, both textual and numerical, is going to explode over the next few years, driven by vast amounts of incredibly cheap computing power and bandwidth. Computer applications do amazing stuff right now, but five years from now truly "intelligent" applications will exist. The term "artificial intelligence" has fallen out of fashion, perhaps a sign of how common place these systems have now become.
As an example, our local phone company has a voice recognition system which actually works reasonably well, much, much better than anything 5-10 years ago. We are certainly making progress.
FREE - Java, J2EE and Ajax Audiobooks for Software Developers - www.DeveloperAdvantage.com
Train it for a particular stock automatically using the actual direction of the stock. Set the filter as one of the inputs among many others (yahoo data) to a genetic algorithm system and then give the lot away free. Bankrupt the big financial advice firms. :)
Hmm, might be worth using it as an excuse to play with Ruby.
Deleted
So you wrote a program that would read some stories that said the stock market was going down, and it told you the market was down? Did your program also see if weather news reports contained words like "rain" and "downpour" and hence "predict" rain?
beware the jabberwock, my son! the jaws that bite, the claws that catch!
Does anyone know whether Majestic Reasearch has any connections to Majestic 12 (http://www.majestic12.co.uk/)? For those who don't know, Majestic 12 is a distributed search engine. The distributed part is in that they have a bunch of people donate CPU cycles and bandwidth to run a web crawler in a SETI at home fashion. Now i thought this was a good thing to join, because we kind of need some independent alternatives to google. But if it turns out i'm sponsoring some marketing firm, well... i'd feel pretty stupid.
Good things like the crash of 1987? Computer-run mutual funds appear to do no better than fund managers, either. I suspect that most of the larger mutual funds have relatively strict rules about when to buy and sell, in order to minimize emotional choices.
BEHOLD! The Power of the Meme!
t up=http://www.realmeme.com:80/Main/miner/investmen t/AMZNDejanews.png
http://www.realmeme.com/Main/miner/stock.jsp?star
Nope, it won't, because even if it does, everyone will start using it and render it useless. There is only one trend in stock market that is backed up by statistics over long run and that is the stock market drifts upward overtime. My professor did a exmeripment using computer modeling, basically using a random number generator to decide if the stock market goes up or down, adding the 'upward drift' factor using historical data and comparing it to the actual data over last 75 yrs, and two data looks almost identical. I know it doesn't "prove" my point, but it does show that playing stock market short term is basically a flip of coin.
I once interviewed with a group in San Francisco that did stuff like this. They weren't clear about who they were working for, but I do remember some of the techniques they mentioned during the interview. Some of these were actually implemented, others were just ideas:
- An eBay crawler that could estimate the number of auctions and average selling price to predict whether eBay would make their earnings target or not. eBay quickly blacklisted their IP space, so they started using a bunch of open proxies they found.
- By analyzing client/server communication for the Sims Online, they discovered that each connection was assigned a sequentially incrementing connection ID number. By looking at the rate at which the connection ID numbers were increasing each time they logged in, they determined that the Sims Online wasn't going to be nearly as popular as Electronic Arts was forecasting.
- They talked about placing a camera somewhere in Union Square (in SF) to monitor the entrace to Tiffany's during the holiday shopping season, and doing image analysis to determine what percentage of shoppers left the store with a Tiffany's bag in hand.
- Monitoring wireless carriers' spectrum to determine what percentage of GSM/CDMA channels were in use for data vs. voice. The communication itself is encrypted of course, but you can still tell whether a channel is carrying voice or data. They wanted to determine if wireless carriers forecasts about revenue from data services were accurate.
This is old news. Data mining systems like this have been around for years. Some are even actual real data mining systems (i.e. a SELECT statement against two tables is not data mining!! argh... i digress)
Forget trying to analize companies for optimum performance. For real performance dump the whole batch and buy precious metals because the fact is that the US economy has more debt than can ever be paid off at face value. IMHO, gold is pratically guaranteed to outperform every investmant class out there.
Seriously, just watch what happens when the fed decides to print up money to try and stall off a cascading credit collapse. They will print up some, but that will make things worse because it will drive up costs without driving up pay or driving down personal debt. So they will print up more, and that will make things more worse for the same reasons, and so on. When it is all over, costs will likely be 10 x higher while pay stas about the same. I woulnd't be supprised if the dollar stopped being a currency.
So, did it work?
The firm's methods differ from traditional Wall Street research, where analysts make forecasts based on conversations with company executives, advertisers, suppliers and mall visits to forecast company results and make recommendations.
yes, over a limited time, it did seem to work, at least to some degree. I used to have a webpage up on geocities that had a java display of my data. Doubt it is still around.
One thing that would need to be done is collect a set of the most important words. I never had time to do that (this was a senior project for my comp sci degree).
I just came up with a set of critical words on my own (probably 30 words or so, of negative, positive and neutral connotational value).
But if you had completed a collection and study of words and then tracked the appearance and occurence of those words to the financial index values, THEN you could build the program from there.
eat shiat and bark at the moon
IDL have been doing this for years - http://www.investor-dynamics.com/
hmm. can't find any trace of my project on the net anywhere. I know I have the code and data and my paper at home, though...
eat shiat and bark at the moon
So, is this code still up somewhere?
wow, I just searched the net and it is NOWHERE to be found. A 70 page paper and a bunch of (pretty bad) perl code. But I do have the code and data at home. Someone perl coder named clarkson fixed the code for me, but I think my own code worked better.
But here was the url:
http://www.geocities.com/uhdseniorproject
now gone, and no cache.
But I still think the idea is a great one...
eat shiat and bark at the moon
Maybe they'll soon announce a deal with Google?
Cheers!! Abdul Aziz
If I can rephrase the skepticism expressed in the parent post: most publicly available information posted on web sites will not yield any startling analysis that can't simply be gained by reading annual and quarterly reports.
There was some comment in response to the parent post that the data mining company was licensing data. This also sounds suspicious. There is an SEC regulation called FD (for Fair Disclosure). This regulation states that you cannot preferentially provide investor information to someone. FD was put in place to stop companies from selective discloser to their pet stock market analysts who always write good things about the company, while freezing out those who write anything critical. FD says that if you provide material information about the company, you have to make it available to everyone. The only way to comply with FD while providing confidential information would be to provide the information with the understanding that it will be aggregated and that companies will not be identified. For example, oil industry companies could provide information on reserves, which could be aggregated and reported as an industry figure. Even in this case, the aggregated information would have to be available without prejudice (e.g., anyone could by it).
There has even been some concern about violating Regulation FD if companies allow analysts to travel around and talk to their sales outlets or customer base. The result of this would be material non-public information that could be used by the analyst and was obtained through the cooperation of the company.
This does not mean that you cannot derive advantage from information. For example, if you have a market model that finds predictive features from data that is either publicly available or that anyone can buy, then you can leverage this information.
Hello all:
... s(t-n)}
I like to highlight that there is a difference between a Prediction and a Summary. From what I read so far, the tool posted in the article generates a summary, which maybe used as a prediction.
Let s(t) be the Summary of a system (in this case, the economy) at any given time, then:
A prediction, p(S), would be a prediction based on a set of summary S, where: S == {s(t), s(t-1), s(t-2),
One can always make a prediction based on a very small number of summaries. |S| = 0 is a guess. |S| = 1 means that no past summaries are considered in the predication, just the most up-to-date one. Presumably, the bigger |S| is, the more information is considered in that summary.
The usefulness of such a tool lies in the value of t. Web-crawling allows one to collect much data in a small amount of time. If one is able to collect a summary quicker than everyone else, then presumably, someone using this summary tool would be able to stay ahead of the trend.
That being said, one of the input of s(t) is actually publicly available data. Financial reports events after the fact. Information based on actual financial transactions (ones that you can collect if you plan a spybot at the central booth of a major retailer, for example) is much better. At the end of the day, if you want to play a really cut-throat, high profit game of stock trading, I think you are better off having insider info.
Cheers.
B. Pascal
This is a pretty common trick, and used in one form or another on Wall Street for many years.
I have seen ones that scanned EDGAR filings, (got canceled when the company was destroyed in the 9/11 attack), campaign contributions (works wonderfully for the telco and other highly regulated industries). patent filings (generally surprisingly well, though no one knows why), job adss, and many others.
I even heard of one that analyzed free internet porn...(insert your favorite joke here, but it actually was a fairly good predictor. The cognitive psychology behind it was fascinating).
Using search engines and NORA text mining is basically a form of technical investing. If you have a data store of any kind whose contents influenced or are influenced by members of the market niche of a company, it can tell you something about the future of that niche. Thats just plain marketing 101...
I am curious to know where market research ends and industrial espionage begins. Data mining seems, in some respects, to go beyond mere research when specific companies are selected for "research". In essence there is no agregation of information to present industry-wide assessments and forecasts. Or am i drawing a rather long bow here?
I guess whether or not "public" information available on the Internet can prove useful or not depends on what the information is and how it is analyzed. If one saw online for instance that a company was slahing prices across an important cateogry of products by monitoring the major retail channels for that company and noticed this data before the company said anything at the end of the quarter then it could be argued that value was added. As for Reg. FD, the companies data is being licensed from are not the companies being analyzed the data is gathered somewhere in the channel or the food chain so to speak and is completely independent of the companies being analyzed. IMS is the perfect example - if you buy POS data from the Pharmacies about which drugs are being dispensed that is not coming from a large Phareceutical company but can give you good insight into say how Pfizer is doing in the depression market.