New Continuous Support System
An anonymous reader writes "eWeek is reporting on a new continuous open-source support system that helps to keep tabs on your mission-critical applications by providing constant diagnostic monitoring. The system is designed to match specific 'signatures' from your applications to a database of over 200,000 possible 'problem' signatures and alert the user for correction or analysis. From the article: 'SourceLabs' Continuous Support System features what Sebastian calls "adaptive diagnostic probes" that are fully integrated and configured for customer environments. The probes identify production issues and begin to gather diagnostic information to help get to the root of the problem, he said. Indeed, the probes can be configured so that as soon as a problem occurs, the SourceLabs support team extracts system information to find and resolve the problem. And the system includes a database of more than 200,000 signatures of problems that might occur.'"
How is this different from splunk? Now if it fixed problems for me...
http://igotyoursidekick.spreadshirt.com
You may not be the first customer to hit the problem. Also, the problem can manifest itself in a non-signature-dependent manner, like throwing an exception. Then if you are not the first to see it, signatures may come in to play in telling you why the exception happened.
Bruce Perens.
Anyway, it was a very interesting and difficult problem. One of the biggest rubs was the level of assurance you had to provide. In otherwords, can you let the system make changes on its own or should it just recommend changes? If the system mis-diagnoses even one problem, it might break more stuff than it fixes. Most monitoring tools have big problems with 'false positives'. Add to that that the system can't necessary 'undo' all changes. Our solution was to allow the administrator to run the system in a variety of modes so they could choose if the system applied the fix automatically, with approval, or just suggested how to fix the problem.
As for how the system actually works, it basically takes a middle approach between ML (machine learning) and KR (knowledge representation). Basically, either you can hard code all the types of problems you have in a KR language, or setup some big neural net (or other ML algorithm) and let the system 'learn' problems. We split the difference and added some domain knowledge. Certain types of 'features' (parts of a diagnose such as the disk is slow) were diagnosed by ML algorithms, but ultimately KR rules written by Exchange experts actually diagnosed the problems and suggested repairs. A very time consuming, but more reliable solution (but less cool).
"Those that start by burning books, will end by burning men."