MIT Offers Picture-Centric Programming To the Masses With Sikuli
coondoggie writes "Computer users with rudimentary skills will be able to program via screen shots rather than lines of code with a new graphical scripting language called Sikuli that was devised at the Massachusetts Institute of Technology. With a basic understanding of Python, people can write programs that incorporate screen shots of graphical user interface (GUI) elements to automate computer work. One example given by the authors of a paper about Sikuli is a script that notifies a person when his bus is rounding the corner so he can leave in time to catch it."
Here's a video demo of the technology, and a paper explaining the concept (PDF).
Sounds like the Microsoft FrontPage of coding software. Why do with text what you can do with pictures? And we all know FrontPge went on to become the defacto standard for web development....that had to be fixed by an real web developer later.
But on the upside, dedicated FTE's for "reinstalling corrupted FrontPage extensions" did skyrocket during the FrontPage era.
I judt got a nre Kinesis keybiartf so please excusr ant egregiou typos.
Especially for Testing your GUI.
This seems like AutoIT but with image recognition (instead of having to input mouse coordinates).
This looks like a powerful tool for gold / isk / whatever farming. I'm tempted to resurrect my eve account and see if I can make an auto-miner script.
My patience is infinite, my time is not.
Actually I think this is more interesting than either FrontPage or LabView, because it allows you to script GUI apps that were not designed to be scriptable. Even for apps that are scriptable, it provides an increase in user efficiency as you don't have to learn the API commands to do things that you already know how to do in the GUI.
How useful it is will depend on how well the image pattern matching deals with corner cases. Consider you need to click on a text field, however there are many identically looking (empty) text fields, with the only distinguishing factor being the label beside them, and clicking on the label does not select the text field. Like screen scraping, it is also somewhat fragile to UI changes (although not as much as other GUI scripting tools that rely on pixel location).
"Computer users with rudimentary skills"..... "with a basic understanding of Python"?
For years I have been asking for a softwsare development tool that allows me to write PHP code by throwing cow-pats at the screem with the Wiimote.
And my colleagues wat a tool that allows dispatching my bugs with the Wii gun attachment they use in "Quantum of Solace".
Sent from my ASR33 using ASCII
The subtitles were a bit of a surprise. Can MIT not afford better than built in microphones on cheap laptops? Between her vaugely asian accent, the poor quality of the audio (seriously, you're TELLING people how to do something, the audio is important here - did they record this in a shower stall or something? my netbook's audio sounds 100x better than this), and then apparently some sort of wacky audio encoding basically makes her impossible to understand. People who speak english as a second language aren't going to be able to understand this, thank god they did the subtitles.
Neat concept though.
moox. for a new generation.
FTFA: "Sikuli -- which means God's eye in the language of the Huichol Indians in Mexico". Mexican Indians love their hallucinogenic Peyote. On the other hand, MIT researchers want the masses to program with the mouse. Well, I know about "correlation is not causation", but MIT sure is an interesting place to be.
Yea- this might work until the icons change. I don't see this working too well in practice. I don't know about Mac- but on my Ubuntu system the icons got updated last week. And it happens often enough that these scripts would need updating to be a serious pain and expense. It isn't like an ordinary user could figure this stuff out either. Despite it being so simple your still going to need an IT person to create these scripts. Now you just have dumber IT people. Probably people who COST you more money in practice too because they "can" do it- it just the results of their work takes more maintenance. It reminds me of this .bat file written for this video store that backs up a database to a flash drive. If it had only had a statement to check if the flash drive were present and alert the user they wouldn't of wasted $80 calling me to come and find out why the backup program wasn't working. Seriously dumb programmer. In the right hands this kind of thing is good. In the wrong hands it is bad.
From what I seen is this a macro program that can use screenshots rather then key/mouse data to automate tasks. So you PROGRAM your PC in the same way you PROGRAM a VCR to record a show. It is NOT the same as writing an application.
But it seems very intresting once you got past this difference. Macro's are very handy for testing in my experience but often have a problem because a tiny mis-alignment can ruin it all. If this program is smarter because it can regonize where data is supposed to go... well that would certainly make automated tests a bit easier.
Interesting stuff. Just don't think you will be writing software with this.
MMO Quests are like orgasms:
You may solo them, I prefer them in a group.
There are far easier ways to commit click fraud than actually looking at the screen to do it. The ad companies tend to ignore the same request multiple times from the same IP so this changes nothing.
People who commit 'click fraud' aren't writing crappy little screen scrapers to do it, its far easier and faster to write a plugin for firefox to do what you're say and just find the text of your ad on the page and trigger the link. No need to futz with whats displayed or 'moving the mouse' to the right spot, you just tell Firefox to find the link and trigger it.
A relatively simple WebKit wrapper would work equally well.
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
I'm suddenly reminded of horrible apps written in VB97, with no concern for the back end, horrible input kludge, etc.
Sent from my PDP-11
Wow, no one has watched the movie Swordfish have they?
Have you seen his wife recently?
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
Otherwise it's just not complete, IMHO.
How many more years will slashdot have an off-by-one error on your Score in your profile?
... but does anyone knows if the program is always that slow?
I understand that it has to visually find the button and this is computationally expensive, but the 2~3 seconds lag didn't seem compatible with the task.
On a sidenote, the video states that there's no "internal API" dependence, but it clearly has to send "click" and "type" signals. Is that really OS independent or was it just an overstatement?
This is the same sort of scripting you can do with many already existing languages. Autohotkey for example. The only new feature would be the ability to copy the screenshot directly into the program as apposed to taking it outside the program and referencing the file directly. I'd say that this scripting language is actually weaker because of it. As far as using this inside a game... they are already hardened against this sort of thing. For example, next time you're in EVE look at the buttons you use. They are semi-transparent. This is not just for aesthetics. If you take a screenshot of the button, and then change your camera angle the button looks different because what's behind it is different. That doesn't mean you can't script inside EVE, you just have to be a lot more clever than using a script to click on a static image of the gui. This language would be almost completely useless in any GUI that has any transparency. Which I'd think would include Vista, Win7 and even Macs with the right stuff turned on.
We're trying to repress those memories, you insensitive clod!
Sorry, there are some things even Sikuli can't process.
How can I believe you when you tell me what I don't want to hear?
How would it not break? You don't expect your regular program to work if the API it's using changes, do you?
It can script GUI actions in much the same way. Granted it's not a very nice environment for more complicated work, but still.
Come on, let's cut through the default Slashdot snark. The image capture aspect of Sikuli is brilliant! I don't like the tagline "program anything with Sikuli" because 99% of software should be written in something else. But think of writing test scripts that can use the image matching features. If the software works as advertised, then you could throw together UI test cases way faster than anything else I've seen. System administration tasks should be a good match too. The resulting code would be brittle and hard to maintain, but for quick one-off scripts, sure... I can see it.
The script may not work if the UI style is different from the one recorded or if the UI language is different from the one recorded. Generally, any option that can change the UI from computer to computer will create a problem for Sikuli.
This time for sure!
if NOT understand logic then
loop
talkTo (self, "Don't program!")
Look (@ Pretty pictures)
endloop
endif
Where are we going and why are we in a handbasket?
I'd be curious to see how they handle the back end, especially as some others pointed out it does make calls that seemingly require some hook into the OS. As for its usefulness, I doubt it will really take off beyond being a decent prototype. It relies on image matching so if you use and change a custom icon set all your scripts would be kinda worthless. Same goes if the programs you are "screenshot scripting" receive a major overhaul in the GUI department. Until it can address those issues, I doubt it will really take off.
Sikuli is certainly not commercial-grade UI testing software. It was never intended to be, this is academic software written to explore ideas, rather than to polish them to perfection. Also, it is not a "general" programming language. The previous posters that compared it to video-programming are right: not all programs have to target complicated algorithms and data-structures, there is plenty of space for automating "simple stuff".
As an idea, I find the readability of the code particularly interesting. Sikuli code is about the closest you can come to self-explanatory, step-by-step instructions on how to achieve whatever a particular program does. Add a few comments to the most arcane steps, publish those programs to an online repository, and presto! executable step-by-step tutorials.
Yes, the developers may have to address the variability of themes on people's desktops. It is certainly possible to do so (for instance, by keeping a list of mappings from any of a set of "supported" themes to a "canonical" theme, which would be used in all examples), but, as far as ideas go, I really think that Sikuli is a very refreshing idea.
The last time I tried to use Applescript on windows or linux, it wouldn't even start up.
I mostly agree with you, it's always silly to automate a sequence of GUI actions.
However I can see where they're going here; the program examines your screen and finds the widget to click on or enter data into, much like a human looking at the screen and deciding what to do next. Extend that to the real world, a robot that looks around your room for the remote control and turns on the TV, then surfs through the channels until it recognizes something you like to watch. By then it will also be capable of understanding speech and making decisions autonomously. Computers will be thinking like humans within just a few years. Oh wait.
I just open this can of worms up, but the first thing I thought of after seeing the demo was, "Can I push a button on a Flash page?"
Some accountants seem to think everyone needs to learn accounting in order to function in society. But people have other jobs. Some of us like our dumbed down tools because they fill a need. My tax software lets me do my taxes without learning "proper" accounting. Similarly, I know some people who benefit greatly from a little passing knowledge of high-level scripting languages like VB, JavaScript, or even Python.
For those kinds of people, Sikuli looks pretty cool because they can do things that would be pretty difficult otherwise. Hey, even for a lot of experienced programmers, capturing a region of the screen and doing fuzzy pattern matching might be a significant task. I haven't tried Sikuli yet, but it looks like it would be very helpful for some things, and a lot easier to deal with than AutoIt or AutoHotkey.
(BTW, TurboTax was just an example. I actually use something I like better, but you get the idea.)
Wow they just created the old VB SendKeys command. I was actually doing stuff like this 12-14 years ago with SendKeys command in VB. In "practical" use back then
it sucked and I am certain that has not changed.
Got Code?
I did this exact same thing in AutoIt, except that it needs exact matches of images instead of a fuzzy recognizer. (Plus, I also had rule triggers and state vs just a single list of imperative commands)
The fuzzy match is a nice addition, but this automation concept has been available for years.
man ifconfig
Got Code?
Just Great... all the spammers need now is a few CAPTCHA deciphering Sikuli plug ins.
Once that's done we can all go back to manually removing spam from our web forums and in-boxes.
How you sanitize your inputs in a language that checks what is displayed on the screen? Instead of xss or sql injection you could end being hacked by watching a mail attached normal picture if that kind of programming becomes popular.
It is basically expect script for GUIs.
There's no place I can be, since I found Serenity.
Yeah, and the last time I tried to run Logic Pro 8 on Windows or Linux, it wouldn't even start up.
The idea is cool and innovative, and makes automating a point-and-click interface a breeze. It certainly has applications.
But overall, it just seems like a Bad Idea. It will be as reliable as screen-scraping in browsers and would therefore be wise to be avoided, and for the same reasons.
Even just changing the theme of your OS or the icon sizes could well be enough to confuse the image processing. The code won't be portable, and in the end, for anything but the most simple tasks, the person using it would still require some programming skills. Because of this, I think between Sikuli and command-line scripting, command-line scripting has more staying power.
Visit http://ringbreak.dnd.utwente.nl/~mrjb/growingbettersoftware to download your free copy of the book
Does it run on Linux? Doesn't seem so, and that's sad.
This is where we get when everything is a GUI. As long as I have a decent shell & environment, I think I prefer shell scripting.
I was thinking about this same thing. I don't know about Mac or Windows, but I'm able to do anything from the CLI on Linux. This macros toy would be written faster using bash; just google for "bash change IP address". I believe my grandma would find it easier to just copy/paste the answer than knowing how to take a screenshot.
I tried playing it and found it really impressive. The implementation is still beta-ish, but good enough to give it a try.
The first thing I make it do for me, is to launch my VNC viewer on my mac laptop, connect to my local head-less windows 7 machine, and click the iTunes play button to play some music there. It just worked (amazingly), and I found it to be a pretty good use case for a tool like this. A task like that cannot be easily automated. At least, it has not be the case with a tool that you just start trying for 5 minutes.
I can imagine that, if the image pattern matching can be extended to do recognition, such as face recognition / text OCR, and passed the recognized info back, or it adds webcam as its input device (instead of keyboard / mouse IO) it'll be overwhelming.
Besides, it really is an inspiring way of coding.
I have to say I am impressed. I have had a play with some of the demos and I like what I see. Whilst I agree that there are limitations this project seems fantastic.
Having tried and failed to use "win runner" in the past due to the complexity of the GUI application I was testing, this scripting would get past the problems we were having.
I can envisage sending canned scripts to my folks for doing maintenance on their own machine, even just some diagnostics that I find hard to do over the phone.
I have a couple of itches of my own that I reckon I could scratch with this, for example I have a macbook that I sometimes attach to an external display. Sometimes the external is on the left of my laptop, sometimes the right, sometimes directly above it would be cool to have a script that allowed me to just click an icon to arrange the displays appropriately. Sikuli is close. I am about to go off and see if that will work.
I mean they have associative arrays indexed by a freaking picture. That is simply, well, paradigm shifting. I am less concerned about the actual efficacy of Sikuli than I am about the ability to hook applications together through their GUI. I am thinking about something like "GUI pipes" which is something I have been thinking about for some time. Mark III of this stuff could be amazing.
I honestly think this project is potentially awesome, in the olden days, before the net was quite so pervasive we used to talk about using the RussTerm, which was basically getting our guy on the ground in a foreigh country (Russell) to type stuff on the machine he was looking at whilst we talked him through it over the phone, mostly because we could not automate the stuff we wanted to do. This would address many of the use cases for which similar requirements might exist today. That's just one idea that occurs off the top of my head.
Many posters have noted that much of this functionality exists already in tools like; AutoIt, AutoHotKey, some numpty even mentioned sendkeys in VB. But these people have missed the point, until now its all been very "Goto X,Y -> Click" not "find(Thing).click()". Even things like WinRunner or RationalTest seem, in my experience to be far to rigid to be useful. I can see how I would have used this tool to do much good work for our software back when I was demoing, devloping and testing stuff.
That it is wrapped in a nice scripting language as well just makes it even better.
I'm off to see how good it actually is....
"The first thing to do when you find yourself in a hole is stop digging."
It's an interesting idea, but if you're serious about automating Windows, I heartily recommend AutoIT3. http://www.autoitscript.com/autoit3/
Increase my killing power, eh?
Okay, I have done a fair amount of programming and yet with a new Mac I have not yet dived into the SDKs, etc. I once wanted to do some batch resizing of photos and yet couldn't get it done in Automator easily without being scared of losing the original photos, on my first dive into it. Yes, I actually wrote a great auto-compositing and resizing program once driving the Gimp on linux. It was awesome. But that was years ago and now I have a nice new computer. And where did that code go. Yes I'm sure Automator, Quartz Composer, my shiny new Xcode system and whatever else works on a Mac will be great. But I haven't had time to learn it.
Enter Sikuli. I wrote a hello world and it worked fast. I don't know if I could do it to do batch photo processing still but it just seems cool. I'd rather it was decoupled from a language and the editor was open sourced (maybe it is?) though, so others could build on that. For example if there was a binding to Perl and you could just use the IDE, then maybe someone could add Perl bindings and someone else might add use of CPAN modules for downloading web pages, etc.
Also the vision algorithm looks a bit slow.
There was once an experimental system created that allowed you to program graphic drawings drawn as if on a napkin which would animate in 2D, which is how the program would run. A true graphic language. Maybe someone can find it probably in the ACM SIGGRAPH proceedings several years ago. Maybe "graphic shell" and "napkin drawings" would find it.
ALso see VizDraw (pdf) where recognition is done on drawing with a pen tablet.
http://www.eng.uwaterloo.ca/~akmishra/VizDraw2.pdf
Anyway, Sikuli is spectacular for using computer vision techniques to allow for slight changes, and for being immediately useful. I'd like to see it linked to Xcode for RAD of Objective-C apps; Apple should definitely license it or hire the developers for research on it. There is a vast field opened up by this, finally an a-hah experience and not just Apple but many developers should now consider how to get the computer to be smart and find out what you want to do.
This ought to make it possible to do easy mechanized data extraction from the web, analysis of webcam feeds, acting on audio and other types of sensor cues, accessing data and devices over networks, and taking action based on feeds from other devices that are minimally enhanced like my cellphone telling my mac and maybe my mail server when its battery is about to die. It could forward mail to another device, etc. This kind of thing even could work in video cameras and household devices. Even if you just consider it a way to turn people on to programming it is invaluable and fun. I'd like to see Sikuli's functional pieces broken off into standalone services that can be used by other things. As for the comment about window manager themes or operating system versions changing and breaking the script due to icon changes, I think the vision detection of a gui button actually is finding the button and window ids in Sikuli and ought to be able to hand those back.
The editor should also be broken off, of course it needs to be able to launch a screenshot capturing action but that does not mean it must be the sole application allowed to do this. And you could write (snap?) a Sikuli script to run a screenshot capture. Finally I think the Sikuli scripts ought to allow being compiled or otherwise optimized since obviously once it is run, Sikuli knows what the ID of the graphic element it finds is and thereafter need not do vision recognition, it seems.