MIT Offers Picture-Centric Programming To the Masses With Sikuli
coondoggie writes "Computer users with rudimentary skills will be able to program via screen shots rather than lines of code with a new graphical scripting language called Sikuli that was devised at the Massachusetts Institute of Technology. With a basic understanding of Python, people can write programs that incorporate screen shots of graphical user interface (GUI) elements to automate computer work. One example given by the authors of a paper about Sikuli is a script that notifies a person when his bus is rounding the corner so he can leave in time to catch it."
Here's a video demo of the technology, and a paper explaining the concept (PDF).
Sounds like the Microsoft FrontPage of coding software. Why do with text what you can do with pictures? And we all know FrontPge went on to become the defacto standard for web development....that had to be fixed by an real web developer later.
But on the upside, dedicated FTE's for "reinstalling corrupted FrontPage extensions" did skyrocket during the FrontPage era.
I judt got a nre Kinesis keybiartf so please excusr ant egregiou typos.
Especially for Testing your GUI.
This seems like AutoIT but with image recognition (instead of having to input mouse coordinates).
This looks like a powerful tool for gold / isk / whatever farming. I'm tempted to resurrect my eve account and see if I can make an auto-miner script.
My patience is infinite, my time is not.
Actually I think this is more interesting than either FrontPage or LabView, because it allows you to script GUI apps that were not designed to be scriptable. Even for apps that are scriptable, it provides an increase in user efficiency as you don't have to learn the API commands to do things that you already know how to do in the GUI.
How useful it is will depend on how well the image pattern matching deals with corner cases. Consider you need to click on a text field, however there are many identically looking (empty) text fields, with the only distinguishing factor being the label beside them, and clicking on the label does not select the text field. Like screen scraping, it is also somewhat fragile to UI changes (although not as much as other GUI scripting tools that rely on pixel location).
Sounds like it would be a great program to commit massive clickfraud. Just take a screenshot of a particular google ad-link in your browser and ask it to click it. Install script on hundreds of computers/ run it thousands of times and you have a great way to commit click fraud.
"Computer users with rudimentary skills"..... "with a basic understanding of Python"?
For years I have been asking for a softwsare development tool that allows me to write PHP code by throwing cow-pats at the screem with the Wiimote.
And my colleagues wat a tool that allows dispatching my bugs with the Wii gun attachment they use in "Quantum of Solace".
Sent from my ASR33 using ASCII
The subtitles were a bit of a surprise. Can MIT not afford better than built in microphones on cheap laptops? Between her vaugely asian accent, the poor quality of the audio (seriously, you're TELLING people how to do something, the audio is important here - did they record this in a shower stall or something? my netbook's audio sounds 100x better than this), and then apparently some sort of wacky audio encoding basically makes her impossible to understand. People who speak english as a second language aren't going to be able to understand this, thank god they did the subtitles.
Neat concept though.
moox. for a new generation.
FTFA: "Sikuli -- which means God's eye in the language of the Huichol Indians in Mexico". Mexican Indians love their hallucinogenic Peyote. On the other hand, MIT researchers want the masses to program with the mouse. Well, I know about "correlation is not causation", but MIT sure is an interesting place to be.
This is a GUI version of Expect. Nothing really groundbreaking. It will also break as soon as the app changes how it looks, just like Expect. I hate expect passionately.
Yea- this might work until the icons change. I don't see this working too well in practice. I don't know about Mac- but on my Ubuntu system the icons got updated last week. And it happens often enough that these scripts would need updating to be a serious pain and expense. It isn't like an ordinary user could figure this stuff out either. Despite it being so simple your still going to need an IT person to create these scripts. Now you just have dumber IT people. Probably people who COST you more money in practice too because they "can" do it- it just the results of their work takes more maintenance. It reminds me of this .bat file written for this video store that backs up a database to a flash drive. If it had only had a statement to check if the flash drive were present and alert the user they wouldn't of wasted $80 calling me to come and find out why the backup program wasn't working. Seriously dumb programmer. In the right hands this kind of thing is good. In the wrong hands it is bad.
From what I seen is this a macro program that can use screenshots rather then key/mouse data to automate tasks. So you PROGRAM your PC in the same way you PROGRAM a VCR to record a show. It is NOT the same as writing an application.
But it seems very intresting once you got past this difference. Macro's are very handy for testing in my experience but often have a problem because a tiny mis-alignment can ruin it all. If this program is smarter because it can regonize where data is supposed to go... well that would certainly make automated tests a bit easier.
Interesting stuff. Just don't think you will be writing software with this.
MMO Quests are like orgasms:
You may solo them, I prefer them in a group.
I'm suddenly reminded of horrible apps written in VB97, with no concern for the back end, horrible input kludge, etc.
Sent from my PDP-11
Sikuli velly nice. Near Itari. Parelmo, velly nice. Except warret got storen.
Wow, no one has watched the movie Swordfish have they?
Have you seen his wife recently?
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
This is where we get when everything is a GUI. As long as I have a decent shell & environment, I think I prefer shell scripting.
Otherwise it's just not complete, IMHO.
How many more years will slashdot have an off-by-one error on your Score in your profile?
... but does anyone knows if the program is always that slow?
I understand that it has to visually find the button and this is computationally expensive, but the 2~3 seconds lag didn't seem compatible with the task.
On a sidenote, the video states that there's no "internal API" dependence, but it clearly has to send "click" and "type" signals. Is that really OS independent or was it just an overstatement?
This is the same sort of scripting you can do with many already existing languages. Autohotkey for example. The only new feature would be the ability to copy the screenshot directly into the program as apposed to taking it outside the program and referencing the file directly. I'd say that this scripting language is actually weaker because of it. As far as using this inside a game... they are already hardened against this sort of thing. For example, next time you're in EVE look at the buttons you use. They are semi-transparent. This is not just for aesthetics. If you take a screenshot of the button, and then change your camera angle the button looks different because what's behind it is different. That doesn't mean you can't script inside EVE, you just have to be a lot more clever than using a script to click on a static image of the gui. This language would be almost completely useless in any GUI that has any transparency. Which I'd think would include Vista, Win7 and even Macs with the right stuff turned on.
We're trying to repress those memories, you insensitive clod!
Sorry, there are some things even Sikuli can't process.
How can I believe you when you tell me what I don't want to hear?
It can script GUI actions in much the same way. Granted it's not a very nice environment for more complicated work, but still.
Come on, let's cut through the default Slashdot snark. The image capture aspect of Sikuli is brilliant! I don't like the tagline "program anything with Sikuli" because 99% of software should be written in something else. But think of writing test scripts that can use the image matching features. If the software works as advertised, then you could throw together UI test cases way faster than anything else I've seen. System administration tasks should be a good match too. The resulting code would be brittle and hard to maintain, but for quick one-off scripts, sure... I can see it.
The script may not work if the UI style is different from the one recorded or if the UI language is different from the one recorded. Generally, any option that can change the UI from computer to computer will create a problem for Sikuli.
Good libertarian / Objectivist / Anarcho-Capitalist trolls at least try to post on topic... Watch me and learn, grasshopper. ;-)
Anyway, did MIT just figure out a way to make computers slower and GUI script kiddies more arrogant?! Yuck! C, perl, and OpenBSD FTW!
Has anyone tried writing a Sikuli script that finds the Sikuli IDE window and clicks the green run button?
This time for sure!
if NOT understand logic then
loop
talkTo (self, "Don't program!")
Look (@ Pretty pictures)
endloop
endif
Where are we going and why are we in a handbasket?
This might have potential, depending on how flexible the pattern match is when looking for thumbnails of, ahhh, things...
Now why would you want to do that?
I'd be curious to see how they handle the back end, especially as some others pointed out it does make calls that seemingly require some hook into the OS. As for its usefulness, I doubt it will really take off beyond being a decent prototype. It relies on image matching so if you use and change a custom icon set all your scripts would be kinda worthless. Same goes if the programs you are "screenshot scripting" receive a major overhaul in the GUI department. Until it can address those issues, I doubt it will really take off.
Sikuli is certainly not commercial-grade UI testing software. It was never intended to be, this is academic software written to explore ideas, rather than to polish them to perfection. Also, it is not a "general" programming language. The previous posters that compared it to video-programming are right: not all programs have to target complicated algorithms and data-structures, there is plenty of space for automating "simple stuff".
As an idea, I find the readability of the code particularly interesting. Sikuli code is about the closest you can come to self-explanatory, step-by-step instructions on how to achieve whatever a particular program does. Add a few comments to the most arcane steps, publish those programs to an online repository, and presto! executable step-by-step tutorials.
Yes, the developers may have to address the variability of themes on people's desktops. It is certainly possible to do so (for instance, by keeping a list of mappings from any of a set of "supported" themes to a "canonical" theme, which would be used in all examples), but, as far as ideas go, I really think that Sikuli is a very refreshing idea.
The last time I tried to use Applescript on windows or linux, it wouldn't even start up.
If you have to write a script to automate GUI applications you're undermining the purpose of computers. I'm sitting here imagining people automating deletion.
I just open this can of worms up, but the first thing I thought of after seeing the demo was, "Can I push a button on a Flash page?"
Some accountants seem to think everyone needs to learn accounting in order to function in society. But people have other jobs. Some of us like our dumbed down tools because they fill a need. My tax software lets me do my taxes without learning "proper" accounting. Similarly, I know some people who benefit greatly from a little passing knowledge of high-level scripting languages like VB, JavaScript, or even Python.
For those kinds of people, Sikuli looks pretty cool because they can do things that would be pretty difficult otherwise. Hey, even for a lot of experienced programmers, capturing a region of the screen and doing fuzzy pattern matching might be a significant task. I haven't tried Sikuli yet, but it looks like it would be very helpful for some things, and a lot easier to deal with than AutoIt or AutoHotkey.
(BTW, TurboTax was just an example. I actually use something I like better, but you get the idea.)
Wow they just created the old VB SendKeys command. I was actually doing stuff like this 12-14 years ago with SendKeys command in VB. In "practical" use back then
it sucked and I am certain that has not changed.
Got Code?
I did this exact same thing in AutoIt, except that it needs exact matches of images instead of a fuzzy recognizer. (Plus, I also had rule triggers and state vs just a single list of imperative commands)
The fuzzy match is a nice addition, but this automation concept has been available for years.
man ifconfig
Got Code?
Just Great... all the spammers need now is a few CAPTCHA deciphering Sikuli plug ins.
Once that's done we can all go back to manually removing spam from our web forums and in-boxes.
How you sanitize your inputs in a language that checks what is displayed on the screen? Instead of xss or sql injection you could end being hacked by watching a mail attached normal picture if that kind of programming becomes popular.
For some reason this suddenly reminded me of HyperCard. Anyway, I think there's definitely a desire for this sort of thing out there. From the Wikipedia article on HyperCard...
HyperCard has been described as a "software erector set." It integrates a software development environment with a run-time environment in a simple, easily accessible way. The tools required to write an application, principally the creation and configuration of screen objects like buttons, fields and menus, are part and parcel with the ability to add programmed functionality to those objects. ... "Empowerment" became a catchword as this possibility was embraced by the Macintosh community, as was the phrase "programming for the rest of us", that is, anyone, not just professional programmers.
It is basically expect script for GUIs.
There's no place I can be, since I found Serenity.
How is this any different than Automate? That has been around for many years and based on the MIT video it appears that automate is much easier to use.
http://www.networkautomation.com/automate/7/
Yeah, and the last time I tried to run Logic Pro 8 on Windows or Linux, it wouldn't even start up.
The idea is cool and innovative, and makes automating a point-and-click interface a breeze. It certainly has applications.
But overall, it just seems like a Bad Idea. It will be as reliable as screen-scraping in browsers and would therefore be wise to be avoided, and for the same reasons.
Even just changing the theme of your OS or the icon sizes could well be enough to confuse the image processing. The code won't be portable, and in the end, for anything but the most simple tasks, the person using it would still require some programming skills. Because of this, I think between Sikuli and command-line scripting, command-line scripting has more staying power.
Visit http://ringbreak.dnd.utwente.nl/~mrjb/growingbettersoftware to download your free copy of the book
What happens when the luser changes his theme? Or when Apple updates the system software and controls change places/colors a little tiny bit? (is it still called "system software"?)
I tried playing it and found it really impressive. The implementation is still beta-ish, but good enough to give it a try.
The first thing I make it do for me, is to launch my VNC viewer on my mac laptop, connect to my local head-less windows 7 machine, and click the iTunes play button to play some music there. It just worked (amazingly), and I found it to be a pretty good use case for a tool like this. A task like that cannot be easily automated. At least, it has not be the case with a tool that you just start trying for 5 minutes.
I can imagine that, if the image pattern matching can be extended to do recognition, such as face recognition / text OCR, and passed the recognized info back, or it adds webcam as its input device (instead of keyboard / mouse IO) it'll be overwhelming.
Besides, it really is an inspiring way of coding.
I have to say I am impressed. I have had a play with some of the demos and I like what I see. Whilst I agree that there are limitations this project seems fantastic.
Having tried and failed to use "win runner" in the past due to the complexity of the GUI application I was testing, this scripting would get past the problems we were having.
I can envisage sending canned scripts to my folks for doing maintenance on their own machine, even just some diagnostics that I find hard to do over the phone.
I have a couple of itches of my own that I reckon I could scratch with this, for example I have a macbook that I sometimes attach to an external display. Sometimes the external is on the left of my laptop, sometimes the right, sometimes directly above it would be cool to have a script that allowed me to just click an icon to arrange the displays appropriately. Sikuli is close. I am about to go off and see if that will work.
I mean they have associative arrays indexed by a freaking picture. That is simply, well, paradigm shifting. I am less concerned about the actual efficacy of Sikuli than I am about the ability to hook applications together through their GUI. I am thinking about something like "GUI pipes" which is something I have been thinking about for some time. Mark III of this stuff could be amazing.
I honestly think this project is potentially awesome, in the olden days, before the net was quite so pervasive we used to talk about using the RussTerm, which was basically getting our guy on the ground in a foreigh country (Russell) to type stuff on the machine he was looking at whilst we talked him through it over the phone, mostly because we could not automate the stuff we wanted to do. This would address many of the use cases for which similar requirements might exist today. That's just one idea that occurs off the top of my head.
Many posters have noted that much of this functionality exists already in tools like; AutoIt, AutoHotKey, some numpty even mentioned sendkeys in VB. But these people have missed the point, until now its all been very "Goto X,Y -> Click" not "find(Thing).click()". Even things like WinRunner or RationalTest seem, in my experience to be far to rigid to be useful. I can see how I would have used this tool to do much good work for our software back when I was demoing, devloping and testing stuff.
That it is wrapped in a nice scripting language as well just makes it even better.
I'm off to see how good it actually is....
"The first thing to do when you find yourself in a hole is stop digging."
It's an interesting idea, but if you're serious about automating Windows, I heartily recommend AutoIT3. http://www.autoitscript.com/autoit3/
Increase my killing power, eh?
Okay, I have done a fair amount of programming and yet with a new Mac I have not yet dived into the SDKs, etc. I once wanted to do some batch resizing of photos and yet couldn't get it done in Automator easily without being scared of losing the original photos, on my first dive into it. Yes, I actually wrote a great auto-compositing and resizing program once driving the Gimp on linux. It was awesome. But that was years ago and now I have a nice new computer. And where did that code go. Yes I'm sure Automator, Quartz Composer, my shiny new Xcode system and whatever else works on a Mac will be great. But I haven't had time to learn it.
Enter Sikuli. I wrote a hello world and it worked fast. I don't know if I could do it to do batch photo processing still but it just seems cool. I'd rather it was decoupled from a language and the editor was open sourced (maybe it is?) though, so others could build on that. For example if there was a binding to Perl and you could just use the IDE, then maybe someone could add Perl bindings and someone else might add use of CPAN modules for downloading web pages, etc.
Also the vision algorithm looks a bit slow.
There was once an experimental system created that allowed you to program graphic drawings drawn as if on a napkin which would animate in 2D, which is how the program would run. A true graphic language. Maybe someone can find it probably in the ACM SIGGRAPH proceedings several years ago. Maybe "graphic shell" and "napkin drawings" would find it.
ALso see VizDraw (pdf) where recognition is done on drawing with a pen tablet.
http://www.eng.uwaterloo.ca/~akmishra/VizDraw2.pdf
Anyway, Sikuli is spectacular for using computer vision techniques to allow for slight changes, and for being immediately useful. I'd like to see it linked to Xcode for RAD of Objective-C apps; Apple should definitely license it or hire the developers for research on it. There is a vast field opened up by this, finally an a-hah experience and not just Apple but many developers should now consider how to get the computer to be smart and find out what you want to do.
This ought to make it possible to do easy mechanized data extraction from the web, analysis of webcam feeds, acting on audio and other types of sensor cues, accessing data and devices over networks, and taking action based on feeds from other devices that are minimally enhanced like my cellphone telling my mac and maybe my mail server when its battery is about to die. It could forward mail to another device, etc. This kind of thing even could work in video cameras and household devices. Even if you just consider it a way to turn people on to programming it is invaluable and fun. I'd like to see Sikuli's functional pieces broken off into standalone services that can be used by other things. As for the comment about window manager themes or operating system versions changing and breaking the script due to icon changes, I think the vision detection of a gui button actually is finding the button and window ids in Sikuli and ought to be able to hand those back.
The editor should also be broken off, of course it needs to be able to launch a screenshot capturing action but that does not mean it must be the sole application allowed to do this. And you could write (snap?) a Sikuli script to run a screenshot capture. Finally I think the Sikuli scripts ought to allow being compiled or otherwise optimized since obviously once it is run, Sikuli knows what the ID of the graphic element it finds is and thereafter need not do vision recognition, it seems.