Programming Languages Will Become OSes

← Back to Stories (view on slashdot.org)

Programming Languages Will Become OSes

Posted by michael on Friday January 17, 2003 @05:48AM from the i-run-perl-OS-5 dept.

Anonymous Coward writes "A couple of months ago, at the Lightweight Languages Workshop 2002, Matthew Flat made a premise in his talk: Operating systems and programming languages are the same thing (at least 'mathematically speaking'). I find this interesting and has a lot of truth in it. Both OS and PL are platforms on which other programs run. Both are virtualizing machines. Both make it easier for people to write applications (by providing API, abstractions, frameworks, etc.)"

1 of 456 comments (clear)

Min score:

Reason:

Sort:

Slashdotted by Anonymous Coward · 2003-01-17 05:50 · Score: 0, Redundant

A couple of months ago, at the Lightweight Languages Workshop 2002 [http://ll2.ai.mit.edu/], Matthew Flat made a premise in his talk: Operating system and programming language are the same thing (at least "mathematically speaking"). I find this interesting and has a lot of truth in it. Both OS and PL are platforms on which other programs run. Both are virtualizing machines. Both make it easier for people to write applications (by providing API, abtractions, frameworks, etc.)

Intro, Isolation, Perl

The difference between the two, Matthew continued, is that OS focuses more on non-interference--or isolation between OS processes. The main task of a multiuser OS is to let several users use the computer simultaneously. Thus, it is important that no user can take over the machine or use up its resources permanently. Also, no processes shall be able to terminate other processes, peek into their resources, or do any other things that violate privacy unless it is permitted by the OS security policy.

On the other hand, PL focuses on expressiveness and cooperation. PL provides high level constructs and facilities so that one can write programs in less time and with less amount of effort. 10 lines of higher level PL code might be equivalent to 100 to 1000 lines of machine/lower level language code. Additionally, PL provides means for people to share reusable code through the concepts of modules, shared libraries, components, etc.

As time progresses, OS'es are becoming more like PL. And vice versa. OS now provides more and more ways for cooperation/sharing: IPC, threads, COM, etc. PL now provides ways to do isolation: sandboxing, processes, etc.

However, in all programming languages that I am currently using (Perl, Python, Ruby), none of them had been designed from the ground up to do isolation. Thus, none of the isolation mechanisms really work well.

This article will focus on above three languages. It would certainly be interesting to also discuss Scheme, Smalltalk, Java, and Erlang--however since I'm not adequately familiar with any of them I'll leave the readers to give feedback on these.

Why Isolation In PL?
As people construct more and more complex systems, the need for isolation becomes apparent. Complex systems usually untrusted user-level code that need to be restricted. Several examples follow.

Database systems usually provide some sort of stored procedure. A remote client can connect to the database and triggers stored procedure to be executed. It is important that if the stored procedure crashes or loops, other clients can continue to use the database.
Business applications usually allow users to specify business rules or constraints. Both are basically some simplified high level code. Users might specify these rules incorrectly and the application must ensure that those errors have any unwanted impact.
Web application servers usually allow pages/templates to contain code. Since generally the interpreter itself (e.g. Perl or PHP) is exposed to do the execution of the code, the application must somehow ensure that no templates can crash the application.
Other applications might allow users to specify regular expressions. Regular expressions is actually a language, though a mini one. Overly complex regexes--either specified accidentally or on purpose--can cause the regex engine to loop endlessly doing backtracking.
So, in essence, complex applications are usually a platform by itself, running subprocesses/subprograms (in a single OS process). Thus, this requires that the PL has isolation mechanisms beyond those provided by the OS: like restricting a piece of code from accessing a certain part of the filesystem, from using more than a specified amount of memory/CPU time, from accessing certain functions/modules/variables. Unfortunately, most PL don't have enough of them.

Perl
The two main security models in Perl are tainting and safe compartments. Tainting are mainly for tracing data, so I will not discuss it here.

In Perl 5.6/5.8 there are about 400 bytecode-level instructions, called opcodes. All Perl code will eventually be compiled to these opcodes. print is actually a single opcode. So are open, sysopen, mkdir, rmdir, fork, gethostbyname, etc. To see the complete list of Perl opcodes, see theOpcode documentation [http://search.cpan.org/author/JHI/perl-5.8.0/ext/ Opcode/Opcode.pm].

Two things are apparent. One, Perl opcodes are higher level than machine level instructions or even Java bytecode instructions. Two, Perl is a monolithic beast. Many facilities (like directory manipulation and even DNS-related stuffs) are built into the language. Perl5 is monolithic because of historical reasons. Perl6 will also be monolithic--so I heard--because of speed reasons.

Every single opcode can be enabled or disabled. This is done in the compilation step. If there is a forbidden opcode encountered by the compiler, the compiler will refuse it and compilation will fail. This has the advantage of speed: the cleansed code will absolutely have no run-time speed impact. The disadvantage: one must be careful to compile code at run-time--otherwise untrusted code can be compiled with dangerous opcodes in it.

The Safe.pm is a standard Perl module that allows a piece code to be compiled with a specified opcode mask (a list of opcodes that are to be forbidden). In addition to that, Safe.pm will do a "namespace chroot". It will make Safe::Root0 (or Safe::Root1 for the second compartment, and so on) as the code's main:: namespace. This means that the code in the compartment cannot access variables in the original main:: namespace, so global variables like $/ is not shared with code outside the compartment (Some variables like $_ or the _ filehandle is shared, though).

That's basically what Perl offers us for security. In practice, Safe.pm is not practical. Choosing a reasonable set of "safe" opcodes is not always straightforward. An opcode like open can range from "rather safe" to "extremely dangerous". Perl's open is so powerful and has many functions: it can open a file for reading, for writing, it can execute programs, open a pipe, duplicate a filehandle, etc. You can't, for instance, make Perl allow only read in open. Overriding open() doesn't make it safe, because the code in compartment can always refer to the builtin version using CORE::open(). Moreover, Perl can be told to read/write files without using any opcode at all (for example, using $^I). Thus it is not possible to restrict an unstrusted Perl code from accessing filesystem. To do this, one must resort to using OS facility (like Unix's chroot or BSD's jail).

The show-stopper for Safe.pm: most modules don't work under Safe.pm. DBI, for example. Embperl 1.x uses Safe.pm but drops it in the 2.x versions. Virtually no other web application servers uses Safe.pm these days. Even Perl experts say that Safe.pm is too broken.

Conclusion: Perl has some sort of sandbox, but it works at the compilation step only. It's not very flexible and it's not very useful. Perl is also monolithic and many functions are built into the interpreter. Thus, it is harder to isolate functionalities.

Python, Ruby, Conclusion

Python
The Python language design is very simple and clean. Amongst the security models of the three languages, Python's is the one I like the most. Python security model is capability-based, meaning that: if you don't want a certain code to be able to do stuff, you don't give a reference to the module/function that provide that stuff. Python is also much more modular: the core functionality is much less than that of Perl. For example, OS specific services--like unlink or rmdir--are located in the sys and os module. This means we can more easily restrict access to those services by depriving the code from importing the appropriate modules.

Here's Python's execution model: each code runs in a frame ("a context"). In a frame, there are two namespaces: the local and the global namespace. A namespace is a mapping between names and objects. You get reference (=capability) to objects from a namespace. Every time a variable/function/object/module name is mentioned, Python will look for it in the namespaces. The local namespace will be searched first, then the global. If the name is not found in either, Python will give a NameError exception.

We can manipulate a namespace easily, since it is available as a dictionary. We can even execute a code and give it our custom dictionaries to be used as the code's local and global namespaces. This way, we can limit what objects are available to the code. That's basically how the security model works in Python.

Actually, there's a third namespace that will be searched when a name is not found in a local and global namespace: the builtin namespace. The builtin namespace contains basic functions like open, exit, execfile. Most of the Python's builtin capabilities are provided through this builtin namespace. The rest is creatures like print or exec which are statements, not functions/objects.

rexec is the standard Python module to do sandboxing. It basically does what is explained above: run the sanboxed code with a custom local and global namespace. Additionally, rexec creates a custom builtin namespace and provides a safer substitutes for functions like open or __import__. This way, we can tell rexec to forbid the untrusted code from opening a file in write mode. Or from importing dangerous modules.

rexec is pretty flexible and indeed has been used successfully in several applications. Guido's web browser Grail, for instance, allows running Python applets. However, rexec seems to be not flexible or fine-grained enough, because Zope chooses not to use rexec. Instead, it uses its own home-growned module to do restricted execution.

There are several things that rexec can't do. Resource limiting, for example. To do that you need to resort to the OS (like using Unix's setrlimit). Also, since Python does not have private attributes, you can't give an object to an untrusted code without the fear that the code will use the Python reflection mechanism to "peek into the guts" of your object (and from there gain references to other objects). There are two separate solutions to the last problem: the Bastion and mxProxy C extension modules, which essentially provide private attributes.

Conclusion: Python has a nice and simple security model. However, rexec cannot do all kinds of isolation that one might need, like resource limiting. Guido once also said that rexec is not tested enough and it might contain security holes.

Ruby
One of the main goals of Ruby seems to be "to replace Perl". In that respect, it has copied many Perl features. Tainting is one of them. In Perl there are two running modes: tainting mode on (-T, setuid) and off (no -T). Ruby extends this concept a bit by providing four different "safe levels" (indicated by the global variable $SAFE). The different safe levels is as follows.

Safe level 0 (default mode): no tainting is performed.

Safe level 1: tainted data cannot be used to do potentially dangerous.

Safe level 2: in addition to level 1 restriction, program files cannot be loaded from a globally writable locations (e.g. from /tmp).

Safe level 3: in addition to level 2 restriction, all newly created objects are considered tainted.

Safe level 4: in addition to level 3 restriction, the running program is effectively partitioned in two. Nontainted objects may not be modified. Typically, this will be used to create a sandbox: the program sets up an environment using a lower $SAFE level, then resets $SAFE to 4 to prevent subsequent changes to that environment.

It's evident that, as with tainting, the safe levels are primarily concerned with data security and are not very sandbox-like (in the sense of "isolating subprocesses from another" sandbox). Matz confirmed this in the ruby-talk mailing list by saying that Ruby currently does not have any sandbox yet. Running a code in safe level 4 is usually too restrictive to be practical, plus it does not provide enough isolation.

The problem with isolation in Ruby is that all objects are accessible from any code through the ObjectSpace facility (including the code running in safe level 4). This is of course in direct conflict with the capability concept, in that you don't give a reference/capability unless necessary. However, Ruby does protect an object's attributes and has a #freeze method to make an object becomes read-only.

Conclusion: Ruby doesn't have a sandbox (yet).

Other PL's
Java has a sandbox security model and a bytecode verifier. Tcl basically has the same. Erlang is evolutionary more advanced in providing isolation, in that it has a notion of "PL-level processes" (a process is isolated in all ways from another).

Conclusion
As people construct more and more complex applications in PL, PL's are required to have adequate security/isolation mechanisms. Current PL's in mainstream usage do not have adequate security mechanisms, so programmers are often forced to fall back to using facilities provided by the OS. This has drawbacks such as lack of portability and reduced efficiency. There will perhaps be new PL's designed with isolation as one of their main goals--or current PL's might be improved/redesigned--so hopefully this requirement of having a "multiuser PL" will be fulfilled in the future.

About the Author:
Steven is a software developer residing in Bandung, Indonesia.