Tesuji Code: June 2009

Monday, June 15, 2009

Premature optimization in language choice

Donald Knuth famously said, "Premature optimization is the root of all evil." This is one of the most important things for a developer to learn. Many hours of programmer time are wasted on optimizations that save milliseconds. Even if your code is run enough to get back that time, the time is likely to be split up between so many users that none of them care about the tiny difference. Worse, time taken to optimize is time not spent on adding features or fixing bugs, each of which can save the user a lot more time than minor optimizations. Optimization before profiling is a choice of a bad trade-off: optimized code takes longer to write, and is harder to understand and maintain.

Choosing a programming language is often the most important optimization choice your team will make. Almost universally, compiled languages run faster than JIT-ed languages, which in turn run faster than traditionally interpreted languages. This comes with the same trade-off as most other optimization choices: the closer the code is to the machine, the longer it takes to write and the harder it is to understand and maintain.

This leads me to my main assertion of this post: language performance should be a very minor consideration in language choice. Since the language is one of the first choices made in a project, choosing a language based on its performance is inherently a premature optimization.

Of course, I'm not saying that optimization isn't important. Optimization is merely an area in which it is very important to choose your battles wisely; only optimize when and where you have to. Performance testing and code profiling can help you to know when and where that is.

The Pareto Principle is often applied to optimization with the ratio that 80% of your performance bottlenecks can be removed by optimizing 20% of your code. However, my experience that the ratio is much more disparate; 99% of your performance bottlenecks can be removed by optimizing 1% of your code.

Just as choosing a language for your project based on performance is a premature optimization, rewriting performance bottleneck code in lower-level language can be a positive optimization. Python, Perl, and Tcl support calls to C or C++ code using SWIG. Java supports call to lower-level languages through the JNI. Nearly every high-level language has similar tools that allow calls to lower-level interfaces. While choosing a language based on its performance is a premature optimization, choosing a language based on its ability to be optimized in a lower-level language is not premature, it's wise.

Some notes:

Performance scalability is a different issue from performance. One need only look to Twitter's issues with Ruby to see why this is an important distinction. Languages with high performance may not have lower performance scalability (see DARCS' issues with Haskell) and languages with lower performance may have high scalability (see Google's use of Python).

Lastly, I said that language performance should be a very minor consideration in language choice. However, sometimes high-performance languages are still good choices, not because of their performance, but because they are most expressive for the problem domain. Operating systems are a good example of this; memory management and filesystems require a lot of low-level addressing that is best represented in a systems language like C or C++.

Sunday, June 14, 2009

Problems with Python

As mentioned in my previous post, Python is my favorite language. I believe that Python is the appropriate language in which to solve more problems than any other language. It is portable, fast (compared to similar languages), easy-to-use, flexible, and powerful. It also resists bad coding practices because of its whitespace-delimited syntax.

However, there are serious problems with Python as well. Some of the benefits of Python are also curses.

Python is fast, but not as fast as compiled languages such as C, OCaml or Haskell. In addition, it is not JIT compiled like some other interpreted languages such as Java and C#, so it is a bit slower than these languages also. However, it is worth noting that both the compiled and JIT interpreted/compiled languages tend to be statically typed, losing the benefits of Python's flexible dynamic typing.

Python's dynamic typing can also be a double-edged sword. While it does allow rapid programming, it also doesn't catch as many bugs as a static type system. Proponents of static typing claim that most errors are type errors. It's very notable that in languages like Haskell, most errors are caught at compile time. If a Haskell program compiles, it most likely works as intended. Proponents of dynamic typing counter that if you use contracts and unit tests, type errors will be caught along with other errors that aren't type errors, and that dynamic languages simply allow you to choose the level of test coverage you want instead of having to use types all the time.

I don't know the solution to this debate, but I will note that whenever you use the word "if" before a counterargument about language features, it's generally not a good argument. Arguments that begin with phrases like "if you use contracts and unit tests" rely on good developer practice. Every programmer who has worked on a team knows that such "if"s get thrown out the window for a deadline, or simply not done because the programmer is inexperienced. It's extremely difficult for even great developers to keep a good level of test coverage. If the language doesn't force you to do it, it won't get done consistently.

However, I'd like to add that most of the proponents of static typing use languages that don't have strong enough typing for their arguments to be legitimate. Take the following examples:

In C:

printf("Hello, my name is %s.",10);

In C++:

if(i = 1) cout << "i == 1";

In Java:

int i = 1.4;

These simple examples would be caught in more strictly-typed languages like OCaml or Haskell. If you write code that relies on the type system to catch errors in C, C++, or Java, you shouldn't have problems, but again that's an argument based on an if.

The next issue I point out with Python is its whitespace delimitation. While in general, whitespace delimitation keeps your code and syntax clean, it also presents problems when embedding code within code of another type (i.e. using Django or another web framework) because one might one to put multiple commands on the same line or maintain consistent indentation. It also causes problems when switching editors (as evidenced by the long-running tabs vs. spaces debate).

Lastly, I'll mention lambdas. Python lambda functions can only be one-liners, as if included in a return function, such as:

lambda x : x + 1

This returns a function which takes one argument, increments it by one, and returns the incremented value. The argument for keeping Python lambdas as they are is that there is nothing that can't be done without them. This is true with one exception; you can't create a function without naming it. This is true, but irrelevant. The point of lambdas is to create anonymous functions. Using named functions defeats the purpose; now a name exists that shouldn't and might be used inappropriately.

People often mention the Global Interpreter Lock and lack of tail-recursion optimization as problems with Python. These aren't problems; performance tests show that thread-switching without the GIL hurts performance more than having it. This will change eventually as processors go multi-core, and when it does I am sure the GIL will be removed, but right now the GIL is as it should be. Lack of tail-recursion optimization is irrelevant. Python is an interpreted language and many optimizations don't happen. The one case where it might be relevant is in the case of infinite recursion, where the unoptimized tail recursion causes a stack overflow. However, in my opinion this should be a criticism of languages that do tail recursion optimization; this causes nearly identical recursive functions to behave differently after optimization, breaking optimization's most basic rule.

Saturday, June 13, 2009

Language Advocacy

This article is nearly ten years old, but it still is relevant today. In a recent discussion, I mentioned to a beginning programmer that C wasn't the best language to learn the thought process behind programming. I justified this assertion by saying that the static typing, lack of direct object orientation, and need to manage your own memory all distract from the process of turning the algorithms in your head into code. As an alternative with a less steep learning curve, I suggested Python.

At this point, another programmer, who codes in C for a living, interjected. "C was my first language and I'm a better programmer for it. There's nothing wrong with learning C first."

I started to argue my point, to clarify that there are benefits to learning C first too but that for the particular purpose of learning the process, I felt Python was better. But as I started to make my point I realized that the whole argument was silly.

C was my first language too, but I didn't have a problem criticizing it. Why should he? Neither one of us designed the language and neither of us wrote the compilers. Aspersions on C aren't aspersions on us. If anything, we can only gain from pointing out the problems in C; if enough people notice, the problems might get fixed in the next version of the standard.

Now, I'll be the first to admit that I do have a bias in favor of Python. It is by far my favorite language and nearly all my personal projects are written in it. I probably know Python better than any language.

But I can recognize that there are legitimate issues with Python. In fact, that will probably be the topic of my next post. And I certainly realize that there are many cases where C is a better choice. It might even be a better choice as a first language, depending on what the beginner wants to get out of his or her learning. I certainly understand types, memory management, and how the computer works better because of my learning in C.

So let's all lighten up a bit, okay? If you can't criticize your own language, you aren't looking at it critically. No language is perfect. Not even yours.

Friday, June 12, 2009

Side Effects and Secondary Effects

This is a first post, but I'm not going to spend time on introductions. Let's get down to some content, alright? I'll start off with something simple (most coders will probably find this obvious).

One of the big issues in functional programming is the issue of side effects. Particularly, functional programmers tend to distrust them. A pure functional language such as Haskell or Clean disallows side-effects entirely. And not without good reason. Side effects complicate code, especially when concurrency is involved, an issue that is becoming more and more important as processor manufacturers move toward multi-core processors.

In programming terms, "side effect" doesn't mean exactly the same thing as it does in normal speech or writing. When a doctor speaks of the "side effects" of a medicine, he is speaking merely of any unintended consequences of the medicine. But when a programmer speaks of side effects, he/she is talking about any effects other than the return of a value. Here's why that difference in definition is important:

int a, b;

a = b = 1;

The second line is valid C because of a trick with the operators. The assignment operator in C not only assigns its value to the target variable, but it also returns the value which it assigns. The intended effect is the assignment. The return is a side effect! Or at least it is in the medical sense of the term, "Side effect." To avoid confusion between the more colloquial usage and the programming usage, I'll refer to these unintended consequences as "secondary effects".

Secondary effects, like side effects, are bad language and/or API design. Let's take a look at why.

int factorial(int i)

{

if(i = 1) return 1;

return i * factorial(i - 1);

}

If you've ever programmed in an academic environment, you probably think you've written this C code before. Let's hope for the sake of your grades that you haven't; this code has a serious error. The condition for the if-statement contains an assignment operator rather than an equality test operator.

This code compiles and then fails silently at runtime. If the assignment operator in C didn't allow this secondary effect, this error could be caught at compile time. One of C's biggest benefits is its static type system, so one would expect that catching errors at compilation is something C programmers care about. Many later languages remove the return from the assignment operator precisely because of how common this error is.

Here is another example, a famous piece of code:

while(*dest++=*src++);

This piece of code famously copies the contents of string src into string dest (provided that enough space is allocated for dest). In addition to the confusion of the asterisk operator (which in different cases declares a pointer, indicates the value stored at a pointer, or multiplies its operators, a blatant violation of the Principle of Least Astonishment) there is the added confusion that the secondary effect (return) of the increment operator are being passed to the secondary effect of the assignment operator, which is being returned as the test value to the while test. Add to this the confusion that the increment operator can also be a prefix unary operator, like so:

while(*++dest=*++src);

I'll leave it to the reader to figure out the error caused by this change.

C programmers of course would not present the above line as an example of good code. However, situations as confusing do often turn up in production code, wreaking havoc on the junior developers that attempt to debug them.

Tesuji Code