Tesuji Code

Sunday, September 20, 2009

Quotes and punctuation: What do programming languages have to teach us about natural languages?

It annoys me when programmers don't use proper grammar. Often when I correct a programmer's grammar, they respond with something along the lines of, "I'm a math person, not a verbal person." To that my response is, "If that's true, you're not a very good programmer." For the record, most of these people are good programmers, because they are verbal people. Programming itself requires good math and verbal skills; there's a reason that so many programming gurus recommend Strunk and White's Elements of Style. Programming is a form of writing. If a good programmer doesn't use good grammar, it's not because they aren't capable, it's because they aren't trying.

As the preceding rant shows, I take grammar very seriously. Admittedly, I have a bit more of a burden in this area than the average programmer; I'm developing my own programming language and would love to work professionally in programming language design, so grammars are obviously an area of interest for me.

Part of my research involves finding ways to make programming languages imitate natural language. The logical, mathematical aspects of programming languages are still important, because the ideas that a programmer is recording for compilation are inherently mathematical. But humans speak and communicate most easily in natural languages, so it only makes sense that we would want programming languages to imitate natural languages to some extent.

Sometimes, however, my brain reverses the process a bit, and I start thinking about how natural languages could learn something from programming languages.

An example of this is my thoughts on quotes. Take a look at this bit of dialogue:

"Are you feeling better?" she asked. "I saw what happened to you. I was with you on the beach all that long night but could not reach you."
"My fugue," I said. "You took it."

This demonstrates a few issues with quotation marks. To highlight the issues, let's extract the dialog into the form of an IM chat log:

Anna: Are you feeling better? I saw what happened to you. I was with you on the beach all that long night but I could not reach you.
William: My fugue. You took it.

This quote is taken from The Empire of Ice Cream, an excellent short story by Jeffrey Ford. Ford is an American author, so his use of quotations follows the American convention of placing sentence-ending punctuation inside the quotation marks. If he were a British author, the sentence-ending punctuation would be outside the quotation marks, according to British convention. There's some debate about which convention is better, but most people will say that you can use either as long as you're consistent or that you should use the convention of whatever organization is publishing your writing. While the latter recommendation is pragmatic, the former could use a little work; neither convention makes sense.

As a programmer, I think of open and close quotes as changes in scope. Punctuation inside the quotation marks is in a scope belonging to the quoted person. The text in the IM version of our conversation contains only the text in the quoted scope Anna and William's scope). Anything outside the quotes is in the author's scope (Ford's scope).

This should also apply to punctuation. Punctuation inside the quotes should only make sense in Anna and William's scope, and punctuation outside the quotes should only make sense in Ford's scope. From this we can see that the American convention has two problems:

"Are you feeling better?" she asked. "I saw what happened to you. I was with you on the beach all that long night but could not reach you."1
"My fugue,2" I said. "You took it."

There is no indication that Ford's sentence has ended.

William's sentence ended here, but a comma doesn't indicate an end of sentence (look at the IM version of the conversation).

The British convention is worse (in this case):

"Are you feeling better?"1 she asked. "I saw what happened to you. I was with you on the beach all that long night but could not reach you".2
"My fugue",3 I said. "You took it".

Putting the question mark inside the quotation is inconsistent with the way periods are handled.

There is no indication that Anna's sentence ended.

In addition to having no indication that William's sentence ended, the comma here indicates that Ford's sentence has continued. This is inconsistent with the lack of such indication after the question mark.

If, instead, we follow the idea of punctuating both the quoted sentence and the author's sentence, we get:

"Are you feeling better?" she asked. "I saw what happened to you. I was with you on the beach all that long night but could not reach you.".
"My fugue." I said. "You took it.".

The period-quote-period looks ugly, but it leaves no ambiguity about whose sentence has ended. One might make an argument for placing commas after '"Are you feeling better?"' and '"My fugue."', but I tend toward using less punctuation if (and only if) the punctuation doesn't reduce ambiguity.

This is just the kind of nonsense that keeps me up at night. Are there any other places where natural language could learn from the less ambiguous grammars of programming languages? Let me know in the comments.

Friday, August 21, 2009

On Standards Organizations

In light of Guido van Rossum's recent Twitter post, the discussion about using divs for layout versus using tables has been revived. Both sides of the discussion are presenting the same arguments as are always presented in this discussion, and I'm not going to beleaguer those arguments further. Suffice it to say that I disagree with van Rossum; following the standards *is* worth it.

However, van Rossum's post does bring our attention to a real problem: even in browsers with good standards-compliance, simple tasks are more complicated than they could and should be.

Try, for example to create a horizontal navigation bar. This simple task is apparently difficult enough to warrant hundreds of tutorials online. Now try to make that same CSS space the links evenly to the width of the page and work intelligently when you have an unknown number of links. Apparently this cannot be done without a brittle JavaScript hack. The reason is that CSS only allows you to hardcode a width percentage for each list item.

The problem here is not the browsers, it's the standard. The standards that W3C puts out aren't good enough to represent solutions to many real problems that web developers are trying to solve.

The most common argument for excluding new features in the standard is that browsers won't be able to conform to the standard. But this is a moot point; browsers already can't comply with standards. The point of the standards isn't to get 100% compliance, it's to ensure that all the browsers work similarly. After all, it's not like there are many websites that have the Acid3 Test on their front page. Having more features in the standard merely gives browsers a higher target for which to aim, and aiming higher can only help the quality of the internet experience for all involved.

An example of this in action comes from the Perl community: Perl has consistently driven innovation in scripting language design by setting ambitious goals. In fact, some of their goals have been so ambitious as to have been formally proven impossible. But setting the bar high has kept Perl a relevant language.

The only ones who benefit from having the bar set low are corporations with large codebases to support: new features sometimes break reverse compatibility, and updating code takes a lot of work. So it should come as no surprise that such corporations have stocked the W3C with their own people to protect their corporate interests. The end result is that average web developers are not represented in the standards with which they are forced to work.

This problem is not unique to the internet. Standards organizations are almost inherently controlled by corporate interest. It's the reason that C++ can't elide semicolons after class definitions and Java can't remove deprecated APIs.

There isn't an obvious solution to this problem; the standards organizations we have are the best we are likely to get. But there is one example worth noting. Python, which doesn't have a standards organization, recently released Python 3.0, which breaks reverse compatibility and adds a whole slew of features. Yes, it's painful to many developers, but they will recover, and when they do, Python will be a better language for the struggle. Obviously this isn't a tenable solution for web standards, but for programming languages in general, the solution may be to *let the language designers design the language*. After all, if there isn't a standard, it can't be polluted by corporate interest.

Saturday, July 18, 2009

Bit manipulation tricks

These tricks let you operate on the bits of an unsigned integer individually. This can be useful for storing booleans in C when low memory usage is more important than speed.

To avoid problems with twos complement, let's assume that flags is declared with an unsigned type. On most systems unsigned int is 32 bits, and in such cases n should only hold values 0 to 31. n and test can be any integer type large enough to hold this range.

// Set the nth bit.
flags |= 1 << n;

// Clear the nth bit.
flags &= ~(1 << n);

// Flip/toggle the nth bit.
flags ^= 1 << x;

// Extract the xth bit as an integer and store it in test.
test = flags >> x & 1;

A quick note on the extraction; the following is more intuitive if you compare it the set, clear, and flip/toggle operations:

test = flags & (1 << n);

However these advantages come with some problems:

test must be stored in an integer of a type the same size or larger than flags

n must be stored in an integer of a type the same size or larger than flags, or cast to such a type before the bitshift.

This method extracts 1 << n if the nth bit is set rather than 1. 99% of the time this is fine because you'll just be using the bit as a test in an if, while, or for statement, but that other 1% can be important, especially if you generalize these statements with a macro or a function.

Monday, June 15, 2009

Premature optimization in language choice

Donald Knuth famously said, "Premature optimization is the root of all evil." This is one of the most important things for a developer to learn. Many hours of programmer time are wasted on optimizations that save milliseconds. Even if your code is run enough to get back that time, the time is likely to be split up between so many users that none of them care about the tiny difference. Worse, time taken to optimize is time not spent on adding features or fixing bugs, each of which can save the user a lot more time than minor optimizations. Optimization before profiling is a choice of a bad trade-off: optimized code takes longer to write, and is harder to understand and maintain.

Choosing a programming language is often the most important optimization choice your team will make. Almost universally, compiled languages run faster than JIT-ed languages, which in turn run faster than traditionally interpreted languages. This comes with the same trade-off as most other optimization choices: the closer the code is to the machine, the longer it takes to write and the harder it is to understand and maintain.

This leads me to my main assertion of this post: language performance should be a very minor consideration in language choice. Since the language is one of the first choices made in a project, choosing a language based on its performance is inherently a premature optimization.

Of course, I'm not saying that optimization isn't important. Optimization is merely an area in which it is very important to choose your battles wisely; only optimize when and where you have to. Performance testing and code profiling can help you to know when and where that is.

The Pareto Principle is often applied to optimization with the ratio that 80% of your performance bottlenecks can be removed by optimizing 20% of your code. However, my experience that the ratio is much more disparate; 99% of your performance bottlenecks can be removed by optimizing 1% of your code.

Just as choosing a language for your project based on performance is a premature optimization, rewriting performance bottleneck code in lower-level language can be a positive optimization. Python, Perl, and Tcl support calls to C or C++ code using SWIG. Java supports call to lower-level languages through the JNI. Nearly every high-level language has similar tools that allow calls to lower-level interfaces. While choosing a language based on its performance is a premature optimization, choosing a language based on its ability to be optimized in a lower-level language is not premature, it's wise.

Some notes:

Performance scalability is a different issue from performance. One need only look to Twitter's issues with Ruby to see why this is an important distinction. Languages with high performance may not have lower performance scalability (see DARCS' issues with Haskell) and languages with lower performance may have high scalability (see Google's use of Python).

Lastly, I said that language performance should be a very minor consideration in language choice. However, sometimes high-performance languages are still good choices, not because of their performance, but because they are most expressive for the problem domain. Operating systems are a good example of this; memory management and filesystems require a lot of low-level addressing that is best represented in a systems language like C or C++.

Sunday, June 14, 2009

Problems with Python

As mentioned in my previous post, Python is my favorite language. I believe that Python is the appropriate language in which to solve more problems than any other language. It is portable, fast (compared to similar languages), easy-to-use, flexible, and powerful. It also resists bad coding practices because of its whitespace-delimited syntax.

However, there are serious problems with Python as well. Some of the benefits of Python are also curses.

Python is fast, but not as fast as compiled languages such as C, OCaml or Haskell. In addition, it is not JIT compiled like some other interpreted languages such as Java and C#, so it is a bit slower than these languages also. However, it is worth noting that both the compiled and JIT interpreted/compiled languages tend to be statically typed, losing the benefits of Python's flexible dynamic typing.

Python's dynamic typing can also be a double-edged sword. While it does allow rapid programming, it also doesn't catch as many bugs as a static type system. Proponents of static typing claim that most errors are type errors. It's very notable that in languages like Haskell, most errors are caught at compile time. If a Haskell program compiles, it most likely works as intended. Proponents of dynamic typing counter that if you use contracts and unit tests, type errors will be caught along with other errors that aren't type errors, and that dynamic languages simply allow you to choose the level of test coverage you want instead of having to use types all the time.

I don't know the solution to this debate, but I will note that whenever you use the word "if" before a counterargument about language features, it's generally not a good argument. Arguments that begin with phrases like "if you use contracts and unit tests" rely on good developer practice. Every programmer who has worked on a team knows that such "if"s get thrown out the window for a deadline, or simply not done because the programmer is inexperienced. It's extremely difficult for even great developers to keep a good level of test coverage. If the language doesn't force you to do it, it won't get done consistently.

However, I'd like to add that most of the proponents of static typing use languages that don't have strong enough typing for their arguments to be legitimate. Take the following examples:

In C:

printf("Hello, my name is %s.",10);

In C++:

if(i = 1) cout << "i == 1";

In Java:

int i = 1.4;

These simple examples would be caught in more strictly-typed languages like OCaml or Haskell. If you write code that relies on the type system to catch errors in C, C++, or Java, you shouldn't have problems, but again that's an argument based on an if.

The next issue I point out with Python is its whitespace delimitation. While in general, whitespace delimitation keeps your code and syntax clean, it also presents problems when embedding code within code of another type (i.e. using Django or another web framework) because one might one to put multiple commands on the same line or maintain consistent indentation. It also causes problems when switching editors (as evidenced by the long-running tabs vs. spaces debate).

Lastly, I'll mention lambdas. Python lambda functions can only be one-liners, as if included in a return function, such as:

lambda x : x + 1

This returns a function which takes one argument, increments it by one, and returns the incremented value. The argument for keeping Python lambdas as they are is that there is nothing that can't be done without them. This is true with one exception; you can't create a function without naming it. This is true, but irrelevant. The point of lambdas is to create anonymous functions. Using named functions defeats the purpose; now a name exists that shouldn't and might be used inappropriately.

People often mention the Global Interpreter Lock and lack of tail-recursion optimization as problems with Python. These aren't problems; performance tests show that thread-switching without the GIL hurts performance more than having it. This will change eventually as processors go multi-core, and when it does I am sure the GIL will be removed, but right now the GIL is as it should be. Lack of tail-recursion optimization is irrelevant. Python is an interpreted language and many optimizations don't happen. The one case where it might be relevant is in the case of infinite recursion, where the unoptimized tail recursion causes a stack overflow. However, in my opinion this should be a criticism of languages that do tail recursion optimization; this causes nearly identical recursive functions to behave differently after optimization, breaking optimization's most basic rule.