Wednesday, June 13, 2012

Unit testing isn't enough. You need static typing too.

When I was working on my research for my Masters degree I promised myself that I would publish my paper online under a free license, as soon as I had graduated. Unfortunately there seems to be an unwritten rule of Graduate School research. You spend so much time focusing on a single topic of study that by the time you graduate you are sick of it. So more than year later I'm finally putting my paper online. For those that don't want to read the full paper (it's not terribly long for a research paper at 60 pages, but it's no tweet either) I'll include a shorter summary below. The summary will omit some important information and so if you would like to provide constructive or destructive feedback I ask that the feedback be directed towards the full paper and not the quick summary.

For me research I wanted to test the frequently cited claim by proponents of dynamically typed programming languages that static typing was not needed for detecting bugs in programs. The core of this claim is as follows:
  1. Static typing is insufficient for detecting bugs, and so unit testing is required.
  2. Once you have unit testing static type checking is redundant.
  3. Because static typing rejects some valid programs static typing is harmful.

Despite the fact that I had heard and read this claim many times I couldn't find any research to back this claim up. So I decided to conduct an experiment to see if in practice unit tests really did obviate static typing for error detection. I also wanted to see if developers frequently use dynamic constructs that can't be expressed in a statically typed programming language.

My experiment would consist of finding examples of open source, unit tested programs written in a dynamically typed programming language and manually translating them into a statically typed programming language. I would then quantify how many (if any) defects were detected by the type checker, and how many dynamic constructs couldn't be directly expressed due to being rejected by the static type checker. I should emphasize that for this experiment I would *not* be simply rewriting the program, but doing a direct line by line translation from one programming language to another. I would not count defects that were not detected by the type checker, nor any defects that could not be reproduced in the original program.

Before starting the experiment I needed to choose a dynamically typed programming language that I would translate programs from. I also needed to choose a statically typed programming language that I would translate those programs to. The criteria for the dynamically typed programming language were as follows:
  • The language should be dynamically typed
  • The language should have support for and a culture of unit testing
  • The language should have a large corpus of open source software for studying
  • The language should be well known and considered a good language among dynamic typing proponents
With this criteria in mind I selected Python. The next step is to chose the statically typed programming language. For this selection I used the following criteria:
  • The language should be statically typed
  • The language should execute on the same platform as Python
  • The language should be strongly typed
  • The language should be considered a good language among static typing proponents
I selected Haskell for the statically typed programming language.

The next step was to choose some unit tested programs to translate from Python into Haskell. I randomly picked four projects, The Python NMEA ToolkitMIDIUtilGrapeFruit and PyFontInfo from the https://code.google.com/ and https://bitbucket.org source code hosting sites.

The Python NMEA Toolkit

The translation of the Python NMEA Tookit from Python to Haskell led to the discovery of nine type errors. Three of them could be triggered by malformed input and the other six by an incorrect usage of the API. Only one of the type errors would have been guaranteed to have been discovered had full unit test coverage been employed. Additionally there was one run time error that could be eliminated once static typing was applied. Two unit tests could have been eliminated as their only function was to perform type checking. No dynamic constructs were used that could not be directly translated into Haskell.

MIDIUtil

The translation of MIDIUtil led to the discovery of 2 type errors. Only one of the type errors would have been certainly been caught had full unit test coverage been employed. An additional run time error could also be eliminated by static typing. None of the unit tests only tested for type safety and so none of them could be eliminated. The MIDIUtil code did use struct.pack and struct.unpack which could not be directly translated as they both rely on format strings that determine the type of arguments and return values. However in all cases the format strings were hard-coded, so the Haskell version could instead use hard-coded functions instead of the hard-coded format strings with no loss in expressiveness. Had the MIDIUtil code stored these format strings in external configuration files then the program would likely have required a re-design to express it in a statically typed language.

GrapeFruit

The translation of GrapeFruit to Haskell did not result in the discovery of any type errors. A single run time error could be eliminated by static typing. Additionally a single unit test could have been eliminated that only tested for type safety. No dynamic constructs were used that could not be directly translated into Haskell.

PyFontInfo

The translation of PyFontInfo resulted in the discovery of six type errors. Two run time errors could be eliminated by static typing. A single unit test could have been eliminated. The PyFontInfo code also used struct.pack and struct.unpack which can not be directly translated, but a simple work around exists.

Results

The translation of these projects revealed that all of these projects could have been written in a statically typed programming language with only minor code changes. Furthermore, unit testing did not seem to be an adequate replacement for static type checking. A total of seventeen type errors were discovered. All of the type errors that were discovered were the result of bugs in the original Python code that were not discovered by the unit tests. Many of the bugs existed in code that did have unit test coverage.

Conclusion

The results of this experiment indicate that unit testing is not an adequate replacement for static typing for defect detection. While unit testing does catch many errors it is difficult to construct unit tests that will detect the kinds of defects that would be programatically detected by static typing. The application of static type checking to many programs written in dynamically typed programming languages would catch many defects that were not detected with unit testing, and would not require significant redesign of the programs.

Future Work

The translation of these four projects do provide an interesting data point on the effectiveness of unit testing for defect detection. I hope that others will try to conduct similar experiments on more samples of dynamically typed programs.


The full length paper is located here.
The original Python code and the Haskell translation are here.