Wednesday, June 13, 2012

Unit testing isn't enough. You need static typing too.

When I was working on my research for my Masters degree I promised myself that I would publish my paper online under a free license, as soon as I had graduated. Unfortunately there seems to be an unwritten rule of Graduate School research. You spend so much time focusing on a single topic of study that by the time you graduate you are sick of it. So more than year later I'm finally putting my paper online. For those that don't want to read the full paper (it's not terribly long for a research paper at 60 pages, but it's no tweet either) I'll include a shorter summary below. The summary will omit some important information and so if you would like to provide constructive or destructive feedback I ask that the feedback be directed towards the full paper and not the quick summary.

For me research I wanted to test the frequently cited claim by proponents of dynamically typed programming languages that static typing was not needed for detecting bugs in programs. The core of this claim is as follows:
  1. Static typing is insufficient for detecting bugs, and so unit testing is required.
  2. Once you have unit testing static type checking is redundant.
  3. Because static typing rejects some valid programs static typing is harmful.

Despite the fact that I had heard and read this claim many times I couldn't find any research to back this claim up. So I decided to conduct an experiment to see if in practice unit tests really did obviate static typing for error detection. I also wanted to see if developers frequently use dynamic constructs that can't be expressed in a statically typed programming language.

My experiment would consist of finding examples of open source, unit tested programs written in a dynamically typed programming language and manually translating them into a statically typed programming language. I would then quantify how many (if any) defects were detected by the type checker, and how many dynamic constructs couldn't be directly expressed due to being rejected by the static type checker. I should emphasize that for this experiment I would *not* be simply rewriting the program, but doing a direct line by line translation from one programming language to another. I would not count defects that were not detected by the type checker, nor any defects that could not be reproduced in the original program.

Before starting the experiment I needed to choose a dynamically typed programming language that I would translate programs from. I also needed to choose a statically typed programming language that I would translate those programs to. The criteria for the dynamically typed programming language were as follows:
  • The language should be dynamically typed
  • The language should have support for and a culture of unit testing
  • The language should have a large corpus of open source software for studying
  • The language should be well known and considered a good language among dynamic typing proponents
With this criteria in mind I selected Python. The next step is to chose the statically typed programming language. For this selection I used the following criteria:
  • The language should be statically typed
  • The language should execute on the same platform as Python
  • The language should be strongly typed
  • The language should be considered a good language among static typing proponents
I selected Haskell for the statically typed programming language.

The next step was to choose some unit tested programs to translate from Python into Haskell. I randomly picked four projects, The Python NMEA ToolkitMIDIUtilGrapeFruit and PyFontInfo from the https://code.google.com/ and https://bitbucket.org source code hosting sites.

The Python NMEA Toolkit

The translation of the Python NMEA Tookit from Python to Haskell led to the discovery of nine type errors. Three of them could be triggered by malformed input and the other six by an incorrect usage of the API. Only one of the type errors would have been guaranteed to have been discovered had full unit test coverage been employed. Additionally there was one run time error that could be eliminated once static typing was applied. Two unit tests could have been eliminated as their only function was to perform type checking. No dynamic constructs were used that could not be directly translated into Haskell.

MIDIUtil

The translation of MIDIUtil led to the discovery of 2 type errors. Only one of the type errors would have been certainly been caught had full unit test coverage been employed. An additional run time error could also be eliminated by static typing. None of the unit tests only tested for type safety and so none of them could be eliminated. The MIDIUtil code did use struct.pack and struct.unpack which could not be directly translated as they both rely on format strings that determine the type of arguments and return values. However in all cases the format strings were hard-coded, so the Haskell version could instead use hard-coded functions instead of the hard-coded format strings with no loss in expressiveness. Had the MIDIUtil code stored these format strings in external configuration files then the program would likely have required a re-design to express it in a statically typed language.

GrapeFruit

The translation of GrapeFruit to Haskell did not result in the discovery of any type errors. A single run time error could be eliminated by static typing. Additionally a single unit test could have been eliminated that only tested for type safety. No dynamic constructs were used that could not be directly translated into Haskell.

PyFontInfo

The translation of PyFontInfo resulted in the discovery of six type errors. Two run time errors could be eliminated by static typing. A single unit test could have been eliminated. The PyFontInfo code also used struct.pack and struct.unpack which can not be directly translated, but a simple work around exists.

Results

The translation of these projects revealed that all of these projects could have been written in a statically typed programming language with only minor code changes. Furthermore, unit testing did not seem to be an adequate replacement for static type checking. A total of seventeen type errors were discovered. All of the type errors that were discovered were the result of bugs in the original Python code that were not discovered by the unit tests. Many of the bugs existed in code that did have unit test coverage.

Conclusion

The results of this experiment indicate that unit testing is not an adequate replacement for static typing for defect detection. While unit testing does catch many errors it is difficult to construct unit tests that will detect the kinds of defects that would be programatically detected by static typing. The application of static type checking to many programs written in dynamically typed programming languages would catch many defects that were not detected with unit testing, and would not require significant redesign of the programs.

Future Work

The translation of these four projects do provide an interesting data point on the effectiveness of unit testing for defect detection. I hope that others will try to conduct similar experiments on more samples of dynamically typed programs.


The full length paper is located here.
The original Python code and the Haskell translation are here.

124 comments:

  1. Sadly, if the code was trivial to rewrite in another language, then the code is trivial, and not likely to have significant logic bugs.

    In all the years I've spent as a maintenance coder 99% of all the bugs I've found and fixed were logic bugs that had nothing to do with type errors. And there is no compiler on earth that could have found this type of logic error.

    My problem with most discussions about static typing isn't in the utility of having static typing, it's that focusing on a subset of all bugs that is so small leads people to believe that problem is larger than it really is. Let me say that again, thinking about static type bugs makes static type bugs seem more important than they really are.

    What's worse, is that programmers in static typed languages tend to get the most feedback about these types of bugs, making them seem even more prevalent than the bugs that their compiler can't detect.

    It's like the beginning programmer who struggles just to get their syntax right, and they think that the only bugs that exist are stray semicolons, or mismatched quotes.

    ReplyDelete
    Replies
    1. Thanks for the feedback you bring up several good points. I'll try and address them here. If I don't please re-comment.

      I don't think I ever claimed that it was trivial to rewrite the code in another language. I spend a couple of months several hours a day doing the translation, I've been coding for 10+ years and I wouldn't consider it a trivial exercise. During the translation I did encounter many logic errors. I didn't talk about them in my paper because they were outside the scope of the experiment.

      With your years of work as a maintenance coder were you working with languages that had type systems as powerful as Haskell's? If you were working with C++/C/Java then I assert it would be very difficult to know if the compiler really could catch the logic error. For example Haskell's type system prevents errors such as Java's NullPointerException or C/C++ dereferencing of NULLs. I'm not trying to attack you but when you first start working with a type system as powerful as Haskell's it just seems magic, it caught errors in the Python program that I didn't think were possible to catch with static typing.

      I discuss in my paper that one of the downsides of static typing is that it can catch "bugs" that would never cause adverse runtime behavior. For my experiment I don't believe I ran into any of these. All of the bugs that were detected by static typing could be exploited in the Python code to make the program output incorrect data or throw an uncaught exception. Because of this I assert that these bugs were in fact *very* important. A program that returns invalid results or crashes due to a type error is just as broken as one that returns invalid results or crashes due to a logic error.

      I think that a solution to only getting feedback about errors that compilers can't detect would be unit and other forms of testing. I'm certainly not arguing that static typing would replace unit testing. I am arguing that unit testing doesn't replace static typing.

      Please let me know if I didn't properly address any of your concerns.

      Delete
    2. "All of the bugs that were detected by static typing could be exploited in the Python code to make the program output incorrect data or throw an uncaught exception. Because of this I assert that these bugs were in fact *very* important."

      Perhaps this is not true for the software you had a look at, but for many classes of software, bugs that never manifest themselves when used by a user don't really matter. They would certainly not be *very* important. With Python you can have half-finished code up and running and address the bugs as they occur.

      There are of course also downsides to this approach. :)

      I think it's interesting but not very surprising that dynamic features are only little used. Code is after all relatively static. The big gain is the ability to leave out the types and still have everything work as long as it is quacking in the right way (compared to something like C++ or Java or C#).

      Delete
    3. In a sufficiently powerful language (haskell, c++, agda, scala, ...) you can express way more invariants in the type system, and prove that your code satisfies those invariant using the type system.

      Thus the class of information that types verify becomes the class of all deterministic provable properties of your code, not just "an int was used where an int was expected." You could for example prove that your code satisfies the red black tree invariant.

      How expressive you make your types equivalent to writing formalized unit tests for every property, such that if the unit test passes, you have a guarantee that the invariant is satisfied in a large set of cases. Unit tests then can be written for nondeterministic properties like speed or stress.

      Delete
    4. Bugs that a typical user might not encounter but that can be triggered by an attacker are important. These can easily lead to serious exploits.

      Delete
    5. Could you elaborate on why you consider throwing an uncaught type error on malformed input a bug?
      What else should the program do?

      Delete
    6. The program should tell me that the input was malformed. Otherwise it is hard to see whether the bug is triggered by an user error (i.e. malformed input) or an error in the program logic

      Delete
    7. Ole you are correct that static typing can report errors that would never be manifest at runtime. I don't doubt that programs exist that could not be directly translated. My experiment suggests that these types of programs may not be common, but my sample size is too small to know for sure.

      Delete
    8. Andreas, great question. My description wasn't clear. I definitely would *not* consider an exception that was explicitly raised by a program a bug. I do however consider it a bug if input to a program causes an unintended exception to be raised. For example if the input caused the program to index into a Python list with a non integer value (which caused Python to raise a TypeError).

      Delete
    9. OP Anonymous: Before asserting such things as ‘there is no compiler on Earth that could have found this type of logic error’, you should really familiarize yourself with recent developments in statically-typed languages. Languages such as ATS or Agda have Turing-complete (modulo general recursion) type-systems, and can express almost any provable constraint as a type.

      Delete
  2. My dissertation on Typed Racket [1] contains somewhat more data, with similar conclusions. I found, like you, that failure to handle error cases is a big source of type mistakes.

    [1] http://www.ccs.neu.edu/racket/pubs/dissertation-tobin-hochstadt.pdf

    ReplyDelete
    Replies
    1. That's great to hear. I took a couple Programming Language classes from Matthew Flatt who instilled in me an interest in programming languages.

      Delete
  3. How did you gauge the quality of the unit test suites of the projects you translated?

    ReplyDelete
    Replies
    1. I echo this question. I took a quick look at the NMEA toolkit, and found its unit tests to be extremely simplistic, and tiny compared to the amount of code being tested. On the other hand, the MIDI toolkit had much more test code. You found 9 errors in the former, and only two in the latter, even though the latter is considerably more code.

      I strongly suspect a correlation between the quality of the tests and the number of errors you found. In fact this correlation may well be stronger than the correlation you claim in this article.

      Delete
    2. I did not do anything to gauge the quality of the unit tests. My interest was on whether unit testing obviated static typing *in practice*. Obviously in theory unit testing can obviate static typing (you could encode a static type checker as a unit test), but if no one does this then in practice it really doesn't matter. The standard way of gauging the quality of unit tests seems to be code coverage. I should point out that static typing was able to find bugs in code that was exercised by unit tests, so code coverage doesn't guarantee a lack of type errors. I do agree that more and better unit tests would help in finding type errors but in order to find all of the type errors that static typing would find you have to test every possible execution paths. For one of the projects I counted the number of possible execution paths for creating a simple object and calling one of it's methods and it would require more that 40 unit tests. It would certainly be interesting to come up with a metric for unit test quality and see how well those tests do for detecting unit tests.

      Delete
    3. >> For one of the projects I counted the number of possible execution paths for creating a simple object and calling one of it's methods and it would require more that 40 unit tests.

      This tells us that that "simple" object is very badly designed. The main purpose of tests is to give huge design feedback, not to catch errors.

      And I agree with the point that static typing catch primitive errors only, not logic errors.

      Delete
    4. Vasily, you may be right that the simple object was poorly designed, but the point stands that writing tests that have full code coverage is relatively easy. Writing tests that execute every possible execution path is intractable for any non-trivial program no matter how well it was designed.

      Delete
    5. I dont get this. Could you explain using code? If the code gets full coverage, all execution paths are covered.

      Delete
  4. Thanks for an excellent summary and for attacking such an interesting open question! I'll have to read the source at length.

    ReplyDelete
  5. I don't want to dismiss your work or your sharing of your results, but I'd wonder if you've actually tested your hypothesis.

    Haven't you instead shown that bugs not detected in one way can sometimes be caught in other ways?

    The programs you chose were already developed: after the event it's hard to know, but I'd suggest the unit tests have previously caught a large number of bugs during development - how many of these would have been caught by static typing, and how many would have slipped past the type checker?

    All code, in practice, has bugs... the more *ways* you try to find them (unit tests, inspection, static analysis), the more you find, so testing any project in "a new way" is likely to expose some bugs....

    ReplyDelete
    Replies
    1. The theory he was attempting to disprove was that unit tests mooted static typing. No one argues that static typing (even Haskell grade typing) moots testing.

      Perhaps it's a trivial theory that he's going after, as you say— many people already hold that you find more bugs the more ways you test. But there really are people who have argued that unit tests completely moot static typing, and I think he did demonstrate that this is probably not so.

      Delete
    2. Tim, The reply by Anonymous is correct, I wasn't trying to show that unit tests were not valuable, but that they don't obviate static typing.

      Delete
    3. If he took the theory as absolute, then sure, but surely there was an "in practice" implication (or it's pretty much a straw man argument).

      If the unit tests had previously found and removed 10 bugs, of which the static type checks would have found 8, and they also uncovered another 1 or 2 that unit tests didn't find, then sure, "in practice" unit testing does not moot static typing.

      But if the unit tests found & fixed 1,000 bugs, or which the static type checks found only 500, and again "another 1 or 2", then I'd say that "unit tests moot static typing" is, in practice, proven.

      Hence reporting "static type checks found 1 or 2 extra bugs" is not sufficient to draw a conclusion as you don't have the other numbers to give context.

      And of course, I'd lay bets that there are still bugs in the code that neither the existing unit tests nor the static type checks found.

      It's still an interesting study & exercise, but I'd dispute the conclusion drawn (altho I am a fan of type checking and particularly the type inference model in Haskell and other FP languages).

      Delete
    4. Tim, it sounds like we're trying to find out how to measure if unit testing obviates static typing. Because the argument that I was testing seemed to be an absolute argument (that static typing provided no benefits in the face of unit testing) I felt that a single missed bug (that had unit test coverage) would disprove the assertion. This doesn't mean that no one should ever use a dynamically typed programming language, it just means that there is a tradeoff. If you are writing software for the mars rovers perhaps finding one or two more bugs out of 1000 is worth it, for your family blog perhaps not.

      Delete
  6. Types and tests are one and the same. Types provide universal qualification that a property is satisfied across a program. A test provides existential qualification that a certain property holds in a certain situation.

    Tests are weaker than types, but can be used to test a much broader class of problems.

    ReplyDelete
    Replies
    1. Types and tests are qualitatively different. A strong static type system can identify the absence of certain kinds of unwanted behaviors regardless of whether or not the developer is thinking about them, whereas tests tend to make assertions about those things a developer remembers to test.

      In my years of testing and teaching people about testing, I've seen that people who rely on testing tend to make positive assertions about their code (if X then Y) and often overlook error conditions, boundary conditions, unexpected input, and so on.

      See also What to know before debating type systems.

      Delete
    2. Type systems can't check for error conditions, boundary conditions, unexpected input, etc. though, can they?

      Delete
  7. this is an interesting paper, but I cannot really understand how you state that full test coverage (by which I assume C4 ) would not have detected these errors, especially from a cursory look aof the first two examples, e.g. _parse_GSV and get_velocity examples.

    ReplyDelete
    Replies
    1. Because 100% test coverage can miss plenty of bugs. Testing and static typing are qualitatively different.

      Delete
    2. Here is some *very* simplistic python pseudo code that shows why full code coverage may not catch all type errors.

      def foo(val):
      if val > 5:
      return 10
      else:
      return None

      // Full unit test code coverage of foo
      assert(foo(6) == 10)
      assert(foo(1) == None)


      def bar(val):
      return foo(val) * 100

      // Full unit test code coverage of bar
      assert(bar(6))


      Now this is obviously a contrived example but we do have full code coverage but if anyone ever calls bar(1) the program will raise a TypeError. A Haskell translation of this code would not pass the type checker.

      Delete
    3. you seem to asume that testing is used to achieve code coverage only. isn't this too simplistic? You would have more test cases if you were testing the expected behavior.

      Delete
    4. Evan -- Your example is interesting, but all you're demonstrating is that "full coverage" is not the same thing as well-tested. If all you want from a test suit is for every line of code to be executed at least once, then you're writing some pretty weak unit tests.

      A test suite with genuine full coverage should look not just at the percent of code executed during the tests, but also at the possible combinations of inputs into the program.

      The example you give is a great illustration of this, but I think you're drawing the wrong conclusions. Foo can return either an int or None; therefore, any test suite for bar needs to include, at minimum, a test where foo returns an int, and a test where foo returns None. As it is now, the program may be "fully" tested, but it is not tested well.

      Finally, a good practice in dynamic languages is to write functions in such a way that they do not change their return type unexpectedly. If we do this conventionally, rather than having it be enforced by the language, we get a bit more flexibility in situations where, well, you need to return None.

      Delete
    5. Lionel, I agree that full code coverage is not the same as well-tested but I believe that full code coverage is a common metric for the quality of tests. I agree that it's insufficient. I also agree that the test for bar needs to include both cases. However, I assert that for large real world programs it's not as obvious what needs to be tested and as I mentioned in another comment for a simple class with a single method it would require 40+ unit tests to fully test it, at some point having unit tests that fully compensate for static typing is simply not tractable. I also agree that it's better to write functions such that they don't change the return type unexpectedly, but if you are doing that you may as well use static typing to enforce the convention. My experiment also shows that it's tractable to use static typing for those cases where you need a bit more flexibility and want the function to return different types (by using either Maybe or Either types in Haskell).

      Delete
    6. If by "full coverage", you mean all-paths coverage, I'd say that only a trivially small set of people use it as a metric for the quality of tests. Even in academic testing research, the action (when I was involved) was in finding subsets of all-paths coverage that were "good enough".

      In practice, all talk of paths is a huge red herring, given that numerous studies have shown that a big proportion (sometimes over 50%) of the bugs in fielded systems are what Robert Glass called "code not complicated enough for the problem" and others call "faults of omission". You can call them requirements bugs or design bugs or whatever, but they manifest (often) as paths that aren't present in the code but should be. Coverage of existing code has nothing to say about those faults (except in the negative sense that tests which poorly cover code paths probably also poorly cover bugs.

      More interesting than path-based testing was error-based testing, notably mutation testing, but that too got bogged down in formalism, where the point was to do tractable program transformations rather than look at actual bugs.

      I daresay nobody with any credibility says unit testing can catch *all* bugs type-checking can. It's a matter of where you think it best to spend your time.

      Delete
    7. Oh, I neglected to give cites for the "numerous studies" because most of them are in a tech report that I wrote so long ago there's no softcopy version. But there are a few at the bottom of this page:

      http://www.exampler.com/testing-com/writings/omissions.html

      Delete
    8. Grr. See also the body of the text for results from Microsoft (30% faults of omission) and a survey I did of open source software (47-70%, depending how you count.

      Delete
  8. Replies
    1. Hm...He makes a lot of interesting observations and had some evidence to back up his claims. But you do make a convincing argument as well.

      Delete
    2. Wow you're right. I didn't think about that. Please contact my Alma mater and them to revoke my Masters degree. I'm obviously unqualified.

      Delete
    3. You lose credibility with responses like this.

      Delete
    4. Jellystone, who are you talking to Anonymous or me?

      Delete
  9. I read you post, and your conclusion has I think one important point at least for me:

    "The translation of these projects revealed that all of these projects could have been written in a statically typed programming language with only minor code changes."

    That's a bold statement. I would assume that one use dynamic typing to avoid managing all the types and concentrate on the contrary of getting the job done.

    So I expect to see a significantaly smaller code size when using dynamic typing than static typing. Thus making the code easier to understand, maintain...

    That why I'am very interrested by lisp familly of language and why I think [lisp] macros are so important.

    You finding intrigate me. I would hope that the dynamic version would come with at least like 2 time less lines of code than the statically typed version. Isn't it the case?

    And if not, do you have an idea why?

    ReplyDelete
    Replies
    1. Code size isn't the only metric when comes to maintaining a software.

      Also in dynamic typed languages, every property access can result in a crash if the user of the function didn't passed the right type.

      Then you need to do type checking at runtime (adding lines of code) or giving that responsability to the caller of the function.

      Static-type solves that by delegating to the compiler this task, which is much better at doing this kind of stuff.

      Also, static type isn't directly correlated to bigger programs.

      Delete
    2. Static typing does not correlate to verbosity.

      Type inference and powerful types mean that you don't have to explicitly maintain all those types.

      In fact, type inference can sometimes help you write less code than dynamically typed code, for example when return-type polymorphism is used.

      For example:

      drawCircle . parse =<< readFile filename

      The "parse" here knows that it needs to parse the content of the file using the "Circle" type's parser, because that is the return type expected as input by drawCircle. In a dynamically-typed language, you would need to explicitly specify which parse function you want to use.

      There are more interesting examples than that, but they may require going into more background to explain.

      Delete
    3. In Haskell static typing does not correlate to verbosity. Elsewhere it often does. Hence the quote from Alan Kay, who created Smalltalk, "I'm not against types, but I don't know of any type systems that aren't a
      complete pain, so I still like dynamic typing." (Stated in 2003, not back when he created the language.)

      Delete
    4. Great question. I too love lisp and lisp's macros. My experiment does not provide any data on whether static typing bloats code size. Because my focus was on a direct translation I wanted to make sure that my translation was auditable by other researchers. For that reason every line of Python code has a matching line or lines of Haskell code. If there was a way to write a handful of Python lines in less lines of Haskell I didn't use them because it would hinder future audits. If a single line of Python required more than one line of Haskell then you would end up with more Haskell lines. So the rules I followed when translating guaranteed that the Haskell code would be at least as long as the Python code and it would almost certainly be longer (which it was). In order to test code verbosity one would not want to just translate the code but rewrite it in more idiomatic Haskell.

      Delete
  10. Superb research, thank you for your excellent hard work. The static type / unit test distinction is going to get very interesting with languages like Dart that have optional type safety.

    ReplyDelete
    Replies
    1. Dart has optional type annotations for tools such as linters. I don't think Dart authors intend for type safety to ever be provided.

      Delete
  11. I started reading the original document, and an important question jumped out.

    Have you fixed the sprinklers yet?

    ReplyDelete
    Replies
    1. Yes I have! I'm also building a fence. I'm a much better husband now that I'm not a full time employee and a part time student. Great question ;)

      Delete
  12. You miss another strong point, and that is how programmers get paid. Unfortunately for the industry we don't get paid to make things bug free, we get paid to make things work. It's a constant battle of features vs. bugs. To say the least, shipping code wins.

    Not only does this approach skip out on logic bugs, but it misses something larger, it misses design bugs, or UX bugs.

    A lot of what code is, is a way to sketch out ideas and see if they work. Once something has passed that very bar then it starts moving up the pipe and collects patterns that encourage consistency.

    As team size grows so do the number of programming styles, i see unit testing and static typing to be another component of keeping code consistent between programmers.

    Static typing also works best when you have deep object graphs, which in the web world pushes you towards patterns that are very bad for web scale applications. Ex. you probably don't want a table for both apples and oranges, in-fact you probably don't want a table for either, you probably want a table called entities.

    I'd much rather the programming world focus on putting a weight on function cost and showing in real time to a developer what their code is costing. In javascript, I am always finding people doing some sort of blocking DOM operation on a user event, when they could have just as easily performed that event on a separate pseudo thread. It's stuff that like that eats at your time. Fixing a typed error takes like 2 seconds in debugger with a unit test.

    ReplyDelete
    Replies
    1. "Fixing a typed error takes like 2 seconds in debugger with a unit test."

      This completely ignores all the mental relief from not having to worry about these types of concerns. When you can be confident your code integrates with the rest of the system correctly, you are free to only be concerned with its correctness. When not, you must spend unknown amount of worry, concern, testing, and so forth, on this problem.

      Delete
  13. Rewriting program from static language into dynamic type language cause type checking causing type-related bugs.
    Well - sure.
    Keep type model in mind, when writhing the code.

    Garbage in - garbage out

    ReplyDelete
    Replies
    1. All of the bugs that were found by the Haskell static type checker, existed in the original Python code. If I couldn't exploit the bug (cause a crash or malformed output) in the Python it wouldn't have counted it. So yes Garbage in - garbage out, but the garbage existed in the Python code ;)

      Delete
  14. If rewriting from Java to C cause memory management bugs, that will prove uselessness of garbage collectors?

    ReplyDelete
    Replies
    1. A type error is not the introduction of a new bug, it is the early detection of a bug. Bug being: int where list was expected, which would fail on both systems. In your case, there would only be bugs on the C side, which wouldn't be caught without unit tests outside the scope of the C language.

      The assertion that dynamic languages can do something that can not be done statically so frequently is similar to implying that static languages can use an int where a list is expected. (say for concatenation) They can not.

      Delete
    2. Invalid analogy. Adding memory management operations to a GC'd program can CAUSE bugs. Type information doesn't CAUSE bugs, it REVEALS latent bugs.

      Delete
  15. I think that this is a fantastic research paper. Thank you for sharing it with us on Hacker News. I am greatly enjoying the discussion that has ensued around your conclusion.

    I'm currently in the process of learning Haskell, and reading your findings has increased my desire to become fluent in this language.

    I only have a few, short, comments about your research.

    I read a few comments on Hacker News that stated that your research paper has a small sample size. I don't think that is the case. I think that the sample size should be counted as the lines of code and not the number of projects. From that perspective, the sample size is quite significant.

    I've also read a few comments that question the importance of the bugs found. I agree with your own comment on this issue - any bug with the potential to crash an application is very important.

    The only thing in your paper that could've possibly been expanded on was the nature of static typing. In other words, not all statically typed programming languages are created equal. I strongly doubt that Java has as good of a static typing system as
    Haskell. You could have made a stronger case in favor of using Haskell or another language with a comparably admirable implementation of static typing.

    All in all, I find it interesting that we are still having a discussion about this. Unit testing is definitely not a comparable replacement for a static type system. To me, tribalism is the only possible explanation for why there are any doubts about this at this stage.

    I'm a strong proponent of dynamic languages. I've used Ruby, Perl, Python, etc. However, to me it makes sense that static typing would catch bugs than unit testing could possibly miss.

    Unit tests are written by human beings, and, like the code they test, they can be incomplete and they can contain bugs. A static typing system helps to avoid a certain class of bug that tends to be insidious and can result in runtime error. Even the best unit tests will not be guaranteed to catch all possible type errors.

    Then there is something else to consider as well. Unit tests actually take time to write. Comprehensive unit tests can take a remarkable amount of time to write - almost as much as the code that they test. Static typing is a boon to programmers because it saves the trouble of having to write those unit tests. This is not to say that unit tests are not useful within a statically typed language. They can still be used to test for adherence to a specification and logic flaws. However, not needing to write unit tests to verify types is still an undeniable bonus.

    ReplyDelete
    Replies
    1. "I strongly doubt that Java has as good of a static typing system as
      Haskell"

      Can you prove yout point, or is it just guessing?

      Delete
  16. What I found most interesting is that the projects you picked don't really highlight the design uses of duck typing. The code is fairly trivial and so it's easier to implement in a statically typed language.

    I agree with the thought that unit tests are not enough, but I don't think statically typed languages are the answer. I like a blend of unit tests and static analysis (like Pylint) to help find these sorts of bugs.

    ReplyDelete
    Replies
    1. That's not accurate. Duck typing was used in the Python projects. The Haskell translation of the projects used Haskell's type classes to simulate Python's duck typing. Type classes are fairly lightweight (not like Java's interfaces) and they provide type safety. Also note that static type checking is a form of static analysis. Redoing the experiment with Pylint would be an interesting test.

      Delete
  17. I have scanned your paper. I think we have some disagreement on what is considered bug. For example, regarding the PyFontInfo project you say

    While the parse method does call parseChildren after the self.data member has been created, if the parseChildren method were to be called directly without calling parse first, Python would raise an AttributeError when self.data was referenced. It is likely that the original developer intended for the parseChildren methods to only be called from the parse method, but neglected to enforce the restriction.


    This looks like a complete legitimate program to me. It is used according to the implicit assumption. Unit testing is sufficient to verify that parse is called before parseChildren. I don't believe static type checking will do anything here. With Java you simply get an NullPointerException.

    If the Haskell compiler finds anything, it must be that it is doing some flow analysis, not static type checking. In any case it will be a false alarm because parseChildren is indeed invoked as the last step of parse.

    ReplyDelete
    Replies
    1. I appreciate the feedback. I agree that determining what is a bug and what is a misuse of the library may be subjective. I ultimately decided to classify this as a bug because it would be possible for the user of the library to use the parseChildren method incorrectly. It's possible that the author intended this method to only be used internally but it wasn't marked as private (the Python convention is prepending two underscores) and there was no documentation indicating that it shouldn't be used. Because it could cause a program that uses this library to crash and static typing did highlight the danger I considered it a bug.

      In this case Haskell is not doing any flow analysis. I know this seems impossible if you're coming from a C/C++/Java background but Haskell's type checker can ensure you don't ever get a NullPointerException.

      In Java every object reference can refer to a valid object or have a null value. In Haskell by default every object reference will reference a valid object. If you want to have a null reference you use the Maybe type which can hold either a valid reference or the Nothing value. Now here is the important part. The Haskell language forces the developer to handle the case where there is a valid reference *and* handle the case where the value is Nothing every time it is used. So you can't ever use Nothing as a valid object. No NullPointerException ever! Haskell's type system is quite magic.

      This can at first seem like a lot of work, but in practice it's not cumbersome.

      Delete
    2. Sounds like this is about some special feature in Haskell rather than a static type checking issue. I suggest you to do the same exercise on some Java projects. See how many bug you can identify that elude both unit testing and Java's static type checking.

      And if a library expects you to call "a" method first before calling the "b" method and you do not follow it, I bet most libraries will balk irrespective of what language it is implemented in.

      Delete
    3. It's a feature of the type-system employed by Haskell (Hindley–Milner type inference). Java's type system is deficient in comparison, especially with the billion dollar mistake (Null pointers).

      Delete
    4. Java type system is not sound, so it is not even correctly statically typed. It may be a good representative of other features, but not of static typing.

      Delete
    5. Wai, there are several programming languages with static type systems that will prevent the NullPointerException case, so it's not really Haskell specific. The reason why I chose Haskell instead of Java is because the argument I was testing was against static typing in general and not a specific static type checker. Since I was going to test the abilities of static type checking I felt it was only fair to use a world class type checker. If I had used Java and its type checker found no errors it wouldn't really prove anything except perhaps that Java's type system is weak. I believe that some (but not all) of the type errors that were found would have been caught had Java been used. I also believe that it would be likely that some of the Python construct could not have been represented in Java without major modification. If I'm correct that this would indicate that the cost of static typing for an advanced (Haskell) type system is less than of a primitive (Java) one, and the benefit is greater. Obviously another experiment would have to be carried out, but I suspect this is the case.

      Delete
    6. Evan, I know, this is offtopic and it is not correct place to ask, but can you please provide (on very high level) why is Haskell's type system superior to Java's one. I'm just wondering and not going to argue. Or maybe at least you know a good resource to read about this.

      Delete
  18. I don't yet understand the assertion that half-decent testing wouldn't have caught those bugs.

    Most projects do a poor job of testing, and these projects are reflective of that. This is still an interesting result, but does it really apply to "testing done right"? (e.g. proper TDD with unit vs system testing)

    ReplyDelete
    Replies
    1. In order to catch all of the errors with unit testing that are automatically caught with static type checking, you have to have tests that cover every possible execution path (not just full code coverage). For this reason I suspect that even proper TDD would be insufficient. Clearly more research would need to be done to know for sure.

      Delete
    2. That's only half the story though. Yes, unit testing only catches most, not all, of the type-related bugs that type systems do. But these bugs seem to me to be only part (and possibly even a small part) of all the bugs that occur over the lifetime of a project. The methodology of reproducing the end-state of mature code has eliminated the biggest source of the most significant bugs, which is the misunderstanding of changing requirements. Although unit tests lose out on typing problems, they gain so much on fronts such as this, that it considerably outweighs the type-related losses.

      Delete
    3. Tartly, You claim that unit testing catches most type related bugs, do you have research to back up that claim? Also I'm not promoting using static typing instead of unit testing. I'd argue that if you are using static typing you should still use unit testing.

      Delete
    4. Hey. No, I don't have research. That doesn't automatically invalidate anything I might have to say though. I'm saying "most" because, by personal observation, tests tend to catch hundreds or thousands of bugs during development of even small projects, and by your own research, the number of detectable bugs left over at the end is generally less than ten. Hence 'most' have been caught. Let me know if you think that's flawed in some way.

      Delete
    5. Tartley, I have no idea which would catch more bugs unit testing or static typing. My experience seems to be that unit testing catches more but my experience with static typing is mainly with C++/Java. I can imagine a vastly different ratio with Haskell. C++'s static typing is so weak that I frequently joke with coworkers about C++ being an inflexible dynamically typed language. The more I work in this industry the less I trust my intuition about things like this.

      Delete
  19. This is a nice start and I think it does a fair job of addressing its goals. It raises the question: is the volume of type errors "significant" in terms of the code volume? Other than "<2KLoC" I don't think you detail the size of the codebases. It seems to me that several handfuls of errors in such small apps *does* represent a significant quality shortfall, but perhaps others would disagree. It's also a nice jumping-off point for the dual experiment of translating a type-checked codebase into a dynamic language and seeing if a typical practitioner would write sufficient unit-tests to capture the lost semantics.

    ReplyDelete
  20. An interesting read.

    I think your claim that it's relatively easy to rewrite programs from dynamically typed languages to statically typed is true. The biggest challenges are finding a statically typed language that is flexible and expressive enough; and all the extra overhead managing these explicit types. Your choice of Python makes sense, but Haskell doesn't represent a mainstream statically typed language. Mainstream statically typed languages fall well short of Haskell's expressiveness.

    I've often said the only language I'd consider a replacement for Python, is Haskell! I think others would fall in this camp too. However Haskell comes with significant challenges of its own. Had this study been done using C++, C#, or Java, the story may have been very different (and much longer :P ).

    Another commenter has mentioned the other big issue here: That typing bugs aren't necessarily the biggest problem. Sure you may have discovered some typing issues, but runtime and logical errors may still abound. From my own experience with Python, probably over half of errors are typing related, but they only occur during development. Logical and runtime errors definitely dominate in production code.

    ReplyDelete
    Replies
    1. My choice of Haskell was based on the idea that I wanted to pick a good example of state of the art type checking that was relatively mainstream (it is used outside of academia to solve real problems). Having said that my research really isn't about Python nor Haskell, it's about dynamically typed programming languages, unit testing and statically typed programming languages. Python and Haskell are just the tools I used to study these concepts. Much like cancer researchers use mice for their experiments but the experiments aren't really about curing cancer in rodents.

      Delete
  21. I think the big value of unit testing your code is a better design that they guided. To mantain your code without bugs, i think integration and end-to-end tests more efficients. Even they being slow and harder to debug, i cannot live without them.

    ReplyDelete
    Replies
    1. I agree. I argue for unit tests because they lead to a better design and a better API. I would still promote unit testing even if they never caught any bugs.

      Delete
  22. To the guys saying that type errors aren't the dominant problem in production code, that all depends on how you use types. A simple example, sqrt(-1) is a runtime error in most languages including typed languages, but with a type like PositiveNumber that could be a type error.

    That's overly simplistic, but if you delve further you'll find that everything short of requirement misinterpretations can be found statically given sophisticated enough type systems.

    ReplyDelete
    Replies
    1. We want to be careful about overstating the power of type systems. I don't believe that there exists a static type system that can prevent division by zero (unless the zero is a literal "0" in the denominator). I do think there is a lot more we can do with static type systems, but it will never solve all of the problems.

      Delete
    2. Actually, it does exist -- type systems without non-termination. Eric Torreborre has a recent interesting post about it, touching briefly on division by zero - http://etorreborre.blogspot.com.au/2012/06/strong-functional-programming.html.

      Delete
    3. The problem are not the type systems but type inference. You can specify any properties in dependent type systems but then type inference is undecidable. In Coq for example you could write a function like

      sqrt {x:int | x <> 0} : real := ...

      whose type would only allow to pass integers to it that are not equal (<>) zero. But the type annotations that you have to provide when calling sqrt can get extremely complicated....

      Delete
    4. Evan, Agda and Epigram have type systems that do that and much more. Coq does too, it's just harder to recognize it as a programming language.

      It is in fact very easy to create a type system that prevents division by 0 and other divergent behavior like infinite non-productive loops.

      What's hard, very hard, is making a type system that prevents divergent behavior but still allows you to write useful programs reasonably easily.

      In Agda, for instance, some kinds of recursion are automatically proven to terminate. But other kinds of recursion require some help from the programmer to prove they don't run off the rails. Figuring that proof out and translating it to Agda will often not qualify as "reasonably easy."

      Delete
    5. This comment has been removed by the author.

      Delete
    6. I stand corrected. Thanks for pointing out my error on type systems not being able to catch divide by zero. I'm happy to know I was wrong.

      Delete
  23. I find it interesting how eager people are to point out that bugs caught by the type system aren't the *biggest* errors.

    Is that really how you write your programs? "I'm not going to even consider this bug, because I already fixed another, even bigger one"?

    The results here are extremely interesting.

    In the end, it's always down to the individual team to decide if they need static typing: do they need to catch these last couple of errors that snuck past the unit tests?

    The really interesting part here is that those bugs *do exist*. That if you choose to rely only on unit testing, you should be aware that you will likely fail to discover a handful of type errors. That doesn't mean no project can be successful if it doesn't rely on static type analysis. It is just something to keep in mind when you decide that you don't need static type analysis.

    ReplyDelete
    Replies
    1. Unit testing from study do only find like 20-40% of all bugs in a program.

      Integration testing and formal code review both find a little more like 30-40%.

      But a number of bugs tend to be found only in production by real users (more or less 10% of all bugs).

      So whatever your choice, static typing or unit testing alone are not sufficiant.

      Now this is more a tradeoff.

      Finding and correcting bug cost money. Finding bugs later in dev cycle cost even more money.

      Working on a bigger codebase increase all costs (for new devs and maintenance). This cost is exponential.

      Static Typing tend to make for bigger code base (maybe not with haskell?). But it allow to correct many error directly when the code is written/compiled.

      Static typing help when dealing with bigger code base due to better tooling and easier raisoning.

      There is no one only true response. You can perform studies and try to see what works better but even then you still need to take into account your project requirements and the associated costs.

      Delete
    2. It's not that you ignore some bugs because you have fixed others. Instead it's that you have a finite amount of effort to expend, so what do you spend it on to maximise the delivery of useful software? Writing unit tests takes effort, but people who do it think that effort is wisely invested because it catches a larger proportion of bugs that would otherwise be important to users, hard to find or hard to fix. This particularly includes the sort of catastrophic bugs which stem from early misunderstanding of requirements. These bugs are not detectable by type systems, and generally take a long time to diagnose, understand and fix. The kind of errors which are prevented by static typing, on the other hand, are, in practice, often relatively trivial to detect and fix. So the question is, is the effort required to overcome the costs of static typing more worthwhile than the effort required to overcome the costs of writing tests? Sometimes, one should do both. But sometimes you don't have the resources to do everything possible - so which is least valuable and should be dropped first?

      Delete
    3. ...Because a lot of people's personal experience is that, when switching from static languages to dynamic ones, their productivity goes up, and the number and severity of bugs in their programs goes dramatically down. While unscientific subjective impressions like this are suspect, any useful models of whether unit tests or static typing are more effective need to be able to explain this, and delineate when it is and isn't true.

      Delete
    4. Most of the stories that I've heard of people talking about productivity gains are people switching from C/C++/Java to Python/Ruby. It's important to remember that Python is not C/C++/Java without static typing is a completely different language. I suspect if someone switched from awk to Haskell they'd also experience an increase in productivity. More study needs to be done to measure the costs (if any) of static typing.

      Delete
    5. Hey. It's true in principle that the alleged productivity and quality advantages of common dynamic languages over traditional static typed languages may be predominantly due to something other than the difference in their type systems. But in the absence of demonstrations otherwise, I don't think it's too foolish to assume for the moment that the type systems are at least a substantial contributor to the differences.

      I think this would all be a lot less controversial if, instead of saying "you need static typing", your conclusions said "static typing will catch some errors that real-world standards of unit testing fail to catch." You have indisputably demonstrated the latter, but without the research you posit to measure the costs of static typing, you haven't even established whether the benefits of static typing outweigh its costs, never mind whether it is more desirable than unit testing (which is what your headline strongly implies.)

      Delete
    6. Tartley, The "need" in my title was meant in the context of the subject of the post which is detecting defects and in context of the theory that I was testing which is that you don't need static typing. I believe that my experiment showed that static typing has some benefits in that it caught bugs that were not caught by the unit tests. I also think my experiment showed that there was very little to no cost as none of the projects required any major modifications to pass the type checker. If there was a magic command line argument to turn on Haskell style type checking to the Python code several bugs would have been discovered and almost no rewriting would be required. I therefore assert that for these projects the benefits of static typing did outweigh the costs. Now obviously more studies need to be done before we treat my conclusion as an established scientific theory, but other than the possible ambiguity of the word "need" I believe my title is accurate. Although I much prefer to be judged on the actual paper and not the summary post.

      Delete
    7. Good research Evan. Obviously, static typing has many advantages.

      That said, I think Tartley's objection is pretty reasonable. I suspect it takes extra programming time to deal with the type system (though I haven't looked into Haskell specifically) and, to me at least, catching 17 bugs and obviating 4 unit tests seems like a very small gain over 4 libraries worth of code.

      Delete
  24. After coding in C/C++ for almost 20 years, Perl for a year somewhere in between, I switched to PHP last year and coded in it extensively. Due to the nature of the project, I had to check into SVN every time I made a program change so i could debug the code (slightly tedious). I was diligent about commenting the source of error. About 70% of the mistakes were variable name typos, case sensitivity typos, and SQL query errors that could not be caught by any language as they were embedded in strings. The time taken in detecting the errors is non-trivial, although the fixes themselves ARE trivial. I much preferred a static type checking system, mainly for basic fat-finger errors that a compiler detects and immediately provides feedback for. Not saying that all dynamically typed languages are inferior or anything of the sort, just that a hybrid approach (linting, or using Haxe or similar statically type checked language to generate other languages) isn't a bad consideration.

    ReplyDelete
    Replies
    1. "SQL query errors that could not be caught by any language as they were embedded in strings"

      There are in fact tools to check embedded SQL statements. For Java, you can check out https://wiki.stacc.ee/trac/edsl/raw-attachment/wiki/WikiStart/edsl_tool_demo.pdf for an example. And, if I remember correctly, C# has LINQ that is statically checked queries inside C# programs.

      Delete
    2. You can also use a programatic API instead of just sending SQL string.

      Many old language did provide direct embedding with SQL and associated type checks.

      Delete
    3. Funny you mention SQL embedded in Strings, because catching errors in them at compile time is precisely what Scala 2.10's string interpolation allows.

      Delete
    4. With respect, I can see people in these comments saying that more primitive type systems such as C++ or C# don't present the 'static typing' argument at its best - the discussion is about more capable languages such as Haskell. I guess for balance someone should mention that Perl and PHP aren't the best example of dynamic / unit testing languages either. PHP is not just dynamic, but is actually weakly typed, and as I understand it, Perl is partially so (is this true?), which even dynamic language evangalists will agree is awful.

      Possibly not unrelatedly, I disagree with your assertion that the time taken to diagnose 'fat-finger' misspellings is significant. Half-decent unit tests catch these straight away, in seconds.

      Delete
  25. Evan, thanks for this effort. To follow up on James Iry's reply briefly, I recommend you also have a look at http://idris-lang.org for a dependently-typed language that attempts to do totality checking, but is self-consciously a programming language, rather than a theorem prover such as Coq. Idris is also interesting to a Haskell programmer by being syntactically inspired by Haskell, so it may be easier to pick up than some of the alternatives.

    ReplyDelete
  26. So you tested to see if statically typed languages or dynamic languages are better at checking types and found out that the ones that enforce types are better at it.

    In other news, it has recently been discovered that 1 = 1.

    Have you considered using your time for anything useful?

    ReplyDelete
    Replies
    1. You just strawmanned my experiment. I wasn't seeing if dynamically typed languages were better at checking types that statically typed languages I was checking to see if unit testing would catch the same errors as static typing. Now I have heard some people say that of course unit testing won't obviate static typing, but I've also heard people claim that unit testing will obviously obviate static typing. I'm of the opinion that we should stop trying to intuit computer science but that we should instead use the scientific method and conduct experiments.

      Delete
  27. Evan, thanks for writing this. I like the idea of your research, but I have a few concerns.

    First, I wonder how great the unit testing culture of these projects is. For example in the first you mention a malformed input error that is caught by static typing, I would have assumed that you would have test cases to check for conditions like that (same goes for a statically typed language). Further, are the bugs you found customer facing, or do they have no impact on the project? Not that this is good practice, but it does minimize the impact. Also -- What was the ratio of bugs you found to the project size?

    In my experience, one of the problems is that popular statically typed languages (ie: Java) aren't that flexible. And not exclusively because of their type system. As opposed to say, JavaScript, which is very expressive and happens to have weak, dynamic typing. Maybe the best answer is in more flexible statically typed languages.

    But I concede that with a dynamically typed language you may need more well-defined code standards and better test suite, but I don't think that's a reason to ditch the language.

    ReplyDelete
    Replies
    1. As far as whether these bugs would be customer facing, I claim that they could be. All of the bugs could be exploited to cause a crash and many of them could be misused by another program to cause a crash or bad output.

      I agree that Java is not a great example of a statically typed programming language that's why I didn't use it in my research. I suggest you look at Haskell to see how a statically typed language can be both flexible, and catch a lot more bugs than Java could.

      I'm not arguing for ditching dynamically typed languages. I still use Python, bash, Javascript etc. I do think developers need to understand that by choosing a dynamically typed language they are sacrificing some safety even if they use unit tests, in many cases that may be a good trade-off in other cases it may not be. I do hope that when choosing a language that people don't think that unit testing can replace static typing.

      Delete
  28. Interesting work Evan, just wanted to mention a little typo in your paper, Chapter 3 Fig 3.:

    class car():
    # The __init__ method
    def __init__(self, color):
    self.color = color

    # The color method
    def color(self):
    return self.color

    When you create the object the method "color" will be shadowed by the new variable "color" of type str. While this is legal Python it's generally not the desired behavior and the instance variable would be usually called something like _color.

    ReplyDelete
    Replies
    1. Arg! You're right! I even had a script that would go through and feed all of my figures to the Python or Haskell interpreters/compilers to make sure I wouldn't have any such typos. I didn't write unit tests for the code in my figures. Now if only Python had static typing I wouldn't have had this typo ;) Thanks for the sharp eye!

      Delete
  29. Evan, thanks for the good stimulation of thought. I wonder if there's a problem with your methodology.

    Your hypothesis is that "Unit testing isn't good enough, therefore static typing is necessary." And that's a summary of your blog post title.

    I don't think you proved that by a long shot. I think what you proved is that unit testing (as defined by %100 code coverage) is inadequate. I think you also proved that static type checking is useful for finding bugs. But you haven't proven the linkage that static typing is the only way to find intractable bugs.

    In other words, you proved "A" to be a true statement and you proved "B" to be true as well. But you didn't prove "Because A we must use B". "Because of the failure of unit tests we must use static typing" is still an unproven statement.

    You can prove that by providing an example of a common intractable bug that is solved efficiently by using static typing, and not as efficiently with extra testing, static code analysis, peer review, etc.

    In other words, you have to prove that only static typing is really the only way to solve certain kinds of intractable problems. Because if I can solve the problem though extra testing, then I don't have to use static typing, and your hypothesis is false.

    I'm not saying you're wrong. Just that you haven't proven your claim.

    ReplyDelete
    Replies
    1. I agree that I haven't proven that static typing is the only way to find intractable bugs. In fact it would be impossible to prove that static typing is the only way to find intractable bugs as there would always be the possibility that some new crazy magic technology could be invented that could replace static typing.

      I think that the misunderstanding is that you have confused my conclusion with my hypothesis. The hypothesis that I was testing is the claim made my proponents of dynamically typed programming languages that "Static typing is insufficient for detecting bugs, and so unit testing is required." (OK there are two other parts to the claim but this is the relevant part for this discussion). As such my experiment is limited to only two methods of finding bugs 1. unit testing and 2. static typing. During the experiment I did find other bugs due to the translation process but I did not include them in my paper as they fell outside the scope of the experiment. Given that the hypothesis and experiment only focused on these two mechanisms of detecting bugs then I believe when taken within this context my conclusion and blog post title are accurate.

      So yes I agree that I have only shown that unit testing is inadequate (in practice), but that is all that my experiment was designed to test and all that my hypothesis claims.

      Delete
    2. Unfortunately, you tested whether "A therefore B". And when you found that statement to be false you assumed that "B therefore A" must be true, which is something you admittedly did not test.

      I found your reply illuminating in that your logic is a confused as your testing methodology and conclusion.

      In short, you showed that unit testing in practice doesn't find all bugs, a conclusion that was already apparent to those writing software.

      You also showed that static typing finds bugs, (again apparent already), but you didn't show that connection between the two.

      We can argue logic, hypotheses, and conclusions until our faces are blue from lack of oxygen. But academic minutiae is worthless unless it can shed light upon the big questions.

      In this case that question is: Which is more valuable, static typing, or unit testing?

      And I'm afraid that question remains unanswered with your thesis.

      Delete
    3. "In short, you showed that unit testing in practice doesn't find all bugs, a conclusion that was already apparent to those writing software."

      That's not an accurate statement of what I found. There was no argument (as you stated) that unit testing didn't find all bugs. There was an argument that unit testing would find all bugs that static typing would find. I showed that unit testing in practice does not find all of the bugs that static typing finds. Therefore static typing is still needed in order to find the bugs that are likely to be missed by unit testing in practice.

      As far as answering the bug question on which is more valuable static typing or unit testing. My research does not address and did not attempt to answer this question. I'm not sure that is the right question to ask. I think a more important question is when are the benefits of unit testing worth the costs and when are the benefits of static typing worth the costs. Because they are not mutually exclusive technologies knowing that one is more valuable does not mean that you shouldn't use the other.

      If you really do care about which is more valuable that's fine, but those that argued that unit testing obviated static typing thought they knew the answer, my experiment seems to indicate that they didn't.

      Delete
  30. Greetings!

    I've translated ( :) ) your blog notes into Russian: https://docs.google.com/document/d/1eMc5CbCy0ihCEbbIFUoM6_2eP55vuFyZdr1yAo-lzn8/edit

    ReplyDelete
  31. Interesting read. And it's nice to see *any* research in this area. Like you've, I've been disappointed by the lack of same.

    However, I think you misstated the claims on the dynamic side. The claim isn't so much that unit testing will replace static typing, but that the time gained by not having to provide type information and otherwise manage types more than makes up for the time lost by not detecting errors earlier.

    This argument is typically aimed at the Java/C+ language family. As such, Haskell has two advantages over those: less work managing types, and stronger type checking.

    ReplyDelete
    Replies
    1. Mike, I've frequently heard both arguments made frequently by proponents of dynamically typed programming languages. An experiment on the claim you mentioned would be very interesting.

      Delete
  32. Very late to the party, but I feel like I have something to contribute anyway.

    I'm on a project using Haskell to implement yet another programming language, and coming up with the theory (i.e. semantics, type rules, etc.) behind it more or less concurrently (the implementation lags by about a month sometimes).

    Over the past year, Haskell's typechecker has found at least 5 (probably closer to 10, but I don't want to overestimate) _logic_ bugs in the theory for us.

    ReplyDelete
  33. First, I'd like to congratulate you on a well-done study. I hope the research community will attempt to replicate it, to measure the effect of variables such as source and destination language, size of a project, and the application of metaprogramming techniques (which are more commonly used in some languages rather than others).

    I blogged a few more observations here: http://blog.rafaelferreira.net/2012/07/types-and-bugs.html.

    ReplyDelete
  34. If you made your program translations available on github or similar, it would facilite easy browsing & comparison of the source and translated programs for the interested passer-by. As its is, they are bundled up in a tar.gz file download, something of a barrier for the casual reader.

    You could then hyperlink to the source where some of the errors were found, rather than simply assert you found them, which would make the post more persuasive.

    Perhaps that's why so little of the comments refer to your actual translated code.

    ReplyDelete
  35. Unknown. All of the bugs are described in the paper. I think for the casual reader the paper would provide more context for discussing the individual bugs. I published the code for the benefit of any (non-casual readers) who want to verify the translation and ensure that the translation is correct.

    ReplyDelete
  36. Yesterday, I gave a talk to the Sydney Python User's Group (SYPY), about this paper. The session was not recorded, but my slides (including speaker's notes) are available on-line:

    http://somethinkodd.com/sypy/farrer.pdf

    Big thanks to Evan Farrer for giving me something interesting (and a little bit different) to talk about.

    ReplyDelete
  37. A lot of these comments are based on the fact that this post (and possibly the paper) only talk about static typing.

    a) Haskell's typing is both static and strong. Unlike C++ types, which are mostly weak
    b) Haskell's typing is *very* static. Unlike Java, which is largely dynamic as well (supporting both up and down casts)
    c) A lot of people seem unaware that static typing *does not* mean type annotation. Type inference needs to be better advocated in general

    ReplyDelete
  38. Evan,

    Your experiment is very interesting and informative. Thanks for sharing it publicly.

    It seems to me your results illustrate another aspect of the way code tends to be written in practice, aside from the relative merits of static typic and unit testing. You touch on it in one of your comments on the blog:

    "My interest was on whether unit testing obviated static typing *in practice*. Obviously in theory unit testing can obviate static typing (you could encode a static type checker as a unit test), but if no one does this then in practice it really doesn't matter."

    My experience supports your finding that unit test suites aren't necessarily well-crafted. I do not conclude from this that unit testing as such is a flawed technique, however. In my work I like to make full use of whatever capabilities the tools offer. When the language has static typing, I want to use it to advantage; the stronger the typing system, the more effort it saves me toward the goal of delivering code people can understand, modify, and feel confident in. When the language doesn't have strong typing, or if it is dynamically typed, then the burden falls to me to ensure the unit test suite covers everything necessary. That is just part of the job. Different tools have different capabilities; when a tool lacks a capability, we have to fill in the blanks. Some other commenters on your post have said similar things.

    I don't see this as an either-or question of strong typing versus unit testing. A strongly typed language leaves us free to devote proportionally more time to crafting meaningful and useful test cases to cover whatever the type system doesn't cover.

    Unfortunately, many people who get paid to write software just pump out code any way they can. You say this "really doesn't matter," and for the purposes of your study that's quite right.

    In the larger scheme of things, I think it does matter. I think it's a deep problem in our line of work. Maybe the quality of unit test suites could be the focus of a future study based on real code, following your example of how to set up such a study.

    Anyway, great work. I'll be pointing others to it.

    ReplyDelete
  39. "No dynamic constructs were used that could not be directly translated into Haskell."

    So your proof is that translating Python into a statically typed loose analogue fails static type checks.

    ReplyDelete
    Replies
    1. Paddy3113. My conclusion is not just that it fails type checks, but that those type check failures represent real bugs found in the dynamically typed program, and that those bugs weren't caught by the unit tests.

      Delete
  40. I have been writing code since I was 12 years old. I've spent a lot of time working in both statically and dynamically typed languages. Which is better? In my opinion, which is better depends on the problem domain that you're attempting to solve.

    In my experience, most problem domains fall into one of two general categories:

    1) The application must behave as expected under known conditions.
    2) The application must behave as expected under unknown conditions.

    Dynamically typed languages will deliver a higher ROI in case one while statically typed languages will deliver a higher ROI in case 2.

    ReplyDelete
    Replies
    1. John, thats definitely a popular hypothesis, I haven't been able to find any scientific studies that show a higher ROI for the different cases. I think we should be hesitant to accept untested hypothesis as facts.

      Delete