What Gets Documented Gets Improved

I used to edit my .emacs1 directly. One day I decided to convert it into a literate program2. So over the course of a week I went about extracting chunks of code, documenting them, and rearranging them.

The process of converting the file to literate form alone improved it. There were blocks of old unused code I deleted. When I came across lines I’d written when I was less experienced with Emacs, I was able to rewrite them in a more compact and understandable form. And I fixed some annoyances that had always bugged me but only became clear to me as I was reading through the code.

You could argue that just the act of reading the file was what improved the code. But you’d be missing the point. Why would I have read the file if I weren’t documenting it?

Are You Really Reading Your Code?

The only reasons for which I’ve seen people read production code3 are:

  1. to fix an error
  2. to understand the code in order to extend it

Both of these are shallow dips into the code. When performing them, the objective is to get in and get out. We tend to read the code only long enough to understand it for the purpose(s) we currently have in mind. But we’re missing a wider, broader understanding of the code. And without that understanding, we can only make focused improvements.

Incremental Change Is Due to Lack of Gestalt Knowledge

There is nothing wrong with incremental change4. There is tremendous power in Kaizen. And maybe that’s why it’s such a prevalent development model.

But in order to achieve revolutionary improvements in code, we need insight. We need to be able to make connections in the code. The more deeply we understand the code, the more we can hold of it in our head5.

The 3 Layers of Program Documentation

Documentation Pyramid
Documentation Pyramid
narrative top-level external
location source code source/external external file
level intra-function inter-function super-function
style novel reference book map
time during after before/after
targets statements/expressions functions/modules relationships

Narrative

The narrative is made up of the comments that exist between (or in) lines of code. In the diagram they’re in the center and have the largest area because I think they’re the most important and should be the most numerous. They’re important because they are the most difficult to reverse engineer later. They give you a glimpse into the mind of the author of the code—ideally at the time that they were writing it (even if that author is the you from the past).

Narrative comments stand in contrast to top-level documentation which can be written later and even by other developers.

Here’s an example of some narrative documentation.

depUpToDate :: String -> IO Bool
depUpToDate oldMD5 = catch
      -- To avoid having to escape directory path separators (such as
      -- converting / to !), each dependency is stored in a file named
      -- by its MD5. e.g. a dependency named `redo.hs` with an MD5 of
      -- 29e57f39b7ea2795ab2e452ada562778 and for the target `redo`
      -- would be stored in the file
      -- `.redo/redo/29e57f39b7ea2795ab2e452ada562778`. The contents of
      -- the file would be the relative path to the dependency (in this
      -- case, simply "redo.hs").
  (do dep <- withFile (metaDir </> target </> oldMD5) ReadMode hGetLine
      newMD5 <- fileMD5 dep
      doScript <- doPath dep
      case doScript of
        -- Simple dependencies are up-to-date if their MD5s match that
        -- of the last time that they were checked.
        Nothing -> return $ oldMD5 == newMD5
        -- But a dependency can itself depend on other files, so aside
        -- from just having their MD5s match between runs, all of their
        -- dependencies must also be up-to-date.
        Just _ -> do upToDate' <- upToDate dep
                     return $ (oldMD5 == newMD5) && upToDate')
  -- If the target metadata directory (e.g. `.redo/redo`) doesn't exist
  -- (for example, on the first run of redo), we trap the IO exception
  -- and just indicate that the target is out-of-date.
  -- TODO: The way this is written, it looks as if it would say that the
  -- target /is/ up-to-date if the metadata directory is missing, which
  -- would be wrong.
  (\e -> return (ioeGetErrorType e == InappropriateType))

Narrative comments share similarities with literate programming: both are best written as the code is being written and both encourage you to put more emphasis on the text than the code. Where narrative comments differ from literate programming is that they are not designed to be nicely formatted and they are inline in the code.

Top-Level

Top-level documentation describes how to use all of the “black boxes” (functions, modules, interfaces, etc.) in the code. While top-level documentation can be written as you’re writing the code, I find that it’s better to wait until afterwards. This saves wasting time if you decide to scrap or change a function, and it helps gives yourself a little distance from the code to better put yourself in another developer’s (or your future self’s) shoes.

Here’s an example of a top-level block of documentation. Note that in this case the documentation could be quite easily derived by reading the source for the function.

-- | Calculate the MD5 checksum of a file.
--
-- For example:
--
-- >>> fileMD5 "redo.hs"
-- "29e57f39b7ea2795ab2e452ada562778"
fileMD5 :: FilePath   -- ^ the file to calculate the checksum of
        -> IO String  -- ^ a 32-character MD5 checksum
fileMD5 path = (show . MD5.md5) `liftM` BL.readFile path

Examples, such as the one in the code above, provide valuable context and make it easier to understand how to use each component at a glance.

External

External documentation should explain the code at a high level. It should act as a map of the layout of the code and explain the relationships between the “black boxes” from the top-level documentation. Instead of explaining the inner workings of the code, it should explain how different pieces fit together. It should explain the overall philosophy of the program and the purpose for each component.

External documentation is unique in that it can be written before any code is written (although the majority is probably written after). It can also be witten by someone else and probably should at least be reviewed by someone other than the author of the code.

While narrative and top-level comments both serve to document the thought process or mental model of the code, the external documentation should step back and explain how to get things done. Think of the external documentation for your program as a guide for new developers—or yourself once you’ve been away from the code for too long. It can contain HOWTOs, FAQs, and tutorials.

External documentation can take the form of a README file, a wiki, or a handbook. There is usually some overlap between external documentation and end-user documentation. A README for a program can provide valuable insight into the intended behavior of a program. But for anything beyond a small program, end-user documentation is no substitude for developer-focused external documentation.

But What About Self-Documenting Code?

Shouldn’t we make the code itself more understandable, instead of documenting it? Shouldn’t variable names, types, and function signatures illuminate the code well enough?

I don’t think there’s any such thing as self-documenting code. Ambiguities in human language prevent variable names from conveying their exact use and purpose. Types are helpful, but they’re usually a small part of the source code. And function signatures, while possibly the most useful self-documenting piece of source code that I know of, still don’t explain everything about a function, especially for functions with side-effects.

The best we can hope for is to make our code more understandable. But that shouldn’t prevent us from writing documentation for it anyway. Additionally, I find it easier to make my code understandable after I’ve written documentation for it. I notice that I use words in the documentation that I use to rename identifiers in my code.

Writing Code Should Be a 2-Pass Process, At Least

And fixing errors doesn’t count as a 2nd pass.

Comments are for documenting code as you write it. And documenting code as you write it can be a distraction.

As with video encoding, we can get a better result with a 2-pass approach. In the first pass you can focus on writing the code. Sprinkle comments as you go for any details that might be unclear.

But now your code is literature. It’s a document. And you wouldn’t write a document in 1 pass. That’s called the first draft! You would proofread it at least once. As you proofread a document, you catch mistakes that are so obvious you’re amazed that you made them in the first place. And as you subvocalize what you read you find awkward parts of the document that you need to rewrite. You may change how a sentence starts or move a paragraph up or down.

The same should apply to programming. Just as a sentence or paragraph may not make sense after being written on the fly, a function may not make sense. Or perhaps one paragraph doesn’t lead cleanly into another. Maybe a paragraph is trying to say too much and needs to be broken into smaller pieces.

The proofreading becomes even more valuable once you’ve been away from the document for a while, especially after a period of sleep. Much of what you had stored in short-term memory has been flushed out and you see the document from the perspective of someone who has never seen it before.

So there’s no need to force yourself to document as you code. Take your time. Back off from the code for a while. Give your mind room to breathe. But come back later and take a look, otherwise someone will be reading your first draft in the future.

You Benefit from Cognitive Dissonance and Cumulative Effects

When you believe that your program will be read (because you’ve provided it in a convenient format which shows value) and you have pride in either the code or the documentation, you will work to improve the other.

If one chapter of your book or section of your document looks full, impressive, and is a pleasure to read and you come across another section that falls flat, you’ll naturally want to fix it.

It’s like having a clean bedroom. When you’ve cleaned your room you’re more alert and intent on keeping it clean. You know the effort it took you to clean it, and you don’t want to have to repeat it. Besides, you get enjoyment from the feeling of having a clean room and want it to last.

Footnotes


  1. A .emacs file is a configuration file for the Emacs editor. It’s actually a program written in the Elisp programming language.

  2. I used the Noweb tool, if you’re curious.

  3. I’m excluding code that was written for educational purposes, such as code in books, technical papers, programming pearls, etc.

  4. There’s nothing wrong with incremental change as long as it’s positive overall. Unfortunately, making changes to code without a gestalt knowledge of the code, will often produce a positive change in 1 vector (e.g. fixing a bug) while producing a negative change in others (e.g. introducing more bugs, making the code more complex, or introducing brittle assumptions or coupling).

  5. See the studies of chess masters conducted by Alfred Binet.