Program Synthesis in 2017-18


This post marks the beginning of a new blog. I’ve been putting off a re-design and re-hosting of this website for the past two years or so, and finally got my hands on it last week.


In early 2017, Sumit Gulwani, Rishabh Singh, and I wrote a survey on Program Synthesis. Its goal was to provide a comprehensive overview of the state of the field at the time and to serve as an introduction for newcomers. While complete at the time, the survey got surprisingly outdated in the mere two years. Nowadays, every time someone asks me for an introduction to program synthesis, I point them to this survey but also add 10-20 other links that have recently pushed state of the art. On one hand, this is great: program synthesis is expanding rapidly (ICLR’18 alone had 14 papers on the subject!). On the other hand, I’m getting tired of compiling and annotating these lists of recent links every time 🙂 Ergo, this post — a high-level overview of the recent ideas and representative papers in program synthesis as of mid-2018.

Before we dive in, a few disclaimers:

  • I assume you are familiar with program synthesis. If not, you’re welcome to check out the survey in question or this excellent blogpost by James Bornholt.
  • As expected, it’s impossible to cover everything, although I tried my best. We never set this goal even for the original survey – there’s an enormous body of great work out there. That said, if you notice some glaring omission, please let me know in the comments, and I’ll be happy to include it!
  • The list is skewed toward ML/AI rather than PL. This isn’t because of any personal biases of mine, but simply because the volume of synthesis-related papers in NIPS/ICLR/ACL/ICML over the last couple of years started exceeding PLDI/POPL/OOPSLA. Many prominent researchers now publish in both communities.
  • As with the original survey, I tried to organize the post to highlight high-level ideas rather than individual papers. An idea appears in more than one research paper. I also listed notable individual papers in their own section.
  • Unlike the original survey, this post focuses only on the ideas and techniques, and not on, say, new applications. There’s also a small section on datasets in the end.
  • There’s a fair number of Microsoft Research papers in this list, although I tried my best to maintain diversity.

Many program synthesis applications are built on a top-down enumerative search as a means to construct the desired program. The search can be purely enumerative (with some optimizations or heuristics), deductively guided, or constraint-guided. A common theme of many research papers of 2018 has been to augment this search with a learned guiding function. Its role is to predict, for each branch at each step of the search process, how likely it is to yield a desired program, after which the search can focus only on the likely branches. The guiding function is usually a neural network or some other probabilistic model. The overall setup can be framed slightly differently depending on the interaction of components in the search process, but the key idea of combining a top-down search procedure with a trained model guiding it remains the same.

DeepCoder pioneered such combination of symbolic and statistical techniques in early 2017. That system learns a model that, given a task specification (input-output examples in the original work), predicts a distribution over the operators that may appear in the desired program. This distribution can be then used as a guiding function in any enumerative search.1

A fragment of the deductive search process looking for the most generalizable program that satisfies the given input-output example. At each branching point in the search tree, the current state is fed into a neural model, which estimates the quality of the best program that is likely to be produced from each branch (shown as a green distribution curve; higher shapes correspond to more promising branches). Source: NGDS.

A fragment of the deductive search process looking for the most generalizable program that satisfies the given input-output example. At each branching point in the search tree, the current state is fed into a neural model, which estimates the quality of the best program that is likely to be produced from each branch (shown as a green distribution curve; higher shapes correspond to more promising branches). Source: NGDS.

The subsequent research in 2017-18 pushed this idea further, allowing the guidance to happen at every step of the search process instead of just its root. The guidance can now take into account a partial search state instead of just a global task specification. As expected, this greatly improves the synthesis speed (thanks to trimming down the combinatorial explosion of branches) and accuracy (thanks to the exploration of more promising branches). The key take-away: if you can obtain enough data to train a reliable guiding function, there’s no reason not to do it, it’ll definitely help the search. The main caveat here is that invoking a neural network has a non-zero cost, and sometimes (especially nearing the leaves of the search tree) this can be more expensive than ignoring its prediction and exploring the entire subtree anyway. Also, if the prediction is wrong, there must be a way to detect this, backtrack, and pick a better branch. Thus, careful engineering and something like A*, branch-and-bound, or beam search must be a part of the solution.

Sketch generation

Human programmers often write a program top-down, first producing a high-level description of the desired program structure, and then concretizing the missing details. In other words, we often first produce a sketch of a program instead of writing the entire program end-to-end. In program synthesis, manually sketching the desired program has long been a technique of choice for many challenging domains. However, what if we allow a model to learn the right sketch?

The idea of sketch generation is to split the program synthesis process (Spec $\to$ Program) into two phases (Spec $\to$ Sketch $\to$ Program). The first phase (sketch generation) is usually automatically learned; the second one (program generation) can either be a different learned model or some kind of search. In either case, in order to train the corresponding model, one defines (a) a language of sketches, (b) an abstraction function that produces a sketch of a given code snippet, and (c) a dataset of sketches (usually automatically rewritten from a dataset of concrete programs) associated with the corresponding task specs.

A two-step program generation process, with separate decoder models for (1) a sketch conditioned on the spec and (2) a program conditioned on the sketch and the spec. Source: Coarse2Fine.

A two-step program generation process, with separate decoder models for (1) a sketch conditioned on the spec and (2) a program conditioned on the sketch and the spec. Source: Coarse2Fine.

Sketch generation outperforms end-to-end program generation. Intuitively, it is easier for a model to learn high-level patterns of the desired program structure rather than its fine implementation details. Moreover, filling in implementation details in a given sketch is much easier than coming up with them on the fly as part of whole program generation. Thus, non-surprisingly, the technique works pretty well on two very different application domains and specification kinds.

Sketch generation was developed independently in Bayou and Coarse2Fine. Bayou takes as input syntactic specs about elements of the desired program (e.g. “must include a call to readLines and an Iterator type”), and produces the most typical Java program that includes these elements. Coarse2Fine takes as input a natural language spec, and produces a snippet that implements it (evaluated on Python and SQL snippets from the Geo, ATIS, Django, and WikiSQL semantic parsing datasets).

Graph neural networks

Graph Neural Networks, or GNNs, arose recently as a particularly useful architecture for tasks that involve reasoning over source code, including program analysis and program synthesis. The motivation here is the need to treat a program as an executable object. Most of the ML applications to code before 2017 treated programs either as a sequence of tokens or, at best, a syntax tree. This captures natural language-like properties of code, i.e. the patterns that arise in programs thanks to their intent to communicate meaning to other programmers. However, code also has a second meaning – it is a formal executable structure, which communicates a particular behavior to the computer. In other words, NLP-inspired methods, when applied to code, have to infer the semantics of for, if, API chaining patterns, and so on. One would hope that this is learnable by a deep neural network from the ground up, but in practice, it’s been exceedingly difficult.

A portion of the data flow graph for a given code snippet, represented with three kinds of auxiliary data flow edges that describe how the variables in the snippet are read and written when the program gets executed. Syntactic AST edges are omitted for clarity. Source: IntelliCode.

A portion of the data flow graph for a given code snippet, represented with three kinds of auxiliary data flow edges that describe how the variables in the snippet are read and written when the program gets executed. Syntactic AST edges are omitted for clarity. Source: IntelliCode.

A simple way to communicate such information about program semantics to a network is to encode the program’s data flow graph. It enriches the AST with additional edges that describe how the information might flow in a typical program execution. For instance, if a variable x is read somewhere in the program, we can connect it with all possible locations where x could have been last written. The beauty of the method is that it doesn’t care about the semantics of these additional edges, as long as they can be deterministically computed and added to the graph. For practical tasks (like detecting when a variable may be used incorrectly) the program is augmented with about 10 kinds of such semantic edges. This augmented graph is then fed into a GNN encoder, which learns the representations for program variables and locations that might be useful for a downstream task, such as variable misuse detection. This is the approach that worked best for a number of software engineering tasks, and is currently shipping as part of Visual Studio IntelliCode.

What does it have to do with synthesis? Well, once we’ve established GNNs as a good architecture for processing programs,2 then it’s only natural to apply them for program synthesis as well. Recently, we’ve tried to do just that. The idea is to frame program synthesis as a top-down generative model where at each step you (a) decide a grammar production to expand the next “hole” in your partial program, (b) use GNN propagation to update the representations for the relevant nodes of the partial program, (c) refocus and repeat once the entire program is complete. The whole process can be conditioned on any specification, but in our initial experiments we’ve focused on the hardest possible setup – infer a missing program subexpression just from its context, i.e. from surrounding source code alone. That’s hard. Especially on “real-life” source code (we applied it to various C# subexpressions from popular GitHub projects). We’ve reached 50% top-1 accuracy, which for this kind of setup, I think, is already pretty remarkable.


The recent transition of program synthesis into an AI discipline had another nice side-effect — a flurry of new datasets. Neural networks are data-hungry, so researchers are forced to find problems with tens of thousands of data samples, or invent new problems where such a dataset can be creatively generated/collected, and pose a challenge for the field. Sometimes even collecting this dataset can be a genuine ML problem in itself.

A typical problem from the Karel dataset. The system is given two input-output examples $I_1 \to O_1$, $I_2 \to O_2$, and it has to synthesize a program that also passes a held-out test example $\check{I} \to \check{O}$. Every example is a grid environment for the robot, with walls and markers (potentially of different sizes). Source: Neural Program Meta-Induction.

A typical problem from the Karel dataset. The system is given two input-output examples $I_1 \to O_1$, $I_2 \to O_2$, and it has to synthesize a program that also passes a held-out test example $\check{I} \to \check{O}$. Every example is a grid environment for the robot, with walls and markers (potentially of different sizes). Source: Neural Program Meta-Induction.

Here are some datasets that the community has lately focused on:

  • Two new semantic parsing (English $\to$ Python) datasets, collected from StackOverflow: StaQC and CoNaLa.
  • Karel, a toy robot programming language that used be to used for introductory programming classes at Stanford. The dataset contains synthetically generated programs, each with 5 input-output examples.
  • NAPS, a really young and quite challenging dataset containing preprocessed problems from algorithmic competitions along with imperative descriptions and examples.
  • WikiSQL, a large semantic parsing dataset (English $\to$ limited SQL). It’s a bit controversial due to its lack of program variety and semi-artificial language generation, but its size and variety of tables/questions has generated a lot of attention after Salesforce released it a year ago.
  • A standardized collection of 10 semantic parsing datasets from NLP and DB communities, preprocessed into SQL instead of various logical forms.

Notable mentions

An end-to-end differentiable version of FlashFill that’s trained on a large volume of synthetically generated tasks and can adapt to small noise in the input-output examples. The core architecture is surprisingly simple, but works well for its DSL. Demonstrates an unreasonable effectiveness of attention 🙂

In addition to a variant of statistical branch prioritization, this paper introduces the notion of conflict-driven search from SAT/SMT solving to the program synthesis world. Loosely speaking, in modern SAT/SMT search algorithms, when a search branch results in a failure, we can learn a reason for this failure – a new conflict clause (i.e., a particular contradiction) that can be added to the working set of constraints to avoid a similar mistake in the future. In contrast, search algorithms in program synthesis couldn’t do that — until now. Now, for instance, an input-output example $[1, 2, 3] \to [2, 4]$ not only immediately eliminates any program of form x => map(x, ...) (because map can’t change the length of its input list), but also generates a constraint that eliminates x => reverse(x, ...) and x => sort(x, ...) for the same reason.

A neat paper that introduces a novel program synthesis problem (recovering $\mathrm{\LaTeX}$ programs from hand-drawn mock-ups) that nicely combines perceptual and symbolic techniques. There are many technical challenges: dealing with noise in the input, recovering the right primitives, optimizing the resulting constraint problem with a learned policy, and so on.

Have you ever wanted to understand the behavior of your learned RL policy? Or maybe not 100% understand, but at least get some insight into its logic, or maybe replace its logic with something more interpretable and generalizable. Turns out, program synthesis can help. The method is straightforward and beautiful in its simplicity: train a policy to, say, drive a race car, and then search for a program in an appropriate DSL to approximate the behavior of this policy. (The devil, of course, is in the details: DSL design, good sketches, and a method to intelligently sample examples from the policy are all key to making this work.) The program is not as perfectly optimized for the fastest track completion as a neural policy, but it drives smoother, provably generalizes to other tracks, and, most importantly, you can read it. All these benefits become apparent in a video.

  1. With sincere apologies to Yoshua Bengio, who hasn’t been asked for consent to using his name in the illustrations. We all know how hard it is to make new figures, so the choices you make in a preprint often live on in posters, talks, and blog posts. 🙂
  2. I’ve heard people say that GNNs are to programs are what CNNs are to images and LSTMs are to text — if not necessarily a silver bullet, then at least a baseline architecture of choice. I personally think it’s not as clear-cut yet, but explicit reasoning over graphs and relations should definitely be involved in most AI tasks out there.

Say no to source-controlled credentials: .NET + RSA encryption


Let’s start with an axiom: you should never store any credentials in your source code in plain text, especially if the codebase is source-controlled. If you are doing it, there’s something seriously wrong with your architecture.1

Here’s one possible solution to this problem. Presumably you already have an OpenSSH pair of public/private keys. It will serve us nicely in this case:

  1. Generate a PEM version of your OpenSSH public key:

    $ openssl rsa -in ~/.ssh/id_rsa -pubout > ~/.ssh/
    Enter pass phrase for ~/.ssh/id_rsa:
    writing RSA key
  2. Write your password in a plaintext file (say, admin_pass), and encrypt it (say, into admin_pass.enc):2

    cat admin_pass | openssl rsautl -encrypt -pubin -inkey ~/.ssh/ > admin_pass.enc
  3. Delete admin_pass from your hard drive.3

The resulting admin_pass.enc can be safely checked into the repository: one needs access to your private key in order to decrypt it:

cat admin_pass.enc | openssl rsautl -decrypt -inkey ~/.ssh/id_rsa

However, you probably don’t want to do even that. The best policy is to decrypt the credentials directly in your app, securely and in-memory. Presumably your favorite programming language/platform has some API for securely-stored memory objects (usually backed by the OS/hardware cryptography system). I am going to show an implementation for .NET below.

First, in order to use the OpenSSH keys in .NET infrastructure, we need to convert them both to PEM, and also generate the corresponding certificate and the PFX file. Here’s the process using my website and email as an example for the generated certificate (the exact values are not really important for our intended purpose):

$ openssl rsa -in ~/.ssh/id_rsa > ~/.ssh/id_rsa.pem
Enter pass phrase for ~/.ssh/id_rsa:
writing RSA key

$ openssl req -nodes -x509 -days 3650 -subj '/' -new -key ~/.ssh/id_rsa.pem -out ~/.ssh/certificate.crt

$ openssl pkcs12 -export -out ~/.ssh/certificate.pfx -inkey ~/.ssh/id_rsa.pem -in ~/.ssh/certificate.crt

Now let’s use this certificate to decrypt the password in C#:

public static SecureString DecryptCredentials(string certificateFile,
                                              SecureString pkPassphrase,
                                              string encryptedFile) {
    string home = Environment.GetFolderPath(Environment.SpecialFolder.UserProfile);
    string sshDir = Path.Combine(home, ".ssh");
    var cert = new X509Certificate2(Path.Combine(sshDir, certificateFile), pkPassphrase, 
    using (var rsa = cert.PrivateKey as RSACryptoServiceProvider)
    using (var credentialStream = new FileStream(encryptedFile, FileMode.Open))
    using (var encrypted = new MemoryStream()) {
        // Unfortunately, RSACryptoServiceProvider does not have a Decrypt() overload that
        // returns a SecureString (no idea why).
        // Because of that, we have to store the decrypted data briefly 
        // in a garbage-collected memory, and manually convert it into a secure 
        // representation as soon as possible.
        byte[] bytes = rsa.Decrypt(encrypted.ToArray(), false);
        var result = new SecureString();
        foreach (char c in Encoding.ASCII.GetChars(decrypted))
        GC.Collect();  // get rid of the decrypted bytes array
        return result;

DecryptCredentials("certificate.pfx", Utils.SecurePrompt(), "admin_pass.enc");

The code uses a simple SecurePrompt function, which prompts the user for the SSH passphrase, encrypting one character at a time in a SecureString:

public static SecureString SecurePrompt(string prompt = "> ") {
    var password = new SecureString();
    while (true) {
        var input = Console.ReadKey(true);
        switch (input.Key) {
            case ConsoleKey.Enter:
                return password;
            case ConsoleKey.Backspace:
                if (password.Length > 0) {
                    password.RemoveAt(password.Length - 1);
                    Console.Write("\b \b");

  1. And with your undergraduate CS Security class, but I digress.
  2. rsautl has a maximum size limit on the data being encrypted, since it applies the RSA algorithm directly (as opposed to, say, AES). The limit is roughly equal to “[your key size] - [padding]“, which for common key sizes gives 200-400 bytes. Hopefully your password is pretty long, but it probably is not that long 🙂
  3. Alternatively, you could avoid creating it in the first place, and simply cat the password from stdin.

Substitution mnemonic


(I overheard this at UW, from a graduate student who told me he doesn’t know the original author of this mnemonic either. Please do not attribute it to either of us 😛).

In PL and formal logic, there exists a common notation, which I personally consider the worst possible notation choice of all times. Let’s say you have an expression $E$, in which you want to replace every occurrence of a variable $x$ (or any other fixed subexpression, for that matter) with another variable $y$ (again, or any other expression). This operation is called substitution 1. In formal logic, $\lambda$-calculus, etc. there exists a bewildering soup of notations representing this simple transformation:

  • $E[x \to y]$ (“$x$ is transformed into $y$”)
  • $E[x \gets y]$ (“$x$ is replaced with $y$”)
  • $E[x := y]$ (“assign $y$ to every occurrence of $x$)
  • $E[y/x]$ (WTF?)

The first two (and their mirrored alternatives) are confusing on its own – you never know which substitution direction the arrow represents (is it “replace $x$ with $y$” or “replace $y$ with $x$”?). But the last one is notoriously ridiculous in that regard. It is too symmetric. How am I supposed to remember which variable substitutes the other one if the only thing denoting this relationship is a meaningless slash character in between? Some say this notation is the easiest to typeset – which is not true, because the third notation also doesn’t use any characters beyond standard ASCII, and it is unambiguous. Anyway, I usually tried to avoid the slash notation as much as possible whenever I could.

However, now I know a beautiful mnemonic, which breaks the “almost-symmetrical” relationship between $x$ and $y$:

“Think of it as $y$ falling over and squishing the $x$.”

  1. The actual formal definition of substitution is trickier, because you have to take into account which variables are free in $E$, and which are bound. The details are outside the scope of this post.

On performance in .NET


Formerly, when someone asked me “How to make my C# program faster?”, my go-to advice was:

  1. Improve your computational complexity.
  2. Switch to a native language.

Of these suggestions, step #1 should take care of performance in most situations, if done properly. However, sometimes it is not enough. For example, you are writing a user-facing real-time performance-critical application, such as the C# compiler, and you need to look not only at the average performance, but at the 99th percentile. This is where step #2 used to take over… but not anymore. Currently, my go-to advice looks like this:

  1. Improve your computational complexity.
  2. Read “Writing High-Performance .NET Code”, Roslyn team blog, and the rest of the links provided below.
  3. Switch to a native language.

Measurement tools

“Writing High-Performance .NET Code” should be your desk book for any performance-critical .NET application. I found the following knowledge it provides most important:

  • What profiling tools exist in the .NET world (PerfView, ETW, CLR Profiler, dotTrace, dotMemory, Windbg…), how to use each one of them, how to answer specific questions with a combination of these tools, which metrics should you look at and why.
  • How exactly .NET GC works and why you should care. How to help .NET GC by managing your allocations and object lifecycle carefully.
  • JIT mechanics, NGEN, when to use each of them, startup rates.

In other words, when you have a specific problem with your .NET application, this book will show how to use the tools to locate who exactly is responsible for the bottleneck, how often it occurs, and so on. Often you will be surprised, because (assuming you’ve already gone through step #1) your greatest bottleneck is going to be either GC, allocations, JIT, boxing, or incorrect usage of .NET API. All of these subtle engineering details are quite unobvious if you don’t train yourself to keep them in mind.

Performance-minded coding

When you know the consequence of the problem (e.g. “my app spends 20% of the time in GC”, which arises because of “there’s this pinned handle that points to a long retention path that is being walked on every gen-2 collection”), the next step is to eliminate it. The series of links below (along with the accompanying chapters in “Writing High-Performance .NET Code”) explains the techniques for doing it. If you know that this particular path in your code is going to be performance-critical up to the 99th percentile (never optimize prematurely!), you should really think of these techniques as guiding principles to keep in mind every second during your coding process.

Here are some must-read articles for any .NET programmer that works with performance-critical code:

I’m going to list some of the mentioned techniques below, briefly. Again, the book and the sources above have already done it much better than me.

Avoid allocations

For every allocation you make, the .NET GC will spend a little more time on every future collection. If this instance is relatively long-lived, it’s going to eventually move to gen-1 and gen-2, and greatly affect the future GC times. But even if it is short-lived, the performance of gen-0 collection is going to matter in your higher percentiles anyway.

Some of the common sources of hidden allocations are:

  • LINQ queries: they allocate enumerators and lambda-representing objects that capture references in a closure. Also, eventually you need to convert a query to a list/array, and it’s going to expand itself several times in the process. Instead you should allocate a list/array of a known capacity beforehand, and do all the required work directly, without allocating hidden LINQ enumerators.
  • Lambdas: if they capture any references from the outer scope, they will be rewritten to compiler-generated objects. If your lambda operates in a generic context this object will not even be cached in place, and will be re-created every time. Consider refactoring your logic in such a way that lambda becomes non-capturing (such lambdas compile down to static methods), or get rid of the lambda altogether.

    UPD: This behavior will change in the upcomping C# 6. In Roslyn, lambda functions now always compile down to instance calls, which causes up to 35% performance gain in common scenarios.

  • Strings: because they are immutable, every string modification allocates a new string object. This is most subtle in calls like string.Trim(' '), which actually returns a new string. Instead use index-based arithmetic on hot paths. For case-insensitive comparison, use the corresponding string.Compare overload instead of converting all your strings to lowercase with string.ToLower. Also, string concatenation for simple cases is much faster than string formatting.

  • Invoking a method with params always allocates an array, possibly of zero length. Consider creating specialized overloads for most common numbers of parameters.

Use object pooling

If you use some objects often but temporarily, consider allocating a pool of objects of this type once, and reuse them where necessary. For example, the interface that Roslyn uses for StringBuilders look somewhat like this:

internal static class StringBuilderPool {
    public static StringBuilder Allocate() {
        return SharedPools.Default<StringBuilder>().AllocateAndClear();
    public static void Free(StringBuilder builder) {
    public static string ReturnAndFree(StringBuilder builder) {
        return builder.ToString();

Which is then later used in the following manner:

public override string GetTextBetween(SyntaxToken token1, SyntaxToken token2) {
    var builder = StringBuilderPool.Allocate();
    CommonFormattingHelpers.AppendTextBetween(token1, token2, builder);
    return StringBuilderPool.ReturnAndFree(builder);

Compile your reflection calls into delegates

Method calling through Reflection is horrendously slow. If you know the signature of your method beforehand (and for some reason cannot agree on a common interface with the third party who implements the external method), you can use LINQ Expression Trees to compile your MethodInfo to a strongly-typed delegate. This needs to be done only once, after that you can keep a reference to the delegate, and call it whenever necessary with standard C# syntax.

The technique is described at Jon Skeet’s blog. Here’s how I used it in one of my recent projects:

public static TDelegate ToDelegate<TDelegate>(this MethodInfo method) {
     var parameterTypes = method.GetParameters().Select(p => p.ParameterType).ToArray();
     MethodInfo invokeMethod = typeof (TDelegate).GetMethod("Invoke");
     var @params = invokeMethod.GetParameters()
                               .Select((p, i) => Expression.Parameter(p.ParameterType, "arg" + i))
     MethodCallExpression methodCall = Expression.Call(method,
     return Expression.Lambda<TDelegate>(Expression.Convert(methodCall, invokeMethod.ReturnType), @params).Compile();

Avoid boxing

Apart from obvious cases where people unnecessarily convert value types into objects (such as using now deprecated System.Collections namespace), there are some subtler ones:

  • String.Format boxes its arguments
  • Structs are boxed when used through their implemented interface. In particular, this happens when you use LINQ: List<T>.Enumerator is a struct, but LINQ methods treat it as IEnumerator<T>
  • Object.Equals(object other) is the source of all evil
  • Do not ever concatenate a string with anything other than a string
  • Calling GetHashCode on an enum causes boxing. Convert the enum instance to int first.

Use specialized collections for smaller sizes

An array outperforms a dictionary up to a certain number of elements, even though its lookup is $\mathcal{O}(n)$. The reason is all the boilerplate overhead that accompanies a dictionary (hashtable, bucket exploration, etc.). The exact tripping point depends on your particular code and has to be measured experimentally.

Know when to use structs and when not to

Structs give better performance than classes when they are small, consist of primitive types, and when they are not copied around often. The overhead of a class vtable is noticeable during garbage collection. However, classes are copied faster, because they are represented with just a pointer. On another hand, an array of classes will cause you a nightmare of cache misses (see reference locality). An array of structs neatly lies in memory as a single block, but would be a nightmare to copy as a function result. Bottom line, know your usecase and measure everything.

Note: you should always re-implement Equals, GetHashCode, and IEquatable<T>.Equals for your structs, because the default implementations use reflection to enumerate over the fields.

Compile your regular expressions

If you are going to reuse the same regular expression often, build it with a RegexOptions.Compiled flag. Do not ever use static methods of the Regex class, they are horrendously slow.

If your regular expressions do not require non-regular features such as lookaround or backreferences, consider using re2 instead of a default regex engine. It might yield you up to a 10x speedup. There is a .NET wrapper available, called Re2.Net.

I will stop here because I cannot possibly outline all of the techniques. Again, for the complete picture read the book, watch the talk, read the articles referenced above, know your tools, and measure everything carefully. Also, feel free to comment here if you want to contribute more advice!

Record types and pattern matching in C#


For those of you who do not follow the Roslyn project, an active discussion is currently going on regarding record classes (read: ADT) and pattern matching in the future version of C#. Neal Gafter wrote a draft spec and a prototype implementation, and both of them are rapidly evolving at this very moment. The prototype is not yet publicly available, but Neal promises that it will be shortly. UPD: The prototype is now published!

What I like most of all in this undertaking is that they didn’t just inject some pieces of F#/Scala/Haskell into C#. It would feel foreign, since the languages have different styles and design principles. Instead they actually designed a feature in such a way that it feels C#, rather than a foreign addition. Kudos to the language design team.

If you have any ideas and/or suggestions, feel free to join the discussion!

Roslyn and my favorite persistent data structure


This post is not a propaganda about perks and benefits of immutable persistent data structures. People familiar with the concept already know when it can tremendously help your code and when it cannot. And for those who aren’t familiar I cannot give a better introduction than those three links:

In this post I want to talk about my favorite persistent data structure. Or, if you wish, the one I consider most beautiful. For a long time the answer to this question was Zipper1. However, two years ago this question popped up on StackOverflow:

Are Roslyn SyntaxNodes reused?

The team who built Roslyn, the new compiler infrastructure for C# and VB.NET languages, had to come up with an AST data structure that would satisfy a set of mutually incompatible requirements. The tree had to (a) have a cheap access to the parent field; (b) maintain an up-to-date character span in the document for each node; (c) persistently reuse most of the nodes on every update. As Eric Lippert describes, they solved this problem by maintaining two trees, one of which is immutable and persistent, and another has parent references and is re-built on-demand for each query, but only partially. The latter tree forms a facade around the former one, and together they solve all of the requirements at once.

The post is a great read, I encourage everyone to look at Eric’s description. Because Roslyn is now open-source, the entire implementation is now available on GitHub (hint: search for SyntaxNode/RedNode/GreenNode). For a long time I was also surprised that the team never published anything that formally describes their idea. However, today I found out that they at least filed a public patent. Its text contains many more interesting explanations and details on how this combination of two parse trees works as one unified persistent parse tree.

What is your favorite persistent data structure?

  1. By the way, did you know that a zipper is a derivative of a surrounding container in a semifield of algebraic data types? This is a beautiful connection that also relates to generating functions, Taylor series, linear logic, and information theory. See, for example, [1], [2], [3], [4], [5].

Random thoughts on Swift


Based on my quick skimming session through the language guide.

Kind of expected from a modern language, would be surprised if it wasn’t included:

  • Closures
  • Type inference
  • First-class collections and tuples
  • Generics with all kinds of boundary restrictions
  • Interoperability with existing libraries (in this case with C and Obj-C).

Surprisingly included, and it’s nice:

  • ADTs and pattern matching
  • First-class Maybe monad
  • Bret Victor-style interactive IDE (let me make a sincere bow to Apple folks at this point)
  • Opt-in mutability
  • Lots of small inspirations from D, C#, Scala, etc.


  • Reference counting instead of garbage collection. Cannot detect reference cycles, ergo manual weakref management.
  • I’m not quite sure how to combine “protocols” (read: interfaces) and “extensions” (read: mixins) into full-fledged traits and typeclasses. For example, how do I define a default method implementation for an interface a protocol?

    (Also, why the heck do you need so many new names for existing concepts?!)

  • Optional types are not quite Maybe, you can force the “unwrapping” with a ! operator (naturally, with a possible runtime exception). I foresee every single iOS developer just blindly doing it with every external optional they get.

  • Extremely closed infrastructure: iOS/MacOS only, proprietary compiler, etc. Hell, you need an iTunes account to even read language documentation.

  • Absolutely no first-class concurrency treatment. Moreover, there’s even no library treatment for concurrency (even though it is not enough).

  • No stream processing sugar (read: comprehensions/LINQ). Well, at least they have map and reduce in the standard library, but that’s about it.

  • No private members.

Not sure how I feel about:

  • No exceptions. If you gave us some other monadic alternative, I might’ve put it in the “nice” section.
  • Some magic in interoperability. Apparently, they wrote “external parameters names” for many standard library functions and forgot to show them to the legacy Objective-C programmers.
  • The fact that their compiler cannot resolve recursive ADT definitions, even though it supposedly should.

Interesting. Let’s see how it flies.