## 2021

### Programming Puzzles NeurIPS 2021: Datasets & Benchmarks

We introduce a new type of programming challenge called programming puzzles, as an objective and comprehensive evaluation of program synthesis, and release an open-source dataset of Python Programming Puzzles (P3). Each puzzle is defined by a short Python program $f$, and the goal is to find an input $x$ which makes $f$ output “True”. The puzzles are objective in that each one is specified entirely by the source code of its verifier $f$, so evaluating $f(x)$ is all that is needed to test a candidate solution $x$. They do not require an answer key or input/output examples, nor do they depend on natural language understanding. The dataset is comprehensive in that it spans problems of a range of difficulties and domains, ranging from trivial string manipulation problems that are immediately obvious to human programmers (but not necessarily to AI), to classic programming puzzles (e.g., Towers of Hanoi), to interview/competitive-programming problems (e.g., dynamic programming), to longstanding open problems in algorithms and mathematics (e.g., factoring). The objective nature of P3 readily supports self-supervised bootstrapping. We develop baseline enumerative program synthesis and GPT-3 solvers that are capable of solving easy puzzles – even without access to any reference solutions – by learning from their own past solutions. Based on a small user study, we find puzzle difficulty to correlate between human programmers and the baseline AI solvers.

### KaggleDBQA: Realistic Evaluation of Text-to-SQL Parsers ACL-IJCNLP 2021

The goal of database question answering is to enable natural language querying of real-life relational databases in diverse application domains. Recently, large-scale datasets such as Spider and WikiSQL facilitated novel modeling techniques for text-to-SQL parsing, improving zero-shot generalization to unseen databases. In this work, we examine the challenges that still prevent these techniques from practical deployment. First, we present KaggleDBQA, a new cross-domain evaluation dataset of real Web databases, with domain-specific data types, original formatting, and unrestricted questions. Second, we re-examine the choice of evaluation tasks for text-to-SQL parsers as applied in real-life settings. Finally, we augment our in-domain evaluation task with database documentation, a naturally occurring source of implicit domain knowledge. We show that KaggleDBQA presents a challenge to state-of-the-art zero-shot parsers but a more realistic evaluation setting and creative use of associated database documentation boosts their accuracy by over 13.2%, doubling their performance.

### Structure-Grounded Pretraining for Text-to-SQL NAACL 2021

Learning to capture text-table alignment is essential for tasks like text-to-SQL. A model needs to correctly recognize natural language references to columns and values and to ground them in the given database schema. In this paper, we present a novel weakly supervised Structure-Grounded pretraining framework (StruG) for text-to-SQL that can effectively learn to capture text-table alignment based on a parallel text-table corpus. We identify a set of novel pretraining tasks: column grounding, value grounding and column-value mapping, and leverage them to pretrain a text-table encoder. Additionally, to evaluate different methods under more realistic text-table alignment settings, we create a new evaluation set Spider-Realistic based on Spider dev set with explicit mentions of column names removed, and adopt eight existing text-to-SQL datasets for cross-database evaluation. StruG brings significant improvement over BERT-Large in all settings. Compared with existing pretraining methods such as GRAPPA, StruG achieves similar performance on Spider, and outperforms all baselines on more realistic sets. All the code and data used in this work will be open-sourced to facilitate future research.

### SCoRe: Pre-Training for Context Representation in Conversational Semantic Parsing ICLR 2021

Also appears as a poster at the Computer-Assisted Programming (CAP) workshop at NeurIPS 2020.

Conversational Semantic Parsing (CSP) is the task of converting a sequence of natural language queries to formal language (e.g., SQL, SPARQL) that can be executed against a structured ontology (e.g. databases, knowledge bases). To accomplish this task, a CSP system needs to model the relation between the unstructured language utterance and the structured ontology while representing the multi-turn dynamics of the dialog. Pre-trained language models (LMs) are the state-of-the-art for various natural language processing tasks. However, existing pre-trained LMs that use language modeling training objectives over free-form text have limited ability to represent natural language references to contextual structural data. In this work, we present SCORE, a new pre-training approach for CSP tasks designed to induce representations that capture the alignment between the dialogue flow and the structural context. We demonstrate the broad applicability of SCORE to CSP tasks by combining SCORE with strong base systems on four different tasks (SPARC, COSQL, MWOZ, and SQA). We show that SCORE can improve the performance over all these base systems by a significant margin and achieves state-of-the-art results on three of them.

## 2020

### Learning to Infer Run-Time Invariants from Source Code CAP @ NeurIPS 2020

Source code is meant to be executed, as well as read. Developers reason about its run-time properties by inferring invariants, which constrain program behavior; but they rarely encode these explicitly, so machine-learning methods don’t have much aligned data to learn from. We propose an approach that adapts cues within existing if-statements regarding explicit run-time expectations to generate aligned datasets of code and implicit invariants. We also propose a contrastive loss to inhibit generation of illogical invariants. Our model learns to infer a wide vocabulary of invariants for arbitrary code, which can be used to detect and repair real bugs. This is complementary to trace-based methods, such as Daikon. Our results confirm that neural models can learn run-time expectations directly from code.

### ALFRED Speaks: Automatic Instruction Generation for Egocentric Skill Learning EVAL @ ECCV 2020

Embodied agents need to not only be able to follow instructions, but also give them. Previous work has focused on simple navigation instructions and on language generation targeted at embodied agents, not humans. We take a different approach and target the creation of EXPLAINER modules that eschew low-level instruction in favor of more natural goal-directed language. We achieve this in part with novel fine-grained state-tracking to minimize extraneous details without forgetting core properties. In addition to this new model formulation, we propose a new Object Capture Test to evaluate the accuracy and goal-directedness of these EXPLAINER-generated instructions. We find that the EXPLAINER generates more fluent, accurate, and goal-directed instructions compared to a naïve sequence-to-sequence generative model.

### Neuro-Symbolic Visual Reasoning: Disentangling “Visual” from “Reasoning” ICML 2020

Visual reasoning tasks such as visual question answering (VQA) require an interplay of visual perception with reasoning about the question semantics grounded in perception. Various benchmarks for reasoning across language and vision like VQA, VCR and more recently GQA for compositional question answering facilitate scientific progress from perception models to visual reasoning. However, recent advances are still primarily driven by perception improvements (e.g., scene graph generation) rather than reasoning. Neuro-symbolic models such as Neural Module Networks bring the benefits of compositional reasoning to VQA, but they are still entangled with visual representation learning, and thus neural reasoning is hard to improve and assess on its own.

To address this, we propose (1) a framework to isolate and evaluate the reasoning aspect of VQA separately from its perception, and (2) a novel top-down calibration technique that allows the model to answer reasoning questions even with imperfect perception. To this end, we introduce a differentiable first-order logic formalism for VQA that explicitly decouples question answering from visual perception. On the challenging GQA dataset, this approach is competitive with non-symbolic neural models while also interpretable by construction and composable with arbitrary pre-trained visual representation learning.

### Learning Web-based Procedures by Reasoning over Explanations and Demonstrations in Context ACL 2020

We explore learning web-based tasks (such as sharing a post on social media or forwarding an email) from a human teacher through natural language explanations and a single demonstration. Our approach investigates a new direction for semantic parsing that models explaining a demonstration in a context, rather than mapping explanations to demonstrations. By leveraging the idea of inverse semantics from program synthesis to reason backwards from observed demonstrations, we ensure that parsed logical forms are consistent with executable actions in the context. We present a dataset of explanations paired with demonstrations for web-based tasks. Our methods show better task completion rates than a supervised semantic parsing baseline (40% relative improvement on average), and are competitive with exploration-and-demonstration based methods, while requiring substantially less supervision. In learning to align explanations with demonstrations, basic properties of natural language syntax emerge as learned behavior. This provides an interesting example of language acquisition from grounded contexts without any linguistic annotation.

### RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers ACL 2020

When translating natural language questions into SQL queries to answer questions from a database, contemporary semantic parsing models struggle to generalize to unseen database schemas. The generalization challenge lies in (a) encoding the database relations in an accessible way for the semantic parser, and (b) modeling alignment between database columns and their mentions in a given query. We present a unified framework, based on the relation-aware self-attention mechanism, to address schema encoding, schema linking, and feature representation within a text-to-SQL encoder. On the challenging Spider dataset this framework boosts the exact match accuracy to 57.2%, surpassing its best counterparts by 8.7% absolute improvement. Further augmented with BERT, it achieves the new state of-the-art performance of 61.9% on the Spider leaderboard. In addition, we observe qualitative improvements in the model’s understanding of schema linking and alignment.

## 2019

### Program Synthesis and Semantic Parsing with Learned Code Idioms NeurIPS 2019

Program synthesis of general-purpose source code from natural language specifications is challenging due to the need to reason about high-level patterns in the target program and low-level implementation details at the same time. In this work, we present Patois, a system that allows a neural program synthesizer to explicitly interleave high-level and low-level reasoning at every generation step. It accomplishes this by automatically mining common code idioms from a given corpus, incorporating them into the underlying language for neural synthesis, and training a tree-based neural synthesizer to use these idioms during code generation. We evaluate Patois on two complex semantic parsing datasets and show that using learned code idioms improves the synthesizer’s accuracy.

### Are My Invariants Valid? A Learning Approach arXiv preprint

Ensuring that a program operates correctly is a difficult task in large, complex systems. Enshrining invariants – desired properties of correct execution – in code or comments can support maintainability and help sustain correctness. Tools that can automatically infer and recommend invariants can thus be very beneficial. However, current invariant-suggesting tools, such as Daikon, suffer from high rates of false positives, in part because they only leverage traced program values from available test cases, rather than directly exploiting knowledge of the source code per se. We propose a machine-learning approach to judging the validity of invariants, specifically of method pre- and post-conditions, based directly on a method’s source code. We introduce a new, scalable approach to creating labeled invariants: using programs with large test-suites, we generate Daikon invariants using traces from subsets of these test-suites, and then label these as valid/invalid by cross-validating them with held-out tests. This process induces a large set of labels that provide a form of noisy supervision, which is then used to train a deep neural model, based on gated graph neural networks. Our model learns to map the lexical, syntactic, and semantic structure of a given method’s body into a probability that a candidate pre- or post-condition on that method’s body is correct and is able to accurately label invariants based on the noisy signal, even in cross-project settings. Most importantly, it performs well on a hand-curated dataset of invariants.

### Generative Code Modeling with Graphs ICLR 2019

Generative models for source code are an interesting structured prediction problem, requiring to reason about both hard syntactic and semantic constraints as well as about natural, likely programs. We present a novel model for this problem that uses a graph to represent the intermediate state of the generated output. The generative procedure interleaves grammar-driven expansion steps with graph augmentation and neural message passing steps. An experimental evaluation shows that our new model can generate semantically meaningful expressions, outperforming a range of strong baselines.

## 2018

### IncSQL: Training Incremental Text-to-SQL Parsers with Non-Deterministic Oracles arXiv preprint

We present a sequence-to-action parsing approach for the natural language to SQL task that incrementally fills the slots of a SQL query with feasible actions from a pre-defined inventory. To account for the fact that typically there are multiple correct SQL queries with the same or very similar semantics, we draw inspiration from syntactic parsing techniques and propose to train our sequence-to-action models with non-deterministic oracles. We evaluate our models on the WikiSQL dataset and achieve an execution accuracy of 83.7% on the test set, a 2.1% absolute improvement over the model trained with traditional static oracles assuming a single correct target SQL query. When further combined with the execution-guided decoding strategy, our model sets a new state-of-the-art performance at an execution accuracy of 87.1%.

### Robust Text-to-SQL Generation with Execution-Guided Decoding arXiv preprint

We consider the problem of neural semantic parsing, which translates natural language questions into executable SQL queries. We introduce a new mechanism, execution guidance, to leverage the semantics of SQL. It detects and excludes faulty programs during the decoding procedure by conditioning on the execution of partially generated program. The mechanism can be used with any autoregressive generative model, which we demonstrate on four state-of-the-art recurrent or template-based semantic parsing models. We demonstrate that execution guidance universally improves model performance on various text-to-SQL datasets with different scales and query complexity: WikiSQL, ATIS, and GeoQuery. As a result, we achieve new state-of-the-art execution accuracy of 83.8% on WikiSQL.

### Execution-Guided Neural Program Decoding NAMPI 2018

We present a neural semantic parser that translates natural language questions into executable SQL queries with two key ideas. First, we develop an encoder-decoder model, where the decoder uses a simple type system of SQL to constraint the output prediction, and propose a value-based loss when copying from input tokens. Second, we explore using the execution semantics of SQL to repair decoded programs that result in runtime error or return empty result. We propose two model-agnostics repair approaches, an ensemble model and a local program repair, and demonstrate their effectiveness over the original model. We evaluate our model on the WikiSQL dataset and show that our model achieves close to state-of-the-art results with lesser model complexity.

### FlashProfile: A Framework for Synthesizing Data Profiles OOPSLA 2018

We address the problem of learning a syntactic profile for a collection of strings, i.e. a set of regex-like patterns that succinctly describe the syntactic variations in the strings. Real-world datasets, typically curated from multiple sources, often contain data in various syntactic formats. Thus, any data processing task is preceded by the critical step of data format identification. However, manual inspection of data to identify the different formats is infeasible in standard big-data scenarios.

To the best of our knowledge, prior techniques are restricted to a small set of pre-defined patterns (e.g. digits, letters, words etc.), and provide no control over granularity of profiles. We define syntactic profiling as a problem of clustering strings based on syntactic similarity, followed by identifying patterns that succinctly describe each cluster. We present a technique for synthesizing such profiles over a given language of patterns, that also allows for interactive refinement by requesting a desired number of clusters.

Using a state-of-the-art inductive synthesis framework, PROSE, we have implemented our technique as FlashProfile. Across 153 tasks over 75 large real datasets, we observe a median profiling time of only ∼0.7s. Furthermore, we show that access to syntactic profiles may allow for more accurate synthesis of programs, i.e. using fewer examples, in programming-by-example (PBE) workflows such as Flash Fill.

### Neural-Guided Deductive Search for Real-Time Program Synthesis from Examples ICLR 2018

Synthesizing user-intended programs from a small number of input-output examples is a challenging problem with several important applications like spreadsheet manipulation, data wrangling and code refactoring. Existing synthesis systems either completely rely on deductive logic techniques that are extensively hand-engineered or on purely statistical models that need massive amounts of data, and in general fail to provide real-time synthesis on challenging benchmarks. In this work, we propose Neural-Guided Deductive Search (NGDS), a hybrid synthesis technique that combines the best of both symbolic logic techniques and statistical models. Thus, it produces programs that satisfy the provided specifications by construction and generalize well on unseen examples, similar to data-driven systems. Our technique effectively utilizes the deductive search framework to reduce the learning problem of the neural component to a simple supervised learning setup. Further, this allows us to both train on sparingly available real-world data and still leverage powerful recurrent neural network encoders. We demonstrate the effectiveness of our method by evaluating on real-world customer scenarios by synthesizing accurate programs with up to 12× speed-up compared to state-of-the-art systems.

## 2017

### Program Synthesis NOW 2017

In Foundations and Trends® in Programming Languages, Volume 4, Issue 1-2.

Program synthesis is the task of automatically finding a program in the underlying programming language that satisfies the user intent expressed in the form of some specification. Since the inception of AI in the 1950s, this problem has been considered the holy grail of Computer Science. Despite inherent challenges in the problem such as ambiguity of user intent and a typically enormous search space of programs, the field of program synthesis has developed many different techniques that enable program synthesis in different real-life application domains. It is now used successfully in software engineering, biological discovery, computer-aided education, end-user programming, and data cleaning. In the last decade, several applications of synthesis in the field of programming by examples have been deployed in mass-market industrial products.

This survey is a general overview of the state-of-the-art approaches to program synthesis, its applications, and subfields. We discuss the general principles common to all modern synthesis approaches such as syntactic bias, oracle-guided inductive search, and optimization techniques. We then present a literature review covering the four most common state-of-the-art techniques in program synthesis: enumerative search, constraint solving, stochastic search, and deduction-based programming by examples. We conclude with a brief list of future horizons for the field.

### A Framework for Mass-Market Inductive Program Synthesis PhD Thesis

Programming by examples (PBE), or inductive program synthesis, is a problem of finding a program in the underlying domain-specific language (DSL) that is consistent with the given input-output examples or constraints. In the last decade, it has gained a lot of prominence thanks to the mass-market deployments of several PBE-based technologies for data wrangling – the widespread problem of transforming raw datasets into a structured form, more amenable to analysis. However, deployment of a mass-market application of program synthesis is challenging. First, an efficient implementation requires non-trivial engineering insight, often overlooked in a research prototype. This insight takes the form of domain-specific knowledge and heuristics, which complicate the implementation, extensibility, and maintenance of the underlying synthesis algorithm. Second, application development should be fast and agile, tailoring to versatile market requirements. Third, the underlying synthesis algorithm should be accessible to the engineers responsible for product maintenance.

In this work, I show how to generalize the ideas of 10+ previous specialized inductive synthesizers into a single framework, which facilitates automatic generation of a domain-specific synthesizer from the mere definition of the corresponding DSL and its properties. PROSE (PROgram Synthesis using Examples) is the first program synthesis framework that explicitly separates domain-agnostic search algorithms from domain-specific expert insight,making the resulting synthesizer both fast and accessible. The underlying synthesis algorithm pioneers the use of deductive reasoning for designer-defined domain-specific operators and languages, which enables mean synthesis times of 1-3 sec on real-life datasets.

A dedicated team at Microsoft has built and deployed 10+ technologies on top of the PROSE framework. Using them as case studies, I examine the user interaction challenges that arise after a mass-market deployment of a PBE-powered application. I show how expressing program synthesis as an interactive problem facilitates user intent disambiguation, incremental learning from additional examples, and increases the users’ confidence in the system.

### Interactive Program Synthesis arXiv preprint

Program synthesis from incomplete specifications (e.g. input-output examples) has gained popularity and found real-world applications, primarily due to its ease-of-use. Since this technology is often used in an interactive setting, efficiency and correctness are often the key user expectations from a system based on such technologies. Ensuring efficiency is challenging since the highly combinatorial nature of program synthesis algorithms does not fit in a 1-2 second response expectation of a user-facing system. Meeting correctness expectations is also difficult, given that the specifications provided are incomplete, and that the users of such systems are typically non-programmers.

In this paper, we describe how interactivity can be leveraged to develop efficient synthesis algorithms, as well as to decrease the cognitive burden that a user endures trying to ensure that the system produces the desired program. We build a formal model of user interaction along three dimensions: incremental algorithm, step-based problem formulation, and feedback-based intent refinement. We then illustrate the effectiveness of each of these forms of interactivity with respect to synthesis performance and correctness on a set of real-world case studies.

### Learning Syntactic Program Transformations from Examples ICSE 2017

Automatic program transformation tools can be valuable for programmers to help them with refactoring tasks, and for Computer Science students in the form of tutoring systems that suggest repairs to programming assignments. However, manually creating catalogs of transformations is complex and time-consuming. In this paper, we present Refazer, a technique for automatically learning program transformations. Refazer builds on the observation that code edits performed by developers can be used as input-output examples for learning program transformations. Example edits may share the same structure but involve different variables and subexpressions, which must be generalized in a transformation at the right level of abstraction. To learn transformations, Refazer leverages state-of-the-art programming-by-example methodology using the following key components: (a) a novel domain-specific language (DSL) for describing program transformations, (b) domain-specific deductive algorithms for efficiently synthesizing transformations in the DSL, and (c) functions for ranking the synthesized transformations.

We instantiate and evaluate Refazer in two domains. First, given examples of code edits used by students to fix incorrect programming assignment submissions, we learn program transformations that can fix other students’ submissions with similar faults. In our evaluation conducted on 4 programming tasks performed by 720 students, our technique helped to fix incorrect submissions for 87% of the students. In the second domain, we use repetitive code edits applied by developers to the same project to synthesize a program transformation that applies these edits to other locations in the code. In our evaluation conducted on 56 scenarios of repetitive edits taken from three large C# open-source projects, Refazer learns the intended program transformation in 84% of the cases using only 2.9 examples on average.

## 2015

### User Interaction Models for Disambiguation in Programming by Example UIST 2015

Programming by Examples (PBE) has the potential to revolutionize end-user programming by enabling end users, most of whom are non-programmers, to create small scripts for automating repetitive tasks. However, examples, though easy to provide, are an ambiguous specification of the user’s intent. Because of that, a key impedance in adoption of PBE systems is the lack of user confidence in the correctness of the program that was synthesized by the system.

We present two novel user interaction models that communicate actionable information to the user to help resolve ambiguity in the examples. One of these models allows the user to effectively navigate between the huge set of programs that are consistent with the examples provided by the user. The other model uses active learning to ask directed example-based questions to the user on the test input data over which the user intends to run the synthesized program.

Our user studies show that each of these models significantly reduces the number of errors in the performed task without any difference in completion time. Moreover, both models are perceived as useful, and the proactive active-learning based model has a slightly higher preference regarding the users’ confidence in the result.

### FlashMeta: A Framework for Inductive Program Synthesis OOPSLA 2015

Inductive synthesis, or programming-by-examples (PBE) is gaining prominence with disruptive applications for automating repetitive tasks in end-user programming. However, designing, developing, and maintaining an effective industrial-quality inductive synthesizer is an intellectual and engineering challenge, requiring 1-2 man-years of effort.

Our novel observation is that many PBE algorithms are a natural fall-out of one generic meta-algorithm and the domain-specific properties of the operators in the underlying domain-specific language (DSL). The meta-algorithm propagates example-based constraints on an expression to its subexpressions by leveraging associated witness functions, which essentially capture the inverse semantics of the underlying operator. This observation enables a novel program synthesis methodology called data-driven domain-specific deduction (D4), where domain-specific insight, provided by the DSL designer, is separated from the synthesis algorithm.

Our FlashMeta framework implements this methodology, allowing synthesizer developers to generate an efficient synthesizer from the mere DSL definition (if properties of the DSL operators have been modeled). In our case studies, we found that 10+ existing industrial-quality mass-market applications based on PBE can be cast as instances of D4. Our evaluation includes reimplementation of some prior works, which in FlashMeta become more efficient, maintainable, and extensible. As a result, FlashMeta-based PBE tools are deployed in several industrial products, including Microsoft PowerShell 3.0 for Windows 10, Azure Operational Management Suite, and Microsoft Cortana digital assistant.

### Personalized Mathematical Word Problem Generation IJCAI 2015

Word problems are an established technique for teaching mathematical modeling skills in K-12 education. However, many students find word problems unconnected to their lives, artificial, and uninteresting. Most students find them much more difficult than the corresponding symbolic representations. To account for this phenomenon, an ideal pedagogy might involve an individually crafted progression of unique word problems that form a personalized plot.

We propose a novel technique for automatic generation of personalized word problems. In our system, word problems are generated from general specifications using answer-set programming (ASP). The specifications include tutor requirements (properties of a mathematical model), and student requirements (personalization, characters, setting). Our system takes a logical encoding of the specification, synthesizes a word problem narrative and its mathematical model as a labeled logical plot graph, and realizes the problem in natural language. Human judges found our problems as solvable as the textbook problems, with a slightly more artificial language.

## 2014

### LaSEWeb: Automating Search Strategies over Semi-structured Web Data KDD 2014

We show how to programmatically model processes that humans use when extracting answers to queries (e.g., “Who invented typewriter?”, “List of Washington national parks”) from semi-structured Web pages returned by a search engine. This modeling enables various applications including automating repetitive search tasks, and helping search engine developers design micro-segments of factoid questions.

We describe the design and implementation of a domain-specific language that enables extracting data from a webpage based on its structure, visual layout, and linguistic patterns. We also describe an algorithm to rank multiple answers extracted from multiple webpages.

On 100,000+ queries (across 7 micro-segments) obtained from Bing logs, our system LaSEWeb answered queries with an average recall of 71%. Also, the desired answer(s) were present in top-3 suggestions for 95%+ cases.