Gen: A General-Purpose Probabilistic Programming
System with Programmable Inference
Marco F. Cusumano-Towner
Massachusetts Institute of Technology
marcoct@mit.edu
Feras A. Saad
Massachusetts Institute of Technology
fsaad@mit.edu
Alexander Lew
Massachusetts Institute of Technology
alexlew@mit.edu
Vikash K. Mansinghka
Massachusetts Institute of Technology
vkm@mit.edu
Abstract
Probabilistic modeling and inference are central to many
fields. A key challenge for wider adoption of probabilistic
programming languages is designing systems that are both
flexible and performant. This paper introduces Gen, a new
probabilistic programming system with novel language constructs for modeling and for end-user customization and
optimization of inference. Gen makes it practical to write
probabilistic programs that solve problems from multiple
fields. Gen programs can combine generative models written
in Julia, neural networks written in TensorFlow, and custom
inference algorithms based on an extensible library of Monte
Carlo and numerical optimization techniques. This paper
also presents techniques that enable Gen’s combination of
flexibility and performance: (i) the generative function interface, an abstraction for encapsulating probabilistic and/or
differentiable computations; (ii) domain-specific languages
with custom compilers that strike different flexibility/performance tradeoffs; (iii) combinators that encode common
patterns of conditional independence and repeated computation, enabling speedups from caching; and (iv) a standard
inference library that supports custom proposal distributions
also written as programs in Gen. This paper shows that
Gen outperforms state-of-the-art probabilistic programming
systems, sometimes by multiple orders of magnitude, on
problems such as nonlinear state-space modeling, structure
learning for real-world time series data, robust regression,
and 3D body pose estimation from depth images.
1 Introduction
Probabilistic modeling and inference are central to diverse
fields, such as computer vision, robotics, statistics, and artificial intelligence. Probabilistic programming languages aim
Permission to make digital or hard copies of part or all of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. Copyrights for thirdparty components of this work must be honored. For all other uses, contact
the owner/author(s).
CSAIL Tech. Reports, 2018, MIT
© 2018 Copyright held by the owner/author(s).
to make it easier to apply probabilistic modeling and inference, by providing language constructs for specifying models
and inference algorithms. Most languages provide automatic
“black box” inference mechanisms based on Monte Carlo,
gradient-based optimization, or neural networks. However,
applied inference practitioners routinely customize an algorithm to the problem at hand to obtain acceptable performance. Recognizing this fact, some recently introduced
languages offer “programmable inference” [29], which permits the user to tailor the inference algorithm based on the
characteristics of the problem.
However, existing languages have limitations in flexibility and/or performance that inhibit their adoption across
multiple applications domains. Some languages are designed
to be well-suited for a specific domain, such as hierarchical
Bayesian statistics (Stan [6]), deep generative modeling (Pyro
[5]), or uncertainty quantification for scientific simulations
(LibBi [33]). Each of these languages can solve problems in
some domains, but cannot express models and inference algorithms needed to solve problems in other domains. For
example, Stan cannot be applied to open-universe models,
structure learning problems, or inverting software simulators for computer vision applications. Pyro and Turing [13]
can represent these kinds of models, but exclude many important classes of inference algorithms. Venture [29] expresses a
broader class of models and inference algorithms, but incurs
high runtime overhead due to dynamic dependency tracking.
Key Challenges Two key challenges in designing a practical general-purpose probabilistic programming system are:
(i) achieving good performance for heterogeneous probabilistic models that combine black box simulators, deep neural networks, and recursion; and (ii) providing users with
abstractions that simplify the implementation of inference
algorithms while being minimally restrictive.
This Work We introduce Gen, a probabilistic programming system that uses a novel approach in which (i) users
define probabilistic models in one or more embedded probabilistic DSLs and (ii) users implement custom inference algorithms in the host language by writing inference programs
CSAIL Tech. Reports, 2018, MIT Marco F. Cusumano-Towner, Feras A. Saad, Alexander Lew, and Vikash K. Mansinghka
Generative Functions
Probabilistic DSL Compilers
Dynamic DSL Static DSL TensorFlow DSL . . .
Generative Function Combinators
Map Unfold Recurse . . .
Standard Inference Library
HMC
Update
Particle
Filter
MH
Update
MAP
Update . . .
Generative Function Interface
Initialize Propose Update Backprop . . .
User Inference Program
(implements inference algorithm in host language)
User Probabilistic DSL Code
Probabilistic Models
Custom Proposal
Distributions
language implementation Results
probabilistic DSL code
inference code
construct
combined by
implements
invoked by
invoked by
compiled by
(a) Gen’s Architecture
User Probabilistic DSL Code
Probabilistic Models
Probabilistic DSL Compilers
(for fixed inference algorithms)
Particle Filtering
Metropolis-Hastings
Hamiltonian Monte Carlo
. . .
Black Box Sampler
Results
compiled by
construct
user params
(b) Standard Architecture
Figure 1. Comparison of Gen’s architecture to a standard probabilistic programming architecture.
that manipulate the execution traces of models. This architecture affords users with modeling and inference flexibility
that is important in practice. The Gen system, embedded
in Julia [4], makes it practical to build models that combine
structured probabilistic code with neural networks written in
platforms such as TensorFlow. Gen also makes it practical to
write inference programs that combine built-in operators for
Monte Carlo inference and gradient-based optimization with
custom algorithmic proposals and deep inference networks.
A Flexible Architecture for Modeling and Inference In
existing probabilistic programming systems [13, 15, 48], inference algorithm implementations are intertwined with
the implementation of compilers or interpreters for specific
probabilistic DSLs that are used for modeling (Figure 1b).
These inference algorithm implementations lack dimensions
of flexibility that are routinely used by practitioners of probabilistic inference. In contrast, Gen’s architecture (Figure 1a)
abstracts away probabilistic DSLs and their implementation from inference algorithm implementation using the
generative function interface (GFI). The GFI is a black box
abstraction for probabilistic and /or differentiable computations that exposes several low-level operations on execution
traces. Generative functions, which are produced by compiling probabilistic DSL code or by applying generative function
combinators to other generative functions, implement the
GFI. The GFI enables several types of domain-specific and
problem-specific optimizations that are key for performance:
• Users can choose appropriate probabilistic DSLs for their
domain and can combine different DSLs within the same
model. This paper outlines two examples: a TensorFlow DSL
that supports differentiable array computations resident
on the GPU and interoperates with automatic differentiation in Julia, and a Static DSL that enables fast operations
on execution traces via static analysis.
• Users can implement custom inference algorithms in the
host language using the GFI and optionally draw upon a
higher-level standard inference library that is also built on
top of the GFI. User inference programs are not restricted
by rigid algorithm implementations as in other systems.
• Users can express problem-specific knowledge by defining
custom proposal distributions, which are essential for good
performance, in probabilistic DSLs. Proposals can also be
trained or tuned automatically.
• Users can easily extend the language by adding new probabilistic DSLs, combinators, and inference abstractions
implemented in the host language.
Evaluation This paper includes an evaluation that shows
that Gen can solve inference problems including 3D body
pose estimation from a single depth image; robust regression; inferring the probable destination of a person or robot traversing its environment; and structure learning for
real-world time series data. In each case, Gen outperforms
existing probabilistic programming languages that support
customizable inference (Venture and Turing), typically by
one or more orders of magnitude. These performance gains
are enabled by Gen’s more flexible inference programming
capabilities and high-performance probabilistic DSLs.
Main Contributions The contributions of this paper are:
• A probabilistic programming approach where users (i)
define models in embedded probabilistic DSLs; and (ii) implement probabilistic inference algorithms using inference
Gen: A General-Purpose Probabilistic Programming System CSAIL Tech. Reports, 2018, MIT
programs (written in the host language) that manipulate
execution traces of models (Section 2).
• The generative function interface, a novel black box abstraction for probabilistic and/or differentiable computations
that separates probabilistic DSLs from the implementation
of inference algorithms (Section 3).
• Generative function combinators, which produce generative functions that exploit common patterns of conditional
independence to enable efficient operations on execution
traces. We give three examples of combinators: map, unfold, and recurse (Section 4).
• Gen, a system implementing this approach with three
interoperable probabilistic DSLs (Section 5).
• An empirical evaluation of Gen’s expressiveness and efficiency relative to other probabilistic programming languages on five challenging inference problems (Section 6).
2 Overview
In this section, we walk through an example probabilistic
model, demonstrate three probabilistic inference programs
for the model, and highlight the main features of Gen’s architecture along the way. The example in Figure 2 is based
on modeling a response variable y as a linear function of a dependent variable x. Figure 2a presents the inference problem:
given a data set of observed pairs (xi
,yi
), the goal of probabilistic inference is to infer the relationship between y and x,
as well as identify any data points i that do not conform to
the inferred linear relationship (i.e. detect outliers). Figure 2b
presents code in a probabilistic domain-specific language
that defines the probabilistic model. The three inference programs, implemented in Julia, are shown in Figure 2f. Each
of these inference programs implements a different algorithm for solving the inference task and, as a result, exhibits
different performance characteristics, shown in Figure 2g.
2.1 Example Probabilistic Model
We now describe the ideas in Figure 2b in greater detail.
Models are Generative Functions Users define probabilistic models by writing generative functions in probabilistic
DSLs. Generative functions generate realizations of (typically
stochastic) processes. The probabilistic DSL used in this example is called the Dynamic DSL (syntax in Figure 3). The
model in this example is defined by the generative function
‘model’ (Lines 14-29). Generative functions in the Dynamic
DSL are identified by a @gen annotation in front of a regular
Julia function definition. Dynamic DSL functions may use
arbitrary Julia code, as long as they do not mutate any externally observable state. Dynamic DSL functions make random
choices with functions like normal and bernoulli that sample from an extensible set of probability distributions (for
instance, normal(0,1) draws a random value from the standard normal distribution) Each random choice represents
either a latent (unknown) or observable quantity.
The generative function model takes as input a vector of
dependent variables xs (containing the xi
) and returns a vector of response variables ys (containing the yi
). First, in Lines
15-18, the model makes random choices (shown in green)
for the parameters (prob_outlier, noise, slope, intercept)
that govern the linear relationship from their prior distributions. The parameters slope and intercept parameterize the
assumed linear dependence of yi on xi
. The parameter noise
determines, for data points that are not outliers, the variance
of the response variable around the straight regression line.
Finally, the parameter prob_outlier is the probability that
any given data point is an outlier.
Addresses of Random Choices Each random choice made
when executing a generative function defined in the Dynamic DSL is given a unique address (shown in purple) using
the @addr keyword. For example, the prob_outlier choice
(Line 18) has address :prob_outlier. In general, the address
of a choice is independent of any name to which its value may
be assigned in the function body (note that random choices
need not be bound to any identifier in the function body).
Addresses are arbitrary Julia values that may be dynamically
computed during function execution. The random choices
seen so far have addresses that are Julia symbol values.
Generative Function Combinators Having generated the
parameters, the generative function model next generates
eachyi from the corresponding xi
, for each data pointi (Lines
19-27). The @diff block (Lines 19-23, brown) is an optional
performance optimization which we will discuss later. Lines
25-27 call the generative function ‘generate_datum’ (Lines 1-
10) once for each data point, passing in the xi and each of the
four parameters. To do this, we use MapCombinator, which is
a generative function combinator (Section 4) that behaves similarly to the standard higher order function ‘map’. On Line
12, the MapCombinator Julia function takes the generative
function generate_datum and returns a second generative
function (generate_data) that applies the first function repeatedly to a vector of argument values, using independent
randomness for each application. On Line 25, the model function calls generate_data, which in turns calls generate_datum
once for each data point i to generate the yi from the corresponding xi and the parameters. Within the ith application
of generate_datum, a Bernoulli random choice (Line 4) determines whether data point i will be an outlier or not. If
the data point is not an outlier (Line 7), then the response
variable yi has an expected value (mu) that is a linear function
of xi and standard deviation (std) equal to noise. If the data
point is an outlier (Line 5), then yi has an expected value
of 0 and a large standard deviation (10). Finally, we sample
the response variable yi from a normal distribution with the
given expected value and standard deviation (Line 9).
Hierarchical Address Spaces When calling other generative functions, @addr is used to indicate the namespace
CSAIL Tech. Reports, 2018, MIT Marco F. Cusumano-Towner, Feras A. Saad, Alexander Lew, and Vikash K. Mansinghka
(a) Probabilistic inference task
5 0 5
10
0
10
Dataset
5 0 5
10
0
10
Linear Relationship
5 0 5
10
0
10
Outliers
inference
+
(b) Probabilistic model
1 @gen function generate_datum(x::Float64, prob_outlier::Float64,
2 noise::Float64, @ad(slope::Float64),
3 @ad(intercept::Float64))
4 if @addr(bernoulli(prob_outlier), :is_outlier)
5 (mu, std) = (0., 10.)
6 else
7 (mu, std) = (x * slope + intercept, noise)
8 end
9 return @addr(normal(mu, std), :y)
10 end
11
12 generate_data = MapCombinator(generate_datum)
13
14 @gen function model(xs::Vector{Float64})
15 slope = @addr(normal(0, 2), :slope)
16 intercept = @addr(normal(0, 2), :intercept)
17 noise = @addr(gamma(1, 1), :noise)
18 prob_outlier = @addr(uniform(0, 1), :prob_outlier)
19 @diff begin
20 addrs = [:prob_outlier, :slope, :intercept, :noise]
21 diffs = [@choicediff(addr) for addr in addrs]
22 argdiff = all(map(isnodiff,diffs)) ? noargdiff : unknownargdiff
23 end
24 n = length(xs)
25 ys = @addr(generate_data(xs, fill(prob_outlier, n),
26 fill(noise, n), fill(slope, n), fill(intercept,n)),
27 :data, argdiff)
28 return ys
29 end
(c) The hierarchical address space of model in (b).
:prob_outlier :slope :intercept :noise :data
1 2
. . . 100
:is_outlier :y :is_outlier :y :is_outlier :y
(d) Custom proposals used by inference programs
30 @gen function is_outlier_proposal(previous_trace, i::Int)
31 is_outlier = previous_trace[:data => i => :is_outlier]
32 @addr(bernoulli(is_outlier ? 0.0 : 1.0),
33 :data => i => :is_outlier)
34 end
35
36 @gen function ransac_proposal(previous_trace, xs, ys)
37 (slope, intercept) = ransac(xs, ys, 10, 3, 1.)
38 @addr(normal(slope, 0.1), :slope)
39 @addr(normal(intercept, 1.0), :intercept)
40 end
(e) Julia functions used in inference programs
41 function make_constraints(ys::Vector{Float64})
42 constraints = Assignment()
43 for i=1:length(ys)
44 constraints[:data => i => :y] = ys[i]
45 end
46 return constraints
47 end
48
49 function parameter_update(trace, num_sweeps::Int)
50 for j=1:num_sweeps
51 trace = default_mh(model, select(:prob_outlier), trace)
52 trace = default_mh(model, select(:noise), trace)
53 trace = mala(model, select(:slope, :intercept), trace, 0.001)
54 end
55 end
56
57 function is_outlier_update(trace, num_data::Int)
58 for i=1:num_data
59 trace = custom_mh(model, is_outlier_proposal, (i,), trace)
60 end
61 return trace
62 end
(f) Three user inference programs
63 function inference_program_1(xs::Vector{Float64}, ys::Vector{Float64})
64 constraints = make_constraints(ys)
65 (trace, _) = initialize(model, (xs,), constraints)
66 for iter=1:100
67 selection = select(:prob_outlier, :noise, :slope, :intercept)
68 trace = default_mh(model, selection, (), trace)
69 trace = is_outlier_update(trace, length(xs))
70 end
71 return trace
72 end
73
74 function inference_program_2(xs::Vector{Float64}, ys::Vector{Float64})
75 constraints = make_constraints(ys)
76 (slope, intercept) = least_squares(xs, ys)
77 constraints[:slope] = slope
78 constraints[:intercept] = intercept
79 (trace, _) = initialize(model, (xs,), constraints)
80 for iter=1:100
81 trace = parameter_update(trace, 5)
82 trace = is_outlier_update(trace, length(xs))
83 end
84 return trace
85 end
86
87 function inference_program_3(xs::Vector{Float64}, ys::Vector{Float64})
88 constraints = make_constraints(ys)
89 (trace, _) = initialize(model, (xs,), constraints)
90 for iter=1:100
91 trace = custom_mh(model, ransac_proposal, (xs, ys), trace)
92 trace = parameter_update(trace, 5)
93 trace = is_outlier_update(trace, length(xs))
94 end
95 return trace
96 end
0 1 2 3 4 5 6 7 8
Runtime (seconds)
225
200
175
150
125
100
Log Probability
Inference Program 1
Inference Program 2
Inference Program 3
(g) Runtime versus accuracy of three inference programs in (f).
Figure 2. Example of the proposed programming model. Probabilistic DSLs are used to define (b) probabilistic models and (d)
custom proposal distributions. (e)–(f) algorithms are implemented by users in the host language, using methods provided by
the standard inference library (shown in bold). (g) shows the time-accuracy profiles of the three inference programs in (f).
Gen: A General-Purpose Probabilistic Programming System CSAIL Tech. Reports, 2018, MIT
D := normal | gamma | categorical |...
E := J | R | F
R := @addr(D(E2, E3, ..), E1)
F := @addr(G(E2, E3, ..), E1)
Figure 3. Syntax of the Dynamic DSL, which is embedded
in Julia. J are Julia expressions, G are names of generative
functions, D are probability distributions, E are expressions,
R are random choices, and F are calls to generative functions.
relative to which all random choices made during the call
are addressed. For example, model calls generate_data with
namespace :data (Lines 25-27). Generative functions that
are produced by generative function combinators also give
a namespace to each call they make—generate_data calls
generate_datum once for each data point with a unique integer namespace ranging from 1 to 100 (for 100 data points).
Figure 2c presents the hierarchical address space of random choices made by model. Hierarchical addresses are constructed by chaining together address components using the
=> operator. For example, :data => 2 => :is_outlier refers
to the is_outlier random choice for the second data point.
Argdiffs Enable Efficient Trace Updates The @diff code
block in model (Lines 19-23, brown) is optional, but important for good performance of Markov Chain Monte Carlo
(MCMC) inference. MCMC inference often involves making
small changes to an execution of a generative function. The
@diff block is a feature of the Dynamic DSL that allows users
to insert code that computes the change to the arguments to
a call (the argdiff ) from one execution to another execution,
as well as the change to the return value of a function (the
retdiff , not used in this example). The argdiff can indicate
that the arguments to a call have not changed (noargdiff) or
that they may have changed (unknownargdiff), or can provide detailed information about the change. In this case, the
argdiff passed to the call to generate_data indicates whether
or not the parameters of the model have changed from the
previous execution. If the parameters have changed, then the
call to generate_datum must be visited for each data point to
construct the new execution trace. If the parameters have not
changed, then most calls to generate_datum can be avoided.
This feature enables efficient asymptotic scaling of MCMC.
Note that code in the @diff block and the argdiff variable
(Line 27) are not part of the generative function body and
are only used when updating an execution.
2.2 Example User Inference Programs
Figure 2f shows three user inference programs, which are
Julia functions that implement different custom algorithms
for inference in the example probabilistic model. These inference programs are implemented using high-level abstractions provide by our system (these methods e.g. select and
initialize are shown bold-face in the code figures). Each
inference program takes as input a data set containing many
(xi
,yi
) values; it returns inferred values for the parameters
that govern the linear relationship and the outlier classifications.
Execution Traces as Data Structures The results of inference take the form of an execution trace (trace) of the generative function model. An execution trace of a generative
function contains the values of random choices made during
an execution of the function, and these values are accessed
using indexing syntax (e.g. trace[:slope], trace[:data =>
5 => :is_outlier]). The inference programs return traces
whose random choices represent the inferred values of latent
variables in the model. In addition to the random choices, an
execution trace (or just ‘trace’) also contains the arguments
on which the function was executed, and the function’s return value. Traces are persistent data structures.
Generative Function Interface Each inference program
in Figure 2f first obtains an initial trace by calling the initialize
method on the generative function (model) and supplying (i)
arguments (xs) to the generative function; and (ii) a mapping
(constraints) from addresses of certain random choices to
constrained values (Lines 65, 79, 89). The initialize method
is part of the generative function interface (GFI, Section 3),
which exposes low-level operations on execution traces. The
constraints are an assignment (produced by Figure 2e, Line
41), which is a prefix tree that maps addresses to values.
Users construct assignments that map addresses of observed
random choices to their observed values. Each inference program constrains the y-coordinates to their observed values.
Inference program 2 also constrains the slope and intercept
(Lines 77-78) to values obtained from a heuristic procedure
(least_squares) implemented in Julia. The other two inference programs initialize the slope and intercept to values
randomly sampled from the prior distribution.
Standard Inference Library After obtaining an initial trace,
each program in Figure 2f then uses a Julia for loop (Lines 66,
80, 90) to repeatedly apply a cycle of different Markov Chain
Monte Carlo (MCMC) updates to various random choices
in the trace. These updates improve the random choices in
the initial trace in accordance with the constrained observations. The MCMC updates use the methods default_mh and
custom_mh from the standard inference library, which provides higher-level building blocks for inference algorithms
that are built on top of the GFI. Inference program 1 uses
default_mh, which proposes new values for certain selected
random choices according to a default proposal distribution
and accepts or rejects according to the Metropolis-Hastings
(MH, [19]) rule.
Custom Proposals are Generative Functions Each inference program in Figure 2f updates the outlier indicator variables using is_outlier_update (Figure 2e, Lines 57-62). This
CSAIL Tech. Reports, 2018, MIT Marco F. Cusumano-Towner, Feras A. Saad, Alexander Lew, and Vikash K. Mansinghka
procedure loops over the data points and performs an MH update to each :is_outlier choice using the custom_mh method
from the standard inference library. The update uses a custom proposal distribution defined by is_outlier_proposal
(Figure 2d, Lines 30-34) which reads the previous value of
the variable and then proposes its negation by making a random choice at the same address :data => i => :is_outlier.
Custom proposal distributions are defined as generative functions using the same probabilistic DSLs used to define probabilistic models. Inference program 3 uses a different MH
update with a custom proposal (ransac_proposal, Figure 2d
Lines 36-40) to propose new values for the slope and intercept by adding noise to values estimated by the RANSAC
[12] robust estimation algorithm.
Using Host Language Abstraction Inference programs 2
and 3 in Figure 2d both invoke parameter_update (Figure 2e,
Lines 49-55), which applies MH updates with default proposals to :prob_outlier and :noise, but uses a MetropolisAdjusted Langevin Algorithm (MALA, [39]) update for the
slope and intercept (mala, Line 53). The standard inference
library method mala relies on automatic differentiation in
the model function (@ad annotation, Lines 2-3) to compute
gradients of the log probability density with respect to the
selected random choices. Inference programs make use of
familiar host language constructs, including procedural abstraction (e.g. reusable user-defined inference sub-routines
like parameter_update) and loop constructs (to repeatedly
invoke nested MCMC and optimization updates). These host
language features obviate the need for new and potentially
more restrictive ‘inference DSL’ constructs [13, 28]. Using
the host language for inference programming also allows
for seamless integration of external code into inference algorithms, as illustrated by the custom least squares initialization in inference program 1 (Line 76) and the custom external
RANSAC estimator used in inference program 3 (Line 91).
Performance Depends on the Inference Algorithm As
seen in Figure 2g, the three inference programs perform
differently. The default proposal for the parameters used in
inference program 1 causes slow convergence as compared to
the combination of MALA updates and custom initialization
used in inference program 2. The custom RANSAC proposal
used in inference program 3 also improves convergence—it
does not employ local random walk search strategy like the
MALA update, but instead proposes approximate estimates
from scratch with each application. The constructs in our
system make it practical for the user to experiment with and
compare various inference strategies for a fixed model.
3 Generative Function Interface
The generative function interface (GFI) is a black box abstraction for probabilistic and/or differentiable computations that
provides an abstraction barrier between the implementation
of probabilistic DSLs and implementations of inference algorithms. Generative functions are objects that implement the
methods in the GFI and are typically produced from DSL code
by a probabilistic DSL compiler. For example, the Dynamic
DSL compiler takes a @gen function definition and produces
a generative function that implements the GFI. Each method
in the GFI performs an operation involving execution traces
of the generative function on which it is called.
3.1 Formalizing Generative Functions
We now formally describe generative functions.
Assignment An assignment s is a map from a set of addresses A to a set of values V, where s(a) denotes the value
assigned to address a ∈ A. The special value s(a) = ⊥ indicates
that s does not assign a meaningful value to a. We define
dom[s] B {a ∈ A |s(a) , ⊥} of s as the set of non-vacuous addresses in s. Lets|B be the assignment obtained by restricting
s to a subset of addresses B ⊆ dom[s], i.e.s|B (a) =s(a) if a ∈ B
and s|B (a) = ⊥ otherwise. For two assignments s and t, let
s ? t mean s(a) = t(a) for all a ∈ dom[s] ∩ dom[t]. Finally,
we let S B V
A be the set of all assignments from A to V.
Generative function A generative function G is a tuple
G = (X, Y, f ,p,q). Here, X and Y are sets that denote the
domain of the arguments and the domain of the return value
of the function, respectively. Moreover, p : X × S → [0, 1] is
a family of probability distributions on assignments s ∈ S indexed by argument x ∈ X. In words, x is a fixed argument to
the generative function and s is a realization of all the random
choices made during the execution. Since p is a distribution,
for each x, the series P
s ∈S p(x,s) = 1. We use the notation
p(s; x) B p(x,s) to separate the arguments x from the assignment s. The output function f :X × {s :p(s; x) > 0} → Y
maps an argument x and an assignment s of all random
choices (whose probability under p is non-zero) to a return
value f (x,s) ∈ Y. Finally, q : X × S ×S → [0, 1] is an internal
proposal distribution that assigns a probability q(s; x,u) to
each assignment s for all x ∈ X and all u ∈ S. The proposal
distribution q satisfies (i) P
s q(s; x,u) = 1 for all x ∈ X and
u ∈ S; and (ii) q(s; x,u) > 0 if and only if s ? u and p(s; x) > 0.
3.2 Interface Methods
We now describe several methods in the GFI. Each method
takes a generative function G in addition to other methodspecific arguments. We will denote an assignment by t if
p(t; x) > 0 (for some x ∈ X) when the assignment is ‘complete’
and by u when the assignment is ‘partial’. An execution trace
(t, x) is a pair of a complete assignment t and argument x.
initialize (sampling an initial trace) This method takes
an assignment u, arguments x, and returns (i) a trace (t, x)
such that t ? u, sampled using the internal proposal distribution (denoted t ∼q(·; x,u)); as well as (ii) a weight w :=
p(t; x)/q(t; x,u). It is an error if no such assignment t exists.
Gen: A General-Purpose Probabilistic Programming System CSAIL Tech. Reports, 2018, MIT
propose (proposing an assignment) This method samples an assignment from a generative function G that is
used as a proposal distribution and computes (or estimates)
the proposal probability. In particular, propose takes arguments x and a set of addresses B, and returns an assignment
u and a weight w by (i) sampling t ∼ p(·; x); then (ii) setting
u B t|B andw B p(t; x)/q(t; x,u). If B = ∅ (i.e. no addresses
are provided) then the method sets u B t.
assess (assessing the probability of an assignment) This
method computes (or estimates) the probability that a generative function G produces a given assignment u. It is equivalent to initialize except that it only returns the weight w,
and not the trace. Note that w is either the exact probability
that the assignment u is produced by G (on arguments x) or
an unbiased estimate of this probability.
1 @gen function foo()
2 val = @addr(bernoulli(0.3), :a)
3 if @addr(bernoulli(0.4), :b)
4 val = @addr(bernoulli(0.6), :c) && val
5 else
6 val = @addr(bernoulli(0.1), :d) && val
7 end
8 val = @addr(bernoulli(0.7), :e) && val
9 return val
10 end
11
12 # Sample a random initial trace (tr1).
13 x = () # empty tuple, foo takes no args
14 u = Assignment() # empty assignment
15 (tr1, w1) = initialize(foo, x, u) # 'tr1' is the trace
16
17 # Use 'update' to obtain a trace (tr2) with :b=false, :d=true.
18 u = Assignment([(:b, false), (:d, true)])
19 (tr2, w2, v, retdiff) = update(foo, x, noargdiff, tr1, u)
:b
:a
:e
:c :d :b
:a
:e
:c :d :b
:a
:e
:c :d :b
:a
:e
:c :d
dom[t] dom[u] dom[t
′
] dom[v]
:a = false
:b = true
:c = false
:e = true
tr1 ≡ (t, ())
initial trace
:b = false
:d = true
u
assignment
:a = false
:b = false
:d = true
:e = true
tr2 ≡ (t
′
, ())
new trace
:b = true
:c = false
v
assignment
p(t; x) = 0.7 × 0.4 × 0.4 × 0.7 = 0.0784
p(t
′
; x
′
) = 0.7 × 0.6 × 0.1 × 0.7 = 0.0294
w = p(t
′
; x
′
)/p(t; x) = 0.0294/0.0784 = 0.375
Figure 4. Illustration of update on a simple generative function foo. Shaded regions indicate sets of addresses.
update (updating a trace) The update method makes adjustments to random choices in an execution trace (t, x) of G,
which is a common pattern in iterative inference algorithms
such MCMC and MAP optimization. In particular, the update
method takes a trace (t, x) of G; new arguments x
′
; and a
partial assignment u. It returns a new trace (t
′
, x
′
) such that
t
′ ?u and t
′
(a) =t(a) for a ∈ (dom[t] ∩ dom[t
′
]) \ dom[u].
That is, values are copied from t for addresses that remain
and that are not overwritten by u. It is an error if no such t
′
exists. The method also returns a discard assignment v where
v(a) B t(a) for a ∈ (dom[t] \ dom[t
′
]) ∪ (dom[t] ∩ dom[u]).
In words, v contains the values from the previous trace
for addresses that were either overwritten by values in u
or removed altogether. The method additionally returns a
weight w B p(t
′
; x
′
)/p(t; x). Figure 4 illustrates the behavior
of update in a simple setting.
Oftentimes, in an invocation of update the assignment u
will contain values for only a small portion of the addresses in
t and there will either be no change fromx to x
′ or the change
will be ‘small.’ In these cases, information about the change
from x to x
′
can be used to make the implementation of
update more efficient. To exploit this invocation pattern for
performance gains, update accepts an argdiff value ∆(x, x
′
)
which contains information about the change in arguments
(when available). To enable compositional implementations
of update, it also returns a retdiff value ∆(y,y
′
) that contains
information about the change in return value (when available). The types of argdiff and retdiff values depend on the
generative function G. Section 2 shows an example usage of
argdiff .
backprop (automatic differentiation) This method computes gradients of the log probability of the execution trace
with respect to the values of function arguments x and/or the
values of random choices made by G. In particular, backprop
takes a trace (t, x) and a selection of addresses B that identify
the random choices whose gradients are desired. It returns
the gradient ∇B logp(t; x) of logp(t; x) with respect to the
selected addresses as well as the gradient ∇x logp(t; x) with
respect to the arguments. backprop also accepts an optional
return value gradient ∇y J (for some J) in which case it returns ∇B (logp(t; x) + J) and ∇x (logp(t; x) + J). Note that
for ∇B logp(t; x) to be defined, random choices in A must
have differentiable density functions with respect to a base
measure, in which case logp(t; x) is a joint density with respect to the induced base measure on assignments which is
derived from the base measures for each random choice.
3.3 Implementing a Metropolis-Hastings Update
A key contribution of the GFI is its ability to act as the building block for implementing many probabilistic inference
algorithms. We next present an example showing how to
use the GFI to implement the Metropolis-Hastings update
(custom_mh) used in Figure 2e. We call propose on the proposal function (proposal) to generate the proposed assignment (u) and the forward proposal probability (fwd_score).
The update method takes the previous model trace (trace)
and the proposed assignment, and returns (i) a new trace that
is compatible with the proposed assignment; (ii) the model’s
CSAIL Tech. Reports, 2018, MIT Marco F. Cusumano-Towner, Feras A. Saad, Alexander Lew, and Vikash K. Mansinghka
contribution (w) to the acceptance ratio; and (iii) the discard assignment (v), which would have to be proposed to reverse the move. Then, assess is used to compute the reverse
proposal probability (rev_score). Finally, mh_accept_reject
stochastically accepts the move with probability alpha; if the
move is accepted, the new trace is returned, otherwise the
previous trace is returned. (Note that the weight w returned
by the GFI is in log-space).
1 function custom_mh(model, proposal, proposal_args, trace)
2 proposal_args_fwd = (trace, proposal_args...,)
3 (u, fwd_score) = propose(proposal, proposal_args_fwd)
4 args = get_args(trace) # model arguments
5 (new_trace, w, v, _) = update(model, args, noargdiff, trace, u)
6 proposal_args_rev = (new_trace, proposal_args...,)
7 rev_score = assess(proposal, proposal_args_rev, v)
8 alpha = w + rev_score - fwd_score
9 return mh_accept_reject(alpha, trace, new_trace)
10 end
3.4 Compositional Implementation Strategy
Each GFI method is designed to be implemented compositionally by invoking the same GFI method for each generative
function call within a generative function. In this implementation strategy, execution trace data structures are hierarchical and have a ‘sub-trace’ for each generative function
call. Therefore, when a probabilistic DSL compiler produces
a generative function G from probabilistic DSL code, it can
treat any generative function that is called by G as a black
box that implements the GFI. This compositional design
enables (i) generative functions defined using independent
probabilistic DSLs to invoke one another and (ii) powerful
extensions with new types of generative functions, as we
demonstrate in the next two sections.
4 Generative Function Combinators
Iterative inference algorithms like MCMC or MAP optimization often make small adjustments to the execution trace
of a generative function. For example, consider a call to
custom_mh with is_outlier_proposal in Figure 2e, Line 59.
This call proposes a change to the :is_outlier choice for a
single data point (i) and then invokes update on the model
function. Because data points are conditionally independent
given the parameters, an efficient implementation that exploits conditional independence can scale as O(1), whereas
a naive implementation that always visits the entire trace
would scale as O(n) where n is the number of data points.
To address this issue, we introduce generative function combinators, which provide efficient implementations of update
(and other GFI methods) by exploiting common patterns of
repeated computation and conditional independence that
arise in probabilistic models. A generative function combinator takes as input a generative function Gk , (called a
“kernel”) and returns a new generative function G
′
that repeatedly applies the kernel according to a particular pattern
of computation. The GFI implementation for G
′
automatically exploits the static pattern of conditional independence
in the resulting computation, leveraging argdiff and retdiff
to provide significant gains in scalability. We next describe
three examples of generative function combinators.
Map The map combinator (Figure 5a) constructs a generative function G
′
that independently applies a kernel Gk from
X to Y to each element of a vector of inputs (x1, . . . , xn ) and
returns a vector of outputs (y1, . . . ,yn ). Each application
of Gk is assigned an address namespace i ∈ {1 . . . n}. The
implementation of update for map only makes recursive GFI
calls for the kernel on those applications i for which the
assignment u contains addresses under namespace i, or for
which xi , x
′
i
. The update method accepts an argdiff value
∆(x, x
′
) that indicates which applicationsi have xi , x
′
i
, and
a nested argdiff value ∆(xi
, x
′
i
) for each such application.
Unfold The unfold combinator (Figure 5b) constructs a
generative function G
′
that applies a kernel Gk repeatedly
in sequence. Unfold is used to represent a Markov chain, a
common building block of probabilistic models, with statespace Y and transition kernel Gk from Y × Z to Y. The kernel
Gk maps (y, z) ∈ Y × Z to a new state y
′ ∈ Y. The generative
function produced by this combinator takes as input the initial state y0 ∈ Y, the parameters z ∈ Z of Gk , and the number
of Markov chain steps n to apply. Each kernel application
is given an address namespace i ∈ {1 . . . n}. Starting with
initial state y0, unfold repeatedly applies Gk to each state
and returns the vector of states (y1, . . . ,yn ). The custom GFI
implementation in unfold exploits the conditional independence of applications i and j given the return value of an
intermediate application l, where i 10x
speedup over Venture, even without caching, at a cost of 1.7x
greater code size. The Gen implementation without caching
gives comparable performance to the Julia implementation,
with a >4x reduction in code size. Finally, recurse gives a
1.7x speedup at the cost of somewhat more code. These results indicate that Gen substantially reduces code size while
remaining competitive with a raw Julia implementation.
1948 1950 1952 1954 1956 1958 1960
Year
100
200
300
400
500
Passenger Volume
Observed Data
Predictions
Held-out Data
(a) Dataset and Predictions
10
0 10
1 10
2 10
3
Runtime (seconds)
0.20
0.25
0.30
0.35
0.40
0.45
Mean Prediction Error
Venture
Gen (Dyanmic)
Gen (Dynamic + recurse)
Julia
(b) Accuracy vs. Runtime
Figure 7. Predictions and accuracy for GP structure learning.
Table 2. Evaluation results for GP structure learning.
Runtime (ms/step) LOC (approx.)
Gen (Dynamic + recurse) 4.96ms (± 0.15) 230
Julia (handcoded, no caching) 6.17ms (± 0.45) 440
Gen (Dynamic) 8.37ms (± 0.53) 100
Venture (no caching) 107.01ms (± 6.57) 60
Runtime (sec.)
Gen (Static) 0.77s (± 0.12)
Turing 2.57s (± 0.04)
Gen (Dynamic) 3.14s (± 0.22)
Figure 8. Left: Observed locations of agent (orange), inferred
path (gray), and destination (red). Black polygons are obstacles. Right: Evaluation results for 1000 iterations of MCMC.
6.3 Algorithmic Model of an Autonomous Agent
We next consider inference in model of an autonomous agent
that uses a rapidly exploring random tree [25] path planning
algorithm to model the agent’s goal-directed motion [8]. The
task is to infer the destination of the agent from observations
of its location over time. We evaluated two Gen implementations and one Turing implementation of the same MCMC
inference algorithm. Gen and Turing, unlike Venture, use
probabilistic DSLs that are embedded in a high-performance
host language, which permits seamless application of MCMC
to models that use fast simulations like path planners. The results in Figure 8 show that the Gen Static DSL outperformed
the Dynamic DSL by roughly 4x, and Turing by 3.3x.
CSAIL Tech. Reports, 2018, MIT Marco F. Cusumano-Towner, Feras A. Saad, Alexander Lew, and Vikash K. Mansinghka
6.4 Nonlinear State-Space Model
We next consider marginal likelihood estimation in a nonlinear state-space model. We tested two particle filtering
inference algorithms [9] implemented in Gen, one using a
default ‘prior’ proposal distribution and one using a custom
‘optimal’ proposal derived by manual analysis of the model.
Turing does not support using custom proposals with particle filtering. We implemented only the default proposal
variant in Turing. The results in Figure 9 show that the custom proposal gives accurate results in an order of magnitude
less time than the default proposal. Also, Table 3 shows that
the Gen Static DSL implementation using the default proposal outperforms the Turing implementation of the same
algorithm by a factor of 23x. Finally, we see that the unfold
combinator provides a speedup of 12x for the Dynamic DSL.
10−3 10−2 10−1
seconds
55
60
65
log probability
Custom Proposal
Default Proposal
Figure 9. Comparing default and custom proposals for particle filtering. Accuracy is the log marginal likelihood estimate.
Table 3. Evaluation results for state-space model.
Runtime (ms) LOC
Gen (Static + unfold + default proposal) 13ms (± 2) 33
Gen (Dynamic + unfold + default proposal) 78ms (± 7) 33
Turing (default proposal) 306ms (± 153) 20
Gen (Dynamic + default proposal) 926ms (± 66) 26
6.5 3D Body Pose Estimation from Depth Images
We next consider an inference task from computer vision
[24, 52]. The model shown in Figure 6 posits a 3D articulated
mesh model of a human body, parametrized by pose variables including joint rotations, and a rendering process by
which the mesh projects onto a two-dimensional depth image. The task is to infer the underlying 3D pose parameters
from a noisy depth image. We implemented two importance
sampling algorithms, one with a default proposal and one
with a custom proposal that employs a deep neural network
trained on data generated from the model. Neither Turing
nor Venture support custom importance sampling proposals.
We compared a Turing implementation using the default
proposal, with a Gen implementation using the custom proposal. The Gen proposal is a generative function written in
the Static DSL that invokes a generative function written in
the TensorFlow DSL. The results (Figure 6) show that the
Gen implementation gives more accurate results in orders
of magnitude less time than Turing. This shows that custom
proposals trained using amortized inference can be essential
for good performance in computer vision applications.
7 Related Work
We discuss work in probabilistic programming, differentiable
programming/deep learning, and probabilistic inference.
Probabilistic Programming Researchers have introduced
many probabilistic programming languages over the last 20
years [6, 13–15, 28, 30, 31, 35, 36, 41, 44, 48]. With the exception of Venture and Turing, these languages do not support
user-programmable inference [29]. Users of Venture and
Turing specify inference algorithms in restrictive inference
DSLs. In contrast, Gen users define inference algorithms
in ordinary Julia code that manipulates execution traces of
probabilistic models. The examples in this paper show how
Gen’s approach to programmable inference is more flexible
and supports modular and reusable custom inference code.
Compilers have been developed for probabilistic programming languages to improve the performance of inference
[6, 21, 50, 51]. Gen’s JIT compiler is currently less aggressive
than these compilers, but Gen’s architecture is significantly
more extensible, allowing end users to integrate custom proposal distributions, new inference algorithms, custom combinators that influence the control flow of modeling and
inference, and custom compilers for new DSLs. No existing
system with compiled inference supports these features.
The TensorFlow DSL in Gen provides high-performance
support for deep generative models, including those written
in the style favored by Pyro [5] and Edward [44]. However,
unlike Pyro, Gen can also express efficient Markov chain
and sequential Monte Carlo algorithms that rely on efficient
mutation of large numbers of traces. Also, these systems
lack Gen’s programmable inference support and hierarchical
address spaces.
Deep Learning and Differentiable Programming Unlike
deep learning platforms such as TensorFlow [1], PyTorch [34],
Theano [3], and MXNet [7], Gen programs explicitly factorize modeling and inference. Gen is also more expressive. For
example, Gen provides support for fast Monte Carlo and numerical optimization updates to model state. Gen allows for
models and inference algorithms to be specified in flexible
combinations of Julia, TensorFlow, and other DSLs added by
Gen users. Gen also automates the process of calculating
the proposal densities needed for a broad range of advanced
Monte Carlo techniques, given user-specified custom proposals that can combine Julia and TensorFlow code.
Probabilistic Inference Gen provides language constructs,
DSLs, and inference library routines for state-of-the-art techniques from probabilistic modeling and inference, including generative modeling [32], deep neural networks [1],
Monte Carlo inference [2, 16, 38], and numerical optimization [17]. Recently, artificial intelligence and machine learning researchers have explored combinations of these techniques. For example, recent work in computer vision uses
generative models based on graphics pipelines and performs
Gen: A General-Purpose Probabilistic Programming System CSAIL Tech. Reports, 2018, MIT
inference by combining Monte Carlo, optimization, and machine learning [24, 49]. Gen’s language constructs support
these kinds of sophisticated hybrid architectures. This is illustrated via the case studies in this paper, which show how
Gen can implement state-of-the-art algorithms for (i) inferring 3D body pose by inverting a generative model based
on a graphics engine; (ii) inferring the probable goals of an
agent by inverting an algorithmic planner [8]; and (iii) learning the structure of Gaussian process covariance functions
for modeling time series data [11, 29, 40, 42].
8 Conclusion
This paper has shown how to build a probabilistic programming system that solves challenging problems from multiple
fields, with performance that is superior to other state-of-theart probabilistic programming systems. Key challenges that
were overcome include (i) achieving good performance for
heterogeneous probabilistic models that combine different
types of computations such as simulators, deep neural networks, and recursion; and (ii) providing users with abstractions that simplify the implementation of custom inference
algorithms while being minimally restrictive.
Acknowledgments
This research was supported in part by the US Department
of the Air Force contract FA8750-17-C-0239, grants from the
MIT Media Lab/Harvard Berkman Center Ethics & Governance of AI Fund and the MIT CSAIL Systems that Learn
Consortium, a gift from the Aphorism Foundation, and the
US Department of Defense through the the National Defense Science & Engineering Graduate Fellowship (NDSEG)
Program.
References
[1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis,
Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving,
Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry
Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016.
TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and
Implementation (OSDI 2016). USENIX Association, 265–283.
[2] Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. 2010.
Particle Markov chain Monte Carlo Methods. Journal of the Royal
Statistical Society: Series B (Statistical Methodology) 72, 3 (2010), 269–
342.
[3] James Bergstra, Frédéric Bastien, Olivier Breuleux, Pascal Lamblin, Razvan Pascanu, Olivier Delalleau, Guillaume Desjardins, David WardeFarley, Ian J. Goodfellow, Arnaud Bergeron, and Yoshua Bengio. 2011.
Theano: Deep learning on GPUs with Python. In Big Learn Workshop,
NIPS 2011.
[4] Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B Shah. 2017.
Julia: A fresh approach to numerical computing. SIAM review 59, 1
(2017), 65–98.
[5] Eli Bingham, Jonathan P. Chen, Martin Jankowiak, Fritz Obermeyer,
Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul
Horsfall, and Noah D. Goodman. 2018. Pyro: Deep universal probabilistic programming. (2018). arXiv:1810.09538
[6] Bob Carpenter, Andrew Gelman, Matthew Hoffman, Daniel Lee, Ben
Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li,
and Allen Riddell. 2017. Stan: A probabilistic programming language.
Journal of Statistical Software 76, 1 (2017), 1–32.
[7] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang,
Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015.
MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. (2015). arXiv:1512.01274
[8] Marco F. Cusumano-Towner, Alexey Radul, David Wingate, and
Vikash K. Mansinghka. 2017. Probabilistic programs for inferring
the goals of autonomous agents. (2017). arXiv:1704.04977
[9] Arnaud Doucet, Nando De Freitas, and Neil Gordon. 2001. An Introduction to Sequential Monte Carlo Methods. In Sequential Monte
Carlo Methods in Practice, Arnaud Doucet, Nando de Freitas, and Neil
Gordon (Eds.). Springer, 3–14.
[10] Simon Duane, Anthony D. Kennedy, Brian J. Pendleton, and Duncan
Roweth. 1987. Hybrid Monte carlo. Physics Letters B 195, 2 (1987),
216–222.
[11] David Duvenaud, James Robert Lloyd, Roger Grosse, Joshua B. Tenenbaum, and Zoubin Ghahramani. 2013. Structure discovery in nonparametric regression through compositional kernel search. In Proceedings
of the 30th International Conference on Machine Learning (ICML 2010).
PMLR, 1166–1174.
[12] Martin A. Fischler and Robert C. Bolles. 1981. Random sample consensus: A paradigm for model fitting with applications to image analysis
and automated cartography. Commun. ACM 24, 6 (June 1981), 381–395.
[13] Hong Ge, Kai Xu, and Zoubin Ghahramani. 2018. Turing: A language
for flexible probabilistic inference. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS 2018).
PMLR, 1682–1690.
[14] Noah Goodman, Vikash Mansinghka, Daniel M. Roy, Keith Bonawitz,
and Joshua B. Tenenbaum. 2008. Church: a language for generative
models. In Proceedings of the 24th Annual Conference on Uncertainty in
Artificial Intelligence (UAI 2008). AUAI Press, 220–229.
[15] Noah D Goodman and Andreas Stuhlmüller. 2014. The Design and
Implementation of Probabilistic Programming Languages. http://dippl.
org. Accessed: 2018-11-8.
[16] Peter J. Green. 1995. Reversible jump Markov chain Monte Carlo
computation and Bayesian model determination. Biometrika 82, 4
(1995), 711–732.
[17] Andreas Griewank. 1989. On automatic differentiation. Mathematical
Programming: Recent Developments and Applications 6, 6 (1989), 83–
107.
[18] Roger B. Grosse, Ruslan Salakhutdinov, William T. Freeman, and
Joshua B. Tenenbaum. 2012. Exploiting compositionality to explore a
large space of model structures. In Proceedings of the 28th Conference on
Uncertainty in Artificial Intelligence (UAI 2012). AUAI Press, 306–315.
[19] Wilfred K. Hastings. 1970. Monte Carlo sampling methods using
Markov chains and their applications. Biometrika 57, 1 (1970), 97–109.
[20] Geoffrey E Hinton, Peter Dayan, Brendan J. Frey, and Radford M. Neal.
1995. The" wake-sleep" algorithm for unsupervised neural networks.
Science 268, 5214 (1995), 1158–1161.
[21] Daniel Huang, Jean-Baptiste Tristan, and Greg Morrisett. 2017. Compiling Markov chain Monte Carlo algorithms for probabilistic modeling.
In Proceedings of the 38th ACM SIGPLAN Conference on Programming
Language Design and Implementation (PLDI 2017). ACM, 111–125.
[22] Bjarne Knudsen and Jotun Hein. 2003. Pfold: RNA secondary structure
prediction using stochastic context-free grammars. Nucleic Acids
Research 31, 13 (2003), 3423–3428.
[23] Daphne Koller, Nir Friedman, and Francis Bach. 2009. Probabilistic
Graphical Models: Principles and Techniques. MIT Press.
CSAIL Tech. Reports, 2018, MIT Marco F. Cusumano-Towner, Feras A. Saad, Alexander Lew, and Vikash K. Mansinghka
[24] Tejas D. Kulkarni, Pushmeet Kohli, Joshua B. Tenenbaum, and Vikash
Mansinghka. 2015. Picture: A probabilistic programming language for
scene perception. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR 2015). IEEE, 4390–4399.
[25] Steven M. LaValle. 1998. Rapidly-exploring random trees: A new tool
for path planning. Technical Report TR 98-11. Computer Science
Department, Iowa State University.
[26] Tuan Anh Le, Atilim Gunes Baydin, and Frank Wood. 2017. Inference
compilation and universal probabilistic programming. In Proceedings of
the 20th International Conference on Artificial Intelligence and Statistics
(AISTATS 2017). PMLR, 1682–1690.
[27] Jonathan Malmaud and Lyndon White. 2018. TensorFlow.jl: An idiomatic Julia front end for TensorFlow. Journal of Open Source Software
3, 31 (2018), 1002.
[28] Vikash Mansinghka, Daniel Selsam, and Yura Perov. 2014. Venture: A
higher-order probabilistic programming platform with programmable
inference. (2014). arXiv:1404.0099
[29] Vikash K. Mansinghka, Ulrich Schaechtle, Shivam Handa, Alexey
Radul, Yutian Chen, and Martin Rinard. 2018. Probabilistic programming with programmable inference. In Proceedings of the 39th ACM
SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018). ACM, 603–616.
[30] Andrew McCallum, Karl Schultz, and Sameer Singh. 2009. FACTORIE:
Probabilistic programming via imperatively defined factor graphs. In
Advances in Neural Information Processing Systems 22 (NIPS), Yoshua
Bengio, Dale Schuurmans, John D. Lafferty, Christopher K. Williams,
and Aron Culotta (Eds.). Curran Associates, 1249–1257.
[31] Brian Milch, Bhaskara Marthi, Stuart Russell, David Sontag, Daniel L.
Ong, and Andrey Kolobov. 2005. BLOG: Probabilistic models with
unknown objects. In Proceedings of the Nineteenth International Joint
Conference on Artificial Intelligence (IJCAI). Morgan Kaufmann Publishers Inc., 1352–1359.
[32] Kevin P. Murphy. 2012. Machine Learning: A Probabilistic Perspective.
[33] Lawrence M. Murray. 2013. Bayesian state-space modelling on highperformance hardware using LibBi. (2013). arXiv:1306.3277
[34] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward
Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga,
and Adam Lerer. 2017. Automatic differentiation in PyTorch. (2017).
[35] Avi Pfeffer. 2001. IBAL: A probabilistic rational programming language.
In Proceedings of the Seventeenth International Joint Conference on
Artificial Intelligence (IJCAI), Vol. 1. Morgan Kaufmann Publishers Inc.,
733–740.
[36] Avi Pfeffer. 2016. Practical Probabilistic Programming (1 ed.). Manning
Publications Co.
[37] Christian Robert and George Casella. 2013. Monte Carlo Statistical
Methods. Springer.
[38] Gareth O. Roberts and Jeffrey S. Rosenthal. 1998. Optimal scaling of
discrete approximations to Langevin diffusions. Journal of the Royal
Statistical Society: Series B (Statistical Methodology) 60, 1 (1998), 255–
268.
[39] Gareth O. Roberts and Richard L. Tweedie. 1996. Exponential convergence of Langevin distributions and their discrete approximations.
Bernoulli 2, 4 (December 1996), 341–363.
[40] Feras A. Saad, Marco F. Cusumano-Towner, Ulrich Schaechtle, Martin C. Rinard, and Vikash K. Mansinghka. 2019. Bayesian synthesis
of probabilistic programs for automatic data modeling. Proc. ACM
Program. Lang. 3, POPL, Article 37 (2019), 29 pages. (Forthcoming, 46th
ACM SIGPLAN Symposium on Principles of Programming Languages,
POPL’19).
[41] John Salvatier, Thomas V. Wiecki, and Christopher Fonnesbeck. 2016.
Probabilistic programming in python using PyMC3. PeerJ Computer
Science 2 (2016), e55.