Information programming

My AGI-21 talk on this topic is available on YouTube, with the audio quality improving slightly 5 minutes in. This material - along with the slide deck and its overview - can be complementory to the article below, which has a different focus.

Information programming is a paradigm and experimental framework for computing missing data. This scope is broad, with instances including planning (efficient paths from cheap heuristics), recognition (fully labeled images from simple filters), and maximization (better solutions from existing ones). IP tackles these problems in the same principled manner:

By limiting the constructions and demanding specific properties, IP comes with more guarantees than you may have come to expect. Information in the system will always increase, and some policies are guaranteed to find fixed points or solutions optimal under some metric. I will go over the paradigm's different building blocks first and then detail how execution goes. The final part dives into examples like parsing and answering questions by combining multiple strategies.
Note this is a new multi-facet project: the article excludes many details, the framework is not optimized, and most areas that could benefit from an IP approach are untouched.


Similar to a computational graph in TensorFlow, your code isn't executed directly but builds a structure underpinning the dynamics. The structure here - referred to as the "IP network" - is a heterogeneous (supporting different kinds of nodes) directed (edges having sources and destinations) hypergraph (edges connecting multiple nodes). Specifically, the paradigm's components play the following role in the IP network:

On top of this high-level structure, there are low-level laws the components need to follow. The following sections describe the components in order, detailing the paradigm's value level.


A type in IP adheres to a partial order, superseding your typical variable type. One element is less than another if it contains less information, though any two elements don't have to be comparable. What does less information mean in this context? While in some cases the notions here form an Information algebra, I'm handling the intuitive version. Depending on the problem, less information can mean fewer features present, more options left, or a wider distribution to consider.
How is your typical variable type superseded? There are extra partial results (less than your typical value) or ambiguous outcomes (greater than your typical value). Let's say your typical type for a variable in a bunch of equations would be Float. In IP, it could have an underdetermined Incomplete cousin (when missing variables are leaving the equation incomplete) and an overdetermined Integer cousin (counting the number of conflicting equations).

Information programming requires two attributes to be specified on these types. There has to be a bottom element among the ordered elements, corresponding to the "no information case". Secondly, a join operation (commonly denoted as ) needs to combine the information from multiple elements. Together with the partial order, this forms a join semilattice, which provides a backbone for the additional components. The Incomplete value is the bottom element in the example above. For the join, Incomplete ∨ Double would map to the number, Double ∨ Double would map to the integer two if the numbers differ and to the agreed-upon solution otherwise, and Integer ∨ Integer would add the conflict counts together.

Another example would be making observations on a 2D map or grid. Here you want to narrow down a search area by combining observations, modeled as binary masks for this example. The more area the masks cover, the more information about the problem. The bottom element is the unexplored map and the join "unifying" two masks. Unlike the set of equations from the previous example, there is also a top element with which no join has an effect: the fully covered map. Top elements are not integral to information programming, but they do enable backtracking policies and some optimizations.


A cell accommodates a value of an information type, starting with the bottom value, and increasing in information with updates. Every cell stores a single value with a pre-determined type, which is mutable only with joins (thus respecting the order).
Currently, there are no dedicated input or output components, so inputs are cells set before calculation, and outputs are cells read after calculation. This uniformity may be traded for a more performant, classical input/output specialization when applications demand it.

As mentioned, cells are like nodes: having just a specific identity or location and no added structure. As a side-effect, the organization of cells needs to be external. Luckily, only the value changes and not the cells, so sets or multimaps can point to them without touching the overarching IP network. The organizational methods often used in the framework are group construction and deduction binding. A cell for every word in a sentence spawned together for the former. As for the latter, deductions register the cells in their co(domain), handy for lookup too.


Deductions are functions between cells of the same information type. Some aim to address inconsistencies, semantic conflicts, or violated laws and effectively discard options. Other deductions propagate facts, perform observations, or improve a result.

Say you're generating a city parametrically and need to decide how to position buildings. For simplicity, not encoding features but enumerating all possibilities, the information type becomes A = Set[Buidling]. The bottom element is the set of all possible buildings, with set-intersection as the logical choice for the join operation. The empty set can serve as a top element, indicating conflicting requirements, while the singleton set makes our life easy actually drawing the solution. Finally, if a particular city block a: A is encoded as an information cell, F: (A, A, A, A) → A can be the deduction that takes neighbors into account.
As for notation, the power notation A^4 → A is used henceforth, abbreviated to X → X where the math generalizes.

Intuitively, F eliminates options from a block using the information of neighboring blocks. Meaning two neighboring buildings can't face each other, some building types need to span multiple blocks, and whatever other properties you want your layout to have. Note that with a more sophisticated information type, you could lower probabilities instead of eliminating options. With this extension, a block with two warehouse neighbors could have a decreased chance of being a restaurant.

Large deductions are common when working on image segmentation or neural network analogs. Deductions like dilate: RGB^1024 → RGB^1024 are justified in "dense operations" such as matrix exponentiation for transitive closure or vector instructions joining bitsets.

Whether mapping to a different type of cells (e.g. X → Y) should be allowed is an open question. Technically, such functions can be extended to form an information exchange, though this requires the definition of a not-always-feasible backward function. If you see a restaurant, you're less likely to encounter a warehouse, but this is not always obvious.
Performance-wise, allowing these mappings can go both ways, with the naive implementation translating exchanges down to deductions.


Picking up on the earlier example: your hypothetical city also needs roads. With the roads needing to connect properly too, there are two different levels on which you want your city to be correct. So let's analogously define B = Set[Streets] to cover all possible street layouts for that block.
Roundabouts (a kind of street for this purpose) shouldn't connect just two roads, and a 4-lane road shouldn't have an abrupt end. Let's put these constraints in G: B → B (dropping the arity here and in F for convenience). For instance, when b: B has a street option that can't connect with any option in d: B, G removes it.

Running the city generator at this point won't lead to good results, regardless of the execution policy. We don't get a unique city on one or both of the levels (many different configurations are still possible), while there are still impossible combinations (like streets passing through buildings). For example, F reaches some fixed point on A, calculation stops, and we have a sub-optimal result.

To resolve this, define L: Set[Buidling] → Set[Streets] that filters the street options based on where the buildings are. Doing the converse for R yields an antitone Galois connection L: A → B, R: B → A that shares relevant information monotonically. Updates caused by G on B now influence A and vice-versa, hence the name "Exchange". So if the update normally looks like a2 = a1 ∨ F(a1); b2 = b1 ∨ G(b1) now it looks like a2 = a1 ∨ F(a1) ∨ R(G(b1)); b2 = b1 ∨ G(b1) ∨ L(F(a1)). Every update, both sets get reduced more, increasing the convergence speed and potentially lifting the solution past a sub-optimal result.

Setting up multiple exchanges between all related types - syncing buildings, streets, water networks, and walkways - further increases their effectiveness.

If the deductions don't need the same domain and codomain (e.g. f: Q → A and G: B → R), chains like G ∘ L ∘ F: Q → R are also possible.


We studied the components for building the IP network independently. Now let's investigate the rich execution schemes this supports. I'll first review how everything fits together and shed light on some practical aspects, but feel free to first look at the concrete examples.

An execution scheme for IP manages what can be calculated when. For deductions, this means determining execution order, stop criteria, and grouping sub-results (to be joined into cells). While - respectively - the definition of join makes deductions confluent, there is only one end-state in theory, and different groupings are equivalent; these are still practical considerations.

The receivers of all these calculations are the information cells, and it helps to understand the data flow from its perspective. With this kind of view, it will be straightforward to draw parallels to standard techniques like message passing.

The framework's intrinsic properties make it ideally suited for parallelization. Because join is associative and commutative, deductions can happen out of order. Moreover, deductions are often executed many times, allowing for well-informed concurrent execution scheduling. The following sections will not discuss this topic, as this would deviate the attention from the core mechanisms, but this is a significant asset of IP nonetheless.

Overview of the IP network

The IP network is flat (thanks to the hypergraph structure), which can be daunting. There are no deductions calling other deductions like their function brothers, no (type) classes or similar structures, and no modules. As mentioned, this structure has to come from the host language (currently Scala 3, with C++ and Rust being logical choices too) or framework extensions (or ad hoc library-style). With labeling, cells, deductions, and exchanges can be conveniently retrieved.

As another mitigation, deductions between cells are rarely defined top-level but rather in a tall stack of functions compositing the sometimes intricate connectivity. Another way to manage complexity is to refrain from factoring a deduction like pool: A^1024 → A^64 into its 960 max: A^2 → A components. While this factoring can - in rare cases - improve the efficiency of some policies (that do not need to recalculate everything), it is unnecessary laborious.
Many to many deductions like pool are common in IP. One technical reason is the lack of currying or encapsulation, which results in cells (and thus inputs and outputs) being used as state for side-effects. Yet this is natural for deductions in the ordinary sense of the word: evidence comes together, and the conclusion influences many different beliefs.

The heterogeneity in the IP network comes from the different information types (and thus cell types) used. Assuming deductions can only connect cells of the same types, we end up with blobs internally connected with deductions and externally connected with exchanges. Exchanges can be involved, like those between a parsed sentence and a knowledge base, or straightforward like projections from subset-like types to a total order counting specified elements.

Cell updates

Cells are central to the execution; they hold the data. That is, every cell starts out with its value set to the bottom element and is then updated to contain some information. The framework can bundle the creation and update for known values, but this doesn't need new theoretical grounding. Similarly, events periodically entering the system can be set up as cell update triggers in the host language.
For substantial or less specialized tasks, it may be helpful for different parts of the system to have different lifetimes. Imagine a neural-network-inspired architecture with a per-sample subnetwork making deductions to update the predictive model's weights. Having a subnetwork for every sample in a large dataset likely results in memory problems, thus reusing the structure makes sense. Practically, the subnetwork can be reset for each sample with a reset-to-bottom escape hatch and the labels discussed earlier.

One way to view information propagating through the network is that every cell sends info to its neighbors. As instructions: for each cell that received information in the previous timestep, fire the deductions taking it as an argument, and merge the results into the result cells. A heavily simplified pseudocode example would be for c in cells {for deduction, neighbor in c.influencing {neighbor.value = neighbor.value ∨ deduction(e.value)}}. With the slight alteration of buffering the new values instead of merging them directly, this looks like message passing.
An opposing view that's mathematically elegant - and shown in the section about exchanges - expresses the new value in terms of its neighbors' old values. For completion: for c in cells {c.value = ∨(for deduction, neighbor in c.influenced yield deduction(neighbor.value))}.
In many cases, the most efficient approach is to focus on the deductions SIMD-style. This approach allows for more predictable branching and better memory locality. A very rough approximation is for vectorized_deduction, influencing, influenced in network {vectorized_update(vectorized_deduction(influencing), influenced)}, where every iteration of this for loop could be transferred to an accelerator.

One of the details missing here is the handling of exchanges. Note that although left out for simplicity's sake, this drives the chunking of similar typed computation. Also, note that the idempotence of the join makes some of these styles work, favoring easy batching over the guarantee no duplicate work is performed.


At last, let's talk about how computation happens in IP. The availability of the different execution strategies (and their modular composition, more on that later) contrasts with mainstream frameworks. Typically the execution strategy - be it event-driven, feedforward, or even search-like - is ingrained in the framework. Here, the policy depends on the problem, the properties provided, and what you can tell about the solution space. By the basic principles, there's always an (underdetermined) answer, though different policies may get further towards a solution or provide a richer answer. They are all listed below, with their features, drawbacks, and manifestations.