The Data Factory

At a Glance

There are many paradigms for data processing: lakes, pipelines, and flows; sources, sinks, and interactions; sensors, emitters, and transforms. However, all of these lack a rich enough mental model to meaingfully talk about complex constructs involving many types of actors. We propose the “data factory” as a new paradigm in data processing.

Background

A quick survey of existing paradigms helps explain the need.

The most common terminology for data processing comes from analogy to fluids – eg, “data lakes” and “data pipelines”. The data “flows” and “pools” in our system, performing useful calculation along the way. But there are three problems with this as a mental model:

While it’s understood that a “pipeline” doesn’t simply move data, the model itself lacks terminology for the transformations as such.
There isn’t recognition of human (or AI) actors who may make active choices about the data and how to process it.
There isn’t a way to model the data processing system as part of a larger process.

The next paradigm for data processing is one that comes from the physics and high-performance computing communities. It discusses data processing in analogy to particle physics experiments: sources (which emit data), sinks (which receive transformed data), and interactions (which are places that the data is transformed). This paradigm solves the first issue above, in that it has a method to explicitly discuss transformations of the data. However, it still suffers from issues two and three. That is, it has no native way to discuss the role of active agents interacting with and/or controlling the system and it has no native way to discuss embedding such a system within a broader context – such as a business process.

The third paradigm is a mixed-analogy terminology used within a major corporation. It’s a case of terminology that while, wisely used, lacks a coherent overarching analogy. In addition to still suffering from issues two and three, this lack of coherent terminology introduces a fourth issue:

This terminology fails to serve as an intuition pump for developers.

We can see that data processing models have a few key features, which current models struggle to meet:

The model needs a way to address each component, eg transforms and actors.
The model needs a way to discuss embedding data processing within a broader context.
The model needs to serve as an intuition pump for the developers (and others) using it.

What to Do

The core problem is that each of these mental models fails to deliver a rich enough analogy to prime the intuition of developers working on real-world systems. The solution is to adopt an analogy that both is familar, yet includes things like decision makers and integrations with broader systems: the data factory.

Why and How

The notion of a data factory is a powerful analogy for two reasons: that most people, even the non-technical, have some conception of what a factory does; and that the analogy is rich enough to solve the issues found in the other paradigms.

The analogy has several components –

At a high level, we can discuss the entire factory as an abstract entity intraged into broader processes:

A factory consumes either raw inputs or the outputs of other factories.
A factory deliveres its outputs either directly to other factories or to warehouses for storage and distribution.
A factory has particular units of delivery, which may not be a single at a time – pallets, trucks, etc.
The broader system can be discusses as a supply chain.

Looking into the factory, we find that the analogy provides a rich framework for discussing data processing:

A data machine takes some set of inputs and produces an output.
These outputs are either consumed by another machine or temporarily stored in a staging area.
A data machine has sensors (and other instruments) to detect its state and report to operators.
Sets of machines can be chained into assembly lines, which may branch or converge.
These assembly lines package their outputs into units of delivery.

There’s even a powerful way to discuss the interaction of actors with assembly lines:

A worker can inspect a machine’s performance and output quality.
A worker can repair or replace a broken machine.
A worker can perform manual processing steps or manually operate a machine.
A customer can place a (special) order with the factory.

The strength of this analogy lies in its ability to discuss several aspects and abstraction levels at once. Indeed, one can discuss an even finer level of detail by focusing on the mechanics of assembly lines and machine construction – discussions of efficient transfer (ie, how assembly lines are built from machines) or discussions of component robustness (ie, how machines are build from blueprints and components). This is important in a mental model, which serves as a skeleton to align the components of our technical systems.

This paradigm solves the problems discussed above:

By discussing how individual worker tasks and data machines fit within the system;
By being able to discuss workers within the context of assembly lines and factories;
By being able to discuss a factory as an abstract entity within a supply chain;
By being able to serve as an intuition pump for developers implementing such systems.

The language we choose impacts our design: let’s do our best work by choosing the strongest mental model – the data factory.