Information assembly lines

This post is part of my collaborative research with Shinsei Bank on highly-evolvable enterprise architectures.  It is licensed under the Creative Commons Attribution-ShareAlike 3.0 license.  I am indebted to Jay Dvivedi and his team at Shinsei Bank for sharing with me the ideas developed here.  All errors are my own.

In my previous post, I explained my (admittedly somewhat arbitrary) transition from version zero to version one of my architectural theory for enterprise software.  The design metaphor for version one of the theory is the high-volume manufacturing facility where assembly lines churn out large quantities of physical products.  Design metaphors from version zero of the theory (the zoo, the house, the city, and the railway) will probably appear at some point, but I’m not yet exactly sure how they fit.

Jay often describes business processes at Shinsei as computer-orchestrated information assembly lines.  These lines are composed of a series of virtual workstations (locations along the line where work is performed), and transactions move along the line from one workstation to the next on virtual pallets.  At each workstation, humans or robots (software agents) perform simple, repetitive tasks.  This description suggests that the salient features of the information factory1 include linear organization, workstations, pallets, and finely-grained division of labor.

How does this architecture differ from traditional approaches?  Here are a few tentative observations.

  • No central database. All information associated with a transaction is carried along the line on a pallet.  Information on a pallet is the only input and the only output for each workstation, and the workstation has no state information except for log records that capture the work performed.  In essence, there is a small database for each transaction that is carried along the line on a pallet.  In keeping with the house metaphor, information on the pallet is stored hierarchically.  (More thoughts about databases here.)
  • Separation of work-in-progress and completed work. Just like an assembly line in a factory, work-in-progress exists in temporary storage along the line and then leaves the line when completed.

In order to make the system robust, Jay adheres to the following design rules.

  • Information travels in its context. Since workstations have no state, the only ways to ensure that appropriate actions are taken at each workstation are to either (a) have separate lines for transactions requiring different handling or (b) have each pallet carry all context required to determine the appropriate actions to take at each workstation.  The first approach is not robust, because errors will occur if pallets are misrouted or lines are reconfigured incorrectly, and these errors may be difficult to detect.  Thus, all pallets carry information embedded in sufficient context to figure out what actions should be taken (and not taken).
  • All workstations are reversible. In order to repair problems easily, pallets can be backed up when problems are detected and re-processed.  This requires that all workstations log enough information to undo any actions that they perform; that is, they must be able to reproduce their input given their output.  These logs are the only state information maintained by the workstations.
  • Physical separation. In order to constrain interdependencies between workstations and facilitate verification, monitoring, isolation, and interposition of other workstations, workstations are physically separated from each other.  More on this idea here.

The following diagram depicts the structure of an information assembly line.  The line performs six tasks, labeled a through f.  The red arrows indicate logical interdependencies.  The output of a workstation is fully determined by the output of the preceding workstation, so the dependency structure resembles that of a Markov chain.  Information about a transaction in progress travels along the line, and completed transactions are archived for audit or analysis in a database at the end of the line.  Line behavior can be monitored by testing the output of one or more workstations.


Information assembly line

By contrast, here is a representation of a system designed according to the traditional centralized database architecture.  The system has modules that operate on the database to perform the same six tasks.  Although the logical interdependency structure is the same in theory, the shared database means that every module depends on every other module: if one module accidentally overwrites the database, the behavior of every other module will be affected.  Moreover, all transactions are interdependent through the database as well.  It’s difficult to verify that the system is functioning properly, since database operations by all six modules are interleaved.

Traditional system architecture with centralized database

Traditional system architecture with centralized database

Clearly, the information assembly line architecture requires more infrastructure than the traditional database approach: at a minimum, we need tools for constructing pallets and moving them between workstations, as well as a framework for building and provisioning workstations.  In addition, we also need to engineer the flow of information so that the output can be computed using a linear sequence of stateless workstations.  There are at least two reasons why this extra effort may be justified.  At this stage, these are just vague hypotheses; in future posts, I’ll try to sharpen them and provide theoretical support in the form of more careful and precise analysis.

First, the linear structure facilitates error detection and recovery.  Since each workstation performs a simple task on a single transaction and has no internal state, detecting an error is much simpler than in the traditional architecture.  The sparse interdependency matrix limits the propagation of errors, and reversibility facilitates recovery.  For critical operations, it is relatively easy to prevent errors by using parallel tracks and checking that the output matches (more on reliable systems from unreliable components here).

Second, the architecture facilitates modification and reconfiguration.  In the traditional architecture, modifying a component requires determining which other components depend on it and how, analyzing the likely effects of the proposed modification, and integrating the new component into the system.  If the number of components is large, this may be extremely difficult.  By contrast, in the information assembly line, the interdependency matrix is relatively sparse, even if we include all downstream dependencies.  Perhaps more importantly, the modified component can easily be tested in parallel with the original component (see the figure below).  Thus, the change cost for the system should be much lower.


Parallel operation in an information assembly line

1A search for the term “information factories” reveals that others have been thinking along similar lines.  In their paper “Enterprise Computing Systems as Information Factories” (2006), Chandy, Tian and Zimmerman propose a similar perspective.  Although they focus on decision-making about IT investments, their concept of “stream applications” has some commonalities with the assembly-line-style organization proposed here.

Share on Google+Share on LinkedInTweet about this on TwitterShare on FacebookPin on Pinterest

3 thoughts on “Information assembly lines

  1. MA

    Hi David, these architectural principles are widely used in SOA and pallet which is used in manufacturing to move goods between work stations is a good way of depicting movement of information between stages. May be you could go through some reference material on BPM and SOA which will give you some idea about architectural principles.

    One observations from some architects, though these principles sound good but unless there are details they would find it extremely hard to use any of these principles.

  2. PA

    Hello David,

    I am really amazed by your entries on this blog and i’m constantly reading your blog. However, I am a bit confused regarding the architecrure of such a line.

    If we have a Work-in-progress database, and a log database for each workstations, how do we enumerate the number of databases? How can we produce a mutually-verifying dualism environment? I am really getting confused on the data structure for each of those databases

    If, for example, we have two separate databases; one for WIP, and one for finished goods, where WIP database carries all the information required to process a specific item on the next workstation, then what does the database of each station do? Does it only log the failures? If yes, how can we use mutually-verifying dualism in this context?

    I am really looking forward to your explanation on this


    1. David James Brunner

      PA, thanks for your kind message. I’m not sure I fully understand your question, but workstations should not store information. They just perform work, and should be empty at the beginning and end of each cycle. Information is stored in datastores designed for that purpose.


Comments are closed.