Tag Archives: knowledge representation

Data warehousing vs. information assembly lines

This post is part of my collaborative research with Shinsei Bank on highly-evolvable enterprise architectures.  It is licensed under the Creative Commons Attribution-ShareAlike 3.0 license.  I am indebted to Jay Dvivedi and his team at Shinsei Bank for sharing with me the ideas developed here.  All errors are my own.

Reporting and analysis are important functions for enterprise software systems.  Information assembly lines handle these functions very differently from data warehousing, and I think the contrast may help clarify the differences between Jay’s approach and traditional design philosophies.  In brief, data warehousing attempts to build a massive library where all possibly useful information about a company is readily available–the ideal environment for a business analyst.  By contrast, information assembly lines manufacture reports and analyses “just-in-time” to meet specific business needs.  One might say that data warehousing integrates first and asks questions later, while information assembly lines do just the opposite.

The Wikipedia article on data warehouses indicates the emphasis on comprehensive data integration.  Data warehouses seek a “common data model for all data of interest regardless of the data’s source”.  “Prior to loading data into the data warehouse, inconsistencies are identified and resolved”.  Indeed, “much of the work in implementing a data warehouse is devoted to making similar meaning data consistent when they are stored in the data warehouse”.

There are at least two big problems with the data warehousing approach.

First, since data warehousing integrates first and asks questions later, much of the painstakingly integrated data may not be used.  Or they may be used, but not in ways that generate sufficient value to justify the cost of providing the data.  The “build a massive library” approach actually rules out granular investment decisions based on the return from generating specific reports and analyses.  To make matters worse, since inconsistencies may exist between any pair of data sources, the work required to identify and resolve inconsistencies will likely increase with the square of the number of data sources.  That sets off some alarm bells in a computer scientist’s brain: in a large enterprise, data warehousing projects may never terminate (successfully, that is).

Second, data warehousing ignores the relationship between the way data are represented and the way they are used.  I was introduced to this problem in my course on knowledge-based systems at MIT, where Professor Randall Davis emphasized the importance of choosing knowledge representations appropriate to the task at hand.  Predicate logic may be a great representation for reasoning about mathematical conjectures, but it may prove horribly cumbersome or even practically unusable for tasks such as finding shortest paths or detecting boundaries in images.  According to Davis and his colleagues, awareness of the work to be done (or, as Jay would say, the context) can help address the problem: “While the representation we select will have inevitable consequences for how we see and reason about the world, we can at least select it consciously and carefully, trying to find a pair of glasses appropriate for the task at hand.”1

The problem, then, is that data warehousing does not recognize the importance of tuning data representations to the task at hand, and thus attempts to squeeze everything into a single “common data model”.  Representations appropriate for analyzing user behavior on web sites may be poorly suited to searching for evidence of fraud or evaluating possible approaches to customer segmentation.  Consequently, data warehousing initiatives risk expending considerable resources to create a virtual jack-of-all-trades that truly satisfies no one.

The information assembly lines approach focuses on manufacturing products–reports or analyses–to satisfy the needs of specific customers.  In response to a need, lines are constructed to pull the data from where they live, machine the data as necessary, and assemble the components.  Lines are engineered, configured, and provisioned to manufacture specific products or product families, so every task can be designed in with an awareness how it serves the product’s intended purpose.

If the data required for business decision-making change very slowly over time and the conceivable uses for the data are relatively stable and homogenous, then perhaps developing a unified data model and building a data warehouse may make sense.  Needless to say, however, these are not the conditions faced by most enterprises: the data environment evolves rapidly, and different parts of the business require varied and ever-changing reporting and analysis capabilities.

It’s actually kind of hard to see why anyone (other than system vendors) would choose the data warehousing approach.  Information assembly lines are modular, so they can be constructed one at a time, with each line solving a specific problem.  Performance criteria are well-defined: do the products rolling off the line match the design?  Since information assembly lines decompose work into many simple, routine tasks, tools developed for use on one line (a machine that translates data from one format to another, for example) will likely be reusable on other lines.  Thus the time and cost to get a line up and running will decrease over time.

1 Davis, Shrobe & Szolovits, “What is a Knowledge Representation?“.  This paper provides an insightful introduction to the problem.