Data warehousing vs. information assembly lines

This post is part of my collaborative research with Shinsei Bank on highly-evolvable enterprise architectures.  It is licensed under the Creative Commons Attribution-ShareAlike 3.0 license.  I am indebted to Jay Dvivedi and his team at Shinsei Bank for sharing with me the ideas developed here.  All errors are my own.

Reporting and analysis are important functions for enterprise software systems.  Information assembly lines handle these functions very differently from data warehousing, and I think the contrast may help clarify the differences between Jay’s approach and traditional design philosophies.  In brief, data warehousing attempts to build a massive library where all possibly useful information about a company is readily available–the ideal environment for a business analyst.  By contrast, information assembly lines manufacture reports and analyses “just-in-time” to meet specific business needs.  One might say that data warehousing integrates first and asks questions later, while information assembly lines do just the opposite.

The Wikipedia article on data warehouses indicates the emphasis on comprehensive data integration.  Data warehouses seek a “common data model for all data of interest regardless of the data’s source”.  “Prior to loading data into the data warehouse, inconsistencies are identified and resolved”.  Indeed, “much of the work in implementing a data warehouse is devoted to making similar meaning data consistent when they are stored in the data warehouse”.

There are at least two big problems with the data warehousing approach.

First, since data warehousing integrates first and asks questions later, much of the painstakingly integrated data may not be used.  Or they may be used, but not in ways that generate sufficient value to justify the cost of providing the data.  The “build a massive library” approach actually rules out granular investment decisions based on the return from generating specific reports and analyses.  To make matters worse, since inconsistencies may exist between any pair of data sources, the work required to identify and resolve inconsistencies will likely increase with the square of the number of data sources.  That sets off some alarm bells in a computer scientist’s brain: in a large enterprise, data warehousing projects may never terminate (successfully, that is).

Second, data warehousing ignores the relationship between the way data are represented and the way they are used.  I was introduced to this problem in my course on knowledge-based systems at MIT, where Professor Randall Davis emphasized the importance of choosing knowledge representations appropriate to the task at hand.  Predicate logic may be a great representation for reasoning about mathematical conjectures, but it may prove horribly cumbersome or even practically unusable for tasks such as finding shortest paths or detecting boundaries in images.  According to Davis and his colleagues, awareness of the work to be done (or, as Jay would say, the context) can help address the problem: “While the representation we select will have inevitable consequences for how we see and reason about the world, we can at least select it consciously and carefully, trying to find a pair of glasses appropriate for the task at hand.”1

The problem, then, is that data warehousing does not recognize the importance of tuning data representations to the task at hand, and thus attempts to squeeze everything into a single “common data model”.  Representations appropriate for analyzing user behavior on web sites may be poorly suited to searching for evidence of fraud or evaluating possible approaches to customer segmentation.  Consequently, data warehousing initiatives risk expending considerable resources to create a virtual jack-of-all-trades that truly satisfies no one.

The information assembly lines approach focuses on manufacturing products–reports or analyses–to satisfy the needs of specific customers.  In response to a need, lines are constructed to pull the data from where they live, machine the data as necessary, and assemble the components.  Lines are engineered, configured, and provisioned to manufacture specific products or product families, so every task can be designed in with an awareness how it serves the product’s intended purpose.

If the data required for business decision-making change very slowly over time and the conceivable uses for the data are relatively stable and homogenous, then perhaps developing a unified data model and building a data warehouse may make sense.  Needless to say, however, these are not the conditions faced by most enterprises: the data environment evolves rapidly, and different parts of the business require varied and ever-changing reporting and analysis capabilities.

It’s actually kind of hard to see why anyone (other than system vendors) would choose the data warehousing approach.  Information assembly lines are modular, so they can be constructed one at a time, with each line solving a specific problem.  Performance criteria are well-defined: do the products rolling off the line match the design?  Since information assembly lines decompose work into many simple, routine tasks, tools developed for use on one line (a machine that translates data from one format to another, for example) will likely be reusable on other lines.  Thus the time and cost to get a line up and running will decrease over time.

1 Davis, Shrobe & Szolovits, “What is a Knowledge Representation?“.  This paper provides an insightful introduction to the problem.

Share on Google+Share on LinkedInTweet about this on TwitterShare on FacebookPin on Pinterest

9 thoughts on “Data warehousing vs. information assembly lines

  1. MA

    Hi David, I think you should visit a few more bank to do a comparison of what bookish and what actually happens.

    One issue i can see is that you do not have practical experience so describing / comparing with what is written in books may not be the right approach.

    I do not think what you are describing and generalising is implemented anywhere except described in books or may be one or 2 people you are interacting with in Shinsei. you talk about frequency could you give me an example where financial data in a bank is not change for more than a month. You will find there is almost none.

    “data required for business decision-making change very slowly over time ”

    Secondly the same data could be used multiple ways in an organisation based on context. referring to you comment —
    “data warehousing ignores the relationship between the way data are represented and the way they are used”
    if you keep building different data set based on end use the organisation will be dead.(just a metaphor) meaning to say it will become herculean task just to build and maint it.

  2. MA

    Hi David, i am just curious you mention ” I am indebted to Jay Dvivedi and his team at Shinsei Bank for sharing with me the ideas developed here”

    I am sharing you a link which states that Dhananjaya Dvivedi is already retiered on 23rd June 2010. reffered to like below

    So it seems a bit contradictory so may be you may want to check and rectify your opening statement / disclamer.

  3. Zahoor ul Islam

    Hi David,

    To the best of my understanding about Jay systems, the data resides with every workstation of the operational assembly line and they form the basis of the data unit to part of information/ knowledgebase in Jay architecture. Jay never pursue centralized model of data and always choose and pick and consolidate required report or information on demand.
    So I just wonder isn’t a different approach to build data warehouse where you make yourself demand driven and end up efficient and skinny data-mart rather having big bang approach to consolidate everything first and compromise on information loss.
    Moreover simple reports or data can be quickly assembled using data from these workstations without building any information assembly line, information assembly line are, if I correctly understood are synonym to routines of data warehouse system which daily collect the data, translate and transform and update the data cubes, which then could be used in analysis, information assembly line comprised of components which does the same but on smaller scale, loosely coupled, and can be managed, killed, changed easily. However when you attach the word “assembly line” to it, it give impression of something sequential being done to generate the product, however in essence you are preparing the ingredients to be used by the analyst to apply his recipe to cook the meal. So isn’t information assembly component and assembly of those to meaningful reports is suffice without attaching the word “line”.
    So I really wonder, is it just the same thing like big-bang and path based approach (David upton) or need to termed as separately like data warehouse and Jay approach?


  4. David James Brunner

    Zahoor, the choice of the label “information assembly line” is intentional, because the assembly process is decomposed into a sequence of simple operations that are performed at workstations, much like an assembly line in a factory.

  5. Zahoor ul Islam

    I think Assembly line is more appropriate for opertional processes, ‘information assembly lime’ to me more like a mechanism where you collect, consolidate the data available at the workstation while there is no changes take place during this process for that workstations, so in such case process or it sequence may irrelevant or of less important as far as it is composed to desired elements and fits the required output.

    Usually assembly line is used for operations of corase granulairty. Process where fine elements are involve we usualy call it mechanism or collectivly a single process.

  6. Zahoor

    David, I do understand the architecture , I was just confused how you use it in the context of data warehousing. In abstract everything is composed /decomposed of fine elements which then re-conceptualized in sequence of steps, simple or complex that is rather relative term.

    Anyway, will keep following your blog and see how these concepts evolves.

  7. MA

    Hi David, I think Zahoor, is right in abstract every thing is composed and decomosedof fine elements. What is also important is the gorund reality. if the later is irrelevant then why this is not followed any where else or if this is followed every where then what is unique and why we want to talk about it and listen about it.

Comments are closed.