| The Data Vault got its start through a study of "wet-technology", that is: the place where natural world models (like a neural model of the brain) is used in the electronic or information world. Throughout this study, the Data Vault maintained its focus on scalability, dynamic change, and repeatability/redundancy. This is of course, from a technical stand-point. From a business stand-point the model / architecture brings together the notions of business modeling with the actual physical data or information modeling. In other words, it converges form and function in an effort to meet business needs in a more timely fashion. For business descriptions, applications and why's and wherefores, please read the White Paper section, and look through the PowerPoint presentations. If you are interested in the real nature of the Data Vault, are a scientist, or are simply curious, then read-on. Items in this page:
The Neural Network - Data Vault Beginnings
Data Mining and the Data Vault
The human Brain, a re-take.
Flexibility, Scalability, and Repeatability
Dynamic Data Warehousing - Dynamic Model Changes
Nanotechnology and the Data Vault The Neural Network - Data Vault Beginnings The Data Vault is an architecture, which leads to integrated enterprise design. It is a formalized approach for data modeling which converges data and business models back together again. The model, or architecture that was chosen as a basis for design was the human brain. Since the understanding of the human brain is extremely limited, the definitions and functionality of the model were simplified for ease of use and application. Below is a picture of neuron with it's dendrites and synapses. I beg the pardon of any professional who has studied the brain and all of it's complexity - below is an oversimplification of how we can utilize this modeling paradigm in our electronic data world. I welcome any corrections, or thoughts that you would wish to share with me.  
Neruons exist all over the human body and are specialized according to their DNA for specific purposes. Some end up as nerve endings, others, as gray matter. In this particular case, the mind model of a neuron was chosen as a highly flexible, yet extensible model from which to build. It has also been said that Humans use no more than 10% of the total capacity of the mind, which leads to beliefs that the model can stretch quite a bit further than any imaginative mechanism which we can create to house information. The picture on the left is a simplistic drawing of a pyramidal neuron, from which we will build our assumptions and physical data housing. Let's assume for a moment that the brain might work as follows: We have a center for information keys (the middle of the neuron cell). These keys are surrounded by contextual information that describe the key, alternative data - other ways of "seeing" the key and understanding it's importance. However all the data contained within the neuron is about the key, just the key and nothing but the key. Let's also assume that the context for the key is time-sensitive (secondary key includes event sequencing). Further, let's assume that the context also has an importance rating (of course, importance is in the eye of the beholder so don't jump on to the "truth" wagon). Finally, the neuron is connected to other neurons through dendrites and synapses, basically message senders and receivers. Again, lets' assume that these neuron connections are relative to a) distance, b) size (importance/relevance) c) likelihood of occurrence and that these connections represent an interaction between two or more keys (other neurons). Gathering this all up from a business context, it may end up looking like this: Keys, lists of keys that represent the same type of thing (holding the same semantic meaning and the same grain of information) are housed as chief access points - Hubs (if you will). These lists of keys inhabit the center of the neuron. They provide mechanisms for accessing the information housed (ie: memories) within the actual node or context. Context has a tendency to change over time. Our memories might get "fuzzy", or we interpret what happened, differently than someone else who saw the same thing at the same time. In this manner, the context is all about the one specific key, and is only one "level" deep away from the key. In the Data Vault model we call those Satellites (time-based context containers that describe the key). Think about it, if I give you a date: 1 January 1976, what memories does it trigger? A wedding anniversary, a birthday party or something else? The date - by itself is a key but with a much broader context. When you read the words "wedding anniversary", what did you think about? Did you ask the question: Which one? What Year? You were searching for context to a key memory (if you have been or are married). Now, if I describe the day: sun shining, blue sky, park, romantic Italian catered lunch, not only are you picturing your wedding anniversary, but you've created or recalled context for what the day looked like. Finally, I begin to describe who was there, Steve and Jane, Bob and Jill, Joe and Jennifer, or I say: you drove your white El Camino car instead of the red Porsche. I've invoked an interaction between two "key" elements. The interaction, or intersection of the information is what is known as the Link (within the Data Vault) and is as granular as the number of connections to key elements permits. Think, dendrite and synapse - not from the function point of view, just from what they represent - all the connections to other neurons. Now, some connections are stronger than others, some mean more than others - they have a relevancy rating, or a confidence rating, or an importance rating. In this light, we begin to introduce function on top of the form (structure). [ Back To Top ] Data Mining and the Data Vault Just what do we mean by Data Mining and the Neural Network model? Well, going back to the neural network analogy, we have to think about form vs function. Should it really be form versus function? No! Form works with function in the natural world in order to achieve specific results. Different forms have different purposes, just like different neurons in the human body serve different purposes. The function of the neuron is always up for debate, however let's over-simplify again and fly on a wing and a prayer. Fact: the brain has the capacity to form and break new neuron connections all the time, some speculation has lead to statements that this is what we call "learning, adapting, and storing." It is quite possible that this might even be considered reasoning. The assumptions start here. Let's assume that this is indeed the way the brain functions, how can we "model" the thinking aspect within our Data Vault (neural form)? One mechanism is to take the processing and embed it in each node or each neuron. That requires advancements in nanotechnology and a better understanding of DNA computing. Imagine, independent data nodes roaming around, forming connections (affiliations/links) with other nodes (hubs and satellites) depending on the context, of course the DNA computing algorithms are short, sweet and simple - but they understand how to reach a consensus of "importance" or confidence levels to mark the links with. The other mechanism is to adapt well known and well-adopted techniques using a data mining engine with it's heuristics and neural net based algorithms to score the model. Not only is the content itself mined, but the structure can be mined in-context for unknown associations. Known associations or "designed" associations can be scored, and weighted according to the confidence levels that the data mining engine outputs. Now, we have dynamic structure, and useful information (at least that's the hope). It's a very tiny step forward in evolutionary thinking. [ Back To Top ] The human brain, a re-take. There is a hypothesis that the brain operates in two basic states. The first state is "awake", the second state is "asleep". In other words, the awake state deals with all the sensory input during the day, and allows fore-thought and conditioned responses to immediate stimuli. The second state, asleep - takes all the sensory input and works at attaching that to the memories, long term memory, and known thought processes. In other words it works at night to shape our knowledge, feelings, and reactions over time.
Sound familiar? It should. It's very similar to Active Data Warehousing, or ODS with an integrated historical data store. Let's take the leap of faith. We design systems that model our own knowledge, behavior and flaws. Being human we also build bugs into these systems. However, if we stop for a minute and look at active data warehousing or (near-real-time data feeds), it could be considered much like reacting to immediate "awake" sensory input. Only we receive multiple sensory channels at the same time, sometimes with a lot of stimuli, and sometimes with not very much.
In this state our data warehouse needs to interrogate (quickly) the arriving information and figure out where to put it within the massive historical data store. Or we could put it in the operational layer or current day's mix (sometimes referred to as ODS). If we put the data within the ODS, that would give us an opportunity at night to process it into the historical data, or data warehouse. It sounds about right - only there are a couple problems in this model that give rise to new models.
If we go back to the brain, only one type of brain structure is known today: neurons, synapses, and dendrites - to house memories, thought processes, and other things. There are certainly different parts of the brain with different functions, but it appears to be all the same architecture. It doesn't matter if it's immediate stimuli or historical memories, it's all the same architecture.
That said, living with an ODS as 3rd Normal form, and a data warehouse as a band-aided or adapted solution just won't work. We need a better solution to put these inflows of data back together again. That's where the Data Vault picks up.
Hubs Let's take a look at a poor mans definition of neuron: within the neuron let's assume that a trigger or key piece of information is required to find a neuron - in other words: your birthday. As soon as I mention birthday, immediately you begin to find your most recent one, and then scan your history of birthdays - only to follow with the questions: What about my birthday? Which birthday do you want to talk about? Provide me some context. The Hub entity is much like the key for lets' say, a single neuron. It provides the key to unlocking additional information.
Links Now let's say I provide context: 2 years ago - now I've provided a time for your birthday, and I make a mention of 2 of the people who were at your birthday. In this case, as soon as I mention their names - you immediately find two things: the people, and your association to those people AT THAT TIME. This is said to be activity across dendrites and synapses. I will crudely equate this to the Link Entity, making associations between Hub Keys.
Satellites Now you say, ok - but I was wearing certain clothing, I felt great, the sun was out and we all had a good time. And your friend says: "I didn't have a good time, I wasn't feeling well that day." Now I have descriptors about the event that occurred at that time. That is equivalent to the Satellites off the Hubs and the Link entities that describe the activity, otherwise known (very loosely) as context.
Can we create a machine that is modeled after the brain? I think it may be possible, albeit crude, it may just begin to be a first step into this area. Right wrong or indifferent, we can potentially begin this quest in earnest. Particularly if we add data mining algorithms and results to both the immediate incoming stimulus and balance the incoming stimulus against our "known history" at the time of arrival, so we know where to put it, or if it's white noise. And if we can run deeper mining algorithms off-line to ask new questions and learn new things during non-critical cycles.
Finally, what is necessary are as follows: 1. A scale free ontology - which different levels of the Data Vault can do, 2. Putting form back with function, data mining and the model underneath (such as the Data Vault) to come up with unique answers. 3. The ability to alter structures on the fly, then grade them by attaching a degree of competency or relevancy within the Links themselves - making the entire "structural storage" dynamic. An example model of the Data Vault is below. This particular model is discussed on the Public Sector page. 
[ Back To Top ] Flexibility, Scalability, and Repeatability Part of any "good" modeling architecture is it's ability to scale. It is proposed, that because the basis for this model is a working neural model, that we don't yet truly know the scalability limits of it. One thing is for certain that MPP (massively parallel processor) environments have proved over the past several years: near-linear scalability in hardware can quickly be crippled by the wrong data architecture under the covers. Information layout, cross-references, and distribution is absolutely vital to the success of multi-hundred terabyte and petabyte systems. We assume (for these reasons) that the brain operates in an MPP type environment. It receives and sends signals with eyes, ears, noses, throat, and nervous system - all at once, all in parallel - while it processes thoughts. This particular model (the Data Vault) excels in the MPP environment, and has room to grow as hardware continues it's performance increases. However when placed in a single processor or limited SMP environment, volumes of data can quickly over-power the machine resources. This model can scale into the petabyte ranges (so far as we've seen). There's also something to be said about the flexibility of this model. In keeping with the basis for formation of this model (the human brain), the model itself must be flexible enough to construct new linkages between information nodes (hubs with satellites) without losing historical context. Again, the brain can create and destroy synapses and dendrites as memories come and go, or become important, or we learn new things. In this model, the link structures act as the flexible component for just this purpose. About repeatability and redundancy, this is not your typical data model, it's not just another model off the old block, it's a change in thinking, a paradigm shift. Yes, it utilizes existing modeling nomenclature and design for "core structures", but the manner in which it operates is arguably different - especially when combined with mining tools. That said, repeatability and redundancy is extremely important for this architecture. The brain could potentially have many copies of the same information, just in case part of the brain is damaged, the individual can retain their memories and individuality, and all their known worldly skills. Repeatability is embedded in the standard structures - apparently all neurons in the brain "look" the same, the core, the dendrites, and the synapses. Without a repeatable structure, scalability is compromised, along with the ability to succeed in utilizing the model to it's full potential. If we build one part of our Enterprise Integration System, and it's a success, then why can't we build the rest in exactly the same way? [ Back To Top ] Dynamic Data Warehousing, Dynamic Model Changes We believe that this IS the future of integrated historical data sets. There are many signs in the industry that point this direction. Our definition of Dynamic Data Warehousing is: the ability to not only detect structural change, but to learn where to apply it, and if it's relevant to apply. It starts with the assumption that an Active Data Warehouse has already been built and goes from there using Neural Nets (Artificial Intelligence) and Data Mining.
The application of a dynamic learning process to structure is a unique twist in the future of Data Warehousing, or integrated historical data stores. Learning and adapting the structures will require a few things to be put in place:
1. A structure capable of being scale-free (like the Data Vault) 2. A structure capable of building and destroying relationships without losing historical data sets. 3. A process that understands the structure and it's contents on the most basic of semantic levels.
The learning and adaptation algorithms in their rudimentary form will probably produce the following levels of changes (each with a score of confidence - as is standard behavior with the application of neural nets). As a result from responding to a structural change, I expect three results:
1. Errors/Alerts - Stop all processes, send alerts and emails to get an individual involved. This type of stop-process/stop-production will probably be because of low confidence levels (no way to guess where to put the new element). 2. Warning/Stop - Certain thresholds that are met, cause a warning, and a stop - less urgency than the error - the company probably has more time to respond. 3. Warning/Continue - Certain thresholds generate warning messages and email, 60% to 80% confidence levels - the new element is attached to the model, or the new relationship formulated, and the data is loaded. 4. Notification/Continue - 80% and above confidence levels, the new element or relationship is added to the structure/design, an email is sent, and the process continues. The neural net has figured out where to put the new data, built the structure, and loaded it.
Obviously we are a couple years off from this type of self-changing design, but it's coming. As Active Data Warehouses blur the lines between real-time, ODS, OLTP and EDW - the Dynamic Data Warehouse will be pushed as a back-room all-uptime operation, to absorb both structure and business rule changes dynamically. See: www.TDAN.com article titled: "Convergence: The Freight Train is Coming"
This is what I call Dynamic Data Warehousing. Remember, that in order to reach this level it is absolutely necessary that the historical nature of the existing data not be sacrificed, but that linkages can be created and destroyed, elements can be added and deleted on the fly. In speaking of this, it requires architecture like the Data Vault. [ Back To Top ] Nanotechnology and the Data Vault In this world we are rapidly approaching Nanotech. Whether we like it or not, it's on our radar, and there is no turning back. So what's happening in the technology arena that might make our machines smart? Is this even a possibility in the future? Is there something we can do to emulate the human brain?
There are many arguments for and against doing these things, and that's not what I'm here to discuss. I'm here to suggest that only baby steps can and have been taken in the direction of making machines "smarter" than they already are. Let's take a look at some interesting facts...
Welcome to a discussion on Nano Technology, Nanohousing™, Nano-Warehousing and the Data Vault. I've written a couple Nanohousing articles on B-Eye Network that references these subjects.
I am currently researching Nano-Technologies' impact on Data Warehousing, and let me just say that it's going to be bigger than I am predicting. Not only will data volumes grow, and devices shrink, but information and function of the information will have to be wrapped within an architecture that is designed to act as Nano-Sized computational devices.
Today's nano-tech consists of the ability to produce products which repel stains (non-stain Dockers), avert denting, are stronger than steel and aircraft aluminum but much lighter and more flexible.
There's talk about self-assembly of the nano devices, which is already happening in certain labs around the world. Nano-tech is also driving "wet-technology" where the lines between natural (nature represented) models, and man-made models and functions. We will no longer be able to tell if the fibers come from a plant or were man made (for instance).
I tend to stand on the edge and state that the Data Vault has a place in the architecture of Nano-Machines. The ability to "store data" in a relative manner, where data or information attribution is a major factor is very important in the clustering of these nano-machines. Getting these nano-machines programmed is another story, they will have to consist of short, light-weight but flexible routines that can be coded right into the architecture of the nano-structures. Much the way DNA works with ribosome’s.
In the future we will all have to have a certain level of knowledge about biology, neurology, and technology. This is the world of the Nano-Machines. The Data Vault has very high data attribution levels, with the attraction and storage of massive sets of "like" data based on a key, it can make searching, and storing of the information relatively easy, and independent. Something that other (current) data modeling techniques don't necessarily do.
The Data Vaults' Link Entities exist to attach multiple Nano-Machines together, and define the rules for the connection, much like the engines which cut, copy, and paste DNA when multiplying cellular structures. The Data Vault relies on a high degree of parallelism, and NanoTechnology (within the technology sciences areas) is proving to provide that, and more. Without the ability to "uniquely identify" each nano-machine, it will quickly become difficult to control their inputs/outputs and communication channels.
They will also need a new form of neural nets, the real challenge will be: how do we reduce the Neural Net algorithms so they are small enough to be encoded on a Nano Machine? Once that can be accomplished, and we attach a Data Vault we may begin to see some very interesting combinations of Nano-Machines. Information we didn't know about before may begin to be readily available.
I would suggest that this type of Nano-Machine is ground-breaking, and will change the face of computation forever. While it doesn't provide "thinking" ability, it will provide a baby-step towards combining form (Data Vault, NanoTech) and function (Neural Net) to make somewhat self-sufficient machines.
Today they are working on encoding and decoding atoms for reading/setting/writing bit-levels, treating each atom as a bit with 3 possible spins (1 - positive, 0 - negative, and 0/1 (both states at the same time). Interesting possibilities abound. The Data Vault with it's unique architecture - will lead to very interesting results in the Nanohousing field. [ Back To Top ] |