Author:
CEO & Co-Founder
Reading time:
Today’s business environment is constantly in a volatile state. It’s characterized by increased globalization and rapid technological changes that have completely changed the conditions that previously defined competitive business conditions. However, data-driven businesses have weathered the storm and emerged stronger than their counterparts. This is because their enterprise data platforms and data analytics tools have helped them recognize change and acclimate to the unpredictability of existing markets and supply chains. However, the main limitation of dimensional and normalized data modeling methods is that they are not designed to adjust swiftly to unpredictable market conditions. On the other hand, data vault addresses this challenge by equipping organizations with better flexibility and speed for their business intelligence needs.
In this article, we define what a data vault is, its evolution and structure, as well as the pros and cons of the data vault methodology. Read on for more insight.
Data vault denotes two things:
For a detailed description, let’s delve into the information sources by Dan Linstedt[1] and Hans Hultgreen[2]. A data vault comprises a set of tables created as per specific guidelines and matches an organization’s reporting and analytical requirements. When the tables are combined, they form an enterprise data warehouse (EDW). Each table in a data vault is normalized[3]. However, if all the tables are taken into consideration, the tables can be described as hyper-normalized or over-normalized.
Normalized tables seek to separate Hubs (key business concepts) and Satellites (contextual data). So data that would usually be captured together in a single table is now distributed over several distinct tables within the same data vault. For instance, a business entity and its related properties are not captured in a single table. Instead, the data is shared out over a number of tables.
These tables have specific metadata columns that specify, for instance:
Note that metadata columns don’t embody user information requirements. However, they are integral for processing data correctly.
As earlier stated, the data vault denotes a specific modeling approach referred to as the data vault methodology. It explains how information can be patterned to accommodate changes as extensible tables. It also illustrates and recommends the following actions:
Thus, the database system and the methodology are like Siamese twins. They go hand in hand. Ideally, you should implement the data vault methodology in cases where a data vault is needed. When a methodology is implemented, the resulting database system should feature the properties of a data vault.
Read more about Data Automation Strategy
Dan Linstedt[4] is the pioneer of the data vault model and still champions its globally. Hans Hultgreen has also made notable contributions in preaching about data vault, especially through his book.
Initially, there was no existing architecture built to meet the requirements of EDW. Then in came the data vault architecture. It’s a hybrid model because it merges the best of both worlds (3NF and Star Schema). And addresses the requirements of EDW, including:
Data vault is a best-of-breed model and not a reworked copy. The research to build the data vault model started in 1990 and lasted for a decade. In the early 2000s, it was adopted as a public modeling method and well-accepted between 2005 and 2010. Today, many organizations have used data vaults to build their data warehouses, and its rate of acceptance is fast spreading. Here’s a website[5] with a list of organizations that use, design, or support data vault architecture.
Consider the analogy of a Semi-trailer truck and the Ferrari. The semi-trailer truck represents the Third normal form. This is the design technique for many relational databases. It uses normalizing concepts to reduce data duplication and guarantee referentiality.
On the other hand, the Ferrari represents the Star Schema. It employs a single huge fact table to save measured or transactional data and a few smaller dimensional tables to save properties about the data. It was optimized for application in data warehousing.
Unfortunately, the semi-truck trailer and the Ferrari can’t match the SUV for its on-road and off-road capabilities. The SUV offers the best of both worlds by combining the attributes of a car and truck. In this case, the data vault architecture signifies the SUV. This is because it was specially designed to satisfy the demands of EDW.
The data vault architecture works similar to the data warehouse definitions attributed to Bill Inmon in 1990[6] except for one part. It is not subject-based. Instead, it’s functionally oriented. This means the hubs are parallel and allow for a quicker data loading process.
The different tables used in this data modeling approaches are categorized as:
Each table is designed to deliver the utmost flexibility and scalability while maintaining a huge chunk of the conventional skill sets of data modeling capacity.
The Hub table represents a core business concept or entity, such as vendor, invoice, customer number, employee number, sale, or product. Each of these concepts or entities is defined around a business key; which explains how the ‘thing’ in question can be tracked, located, and identified by your business.
Other details in the hub may include:
Consider the example of a given organization that wants to record customer numbers across the different departments. The accounting department may capture a customer number as (34567) in numeric style. After the business is properly identified through keys (for example, customer and account number), up next is the construction of the link entities.
It might be interesting for you: Data Engineering in Startups
A link is a meeting point or connection between business keys. Generally, a data vault contains several links. A link table defines the interaction and relationship involving business entities. The objective of the linked entity is to record the history, present, and future interactions of business keys.
Each relationship to a hub is captured using a unique identifier which is added as a foreign key. Given that many link tables describe an event like cash withdrawal from an automated teller machine (ATM), a payment, or a hospitalization, they typically feature at least one or more columns labeled date and time. Here, records of when the event occurred are entered.
Links also have metadata columns containing information about where the data is extracted from. This affords the data vault model with flexibility and scalability.
The following relationships among business keys usually lead to the creation of links:
Many-to-Many relationships, in particular, have the following advantages:
By designing the links as a many-to-many connection, the physical data platform can accommodate data alterations and business regulation adjustments with minimal or no effect on existing data sets and processes. Consequently, the data vault avoids redesigning or restructuring the enterprise data warehouse model when the relationship changes.
Satellites describe the properties of hubs and links. In other words, they give context to a business entity and concept or the relationship involving two or more business concepts, like customer and product. While the hub and links are the skeletal structure of the data vault architecture, satellites are the ‘meat’ of the data platform. Several satellites may provide descriptive data on a single business key. But a single Satellite can only explain one key.
Contextual data typically changes over time, and this provides the Enterprise Data Warehouse model with the time-variant element. On the one hand, when the context relates to a business key, it is descriptive data, particularly about that key (hub or link) and how it alters over time. On the other hand, when the descriptive information relates to a relationship, it touches on the dealings between keys and how it varies over time.
Here are some examples of descriptive data (context):
Satellites employ load dates as well as load-end dates to specify record lifespans. This is because the majority of database systems don’t support inherently time-series-based table structures.
In terms of entity structure, satellites feature the essential required elements, including:
Here are some of the benefits associated with the data vault methodology:
This attribute signifies the ability to accommodate new information requirements within a data platform seamlessly. Typically, a majority of data warehouses require the change of existing data structures whenever there’s a need for new information to be implemented. Those alterations may bring irrelevant revisions to sections of the existing architecture.
But new information requirements in a data vault activate the formation of new hubs, links, and satellites. No changes are made to the existing data structures. This makes it easier to apply changes or adjustments. The extensibility attribute makes data vault an extremely flexible solution.
Reproducibility falls in line with compliance. One element of compliance entails reproducing reporting documents or findings from the past. According to the data vault model, both old and new data are stored in such a manner that they’re easily accessible.
Also, the application of the data vault methodology enables previously created reports to be reproduced, notwithstanding the change of data platforms. Metadata columns support reproducibility and are incorporated in all the tables.
It is important to track and audit your data with much ease. Data vault naturally supports auditing. This is because the load times and record sources are a key requirement in every row. It also tracks all previously made alterations as satellites add the load time as a section of the primary key. A new record is automatically updated whenever an attribute is revised. The inherent auditing feature allows you to satisfy accountability standards for regulatory and data security purposes.
It supports the formation of different data modeling relationships involving business objects. This means there’s no need for updates, for example, when a 1-to-many relationship turns into a many-to-many relationship. Thus, there’s no extra work needed within the data vault architecture when the extent of the relationship changes.
Data vault is associated with a shorter implementation cycle. This translates to time and cost savings. Also, shorter cycles help business needs for the data warehouse and continuing enhancements to remain applicable until completion. So organizations avoid the need to shift goalposts which can increase expenditures.
These are key features of the data vault methodology. In fact, the agile methodology to project management is increasingly popular among many organizations and perfectly aligns with the concepts that characterize the data vault model. Combined, the duo can provide the necessary agility to the data approach of any business while getting rid of the budget implications of expanding data storage whenever necessary.
You want a faster data loading process in your data warehouse. Data vault methodology enables data from several source systems to load quickly because most of the tables can be loaded simultaneously in parallel. This allows the seamless management of huge data volumes.
There’s less dependency between the tables during the data loading process. Also, the data vault methodology leverages inserts only, which helps simplify the loading process. Generally, data inserts load quickly compared to upserts or merges.
The above-mentioned benefits are a major attraction. However, data vault has a few drawbacks that organizations need to take into account. These include:
The data vault data platform is complex when looked at from a querying perspective. This is because data is shared across several hubs, links, and satellites. As a result, generating reports within the data vault results in very intricate SQL statements which require the joining of tables intricately. This leads to complex queries.
Organizations that use a data vault methodology opt to develop several derived data stores like spreadsheet files, SQL‐based data marts, Microsoft Access databases, and Microsoft SQL Server cubes to address this shortcoming. Typically, the derived data stores contain structures well suited for reporting requirements and delivering improved performance.
One of the key reasons for adopting a data vault model is for auditing and historical tracking functionality. If neither of these is integral to your business, it may not be easy to cater for the operating cost required to add another layer into your data platform.
Thus, if the project is small-sized or in the short term, it may not be prudent to pursue a data vault methodology. But from a long-term perspective, it may be a sensible investment upfront.
Data vault takes a decomposed methodology to business keys, relationships, and attributes. Thus, it generates a high number of tables in comparison to star schema and other denormalized structures. As a result, a high number of joins are needed to derive data marts in the data vault.
Data vaults are perfect for storing new data as well as any alterations done to original data. This is done within a highly extensible structure. Using the data vault approach allows you to implement a more nimble methodology while growing the model bit by bit.
You can begin with only a few business concepts and engage some of the business users to track and locate the most important business keys in the organization. Next, add satellites to those business keys to capture past descriptive data. Lastly, you can incorporate the link to record all the crucial relationships between business keys.
If you want to know more, see our data engineering services.
Category: