Summary:
Hot models…data models that is. A survey of many of the most popular data modeling approaches in the news today. Third Normal Form, Anchor Modeling, Data Vault, Data Lakes, Data Swamps. What do they do well, what do they do badly, and which is the one true data model to rule them all? (Hint: it depends, as usual.)
Third Normal Form (3NF) (a.k.a. Naomi Sims)
History: E.F. Codd defined 3NF in 1971 while working at IBM.
Basic Concept:
“The Key, the Whole Key, and Nothing but the Key” -Bill Kent
The gold standard for purist relational database design. If a table has the following characteristics:
1NF – a) Values in a particular field must be atomic and b) a single row cannot have repeating groups of attributes2NF – in addition to being in 1NF, all non-key attributes of the table depend on the primary keyThere is no transitive functional dependencyPros:
A battle-tested, well-understood modeling approach that is extremely useful for transactional (OLTP) applicationsEasy to insert, update, delete data because of referential integrityAvoids redundancy, requiring less space and less points of contact for data changesMany software tools exist to automatically create, reverse engineer, and analyze databases according to 3NFWriting to a 3NF DB is very efficientCons:
Reading from a DB in 3NF is not as efficientNot as easily accessed by end-users because of the increased number of joinsMore difficult to produce analytics (trends, period-to-date aggregations, etc.)Many times even transactional systems are slightly de-normalized from 3NF for performance or audit-abilitySome people feel that 3NF is no longer as appropriate in an era of cheap storage, incredibly fast computing, and APIs
Source: ewebarchitecture.com
(Sources:https://en.wikipedia.org/wiki/Third_normal_form; http://www.1keydata.com/database-normalization/third-normal-form-3nf.php; http://www.island-data.com/downloads/papers/normalization.html; http://www.johndcook.com/blog/2012/11/21/rise-and-fall-of-the-third-normal-form/)
Anchor Modeling (incorporates Sixth Normal Form [6NF]) (a.k.a. Gisele Bundchen)
History: Created in 2004 in Sweden
Basic Concepts: Mimics a temporal database
anchors – entities or eventsExample: A personattributes – properties of anchorsExample: A person’s name; can be historical, such as favorite color)ties – relationships between anchorsExample: Siblingsknots – shared properties, such as states or reference tables – combination of an anchor and a single attribute (no history)Example: Gender – only male/femalePros:
Incremental change approach – previous versions of a schema are always encompassed in new changes, so backwards compatibility is always preservedReduced storage requirements by using knotsCons:
Many entities are created in the databaseJoins become very complex; hard for end user to understand modelDaunting for new technical resources to come up to speed initially
Source: bifuture.blogspot.com
(Sources: Wikipedia,http://en.wikipedia.org/w/index.php?title=Anchor_Modeling& oldid =570899764; http://www.anchormodeling.com/)
Data Vault (DV) (a.k.a. Heidi Klum)
History: Dan Linstedt developed the started implementing data vaults in 1990 and published the first version of the methodology (DV1.0) in 2000. He published an updated version (DV2.0) in 2013. The methodology is proprietary and Dan restricts who can train others by maintaining a copyright on the methodology and requiring people who train others to be Data Vault certified. You can still implement data vaults; you just cannot train others on it without being certified.
Basic Concept:
“A single version of the facts (not a single version of the truth)”
“All the data, all the time” – Dan Linstedt
The data fault consists of three primary structures and supporting structures such as reference tables and point-in-time bridge tables. The three main structures are:
Hubs – a list of unique business keys that change infrequently with no other descriptive attributes (except for meta data about load times and data sources). A good example of this is a car or a driver.Links – relationships or transactions between hubs. These only define the link between entities and can easily support many-to-many relationships; again no descriptive attributes on these tables other than a few meta-attributes. An example of this would be a link between cars and their drivers.Satellites – Satellites may attach to hubs or links and are descriptive attributes about the entity to which they connect. A satellite for a car hub could describe the year, make, model, current value, etc. These often have some sort of effective dating.Source: http://www.slideshare.net/kgraziano/introduction-to-data-vault-modeling
General best practices:
Separate attributes from different source systems into their own satellites, at least in a raw data vault. Using this approach it may be common to have a raw data vault that contains source system specific information with all history and attributes maintained and a second downstream business data vault. The business data vault will contain only the relevant attributes, history, or merged data sets that have meaning to the users of that vault.Having a raw mart allows you to preserve all historical data and rebuild the business vault if needs change without having to go back to source systems and without losing data if it is no longer available the source system.Track all changes to all elements so that your data vault contains a complete history of all changes.Start small with a few sources and grow over time. You don’t have to adopt a big bang approach and you can derive value quickly.It is acceptable to add new satellites when changes occur in the source system. This allows you to iteratively develop your ETL without breaking previous ETL routines already created and tested.DV2.0 – DV1.0 was merely the model. DV2.0 is:
An updated modeling approach. Key changes include:Numeric IDs are replaced with hash values, created in the staging area, that support better integration with NoSQL repositoriesBecause hashes are used, you can parallelize data loads even further because you do not have to lookup a surrogate ID if you have the business key to hash from when you’re bringing in data. This means you can load hubs, links, and satellites at the same time in some casesReferential integrity is disabled during loadingRecommended architectures around staging areas, marts, virtualization, and NoSQLAdditional methodology recommendations around Agile, Sixth Sigma, CMMI, TQM, etc.Pros:
Preserves all data, all the time – this provides the capability for tremendous analysis and responding to changing business needs. The approach allows you to obtain data from multiple sources iteratively and rapidly, preserving backwards compatibilityWorks extremely well with massively parallel processing (MPP) databases and hardwareCan be loaded extremely rapidly, particularly using the DV2.0 modeling approachLends itself very well to ETL and DW automation/virtualizationDV2.0 covers a wide spectrum of modeling needs from staging and marts to methodologyCons:
The data model can spawn a lot of tables and make queries very complicated very quickly.The raw data mart is really not meant for end users to query/explore directlyIterative additions make the data model more complicatedAlthough storage may be cheap, keeping all changes for all data in all sources can lead to data sprawl. This also makes a pared down information mart almost a necessity.Raw DV data is not cleansed and data from multiple sources are not blended when being stored
(Sources: https://en.wikipedia.org/wiki/Data_Vault_Modeling;http://roelantvos.com/blog/?p=1063; https://kentgraziano.files.wordpress.com/2012/02/introduction-to-data-vault-modeling.pdf)
Data Lake (DL) (a.k.a. Brooklyn Decker)
History: Term was coined by Pentaho CTO James Dixon in a blog post in 2010 referring to Pentaho’s data architecture approach to storing data in Hadoop.
Basic Concept: A massive, big data repository, typically on Hadoop or HDFS, at least. Key points are that it is:
Schema-less – data is written to the lake in its raw form without cleansingIngests different types of data (relational, event-based, documents, etc.) in batch and/or real-time streamingAutomated meta data management – a best practice is to use tools to automatically catalog meta data to track available attributes, last access times, data lineage, and data qualityTypically multiple products are used to load data into and read data from the lakeRapid ability to ingest new data sourcesTypically only a destination; it is usually not a source from which operational systems will source dataPros:
Useful when you do not know what attributes will be needed or used.Schema on Read – can ingest any type of data and allow different users to assess value during analysisExtremely large scale at low to moderate costCan and will use a variety of tools/technologies to analyze/visualize/massage data into a useful formCons:
Can me seen as a vast wasteland of disorganized data, particularly without good meta dataConsumers must understand raw data in various systems to know how to integrate and cleanse it in order to derive meaningful informationHigh likelihood that different consumers will perform very similar operations to retrieve data (i.e., overlap and duplication of efforts). Slight differences between groups can lead to reconciling differencesUncleansed data and multiple versions of the same data may possibly lead to duplication if not handled/filtered carefullyIt isn’t SQL – Some users will have to use more than just SQL to derive useful information from dataOffloading ETL can require significant rework of existing processes to move to something like HiveUsing multiple tool sets can lead to training and supportability challenges if not governed properlyData curation can by very challenging
– Source: http://www.slideshare.net/ImpetusInfo/enterprise-big-data-lake-challenges-strategies-maximizing-benefits-impetus-webinar-44720024
(Sources: http://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/; http://knowledgent.com/whitepaper/design-successful-data-lake/; http://martinfowler.com/bliki/DataLake.html; https://hortonworks.com/wp-content/uploads/2014/05/TeradataHortonworks_ Datalake_White-Paper_ 20140410.pdf; http://cacm.acm.org/blogs/blog-cacm/181547-why-the-data-lake-is-really-a-data-swamp/fulltext)
Data Swamp (DS) (a.k.a. Tyra Banks)
History: I’m not including a lot of history here, because this is really an extension of a Data Lake (gone bad).
Basic Concept: A data swamp is a data lake that has been poorly maintained or documented, lacks meta data, or has so much raw data that you don’t know where to start for insights. Or, it could be a combination of several of those points. When you start tracking tons of data from all different sources, but you don’t know who is using what, how to merge data sets, or how to use most of the data in your “data lake”, you’ve really got a data swamp.
Pros:
Hey, you must’ve done something right to get all that data into the repository…?At least you haven’t lost data that you can’t go back and get.If it were easy, everyone would be doing it Cons:
You’ve likely spent a lot of time and effort putting in a data lake/HDFS/Hadoop/Hive/etc. and you’re struggling to operate it at scale or to answer the questions you set out to answer.You need meta data to clue users into what is most useful, relevant, or recentYou probably need to look into key use cases (low hanging fruit) and start from that point as a place to begin using/resuscitating your repository.*** The assignment of model names to each data model was an incredibly (un)scientific process of googling various terms like “most famous supermodel ”, “ top supermodel”, etc. and teasing out the most likely #1. Feel free to disagree and let me know your vote and how you obtained it.