A beginner’s guide to the best of breed tools and capabilities for your Data Platform initiative
If you Google “modern data platform”, you’ll immediately be bombarded with advertisements and lots of companies professing that they are the one true data platform. Not so helpful, right?
So what the heck is a modern data platform? What does that even mean, andwhat does it look like in 2021?
The short answer: a modern data platform is a collection of tools and capabilities that, when brought together, allow organizations to achieve the gold standard — a fundamentally data-driven organization.
In this article, I’ll break down what a modern data platform means in practice today. This includes the three core characteristics, six fundamental building blocks, and latest data tools you should know.
The 3 characteristics of a modern data platform
Given the sheer scale and complexity of data today, it’s no longer enough for a modern data platform to process and store data. It has to move and adapt faster than ever to keep up with the diversity of its data and data users. There are 3 fundamental characteristics that make a data platform truly “modern”.
Enable self-service for a diverse range of users
In a world where everyone — from business users and marketers to engineers and product managers — is an analyst, people shouldn’t need an analyst to help them understand their company’s data.
One key aspect of a modern data platform is that it can be used intuitively by a wide range of users. If someone wants to bring data into their work, they should be able to easily find the data they need.
This means that the platform should make it possible for all users to…
- Easily discover and analyze data within the platform
- Understand the context associated with data, such as column descriptions, history and lineage
- Derive insights from data with minimal dependencies on the data or IT team
Enable “agile” data management
One of the major challenges in legacy data platforms is their complexity. Just getting access to data usually required setting up time-consuming ETL jobs. Need to modify your data or query even a bit? The lengthy process starts all over again.
Modern data platforms aim to change that. With a well-built modern platform, data-driven decision-making should be able to move at the speed of business.
The two fundamental principles that govern modern data platforms are availability and elasticity:
- Availability: Data is already available in a data lake or warehouse. Modern data lakes and warehouses separate storage and compute, which makes it possible to store large amounts of data for relatively cheap.
- Elasticity: Compute is based on a cloud platform, which allows for elasticity and auto-scalability. For example, if end users consume the most data and analytics on Friday afternoons, it should be possible to auto-scale processing power on Friday afternoon to give users a great experience and then scale down within a few hours.
Flexible, fast set-up, and pay-as-you-go
Modern data platforms are a far cry from the complex, on-premise implementations of the Hadoop era. They are built in a cloud-first, cloud-native world, which means that they can be set up in hours, not years.
A modern platform should be…
- Easy to set up — no lengthy sales process, demo calls, and implementation cycles. Just login, pay via credit card, and go!
- Pay as you go — no up-front payments and million dollar licensing fees. The “modern” stack is all about putting power in the hands of the consumer, i.e. paying only for what you use
- Plug and play — the modern data stack will continue to evolve and innovate, and the best of breed tools do not enforce “lock in” like tools in the legacy era, and are instead built on open standards & APIs allowing easy integration with the rest of the stack
The key building blocks of a modern data platform
Modern data ingestion
Data ingestion is likely where your efforts to build out a modern data platform get started — i.e. how do you bring data from a variety of different data sources and ingest it into your core data storage layer?
Here are some key modern data ingestion tools:
- SAAS tools: Fivetran, Hevo Data, Stitch
- Open-source tools: Singer, StreamSets
- Custom data ingestion pipelines built on orchestration engines like Airflow
Modern data storage and processing
The data storage and processing layer is fundamental to the modern data platform. While this architecture is evolving, we typically see 3 kinds of tools or frameworks:
Data warehouses: The cornerstone of this architecture is a modern data warehouse. These are generally the system of choice for analysts since they optimize compute and processing speed.
Data lakes: A data lake architecture refers to data being stored on an object storage like Amazon S3, coupled with tools to process the data such as Spark. These cheap storage systems are often used to store vast amounts of raw or even unstructured data.
Here are some key tools for data lakes:
- Data storage: Amazon S3, Azure Data Lake Storage Gen2, Google Cloud Storage
- Data processing engines: Databricks or Spark; Athena, Presto, or Starburst; Dremio
New trend alert: Data lakehouses! One of the trends we’re seeing this year is the long-awaited convergence of data warehouses and lakes. This will help unify the siloed systems that most companies have created over the last decade.
For example, one concept that’s emerging is the “data lakehouse” — a system design that combines the data management features such as ACID transactions and change data capture from a data warehouse with the low-cost storage of a data lake.
Modern data transformation
There are two core implementations we’re seeing today for the data transformation layer. For companies with data warehouse–first architectures, tools like dbt that leverage native SQL for transformation have emerged as the top choice for data transformation. The other common implementation is using Airflow as an orchestration engine coupled with custom transformation in a programming language like Python.
Here are some key tools for data transformation:
- With data warehouses: dbt, Matillion
- With an orchestration engine: Apache Airflow + Python, R, or SQL
Modern business intelligence and analytics
BI dashboards have been around for ages, but the latest breed of BI and analytics tools are built to fit within a larger modern data platform. These are generally more self-service platforms that allow users to explore data, rather than just consuming graphs and charts.
Modern data catalogs and governance
While the modern data platform is great in some areas (super fast, easy to scale up, little overhead), it struggles with bringing discovery, trust and context to data.
As metadata itself becomes big data, we’re in the midst of a leap forward in metadata management and the space is about to see significant amounts of innovation in the next 18–24 months. I recently wrote about the idea of Data Catalog 3.0: a new era of software, which will be built on the premise of embedded collaboration that is key in today’s modern workplace, borrowing principles from Github, Figma, Slack, Notion, Superhuman, and other modern tools that are commonplace today.
Here are some key tools for modern data cataloguing and governance:
- SAAS tools: Atlan
- Open-source tools: Apache Atlas, LinkedIn’s DataHub, Lyft’s Amundsen
- In-house tools: Airbnb’s Dataportal, Facebook’s Nemo, Uber’s Databook
Modern data privacy and access governance
Finally, as tooling in the modern data platforms grow, a major challenge is to be able to manage privacy controls and access governance across your entire stack. While it’s still early, there have been some recent players and developments in the space and players emerging with tools that can act as an entitlement engine to apply privacy and security policies across the data stack.
Here are some key tools for data privacy and access governance:
Other modern data tools you should know about
The capabilities and tools listed above are the basic layers of a modern data platform. However, each company uses data differently, so many have added additional layers and tools for specific use cases.
Real-time data processing tools
Companies that need real-time data processing generally add two additional types of tools to their data platform:
Data science tools
Companies that have moved past BI and analytics onto strong predictive and data science analytics often add a specific data science tool (or tools) to their data stack:
Data quality tools
This space is still relatively nascent, but it’s seeing a lot of activity nowadays. Broadly, we’re seeing a few aspects of data quality incorporated throughout the data stack — data quality checks during profiling, business-driven quality rules, and unit testing frameworks within the pipeline.
- Data profiling: This is getting engulfed by data catalog and governance tools like Atlan or open source profiling frameworks like Amazon Deequ.
- Unit testing: Frameworks like the open-source Great Expectations are emerging and evolving, allowing unit tests to be written as a part of data pipelines itself.
This article was originally published in Towards Data Science.