Atlan | Humans of Data Atlan | Humans of Data
  • Articles
    The Future of Data and Analytics in 2023: Insights from 300+ Data Leaders
    March 25, 2023
    Tide’s Story of GDPR Compliance: Embedding Privacy into Automated Processes
    February 21, 2023
    Data Governance at Brainly
    February 16, 2023
    How to Kickstart a Data Governance Program
    January 25, 2023
    Product Roundup: Our 15 Favorite Features from 2022
    January 24, 2023
    Gordon Wong: Using a data contract to build trust in modern data teams
    January 17, 2023
    The Future of the Modern Data Stack in 2023
    January 10, 2023
    December Product Roundup: 11 Reasons for Holiday Cheer
    January 5, 2023
    colorful high-rise buildings What I got wrong: Looking back at my 2022 predictions for the modern data stack
    December 21, 2022
    2022 Reading List: The Top 5 Must-Read Blogs on the Modern Data Stack
    December 21, 2022
    Previous Next
  • Community
    2022 Reading List: The Top 5 Must-Read Blogs on the Modern Data Stack
    December 21, 2022
    Gartner Data & Analytics Summit 2022 Key Takeaways from Gartner Data & Analytics Summit 2022: Augmented Analysis, Synthetic Data, Adaptive Governance, and More
    August 24, 2022
    Gartner Data & Analytics Summit from Aug 22-24, 2022, in Orlando, FL Gartner Data & Analytics Summit: Don’t Miss These Sessions on Active Metadata, DataOps, and More
    August 11, 2022
    Databricks Data and AI Summit 2022 Data + AI Summit 2022: Recapping 11 Major Announcements across 4 Keynotes
    July 3, 2022
    Great Data Debate - Future of Metrics Layer and Metadata with Prukalpa Sankar (Atlan), Drew Banin (dbt), and Nick Handel (Transform) Future of the Metrics Layer with Drew Banin (dbt) and Nick Handel (Transform)
    May 19, 2022
    Why Snapcommerce Chose to Start Data Cataloging
    March 11, 2022
    Subsurface LIVE Winter 2022 Conference Subsurface 2022: Leveraging DataOps to Build India’s National Data Platform
    March 3, 2022
    Postman-blog-missing-layer-data-stack A Behind-the-Scenes Look at How Postman’s Data Team Works
    October 7, 2021
    MDSCON 2021 - Modern Data Stack Conference MDSCON 2021: Learnings and Insights from Our Favorite Sessions
    September 23, 2021
    Postman-blog-missing-layer-data-stack How Postman Fixed a Missing Layer in their Data Stack
    September 13, 2021
    Previous Next
  • Guides
    • The Future of the Modern Data Stack in 2023
    • The Third-Generation Data Catalog Primer
    • The Secrets of a Modern Data Leader
    • The Ultimate Guide to Evaluating Data Lineage
    • How Active Metadata Helps Modern Organizations Embrace the DataOps Way
  • Inside Atlan
    Product Roundup: Our 15 Favorite Features from 2022
    January 24, 2023
    December Product Roundup: 11 Reasons for Holiday Cheer
    January 5, 2023
    Introducing Supercharged Automation for your Data Estate
    December 19, 2022
    November Product Roundup: 13 Features to Feed Your Data Appetite
    December 7, 2022
    Atlan + dbt Labs partnership and integration Atlan Becomes dbt Semantic Layer Launch Partner and Announces Integration
    October 18, 2022
    Atlan + Fivetran Metadata API Column-Level Lineage Atlan Partners with Fivetran and Launches Integration with Metadata API
    September 22, 2022
    Introducing the DataOps Leaders Program and its Inaugural Cohort of Inspiring Data Leaders
    July 28, 2022
    Atlan Pioneering Active Metadata with a Brand New Look and Features
    July 18, 2022
    Auto-generated, column-level lineage in Atlan for an Amazon Redshift asset How to Collaborate Across Your AWS Data Stack with Atlan
    May 13, 2022
    atlan snowflake ready technology partner Atlan Is the First Data Catalog Validated as a Snowflake Ready Technology Partner
    April 26, 2022
    Previous Next
  • Facebook
  • Twitter
  • YouTube
  • LinkedIn

The Top 5 Data Trends for CDOs to Watch Out for in 2021

By Prukalpa Sankar January 26, 2021
Modern metadata solutions, data quality frameworks, infrastructure, job roles, and other big changes are on their way.

Just like every other realm, 2020 upended the data world. When COVID shut down businesses and sent employees to work from home, companies had to quickly adapt to the “new normal”.

Cloud became an absolute necessity, as organizations moved to working remotely. Data governance and security became a big priority, with everyone accessing data from different locations and systems. Smarter AI became appealing, now that historical models were meaningless. In short, organizations realized they needed to make changes fast. Data investments went up and organizations sought to upgrade their systems and create the perfect data stack.

With 2020 in the rearview mirror, we’re now looking ahead to a new and hopefully better year. What will 2021 bring to the data world? How will data infrastructure evolve to keep up with all the latest innovations and changes?

This year, we’ll see several new data trends: the emergence of new data roles and data quality frameworks, the rise of the modern data stack and modern metadata solutions, and the convergence of data lakes and warehouses.

1. Data lakes and warehouses are converging

For the past decade, data architects have designed data operations around two key units:

  • Data lakes: Cheap storage to store vast amounts of raw or even unstructured data. The data lake architecture is typically great for ad-hoc exploration and data science use cases.
  • Data warehouses: Traditionally, data warehouses have optimized compute and processing speed. This is helpful for reporting and business intelligence, making warehouses the system of choice for analytics teams.

Today, many companies still use both systems — a data lake for all their data, plus specialized data warehouses for analytics and reporting use cases.

While we’re not there yet, we’re starting to see the two ecosystems converge as data lakes and warehouses both add more capabilities.

Data warehouses like Snowflake already separate costs for storage and compute, drastically reducing the costs associated with storing all your data on data warehouses. Taking this one step further, some data warehouse players have started adding support for semi-structured data.

On the other hand, data lake players like Databricks have started moving towards the concept of a “data lakehouse”, and they recently announced support for SQL analytics and ACID transactions.

Learn more:

  • Data Lakehouses: An emerging system design that combines the data structures and management features from a data warehouse with the low-cost storage of a data lake.
  • The Great Data Debate: A cool episode of the a16z podcast with thought-provoking notes about different technologies and architectures emerging in the data stack.

2. The “modern data stack” goes mainstream

Starting in 2020, the term “modern data stack” was everywhere you looked in the data world. It refers to the new, best-of-breed modern data architecture for dealing with massive amounts of data.

One of the key pillars of the modern data stack is a powerful cloud platform. Originally centered around cloud data warehouses, it’s also beginning to include cloud data lakes and associated data lake engines.

Today, the modern data stack refers to a suite of tools for every part of the data workflow:

  • Data ingestion: e.g. Fivetran, Stitch, Hevodata
  • Data warehousing: e.g. Snowflake, BigQuery
  • Data lakes: e.g. Amazon S3
  • Data lake processing: e.g. Presto, Dremio, Databricks, Starburst
  • Data transformation: e.g. dbt, Matillion
  • Metadata management: e.g. Atlan
  • BI tools: e.g. Looker

Learn more:

  • Emerging Architectures for Modern Data Infrastructure: An great, in-depth read about what technologies are winning in the modern data stack, based on interviews with 20+ practitioners.
  • Modern Data Stack Conference 2020: Resources from Fivetran’s first modern data stack conference on the latest innovations, tools, and best practices.
  • The Modern Data Stack Newsletter: A biweekly newsletter with blogs, guides, and podcasts on the modern data stack.

3. Metadata 3.0: metadata management is reborn

As the modern data stack matures, companies have undertaken ambitious projects to upgrade their data infrastructure and sort out basic data needs (i.e. ingesting data, wrapping up cloud migration projects, and setting up new BI tools). While these have unlocked a lot of potential, they’ve also created chaos.

Context questions like “What does this column name actually mean?” and “Why are the sales numbers on the dashboard wrong again?” kill the agility of teams that are otherwise moving at breakneck speed.

While these aren’t new questions, we’re on the cusp of new disruptive solutions. As modern data platforms are converging around five main players (AWS, Azure, Google Cloud Platform, Snowflake, and Databricks) and metadata itself is becoming big data, there’s significant potential for bringing intelligence and automation to the metadata space.

In the next 24 to 36 months, we’ll see the rise of one or more modern metadata management platforms built for the modern data stack, which solve for data discovery, data cataloging, data lineage, and observability.

Learn more:

  • Data Catalog 3.0: My article on the past and future of metadata solutions, and why we’re about to make a huge leap forward in creating modern metadata for the modern data stack.

4. New roles emerge: Analytics Engineer and Data Platform Leader

2020 saw the rise of two roles that have become more mainstream than ever before.

1. Data Platform Leader

Organizations are increasingly realizing that there needs to be a central team responsible for developing data platforms that help the rest of the organization do their work better. And naturally, this team needs a leader.

In the past, this was handled by more traditional roles like Data Warehousing Specialists or Data Architects. Now it’s become common to have a data leader, who leads the data initiative across the organization. These people go by a range of titles, such as “Head of Data Platform” or “Director of Data Platforms”.

Data platform leaders typically oversee the modernization (or set-up from scratch, for startups) of a company’s data stack. This includes setting up a cloud data lake and warehouse, implementing a data governance framework, choosing a BI tool, and more.

This new role comes with an important new KPI: end user adoption. This refers to the leader’s ability to get people and teams within the organization to adopt data (and data platforms) in their daily workflows. This is a welcome change, as it aligns the incentives of those deciding what data products to invest in with those who ultimately use the products.

2. Analytics Engineer

Every analyst that I’ve spoken to in the past decade had one major frustration: dependency on data engineers for productionalization and setting up data pipelines.

The rise of powerful SQL-based pipeline building tools like dbt and Dataform has changed this for the better. By giving the analyst superpowers, they put the entire data transformation process in the hands of data analysts.

The result is the rise of the term “Analytics Engineer”, which describes former analysts who now own the entire data stack from ingestion and transformation to finally delivering usable data sets to the rest of the business.

Learn more:

  • What Is an Analytics Engineer? An article from Claire Carroll at dbt about the why and how behind the new analytics engineering role.

5. Data quality frameworks are on the rise

Data quality is a space that hasn’t seen much innovation in the last two decades. However, it’s recently made significant strides, and different aspects of data quality are being incorporated throughout the data stack.

Data quality profiling

Data profiling is the process of reviewing data to understand its content and structure, check its quality, and identify how it can be used in the future.

Profiling can happen several times through a data asset’s lifecycle, ranging from shallow to in-depth assessments. It includes calculating the missing values, minimums and maximums, median and mode, frequency distribution, and other key statistical indicators that help users understand the underlying data quality.

While data quality profiling was typically a stand-alone product in the data stack, companies are increasingly incorporating it as a capability in modern data catalogs, enabling end users to understand and trust their data.

Business-driven data quality rules

Data quality isn’t just about the statistical understanding of data. It’s also about whether the data is trustworthy, based on business context.

For example, your sales numbers typically shouldn’t increase by more than 10% per week. A 100% spike in sales should alert the right team member and stop the data pipeline run, rather than making its way to a dashboard the CEO uses!

This need for intelligent alerts has led organizations to bring business teams into the process of writing data quality checks.

There still isn’t a great way for data teams to collaborate with business counterparts on data quality checks, but I expect this space will see a lot of innovation in the years to come. In the future, we’ll see smarter solutions that auto-generate business-driven data quality rules based on trends in the data.

Data quality tests in data pipelines

The third way that data quality is becoming common is writing it into the data pipeline itself. This borrows principles from “unit tests” in the software engineering world.

Software engineering has included unit testing frameworks for years. These automatically test each individual unit of code to make sure it’s ready to use. Data quality tests within the pipeline mimic unit testing frameworks to bring the same confidence and speed to data engineering.

This helps teams catch data quality issues caused by upstream data changes before it affects the organization’s workflows and reports.

Learn more:

  • Amazon Deequ: Built internally at Amazon, Deequ is a promising open-source framework for data quality profiling.
  • Great Expectations: This is emerging as a popular open-source community for data quality testing within data pipelines.
  • Netflix’s Presentation on Scaling Data Quality: This is an interesting read for any data leader getting started on their data quality journey.

Do you agree or disagree with these trends? Spotted something that we’ve missed? Drop a comment with your insights!

This article was originally published in Towards Data Science.

CDOData ManagementData Qualitymetadata managementRecent Posts
Author Prukalpa Sankar

    Related Posts

    The Future of Data and Analytics in 2023: Insights from 300+ Data Leaders

    March 25, 2023
    Semantic Layer

    The Past, Present, and Future of the Semantics Layer

    November 1, 2022

    How to Improve Data Discovery with Persona-Driven Strategies

    October 12, 2022

    Write A Comment Cancel Reply

    This site uses Akismet to reduce spam. Learn how your comment data is processed.

    © 2023 Atlan. All registered.

      Atlan | Humans of Data
      • Articles
      • Community
      • Guides
        • The Future of the Modern Data Stack in 2023
        • The Third-Generation Data Catalog Primer
        • The Secrets of a Modern Data Leader
        • The Ultimate Guide to Evaluating Data Lineage
        • How Active Metadata Helps Modern Organizations Embrace the DataOps Way
      • Inside Atlan

      Type above and press Enter to search. Press Esc to cancel.