Building a Data Platform

The Story so far

18 months ago we hired our first Data Engineer, without a whole lot of knowledge about how to build a Data Engineering team. We quickly built the team and there was a lot to learn, but one thing that we knew was that we started almost from scratch and that meant that we could build our architecture according to best practices and use modern technologies.

Now 18 months later we’ve built a great data architecture revolving around the tools, DBT, Fivetran and Census which allow us to use common development practices, such as CI/CD, staging environments and working with PRs, while offering Reverse ETL, Data Quality monitoring and much more.

Now looking at what we have and how the team works in our organisation, it was time to take the next step.

Why?

Until now the team has been a dependency for other teams, with everything that entails. They worked ad-hoc across multiple departments, often individually, which made knowledge sharing harder. Prioritising work was hard with many stakeholders all wanting a piece of the team and without a dedicated PM. We had been working using one KPI, Data Quality (DQ), and that led us on a path. We wanted to ensure most of the data in our Data Warehouse had some DQ measures associated with it, like number of rows, field integrity, distribution of data, etc. But getting all of this right, meant that we needed to work very closely with our BI-team, which wasn’t always ideal. We did get to learn more about business dependencies, but roles and responsibility became clouded. Ideally our BI-team would add most of those DQ measures themselves, as they had the required insight, but they were not equipped to do it.

Another important reason was one of our general guidelines, all teams should be able to perform at least 80% of their work without dependencies.

Inspired by the State of Platform Engineering paper, we decided to think of what we had as a product, a Data Platform, which we’d offer to the rest of our organisation. Instead of being a dependency, we’d have teams use our product themselves, instead of relying on us to do their work. The Data Platform was born.

The process we went through

We went through a series of meetings, with the purpose of turning a request driven team into thinking like a product team. Instead of delivering a data-model, a pipeline, DQ-monitoring, we’d be delivering the platform on which it was supposed to be built.

The Vision Meeting

In the first meeting it was basically about getting into the mindset. As Engineering Manager  and to some extent PM, I had the advantage of having lots of inputs and was able to establish the company wide vision and in this meeting it was about getting the entire team to understand the perspective and at that abstraction level.

The Features Meeting

In the next meeting we put our product- and sales-hats on and really went outside of our usual roles. How could we look at the work that we did in a completely different light, abstract away and envision what the feature-list of our product would look like. The characteristics that our product does or has.

The Details Meeting

Next up, we wanted to add more details to each feature as Data Documentation, Data Compliance or Empower Product Development are hard to quantify. Let’s take an example, Data Compliance in detail is about:

Data Flow Documentation, which is required to be GDPR compliant. This is simply the documentation that we maintain for each pipeline about where we’re moving what data, to and from and why.

Curated data taking GDPR requirements and consents into consideration. This is a piece of business logic we maintain which ensures that only data that a user has given us consent to can be used in the context the consent has been given.

Automated propagation of consent to all destinations. This is the actual business logic in use in pipelines and ensuring that all destinations are updated accordingly.

The Maturity Meeting

In the final meeting we went through all the features and scored each feature in terms of how mature and ready for consumption by an “outsider” the feature is. We did a simple voting between low, medium and high and discussed when we didn’t agree. On most we agreed, but on a few we saw quite differently on maturity. An example was the feature, Consolidated Development Process. This revolves around the advantages of working with pull requests. While it was easy to do PRs and gain the advantages of those, the main repository which hold all the DBT code, was not structured in the most easy to understand way, if you were not familiar with DBT or the codebase. So on one hand the feature was very mature (you could do PRs) yet on the other hand it was not really mature, since it wasn’t as straightforward as we wanted.

Our Data Platform

The list of features, the details and the maturity level of each as they stand for now are

  • Data Warehouse (medium)
    • Curated data
    • One source of truth
    • Tables that are optimized for analytics/reporting
    • Maintenance and operations for the data models
  • Ensured Data Quality + Monitoring (low)
    • Visibility over data quality metrics
    • Alerting on anomalies (proactive)
    • Evolution of the quality of data
    • Evaluating the business impact of data changes
  • Data Lineage - understand where data comes from and where it goes (high)
    • Speeds up troubleshooting (reactive)
    • Data change impact analysis (proactive)
    • Data Observability
  • Data Documentation (medium)
    • Easier to onboard new employees
    • Automatically updated
    • Empower employees to self-serve
    • Independent exploration of data
    • dbt.issuu.com
  • Consolidated development process (medium)
    • Improved knowledge sharing
    • Better development by using best practices (PR)
    • Traceability
    • Scheduled queries
  • Empower product development (high)
    • Capitalize on our data to provide customer value
    • Provide aggregate data for product
    • Take advantage of a modern data stack to easily build product features
    • Sitemap
    • Publisher Directory
    • Publisher Statistics
    • Reader statistics
    • Recommendations
    • Trending
  • Data Compliance (not an enabler, but something that is offered automagically) (medium)
    • Data Flow documentation (required to offer a DPA)
    • Curated data taking GDPR requirements and consent into consideration
    • Automated propagation of consent to all destinations
  • Data Automation - manual work can be automated (high)
    • Reverse ETL to a large variety of destinations and services
    • Data consistency across destinations (SFDC, Iterable, FB audiences, etc)
    • Remove human involvement, freeing up time
  • Data Operation Analysis (medium)
    • Cost analysis
    • Data usage analysis

You’re doing it all wrong

Something we heard a few times as we talked about what we were doing, was that we did it wrong. If we want to build a product, we need to ask and understand our users and the problems they face and build a solution for that. So, if we were starting from scratch, this would be correct, but we’ve already helped our users for a long time and we did have a pretty good idea of what problems they were facing. And we also knew what we’d like them to do. We took the Apple-approach and told our users what they needed - initially.

However, one very important learning is that we can’t just say, now we have a Data Platform, use it. We need to educate our users and teach them what we already know and have done for them for a long time and get them to understand why they should do it themselves. While at the same time collecting feedback and work on eliminating the user pain points and reduce friction in the adoption of our product.

What’s next

We’ve so far presented this new product to our BI team which we consider our primary user. We’ve also briefly announced it at a larger meeting with our product teams. And we plan to present an example we’ve already implemented in greater detail at a tech-talk in January.

We’ve also had a meeting with our monetisation team around monitoring and shown how we can provide a very different experience, which much better captures what the team need to keep an eye on, which will be implemented using our Data Platform, ideally by the team themselves together with our BI-team.

And finally, we’re compiling a list of use-cases that we will use to provide a better understanding of how this platform fits into the development process of Issuu and the value it can create.

If this sounds interesting, maybe you will be our new team mate?