Data Engineering at Issuu
The story so far
About 1 year ago we decided to build our first Data Engineering team at Issuu. This is the story of what went before that decision, what we did then, where we are now and where we want to be
How it all started
Around 8 years ago we had grown to a size where we needed a dedicated Business Intelligence (BI) team. We got our first hire and we got started. As we often do when we start new teams, they work fairly ad-hoc, until we have dipped our toes enough around the pond and add structure to how they work. However, for our BI-team it took a long time to get to this structure. We went through several iterations of the team and we were challenged by having a team that was mostly working directly for our board and management team. This meant that we had a disconnect between the reporting on our data and the creators of the data. The team more or less dug into our data as it was, writing complicated and convoluted SQL queries to derive the needed information.It was all the bad practices, querying production databases, doing everything manually, not knowing exactly what the data represented, lots of spreadsheets and very long instructions on which queries to run to build the reports.
We had to do something.
Fast forward to 18 months ago, we hired a consulting company to help us build a proper data warehouse with proper pipelines and getting everything automated. At this time everything Data was within our Devops team and our BI-team which now consisted of 4 members. And that was a problem, since none of those teams had Data management as their primary area of work. So the consultant was asked to run the project for us and we would rely on their experience to set us up for the future. After 4 months of consultant work we had something. They built new connectors for us, exporting pre-defined reports and loading them to our Data Lake. The syncs were robustly built and addressed our concerns regarding main KPIs and metrics. However, they were so closely tailored for very narrow problems that even a small deviation would require heavy engineering work. We literally skipped building a good long-lasting data architecture and jumped directly to views and dashboards
We had to do something, again.
Our first Data Engineer
So a year ago we decided to build a Data Engineering team. As I was the engineering manager for our Devops and Platform team and we knew it was going to be anchored within our engineering organisation, I was put in charge of building the team. I didn’t have a whole lot of knowledge about Data Engineers, but we knew roughly what we needed someone to own and who they’d work with. So I worked with the BI-team and together we came up with a description of the role, the responsibilities and we also devised a simple test, based on what our consultant had spent most of their time doing. The test, which we still use today, is rather simple. It’s a simple question based on a real scenario and it allows us to figure out how broad our candidates are and where they’re strong. Essentially we can gauge how far into engineering and how far into BI they stretch. To this day it has proved very valuable and for each person that passed the test, we’ve ended up offering a position to more than 70% of the candidates.
After a couple of weeks of screenings, doing the test and having half day interviews, we landed our first hire. For the first couple of months it was a lot of figuring out. We had a bunch of homemade Python based pipelines, some auto-generated reports and a lot of data-sources that we needed to understand and get owned by our new Data Engineer. I learned the terminology and I found out that I actually knew a whole lot about data engineering, but I just never used the right terms. As an example, we talked about data abstraction, how we needed to separate the data created by our production systems from the data that was used for reporting, essentially the concept of a Data Lake and a Data Warehouse.
Where we are now
Today we are in a good place. Early on I pushed for Data Infrastructure as Code, which is inspired by my background in Devops, where it’s all about automation and Infrastructure as Code these days. That led us to looking at our ETL tools and which tools existed in this space. From my limited knowledge about data architecture, I’ve only ever heard of Looker having a close integration with GitHub, which is the direction I was thinking we needed to go. We then checked Fivetran, which we were already using and they have a full-fledged API allowing us to get all the configuration of pipelines into code. For transformation we quickly decided that DBT was the right tool for us. And for the data warehouse we were already using Google’s Big Query and we didn’t see any reason to try something else.
A 1-person team is not really a team and we had planned to scale the team to at least 3 members by the end of the year. Due to organisational changes we didn’t quite make it, but we finished strong and we got the 2 hires needed to join us in February 2022.
What’s to come
For 2022 we have a bunch of work ahead of us.
We only just started building out our Data Warehouse and this work will continue throughout 2022. We got the framework ready with DBT being the centre point and we need to automate and clean up all our old processes. We need to help our BI-team use the Data Warehouse and build trust in our data throughout the company. We need them to collaborate closely with the Data Engineering team, looking into long-term solutions, rather than trying to solve their inquiries quickly.
We want to have a modern data architecture where data reliability and quality is ensured and measured. We want to act on anomalies the moment they happen and not when the data is used in a report one month later. We want to be able to explain how every single piece of data is generated and how it should be interpreted.
We want to employ machine learning as a means of turning unstructured data into structured data. We have a few classifiers we need to build and let’s see how far we can take this. There’s a lot of interest in this area and I’m sure we’ll have a lot of fun building this out.
And for me the most important part is to ensure we only do meaningful work where we enable other teams to be data-driven and build the solutions that our users are looking for.