PDF invoices → Real-time financial insights: How I stopped relying on an engineer to automate my workflow and learnt to do it myself

October 9, 2023 · 7 min read

COO

info

TL;DR I set up a data pipeline that automatically extracts data from PDF invoices in my Gmail and puts it all in a single Google Sheet where I can easily keep of track of it. I did this using the python library dlt that uses langchain and LLMs to read PDF data and converts it into structured tables.

I am Anna, co-founder & COO of dltHub. As an ops lead with many years of running SMB-size startups, I find myself juggling a myriad of tasks, from administration, finance, and people to customer support or sales. These tasks come with their own data, all of which are crucial for making decisions. This creates a huge scope for automation, but unfortunately getting engineering support is not always easy. Whether it's integrating tools with APIs or managing data efficiently, the waiting game can be frustrating.

So, I often end up doing manual tasks such as budgeting, cost estimation, updating CRM, or preparing audits. I have been dreaming about automating these processes.

For example, I need to analyze expenses in order to prepare a budget estimation. I get numerous PDFs daily in a dedicated Gmail group inbox. I was wondering to which extent dlt can help fulfill my automation dream. I decided to work with Alena from our data team on an internal project.

invoice flow chart

Use Case

Imagine this scenario: your team receives numerous invoices as email attachments daily. You need to extract and analyze the data within these invoices to gain insights crucial to your operations. This is where the data load tool (dlt) steps in.

Alena created a pipeline using dlt that automates the process of translating invoices received as email attachments in a specific Google email group and stores them in a database (for example, BigQuery or DuckDB).

As a non-coder working in tech startups for a long time, I finally got a chance to learn how to use the terminal and run a simple pipeline.

Here's a summary of how it works.

Let’s get started

In this article, I will show you an example of loading structured data from invoices received by email into BigQuery. For more details, check the README.md in the GitHub repository.

Step 1. Preparation

Make sure that you have all you need:

Make sure you have Python 3.x installed on your system.
Use a virtual environment (more details on how to set up the environment).
Install the dlt library by using pip install "dlt[bigquery]".
Create a project folder on your laptop. I called mine “unstructured_data_pipeline”.
We will need access to LLM, Langchain will use OpenAI models by default, so we also used an OpenAI API token.
Using a tool like Visual Studio makes it easier.

Step 2. Initiate the pipeline

To create the pipeline, we will use the dlt verified source unstructured_data, which includes the verified source inbox.

Init the pipeline by using dlt init unstructured_data bigquery.
Install necessary requirements pip install -r requirements.txt.

Step 3. Set up your credentials

The dlt init command creates folder .dlt in your project directory, and clones the source code from the verified-sources repository.

Open .dlt/secrets.toml file on your laptop.

Enter the OpenAI secrets:

[sources.unstructured_data]
openai_api_key = "openai_api_key"

Enter your email account secrets in the same section [sources.unstructured_data]:
```
host = 'imap.example.com'
email_account = "example@example.com"
password = 'set me up!'
```
Check here how to configure the inbox source.

Enter the BigQuery secrets:

[destination.bigquery]
location = "US"
[destination.bigquery.credentials]
project_id = "set me up!"
private_key = "set me up!"
client_email = "set me up!"

Read more about dlt credentials and BigQuery credentials.

Step 5: Define your queries

This is the part where you can define what you’d like to see as an outcome.

Queries example:

INVOICE_QUERIES = {
    "recipient_company_name": "Who is the recipient of the invoice? Just return the name. If you don't know, then return None",
    "invoice_amount": "What is the total amount of the invoice? Just return the amount as decimal number, no currency or text. If you don't know, then return None",
    "invoice_date": "What is the date of the invoice? Just return the date. If you don't know, then return None",
    "invoice_number": "What is the invoice number? Just return the number. If you don't know, then return None",
    "service_description": "What is the description of the service that this invoice is for? Just return the description. If you don't know, then return None",
}

Customize the INVOICE_QUERIES dictionary in the unstructured_data/settings.py file if you want to extract other information, or if your invoices have a different structure.

Step 6: Run the pipeline!

And now the magic happens. Use the following command to run the pipeline:

python unstructured_data_pipeline.py

In the next step, dlt will save all processed structured data to the database (in my case, BigQuery).

Step 7: Check the outcome in BigQuery

If you load it to BigQuery like I did in my example, then you can look at your data using BigQuery UI or export it directly to a Google sheet.

Step 8: Deploy

Now you can deploy this script with GitHub Actions as we did, so that it checks your incoming email every day and processes invoices automatically.

Outcome:

Here’s how the result looks like in BigQuery:

screenshot 1

…and as a Google Sheet. You can easily export this table from BigQuery to Google Sheets using the Export button in the top right corner.

screenshot 2

Bonus: In order to have a Google Sheet with live updates, you can go to the Data tab in your Spreadsheet → Data Connectors → BigQuery → choose your database and voila, your data will be updated automatically.

screenshot 3

Conclusion:

This worked well:

dlt was good at extracting the data I needed, and it indeed worked in real-time. I needed some support from Alena when running the pipeline for the first time, but that’s because I had never coded before. 😊
I was able to see the details that are relevant to my workaround budgeting.

This did not work well:

Some PDFs don’t get transformed correctly. Some details were missing or misspelled. That depends on the LLM, which extracts structured data from a raw text. And also on the invoice structure.
And it only worked well with digital PDFs, though not with JPG/scanned versions. Luckily, 99% of all the invoices are of the former kind. However, you can even set up this process for the other kinds of documents by making modifications to unstructured.io.

Where to go next?

It was definitely a great start, and we will test it further. And I already have many other use cases where dlt pipelines could help with ops automation processes. E.g.:

In creating a list of all contracts based on PDFs in a Google Drive folder (super useful for audits).
In moving specific data to CRM (e.g. invoice related information about the customers).

This specific example illustrates just one way in which Operations Leads can harness the power of dlt to analyze data efficiently without relying on engineers for extensive support. By automating data processes and enabling real-time insights, dlt empowers small startups to make informed decisions and stay competitive in their respective markets.

In the startup world where time is of the essence, dlt has a chance to be the key to unlock data's full potential and accelerate operational efficiency. I’m looking forward to saying goodbye to endless waiting and hello to a world where Operations Leads can take control of their data needs, all thanks to dlt.

Modeling Unstructured Data for Self-Service Analytics with dlt and Holistics

October 6, 2023 · 19 min read

Zaeem Athar

Junior Data Engineer

info

TL;DR: A modern analytics stack with dlt and Holistics to transform and ingest unstructured production data from MongoDB to flat tables in BigQuery for self-service analytics.

If you’re a CTO, then you probably love MongoDB: it’s scalable, production-ready, and a great dump for unstructured, and semi-structured data. If you’re however a data scientist or data analyst and you need to run analytics on top of MongoDB data dumps, then you’re probably not a fan. The data in MongoDB needs to be transformed and stored in a data warehouse before it is ready for analytics. The process of transforming and storing the data can become quite tedious due to the unstructured nature of the data.

In this blog, we will show you how you can combine dlt and Holistics and create a modern data stack that makes the process of extracting unstructured data from MongoDB, and running self-service analytics on the data simple and straightforward. We will use dlt to ingest the Movie Flix Dataset into BigQuery from MongoDB and use Holistics to transform the data and run self-service analytics.

An Overview of the MongoDB Modern Analytics Stack

Diagram illustrating the inner workings of our Modern Analytics Stack

Tool	Layer	Why it’s awesome
MongoDB	Production	Sometimes used as a data dump by CTOs. Often stores unstructured, semi-structured production data that stakeholders want to access.
dlt	Data Ingestion	Mongo is great, but then others struggle to analyze the data. dlt extracts data from MongoDB, creates schema in BigQuery, and loads normalized MongoDB data into BigQuery.
BigQuery	Data Warehouse	Because of its pricing model, it’s a good data warehouse choice to store structured MongoDB data so it can be used by BI tools like Holistics for self-service analytics.
Holistics	Data Modeling for Self-Service Analytics	Holistics makes it easy for data teams to setup and govern an end-user self-service analytics platform using DevOps best practices

In our stack, dlt resides in the data ingestion layer. It takes in unstructured data from MongoDB normalizes the data and populates it into BigQuery.

In the data modeling layer, Holistics accesses the data from BigQuery builds relationships, transforms the data, and creates datasets to access the transformations. In the reporting layer, Holistics allows stakeholders to self-service their data by utilizing the created datasets to build reports and create visualizations.

MongoDB is loved by CTOs, but its usage creates issues for stakeholders.

NoSQL databases such as MongoDB have gained widespread popularity due to their capacity to store data in formats that align more seamlessly with application usage, necessitating fewer data transformations during storage and retrieval.

MongoDB is optimized for performance and uses BSON (Binary Javascript Object Notation) under the hood as compared to JSON. This allows MongoDB to support custom and more complex data types, such as geospatial data, dates, and regex. Additionally, BSON supports character encodings.

All these benefits enable MongoDB to be a faster and better database, but the advantages of the flexibility offered by MongoDB are sometimes abused by developers and CTOs who use it as a dump for all types of unstructured and semi-structured data. This makes this data inaccessible to stakeholders and unfit for analytics purposes.

Moreover, the unique nature of MongoDB with its BSON types and its usage as a data dump in current times mean that additional hurdles must be crossed before data from MongoDB can be moved elsewhere.

How does our Modern data stack solve the MongoDB problem?

In the data ingestion layer, dlt utilizes the MongoDB verified source to ingest data into BigQuery. Initializing the MongoDB verified source setups default code needed to run the pipeline. We just have to setup the credentials and specify the collections in MongoDB to ingest into BigQuery. Once the pipeline is run dlt takes care of all the steps from extracting unstructured data from MongoDB, normalizing the data, creating schema, and populating the data into BigQuery.

Getting your data cleaned and ingested into a data warehouse is just one part of the analytics pipeline puzzle. Before the data is ready to be used by the entire organization the data team must model the data and document the context of data. This means defining the relationships between tables, adding column descriptions, and implementing the necessary transformations. This is where Holistics shines. With analytics-as-code as first-class citizens, Holistics allows data teams to adopt software engineering best practices in their analytics development workflows. This helps data teams govern a centralized curated set of semantic datasets that any business users can use to extract data from the data warehouse.

Why is dlt useful when you want to ingest data from a production database such as MongoDB?

Writing a Python-based data ingestion pipeline for sources such as MongoDB is quite a tedious task as it involves a lot of overhead to set up. The data needs to be cleaned before it is ready for ingestion. Moreover, MongoDB is a NoSQL database meaning it stores data in a JSON-like data structure. So if you want to query it with SQL natively, you will need to transform this JSON-like data structure into flat tables. Let's look at how this transformation and cleaning can be done:

Create a Data Model based on the MongoDB data we intend to ingest.
Create tables in the data warehouse based on the defined Data Model.
Extract the data from MongoDB and perform necessary transformations such as Data Type conversion (BSON to JSON), and flattening of nested data.
Insert the transformed data into the corresponding SQL tables.
Define relationships between tables by setting up primary and foreign keys.

Using the dlt MongoDB verified source we can forgo the above-mentioned steps. dlt takes care of all the steps from transforming the JSON data into relational data, to creating the schema in the SQL database.

To get started with dlt we would need to set some basic configurations, while everything else would be automated. dlt takes care of all the steps from creating schema to transforming the JSON data into relational data. The workflow for creating such a data pipeline in dlt would look something like this:

Initialize a MongoDB source to copy the default code.
Set up the credentials for the source and destination.
Define the MongoDB collection to ingest (or default to all).
Optionally configure incremental loading based on source logic.

What is useful about Holistics in this project?

Holistics is a Business Intelligence platform with the goal of enabling self-service analytics for entire organizations. Holistics works by connecting to an SQL data warehouse. This allows it to build SQL queries and execute them against the data warehouse. In essence, Holistics utilizes the storage and processing capabilities of the data warehouse and the data never leaves the data warehouse.

To enable self-service Holistics introduces a modeling layer. The data teams use this layer to define table relationships, data transformations, metrics, and data logic. The entire organization can utilize these metrics and data logic defined in this layer to self-service their data needs.

In addition to the transformation layer, Holistics provides advanced features such as defining models using code through Holistics’ analytics-as-code languages (AMQL) and utilizing Git version control systems to manage code changes. Moreover, data teams can integrate with dbt to streamline the data transformations.

The overall Holistics workflow looks something like this:

Holistics Overview

Connect Holistics to an existing SQL data warehouse.
Data teams use Holistics Data Modeling to model and transform analytics data. This model layer is reusable across reports & datasets.
Non-technical users can self-service explore data based on predefined datasets prepared by data teams. They can save their explorations into dashboards for future use.
Dashboards can be shared with others, or pushed to other platforms (email, Slack, webhooks, etc.).

Code Walkthrough

In this section, we walk through how to set up a MongoDB data pipeline using dlt. We will be using the MongoDB verified source you can find here.

1. Setting up the dlt pipeline

Use the command below to install dlt.

pip3 install -U dlt

Consider setting up a virtual environment for your projects and installing the project-related libraries and dependencies inside the environment. For best installation practices visit the dlt installation guide.

Once we have dlt installed, we can go ahead and initialize a verified MongoDB pipeline with the destination set to Google BigQuery. First, create a project directory and then execute the command below:

dlt init mongodb bigquery

The above command will create a local ready-made pipeline that we can customize to our needs. After executing the command your project directory will look as follows:

.
├── .dlt
│   ├── config.toml
│   └── secrets.toml
├── mongodb
│   ├── README.md
│   ├── __init__.py
│   └── helpers.py
├── mongodb_pipeline.py
└── requirements.txt

The __init__.py file in the mongodb directory contains dlt functions we call resources that yield the data from MongoDB. The yielded data is passed to a dlt.pipeline that normalizes the data and forms the connection to move the data to your destination. To get a better intuition of the different dlt concepts have a look at the docs.

As the next step, we set up the credentials for MongoDB. You can find detailed information on setting up the credentials in the MongoDB verified sources documentation.

We also need to set up the GCP service account credentials to get permissions to BigQuery. You can find detailed explanations on setting up the service account in the dlt docs under Destination Google BigQuery.

Once all the credentials are set add them to the secrets.toml file. Your file should look something like this:

# put your secret values and credentials here. do not share this file and do not push it to github
[sources.mongodb]
connection_url = "mongodb+srv://<user>:<password>@<cluster_name>.cvanypn.mongodb.net" # please set me up!
database = "sample_mflix"

[destination.bigquery]
location = "US"
[destination.bigquery.credentials]
project_id = "analytics" # please set me up!
private_key = "very secret can't show"
client_email = "<org_name>@analytics.iam.gserviceaccount.com" # please set me up!

The mongodb_pipeline.py at the root of your project directory is the script that runs the pipeline. It contains many functions that provide different ways of loading the data. The selection of the function depends on your specific use case, but for this demo, we try to keep it simple and use the load_entire_database function.

def load_entire_database(pipeline: Pipeline = None) -> LoadInfo:
    """Use the mongo source to completely load all collection in a database"""
    if pipeline is None:
        # Create a pipeline
        pipeline = dlt.pipeline(
            pipeline_name="local_mongo",
            destination='bigquery',
            dataset_name="mongo_database",
        )

    # By default the mongo source reflects all collections in the database
    source = mongodb()

    # Run the pipeline. For a large db this may take a while
    info = pipeline.run(source, write_disposition="replace")

    return info

Before we execute the pipeline script let's install the dependencies for the pipeline by executing the requirements.txt file.

pip install -r requirements.txt

Finally, we are ready to execute the script. In the main function uncomment the load_entire_database function call and run the script.

python mongodb_pipeline.py

If you followed the instructions correctly the pipeline should run successfully and the data should be loaded in Google BigQuery.

2. The result: Comparing MongoDB data with the data loaded in BigQuery.

To get a sense of what we accomplished let's examine what the unstructured data looked like in MongoDB against what is loaded in BigQuery. Below you can see the sample document in MongoDB.

{
    "_id": {
        "$oid": "573a1390f29313caabcd42e8"
    },
    "plot": "A group of bandits stage a brazen train hold-up, only to find a determined posse hot on their heels.",
    "genres": [
        "Short",
        "Western"
    ],
    "runtime": {
        "$numberInt": "11"
    },
    "cast": [
        "A.C. Abadie",
        "Gilbert M. 'Broncho Billy' Anderson",
        "George Barnes",
        "Justus D. Barnes"
    ],
    "poster": "https://m.media-amazon.com/images/M/MV5BMTU3NjE5NzYtYTYyNS00MDVmLWIwYjgtMmYwYWIxZDYyNzU2XkEyXkFqcGdeQXVyNzQzNzQxNzI@._V1_SY1000_SX677_AL_.jpg",
    "title": "The Great Train Robbery",
    "fullplot": "Among the earliest existing films in American cinema - notable as the first film that presented a narrative story to tell - it depicts a group of cowboy outlaws who hold up a train and rob the passengers. They are then pursued by a Sheriff's posse. Several scenes have color included - all hand tinted.",
    "languages": [
        "English"
    ],
    "released": {
        "$date": {
            "$numberLong": "-2085523200000"
        }
    },
    "directors": [
        "Edwin S. Porter"
    ],
    "rated": "TV-G",
    "awards": {
        "wins": {
            "$numberInt": "1"
        },
        "nominations": {
            "$numberInt": "0"
        },
        "text": "1 win."
    },
    "lastupdated": "2015-08-13 00:27:59.177000000",
    "year": {
        "$numberInt": "1903"
    },
    "imdb": {
        "rating": {
            "$numberDouble": "7.4"
        },
        "votes": {
            "$numberInt": "9847"
        },
        "id": {
            "$numberInt": "439"
        }
    },
    "countries": [
        "USA"
    ],
    "type": "movie",
    "tomatoes": {
        "viewer": {
            "rating": {
                "$numberDouble": "3.7"
            },
            "numReviews": {
                "$numberInt": "2559"
            },
            "meter": {
                "$numberInt": "75"
            }
        },
        "fresh": {
            "$numberInt": "6"
        },
        "critic": {
            "rating": {
                "$numberDouble": "7.6"
            },
            "numReviews": {
                "$numberInt": "6"
            },
            "meter": {
                "$numberInt": "100"
            }
        },
        "rotten": {
            "$numberInt": "0"
        },
        "lastUpdated": {
            "$date": {
                "$numberLong": "1439061370000"
            }
        }
    },
    "num_mflix_comments": {
        "$numberInt": "0"
    }
}

This is a typical way data is structured in a NoSQL database. The data is in a JSON-like format and contains nested data. Now, let's look at what is loaded in BigQuery. Below you can see the same data in BigQuery.

BigQuery Data Overview

The ddl (data definition language) for the movies table in BigQuery can be seen below:

CREATE TABLE `dlthub-analytics.mongo_database.movies`
(
  _id STRING NOT NULL,
  plot STRING,
  runtime INT64,
  poster STRING,
  title STRING,
  fullplot STRING,
  released TIMESTAMP,
  rated STRING,
  awards__wins INT64,
  awards__nominations INT64,
  awards__text STRING,
  lastupdated TIMESTAMP,
  year INT64,
  imdb__rating FLOAT64,
  imdb__votes INT64,
  imdb__id INT64,
  type STRING,
  tomatoes__viewer__rating FLOAT64,
  tomatoes__viewer__num_reviews INT64,
  tomatoes__viewer__meter INT64,
  tomatoes__fresh INT64,
  tomatoes__critic__rating FLOAT64,
  tomatoes__critic__num_reviews INT64,
  tomatoes__critic__meter INT64,
  tomatoes__rotten INT64,
  tomatoes__last_updated TIMESTAMP,
  num_mflix_comments INT64,
  _dlt_load_id STRING NOT NULL,
  _dlt_id STRING NOT NULL,
  tomatoes__dvd TIMESTAMP,
  tomatoes__website STRING,
  tomatoes__production STRING,
  tomatoes__consensus STRING,
  metacritic INT64,
  tomatoes__box_office STRING,
  imdb__rating__v_text STRING,
  imdb__votes__v_text STRING,
  year__v_text STRING
);

If you compare the ddl against the sample document in MongoDB you will notice that the nested arrays such as CAST are missing from the ddl in BigQuery. This is because of how dlt handles nested arrays. If we look at our database in BigQuery you can see the CAST is loaded as a separate table.

BigQuery Table Overview

dlt normalises nested data by populating them in separate tables and creates relationships between the tables, so they can be combined together using normal SQL joins. All this is taken care of by dlt and we need not worry about how transformations are handled. In short, the transformation steps we discussed in Why is dlt useful when you want to ingest data from a production database such as MongoDB? are taken care of by dlt, making the data analyst's life easier.

To better understand how dlt does this transformation, refer to the docs.

3. Self-service analytics for MongoDB with Holistics.

After dlt ingests the data into your data warehouse, you can connect Holistics to the data warehouse and model, govern, and set up your self-service analytics platform for end-user consumption.

By combining dlt with Holistics we get the best of both worlds. The flexibility of an open source library for data ingestion that we can customize based on changing data needs, and a self-service BI tool in Holistics that can not only be used for analytics but also introduces a data modeling layer where metrics and data logic can be defined. Holistics also has support for Git version control to track code changes and can integrate with dbt for streamlining data transformations.

We took care of the data ingestion step in the previous section. We can now connect to our SQL data warehouse, and start transforming the data using the modeling layer in Holistics. We will be using the newest version of Holistics, Holistics 4.0 for this purpose.

In Holistics, add a new data source click on the plus sign (+) on the top menu, and then select Connect Data Sources. Select New Data Sources and in the database type select Google BigQuery. We need to provide the service account credentials that were generated above when we connected dlt to BigQuery. For more detailed instructions on connecting BigQuery to Hollistics refer to this guide.

Once the BigQuery source is added we are ready to import the schemas from BigQuery into Holistics. The schema(dataset_name) name under which dlt loaded the MongoDB data is defined in the load_entire_database function when we create the MongoDB pipeline.

# Create a pipeline
pipeline = dlt.pipeline(
    pipeline_name="local_mongo",
    destination='bigquery',
    dataset_name="mongo_database", # Schema Name
)

4. Modeling the Data and Relationships with Holistics.

To use the data, we will define a data model and the join paths that Holistics can use to build the semantic datasets.

A data model is an abstract view on top of a physical database table that you may manipulate without directly affecting the underlying data. It allows you to store additional metadata that may enrich the underlying data in the data table.

In Holistics, go to the Modelling 4.0 section from the top bar. We will be greeted with the Start page as we have created no models or datasets. We will turn on the development mode from the top left corner. The development model will allow you to experiment with the data without affecting the production datasets and reporting. To keep things organized let’s create two folders called Models and Datasets.

Adding Holistics Data Model(s):

Under the Models folder, let's add the MongoDB data from BigQuery as Table Models. Hover over the Models folder and click on the (+) sign then select Add Table Model. In the Data Sources select the BigQuery Source we created before and then select the relevant table models to import into Holistics. In this case, we are importing the movies, movies_cast and movies_directors tables.

Holistics Add Model

Adding Holistics Dataset(s) and Relationships:

After the Data Models have been added, we can create a Dataset with these models and use them for reporting.

info

A Dataset is a "container" holding several Data Models together so they can be explored together, and dictating which join path to be used in a particular analytics use case.

Datasets works like a data marts, except that it exists only on the semantic layer. You publish these datasets to your business users to let them build dashboards, or explore existing data.

Hover over the Datasets folder, click on the (+) sign, and then select Add Datasets. Select the previously created Table Models under this dataset, and Create Dataset.

Holistics Create Dataset

We will then be asked to create relationships between the models. We create a Many-to-one (n - 1) relationship between the cast and the movies models.

Add Relationship between Models

The resulting relationship can seen As Code using the Holistics 4.0 Analytics as Code feature. To activate this feature click on the newly created dataset and select the View as Code option from the top right. For more detailed instructions on setting up relationships between models refer to the model relationship guide.

Previously, we created the relationship between the cast and the movies tables using GUI, now let’s add the relationship between the directors and movies tables using the Analytics as Code feature. In the dataset.aml file append the relationships block with the following line of code:

relationship(model__mongo_database_movies_directors.dlt_parent_id > model__mongo_database_movies.dlt_id, true)

After the change, the dataset.aml file should look like this:

import '../Models/mongo_database_movies.model.aml' {
  mongo_database_movies as model__mongo_database_movies
}
import '../Models/mongo_database_movies_cast.model.aml' {
  mongo_database_movies_cast as model__mongo_database_movies_cast
}
import '../Models/mongo_database_movies_directors.model.aml' {
  mongo_database_movies_directors as model__mongo_database_movies_directors
}

Dataset movies {
  label: 'Movies'
  description: ''
  data_source_name: 'bigquery_mongo'
  models: [
    model__mongo_database_movies,
    model__mongo_database_movies_cast,
    model__mongo_database_movies_directors
  ]
  relationships: [
    relationship(model__mongo_database_movies_cast.dlt_parent_id > model__mongo_database_movies.dlt_id, true),
    relationship(model__mongo_database_movies_directors.dlt_parent_id > model__mongo_database_movies.dlt_id, true)
    ]
  owner: 'zaeem@dlthub.com'
}

The corresponding view for the dataset.aml file in the GUI looks like this:

Add Relationship GUI

Once the relationships between the tables have been defined we are all set to create some visualizations. We can select the Preview option from next to the View as Code toggle to create some visualization in the development mode. This comes in handy if we have connected an external git repository to track our changes, this way we could test out the dataset in preview mode before committing and pushing changes, and deploying the dataset to production.

In the current scenario, we will just directly deploy the dataset to production as we have not integrated a Git Repository. For more information on connecting a Git Repository refer to the Holistics docs.

The Movies dataset should now be available in the Reporting section. We will create a simple visualization that shows the workload of the cast and directors. In simple words, How many movies did an actor or director work on in a single year?

Visualization and Self-Service Analytics with Holistics:

The visualization part is pretty self-explanatory and is mostly drag and drop as we took the time to define the relationships between the tables. Below we create a simple table in Holistics that shows the actors that have appeared in most movies since the year 2000.

Holistics Create Visualization

Similarly, we can add other reports and combine them into a dashboard. The resulting dashboard can be seen below:

Holistics Dashboard

Conclusion

In this blog, we have introduced a modern data stack that uses dlt and Holistics to address the MongoDB data accessibility issue.

We leverage dlt, to extract, normalize, create schemas, and load data into BigQuery, making it more structured and accessible. Additionally, Holistics provides the means to transform and model this data, adding relationships between various datasets, and ultimately enabling self-service analytics for the broader range of stakeholders in the organization.

This modern data stack offers an efficient and effective way to bridge the gap between MongoDB's unstructured data storage capabilities and the diverse needs of business, operations, and data science professionals, thereby unlocking the full potential of the data within MongoDB for the entire Company.

Additional Resources:

Want to discuss dlt? Join the dlt Slack Community
Check out our friends over at Holistics.
dlt MongoDB Source.
Holistics 4.0: Analytics as Code.
Holistics: Data Modelling.
Holistics: Model Relationship.
Holistics 4.0: Production vs. Development Mode.
Holistics 4.0: Git Version Control.
Holistics 4.0: dbt Integration.

Talk to your Zendesk tickets with Weaviate’s Verba and dlt: A Step by Step Guide

September 26, 2023 · 10 min read

Anton Burnashev

Software Engineer

tl;dr: In this blog post, we'll build a RAG chatbot for Zendesk Support data using Verba and dlt.

As businesses scale and the volume of internal knowledge grows, it becomes increasingly difficult for everyone in the company to find the right information at the right time.

With the latest advancements in large language models (LLMs) and vector databases, it's now possible to build a new class of tools that can help get insights from this data. One approach to do so is Retrieval-Augmented Generation (RAG). The idea behind RAGs is to retrieve relevant information from your database and use LLMs to generate a customised response to a question. Leveraging RAG enables the LLM to tailor its responses based on your proprietary data.

Diagram illustrating the process of internal business knowledge retrieval and augmented generation (RAG), involving components like Salesforce, Zendesk, Asana, Jira, Notion, Slack and HubSpot, to answer user queries and generate responses.

One such source of internal knowledge is help desk software. It contains a wealth of information about the company's customers and their interactions with the support team.

In this blog post, we'll guide you through the process of building a RAG application for Zendesk Support data, a popular help desk software. We’re going to use dlt, Weaviate, Verba and OpenAI.

dlt is an open-source Python library that simplifies the process of loading data from various sources. It does not requires extensive setup or maintenance and particularly useful for CRM data: highly tailored to the needs of the business and changes frequently.

Weaviate is an open-source, AI-native vector database that is redefining the foundation of AI-powered applications. With capabilities for vector, hybrid, and generative search over billions of data objects, Weaviate serves as the critical infrastructure for organizations building sophisticated AI solutions and exceptional user experiences.

Verba is an open-source chatbot powered by Weaviate. It's built on top of Weaviate's state-of-the-art Generative Search technology. Verba includes a web interface and a query engine that uses Weaviate database.

Prerequisites

A URL and an API key of a Weaviate instance. We're using the hosted version of Weaviate to store our data. Head over to the Weaviate Cloud Services and create a new cluster. You can create a free sandbox, but keep in mind your cluster will expire and your data will be deleted after 14 days. In the "Details" of your cluster you'll find the Cluster URL and the API key.
An OpenAI account and API key. Verba utilizes OpenAI's models to generate answers to user's questions and Weaviate uses them to vectorize text before storing it in the database. You can sign up for an account on OpenAI's website.
A Zendesk account and API credentials.

Let’s get started

Step 1. Set up Verba

Create a new folder for your project and install Verba:

mkdir verba-dlt-zendesk
cd verba-dlt-zendesk
python -m venv venv
source venv/bin/activate
pip install goldenverba

To configure Verba, we need to set the following environment variables:

VERBA_URL=https://your-cluster.weaviate.network # your Weaviate instance URL
VERBA_API_KEY=F8...i4WK # the API key of your Weaviate instance
OPENAI_API_KEY=sk-...R   # your OpenAI API key

You can put them in a .env file in the root of your project or export them in your shell.

Let's test that Verba is installed correctly:

verba start

You should see the following output:

INFO:     Uvicorn running on <http://0.0.0.0:8000> (Press CTRL+C to quit)
ℹ Setting up client
✔ Client connected to Weaviate Cluster
INFO:     Started server process [50252]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

Now, open your browser and navigate to http://localhost:8000.

A user interface screenshot showing Verba, retrieval and augmented generation chatbot, powered by Weaviate

Great! Verba is up and running.

If you try to ask a question now, you'll get an error in return. That's because we haven't imported any data yet. We'll do that in the next steps.

Step 2. Install dlt with Zendesk source

We get our data from Zendesk using dlt. Let's install it along with the Weaviate extra:

pip install "dlt[weaviate]"

This also installs a handy CLI tool called dlt. It will help us initialize the Zendesk verified data source—a connector to Zendesk Support API.

Let's initialize the verified source:

dlt init zendesk weaviate

dlt init pulls the latest version of the connector from the verified source repository and creates a credentials file for it. The credentials file is called secrets.toml and it's located in the .dlt directory.

To make things easier, we'll use the email address and password authentication method for Zendesk API. Let's add our credentials to secrets.toml:

[sources.zendesk.credentials]
password = "your-password"
subdomain = "your-subdomain"
email = "your-email@example.com"

We also need to specify the URL and the API key of our Weaviate instance. Copy the credentials for the Weaviate instance you created earlier and add them to secrets.toml:

[destination.weaviate.credentials]
url = "https://your-cluster.weaviate.network"
api_key = "F8.....i4WK"

[destination.weaviate.credentials.additional_headers]
X-OpenAI-Api-Key = "sk-....."

All the components are now in place and configured. Let's set up a pipeline to import data from Zendesk.

Step 3. Set up a dlt pipeline

Open your favorite text editor and create a file called zendesk_verba.py. Add the following code to it:

import itertools

import dlt
from weaviate.util import generate_uuid5
from dlt.destinations.adapters import weaviate_adapter

from zendesk import zendesk_support

def to_verba_document(ticket):
    # The document id is the ticket id.
    # dlt will use this to generate a UUID for the document in Weaviate.
    return {
        "doc_id": ticket["id"],
        "doc_name": ticket["subject"],
        "doc_type": "Zendesk ticket",
        "doc_link": ticket["url"],
        "text": ticket["description"],
    }

def to_verba_chunk(ticket):
    # We link the chunk to the document by using the document (ticket) id.
    return {
        "chunk_id": 0,
                "doc_id": ticket["id"],
        "doc_name": ticket["subject"],
        "doc_type": "Zendesk ticket",
        "doc_uuid": generate_uuid5(ticket["id"], "Document"),
        "text": ticket["description"],
    }

def main():
    pipeline = dlt.pipeline(
        pipeline_name="zendesk_verba",
        destination="weaviate",
    )

    # Zendesk Support has data tickets, users, groups, etc.
    zendesk_source = zendesk_support(load_all=False)

    # Here we get a dlt resource containing only the tickets
    tickets = zendesk_source.tickets

    # Split the tickets into two streams
    tickets1, tickets2 = itertools.tee(tickets, 2)

    @dlt.resource(primary_key="doc_id", write_disposition="merge")
    def document():
        # Map over the tickets and convert them to Verba documents
        # primary_key is the field that will be used to generate
        # a UUID for the document in Weaviate.
        yield from weaviate_adapter(
            map(to_verba_document, tickets1),
            vectorize="text",
        )

    @dlt.resource(primary_key="doc_id", write_disposition="merge")
    def chunk():
        yield from weaviate_adapter(
            map(to_verba_chunk, tickets2),
            vectorize="text",
        )

    info = pipeline.run([document, chunk])

    return info

if __name__ == "__main__":
    load_info = main()
    print(load_info)

There's a lot going on here, so let's break it down.

First, in main() we create a dlt pipeline and add a Weaviate destination to it. We'll use it to store our data.

Next, we create a Zendesk Support source. It will fetch data from Zendesk Support API.

To match the data model of Zendesk Support to the internal data model of Verba, we need to convert Zendesk tickets to Verba documents and chunks. We do that by defining two functions: to_verba_document and to_verba_chunk. We also create two streams of tickets. We'll use them to create two dlt resources: document and chunk. These will populate the Document and Chunk classes in Verba. In both resources we instruct dlt which fields to vectorize using the weaviate_adapter() function.

We specify primary_key and write_disposition for both resources. primary_key is the field that will be used to generate a UUID for the document in Weaviate. write_disposition tells dlt how to handle duplicate documents. In our case, we want to merge them: if a document already exists in Weaviate, we want to update it with the new data.

Finally, we run the pipeline and print the load info.

Step 4. Load data into Verba

Let's run the pipeline:

python zendesk_verba.py

You should see the following output:

Pipeline zendesk_verba completed in 8.27 seconds
1 load package(s) were loaded to destination weaviate and into dataset None
The weaviate destination used <https://your-cluster.weaviate.network> location to store data
Load package 1695726495.383148 is LOADED and contains no failed jobs

Verba is now populated with data from Zendesk Support. However there are a couple of classes that need to be created in Verba: Cache and Suggestion. We can do that using the Verba CLI init command. When it runs it will ask us if we want to create Verba classes. Make sure to answer "n" to the question about the Document class — we don't want to overwrite it.

Run the following command:

verba init

You should see the following output:

===================== Creating Document and Chunk class =====================
ℹ Setting up client
✔ Client connected to Weaviate Cluster
Document class already exists, do you want to overwrite it? (y/n): n
⚠ Skipped deleting Document and Chunk schema, nothing changed
ℹ Done

============================ Creating Cache class ============================
ℹ Setting up client
✔ Client connected to Weaviate Cluster
✔ 'Cache' schema created
ℹ Done

========================= Creating Suggestion class =========================
ℹ Setting up client
✔ Client connected to Weaviate Cluster
✔ 'Suggestion' schema created
ℹ Done

We're almost there! Let's start Verba:

verba start

Step 4. Ask Verba a question

Head back to http://localhost:8000 and ask Verba a question. For example, "What are common issues our users report?".

A user interface screenshot of Verba showing Zendesk tickets with different issues like API problems and update failures, with responses powered by Weaviate

As you can see, Verba is able to retrieve relevant information from Zendesk Support and generate an answer to our question. It also displays the list of relevant documents for the question. You can click on them to see the full text.

Conclusion

In this blog post, we've built a RAG application for Zendesk Support using Verba and dlt. We've learned:

How easy it is to get started with Verba.
How to build dlt pipeline with a ready-to-use data source.
How to customize the pipeline so it matches the data model of Verba.

Where to go next?

Ensure your data is up-to-date. With dlt deploy you can deploy your pipeline to Google's Cloud Composer or GitHub Actions and run it on a schedule.
Build a Verba RAG for other data sources. Interested in building a RAG that queries other internal business knowledge than Zendesk? With dlt you can easily switch your data source. Other dlt verified sources of internal business knowledge include Asana, Hubspot, Jira, Notion, Slack and Salesforce. However, dlt isn’t just about ready-to-use data sources; many of our users choose to implement their own custom data sources.
Learn more about how Weaviate works. Check out Zero to MVP course to learn more about Weaviate database and how to use it to build your own applications.
Request more features. A careful reader might have noticed that we used both Document and Chunk classes in Verba for the same type of data. For simplicity's sake, we assumed that the ticket data is small enough to fit into a single chunk. However, if you if you're dealing with larger documents, you might consider splitting them into chunks. Should we add chunking support to dlt? Or perhaps you have other feature suggestions? If so, please consider opening a feature request in the dlt repo to discuss your ideas!

Let's stay in touch

If you have any questions or feedback, please reach out to us on the dltHub Slack.

How to write a data engineering CV for Europe and America - A hiring manager’s perspective

September 20, 2023 · 6 min read

Adrian Brudaru

Open source data engineer

I recently reviewed someone’s anonymized CV on Reddit as they complained they applied to dozens of jobs but never heard back. Shortly after, my inbox blew up with review requests.

My experience comes from 10+ years in the industry, and having reviewed thousands of CVs myself, having hired for multiple companies. I also asked the opinion of other hiring managers who have hired data engineers.

Why European and US companies? Because, in my experience, other parts of the world work more in enterprises than startups or SMEs, and as a consequence they work with different tool sets. They often do “outsourcing” work, often on technologies used 5-20 years ago and less modern techs. Even when the tech used is new, the role is usually very limited in an enterprise and generally lacks the end-to-end capabilities requirement. For example, it might mean doing drag and drop tools + SQL stored procedures, or it might mean doing only data loading without architecture and generally without an orchestrator, etc.

So, let’s cut to the chase - what is a hiring manager looking for in an application?

Cover letter

Let’s start with the shortest bit - cover letter. Definitely write one. If you apply by email, put it in the email body. If not, attach it to the application.

What should it contain? A couple of paragraphs about

why this company you are applying for is interesting to you
your motivation for applying and what you hope to achieve here
why you would be a good fit for the role.

Keep it honest. Write about the role and your intent, do not make it about you.

A definite don’t is a generic letter that talks about you, but doesn’t mention the role or company. This shows a lack of investment in the application and gives wrong signals. So if you cannot review the company and say why you like it, then the reviewer might feel the same way about you.

The CV

Rules of thumb:

Space is your friend. It drives focus to what matters. Don’t fill it with junk.
“Soft skills” are important. Show the human behind the role, and have a bio section that should contain info about you.
Focus shows consideration to the reviewer. Tell me what’s relevant.
If you are junior, demonstrate potential. List your initiatives, learning, etc.
If you are senior, demonstrate results. Focus less on techs and more on outcomes.
Use a two-pager if you can. First page - everything else. Second page: your experience. If you do not have experience, stick to a one-pager. If you are in the US, you might be required to use a one-pager, in which case make your job descriptions shorter and more focused, dropping the non-relevant.

The parts

Bio: name, where you come from, your values.
- Image. You are a human talking to a human, don’t make it less, it doesn’t help. Be confident, friendly, smile.
- “Values”. Your values are a significant predictor of your “direction”, behavior and interactions. List the top 3, not more. No BS.
- “Motivation”. It indicates what is the fuel to your fire so the company knows if they can do that for you. One sentence.
- “Mission” or what you are looking to do, indicates if your goals may align with role goals. One or two sentences max.
Skills: The easiest thing to filter on, so list the relevant skills for the role you are applying. Do not list the entire list of something you have ever used or tried. This only creates noise and the impression that you don’t have any practiced skills. Mention the skills that are relevant first, and consider skipping the rest. For example: “Python, SQL, orchestrator, GitHub, others.” instead of “Java, Javascript, Ruby, Go, R, C++, SQL, Python, Tableau, Excel, Looker, BigQuery, Bash, Basic, Assembly, MsWord, Google Sheets, PowerBI, ” Make sure those skills are clearly displayed and optionally also mentioned in the listed jobs so people can judge how much you used them.
You can personalize the list based on the job you apply for. If skills are listed outside the job description, look at who else works or worked there, they might list the tools or skills on their LinkedIn.
If you are starting and need to gain the skills, make them happen. Take a course, learn them, and list them.
Job experience. Format:
💡 Time-Time Company, Role. At the company, I had a role that focused on delivering XYZ in a context. I worked with A, B, and C technologies. - I achieved X - I also built Y - I am pleased about having done Z
What are we presenting? The context (company, team), the achievements (X, Y, Z), the takeaways (A, B, C, X, Y, Z experience)
What are we not presenting? Things are irrelevant to the role we are applying for or our values. Less is more as it shifts focus to the important. Also, no vanity achievements or internal company awards, which are often offered in consultancies. These are just tools to motivate you, and nobody cares externally.
Education, studies, self-studies: Some groups care about the school. It is less critical, as the school does not teach real-life data engineering, but it will help you understand where you come from.
Projects, courses, volunteering: I care more about the demonstrated learning experience. I want to see that you keep learning. If not on the job, then I want to see courses or projects here. Volunteering also paints a picture of a “doer”.

The skills you need

This depends on the role, but most DE roles building standard platforms need the following:

Python
orchestrator (Airflow usually)
dimensional modeling and other data architectures
SQL
software best practices - versioning, cicd, testing, etc.

And the magic sauce…

… goes into the food, keep it out of the CV :) There’s no magic here, just clear communication.

Your three levers are

better communication
better skills
more applications

The rest is a matter of opportunities, depending on your location and the job market. If the opportunities are not available, consider entering the field via an adjacent role like an analytics engineer.

Screening CVs is a rapid process, so you need to ensure you check off the requirements while removing (not adding) any doubts. Any hire is a risk, so the screener wants to minimize that - help them do it by showing a good understanding of your role and what is expected of you. Don’t add info that isn’t needed, as that might add doubts.

I wish you good luck on your application process and encourage you to ask for feedback from the community if you get stuck.

Dumpster diving for data: The MongoDB experience

September 5, 2023 · 7 min read

Adrian Brudaru

Open source data engineer

info

💡 TIL: BSON stands for binary JSON, not for BS Object Notation /s

What will you learn in this article?

The scope of this article is to look at and discuss the process of extracting data from MongoDB and making it available in a SQL store. We will focus on the difficulties around ingesting the data into a SQL database, while we will look at MongoDB only from a source perspective.

The focus is the data in its different states, not the underlying technologies.

Why the harsh title?

The title may sound harsh, but it accurately reflects the challenges often encountered when dealing with MongoDB.

Also referred to as /dev/null, Mongo offers a flexible schema and document-based storage can lead to data inconsistencies and complexities, resembling a metaphorical "dumpster" where data might be scattered and difficult to organise.

Analytical consumers of MongoDB will be forced to invest extra effort in data modelling, schema design, and querying optimisation to ensure data quality and retrieval efficiency.

Is this a common problem?

It's inevitable. An analytical system has multiple requirements which force us to move data from places such as MongoDB to a SQL store.

Let’s look at those requirements:

Business User access: Most data is used by business users to improve operations. Business users access data via dashboards, or pivot-table like interfaces that enable them to do custom aggregations. The tooling that exists to do this is created for SQL, as a place where relational queries can be executed.
Ecosystem of integrations and tools: In analytics, having a diverse ecosystem of integrations and tools is crucial. SQL databases seamlessly fit into this ecosystem, offering compatibility with a wide array of data warehousing, data integration, and data governance tools. This comprehensive ecosystem enhances the analytics infrastructure, ensuring that data can be efficiently managed, transformed, and accessed by various stakeholders.
Standardization for consistency: Maintaining data consistency is paramount in analytics. SQL's widespread adoption and standardized query language enable analysts and data professionals to work with data consistently across different systems and platforms. This standardization ensures that data is interpreted and manipulated uniformly, reducing the risk of errors and discrepancies in analytical processes.
Data transformation & modelling capabilities: Effective data transformation and modelling are prerequisites for meaningful analytics. SQL provides a robust toolkit for these tasks, enabling data professionals to perform complex operations such as joins, filtering, aggregation, and intricate calculations. These capabilities are essential for preparing raw data into structured formats that can be readily used for in-depth analysis, reporting, and decision-making in the analytics domain.

So, after looking at what is needed for analytics, it becomes clear that going off the beaten path will lead to some pretty gnarly limitations and outcomes.

Mongo in particular: BSON vs JSON

How is Mongo different from semi-structure data like JSON, and is MongoDB particularly hard to ingest from?

Bson is for performance, json is for transmission.

The differences stem from the fact that MongoDB uses BSON under the hood, as opposed to JSON. BSON is a binary object notation optimised for performance, while JSON is a standard interchange format.

Similarly, Mongo also supports custom and more complex data types, such as geospatial, dates, regex, etc, that json does not. Additionally, BSON supports character encodings. All these benefits enable MongoDB to be a faster and better database, but the cost is additional hurdles that must be crossed before we can use this data elsewhere.

So how do you solve these issues? Well, hopefully your development team didn't go overboard, and you can just simply convert the BSON to JSON. If you are unlucky, you will need to create your own mappers that follow whatever your team did.

From JSON to DB

Once you have converted your mongo BSON into JSON, you are able to use its wide support to have it ingested.

JSON enjoys widespread support across various data processing tools and systems, making it a versatile choice for data ingestion. With your data in JSON, you can seamlessly integrate it into your database, leveraging its compatibility to efficiently manage and analyze your information.

Cleaning and typing

Data typing is essential in ensuring data integrity. It involves assigning appropriate data types to JSON fields, like converting numerical values into integers or floats, representing dates as datetime types, and labeling text data as string data types. This step guarantees that the database accurately stores and processes information.

Do we unpack?

The choice between unpacking nested JSON into tables or keeping it as JSON depends on your specific needs. Unpacking enhances query performance, indexing, and data manipulation within relational databases. However, native JSON support in some databases can suffice for simpler scenarios, preserving the original hierarchical structure. Your decision should align with data analysis, retrieval requirements, and your chosen database's capabilities.

Simply put, if you plan to use the data, you should probably unpack it to benefit from what relational dbs have to offer. But if you simply need to store and retrieve the json, do not convert it.

Querying unpacked data is cheaper and more robust than maintaining wet json_extract() code

Unpacking nested JSON into separate tables within a relational database is essential for robustness and query efficiency. Relational databases are optimized for tabular data and typed columns, making it challenging and error prone to handle complex nested structures directly.

By breaking down nested JSON into separate tables and establishing relationships through foreign keys, the data becomes more structured, ensuring robust data management and enhancing query efficiency. This simplification streamlines data retrieval and manipulation, aligning it with standard SQL operations for efficient and effective use.

Start using `dlt` to load Mongo to SQL today

To help with the challenges of loading Mongo data, we created a dlt source that reads your mongo collections and throws flat sql tables on the other side.

The benefit of using dlt is that you get flat tables in your sql database that adapt to match the Mongo schema.

Here's a code explanation of how it works under the hood:

It grabs data from Mongo and turns it into JSON.

From json, dlt leverages schema inference and evolution to make sense of the data. Here is an example of how this nested data could look:

 {
    "id": 1,
    "name": "Alice",
    "job": {
        "company": "ScaleVector",
        "title": "Data Scientist"
    },
    "children": [
        {
            "id": 1,
            "name": "Eve"
        },
        {
            "id": 2,
            "name": "Wendy"
        }
    ]
}

We can load the data to a supported destination declaratively:

import dlt

pipeline = dlt.pipeline(
    pipeline_name='from_json',
    destination='duckdb',
    dataset_name='mydata',
    dev_mode=True,
)
# dlt works with lists of dicts, so wrap data to the list
load_info = pipeline.run([data], table_name="json_data")
print(load_info)

Now we can use the data, these are two tables:
json_data
index id name job__company job__title _dlt_load_id _dlt_id
0 1 Alice ScaleVector Data Scientist 1693922245.602667 0ZbCzK7Ra2tWMQ
json_data__children
index id name _dlt_parent_id _dlt_list_idx _dlt_id
0 1 Eve 0ZbCzK7Ra2tWMQ 0 TjzpGZ+dwrrQhg
1 2 Wendy 0ZbCzK7Ra2tWMQ 1 RdqpN1luoKxQTA
Note that the original json got unpacked into tables that are now joinable via generated keys child._dlt_parent_id = parent._dlt_id.

index	id	name	job__company	job__title	_dlt_load_id	_dlt_id
0	1	Alice	ScaleVector	Data Scientist	1693922245.602667	0ZbCzK7Ra2tWMQ

index	id	name	_dlt_parent_id	_dlt_list_idx	_dlt_id
0	1	Eve	0ZbCzK7Ra2tWMQ	0	TjzpGZ+dwrrQhg
1	2	Wendy	0ZbCzK7Ra2tWMQ	1	RdqpN1luoKxQTA

Read more about it here: Mongo verified source.

What are you waiting for?

Dive into our Getting Started.
Join the ⭐Slack Community⭐ for discussion and help!

The return of ETL in the Python age

August 24, 2023 · 6 min read

Adrian Brudaru

Open source data engineer

info

PSSSST! You do ELT, right? not ETL? asking for a friend...

ETL vs ELT? A vendor driven story.

One of the earliest tooling for "ETL" data was Pentaho Kettle. Kettle stands for "Kettle Extraction Transformation Transport Load Environment" and signifies that it transforms the data before loading it. It was usually used to load data which was later transformed in SQL via "SQL scripts", while still in the tool, or via database triggers or views outside of the tool.

Indeed, the tool creators imagined some folks would write java to transform before loading, but the vast majority of data users just wanted to use SQL.

Sounds familiar? This is not so different to today's "ELT", is it?

Why did we call it ELT?

The people

Well, first of all SQL is much more accessible and very powerful for transforming tables, columns and rows - where programming handles single values. So before purpose built tooling existed, data people were already doing the transform in SQL - it just made sense.

The "EL" vendors

In the decade following Pentaho, Saas solutions started offering pipelines that load data into your database, removing the option for you to tinker with it before loading. For this reason, they would call it "ELT".

The db vendors

The concept also resonated with MPP DBs (massive parallel processing), such as Snowflake, Redshift, Bigquery, which were more than happy to encourage doing all the compute on their side.

The "T in ELT" vendors

Another puzzle piece was dbt, a tool purpose built for SQL transform. So if there's a question of ETL or ELT, dbt can only answer ELT. In dbt's word view, data starts dirty in your warehouse, where you "rename, cast, join, enrich" - a true ELT. To make the drudgery of data cleaning in SQL easier, dbt offers some python support to enable generating some of the typing and renaming SQL. They also offer a litte bit of python support for scalar operations in some db vendor systems.

What do we really do?

Most of us do a little bit of both - we extract with python, and the next steps are loading, cleaning and curation. In some cases, cleaning and curation are optional. For example, when we load a report from another platform we will probably not need to clean or curate anything.

Where do we clean data?

Data cleaning usually refers to normalising the data into correct types, usable names, etc. Doing this in SQL results in writing a lot of manual code that needs to be maintained. On the other hand, sturcturing data in python isn't easy either, it's just less technically difficult, but when metadata is missing, it becomes guesswork.

So, technically the easier place to clean data is in python, but likely the majority will do it in SQL as they are more practiced in SQL.

Where do we transform data?

When it comes to working with tables, SQL is still the better place to be. Joins and aggregations are the core operations that will happen here and they would be much harder to handle scalably in python.

`dlt` puts the small t back in EtlT, let's see how.

So, python is still superior at a few operations

Typing, renaming, normalising, unpacking
complex scalar operations

While we will leave the aggregations and joins to the big T, SQL.

Normalisation, typing, unpacking

dlt does this well out of the box. Automatic typing, renaming, flattening, and ddl deployment are all handled by the schema inference and evolution engine. This engine is configurable in both how it works and what it does, you can read more here: Normaliser, schema settings

Here is a usage example (it's built into the pipeline):

import dlt

# Json, dataframes, iterables, all good
# the data will be auto typed and normalised
data = [{'id': 1, 'name': 'John'}]

# open connection
pipe = dlt.pipeline(destination='bigquery',
                    dataset_name='raw_data')

# self-explanatory declarative interface
job_status = pipe.run(data,
                      write_disposition="merge",
                      primary_key="id",
                      table_name="users")

# optionally load schema and metadata
pipe.run([job_status],
         write_disposition="append",
         table_name="loading_status")

Scalar operations

Sometimes we need to edit a column's value in some very specific way for which SQL doesn't quite cut it. Sometimes, we have data we need to pseudonymise before loading for regulatory reasons.

Because dlt is a library, it means you can easily change how the data stream is produced or ingested. Besides your own customisations, dlt also supports injecting your transform code inside the event stream, see an example here

Here is a code example of pseudonymisation, a common case where data needs to be transformed before loading:

import dlt
import hashlib

@dlt.source
def dummy_source(prefix: str = None):
    @dlt.resource
    def dummy_data():
        for _ in range(3):
            yield {'id':_, 'name': f'Jane Washington {_}'}
    return dummy_data(),

def pseudonymize_name(doc):
    '''
    Pseudonmyisation is a deterministic type of PII-obscuring
    Its role is to allow identifying users by their hash,
    without revealing the underlying info.
    '''
    # add a constant salt to generate
    salt = 'WI@N57%zZrmk#88c'
    salted_string = doc['name'] + salt
    sh = hashlib.sha256()
    sh.update(salted_string.encode())
    hashed_string = sh.digest().hex()
    doc['name'] = hashed_string
    return doc


# 1. Create an instance of the source so you can edit it.
data_source = dummy_source()
# 2. Modify this source instance's resource
data_resource = data_source.dummy_data().add_map(pseudonymize_name)
# 3. Inspect your result
for row in data_resource:
    print(row)
#{'id': 0, 'name': '96259edb2b28b48bebce8278c550e99fbdc4a3fac8189e6b90f183ecff01c442'}
#{'id': 1, 'name': '92d3972b625cbd21f28782fb5c89552ce1aa09281892a2ab32aee8feeb3544a1'}
#{'id': 2, 'name': '443679926a7cff506a3b5d5d094dc7734861352b9e0791af5d39db5a7356d11a'}

pipeline = dlt.pipeline(pipeline_name='example', destination='bigquery', dataset_name='normalized_data')
load_info = pipeline.run(data_resource)

The big T

Finally, once you have clean data loaded, you will probably prefer to use SQL and one of the standard tools. dlt offers a dbt runner to get you started easily with your transformation package.

pipeline = dlt.pipeline(
    pipeline_name='pipedrive',
    destination='bigquery',
    dataset_name='pipedrive_dbt'
)

# make or restore venv for dbt, using latest dbt version
venv = dlt.dbt.get_venv(pipeline)

# get runner, optionally pass the venv
dbt = dlt.dbt.package(
    pipeline,
    "pipedrive/dbt_pipedrive/pipedrive", # or use public git "https://github.com/dbt-labs/jaffle_shop.git"
    venv=venv
)

# run the models and collect any info
# If running fails, the error will be raised with full stack trace
models = dbt.run_all()

#optionally log dbt status
pipeline.run([models],
         write_disposition="append",
         table_name="_models_log")

In conclusion

ETL vs ELT was never really a debate. With some exceptions almost everyone transforms the data in SQL - but what they call this process depends on who's telling the story.

While it's easier to do most of the transformation in SQL, the tedious is completely automatable in python, and the dirty data doesn't need manual normalisation. With dlt, you can do ETL or ELT, or even better, both, as EtLT

Or, if you're feeling funny, you can add duckdb in the middle and go full EtLTLT where you have an additional T step in the middle for the kinds of operations that could be done locally. And afterwards you could load to operational systems to add one more L to the name :)

Fundamentally, we all agree it's all ETL, with the flavors simply designating specific sub-types.

Start using `dlt` today

What are you waiting for?

Dive into our getting started docs
Join the ⭐Slack Community⭐ for discussion and help!

Trust your data! Column and row level lineages, an explainer and a recipe.

August 21, 2023 · 9 min read

Adrian Brudaru

Open source data engineer

info

TL;DR: By linking each load's metadata to the schema evolution event or schema version, we are able to assign the origin of a column to a specific load package, identified by source and time.

Row and Column Level Lineage with `dlt`

Load IDs

Load IDs are crucial in dlt and are present in all the top tables (_dlt_loads, load_id, etc.). Each pipeline run creates one or more load packages, which can be identified by their load_id. A load package typically contains data from all resources of a particular source. The load_id of a particular package is added to the top data tables and to the _dlt_loads table with a status 0 (when the load process is fully completed).

For more details, refer to the Load IDs section of the documentation.

Schema Versioning https://dlthub.com/

Each schema file in dlt contains a content-based hash version_hash that is used to detect manual changes to the schema (i.e., user edits content) and to detect if the destination database schema is synchronized with the file schema. Each time the schema is saved, the version hash is updated.

For more details, refer to the Schema content hash and version section of the documentation.

Data Lineage

Data lineage can be super relevant for architectures like the data vault architecture or when troubleshooting. Using the pipeline name and load_id provided out of the box by dlt, you are able to identify the source and time of data.

You can save complete lineage info for a particular load_id including a list of loaded files, error messages (if any), elapsed times, schema changes. This can be helpful, for example, when troubleshooting problems.

For more details, refer to the Data lineage section of the documentation.

By combining the use of load_id and schema versioning, you can achieve a robust system for row and column level lineage in your data pipelines with dlt.

Row and Column Level Lineage

Row Level Lineage

Row level lineage refers to the ability to track data from its source to its destination on a row-by-row basis. This means being able to identify exactly where each row of data in your destination came from, which can be crucial for data validation, debugging, and compliance purposes.

In dlt, each row in all (top level and child) data tables created by dlt contains a unique column named _dlt_id. Each child table contains a foreign key column _dlt_parent_id linking to a particular row (_dlt_id) of a parent table. This allows you to trace the lineage of each row back to its source.

For more details, refer to the Child and parent tables section of the documentation.

Column Level Lineage

Column level lineage refers to the ability to track how each column in your data has been transformed or manipulated from source to destination. This can be important for understanding how your data has been processed, ensuring data integrity, and validating data transformations.

In dlt, a column schema contains properties such as name, description, data_type, and is_variant, which provide information about the column and its transformations. The is_variant property, for example, tells you if a column was generated as a variant of another column.

For more details, refer to the Tables and columns section of the documentation.

By combining row and column level lineage, you can have an easy overview of where your data is coming from and when changes in its structure occur.

Identifying the lineage with dlt

After a pipeline run, the schema evolution info gets stored in the load info. Load it back to the database to persist the column lineage:

load_info = pipeline.run(data,
                          write_disposition="append",
                          table_name="users")

pipeline.run([load_info], write_disposition="append", table_name="loading_status")

Loading it back to the database will produce a few status tables.

Note the load id, which is a unix timestamp, identifying the origin of every new column. You can link it back to the load packages via the _load_id column.

Below, you can find some examples of what this info looks like - Note the _load_id column that identifies each load, and the metadata that comes with it:

Here is an example what load info contains in the column info of the metadata we just loaded (table load_info__load_packages__tables__columns):

nullable	partition	cluster	unique	sort	primary_key	foreign_key	root_key	merge_key	name	data_type	table_name	schema_name	load_id	_dlt_parent_id	_dlt_list_idx	_dlt_id
false	false	false	false	false	false	false	false	false	version	bigint	_dlt_pipeline_state	dlt_test_pipe	1692188651.466199	WBS2MJRkxEn2xw	0	4rQWa44uF2CKyg
false	false	false	false	false	false	false	false	false	engine_version	bigint	_dlt_pipeline_state	dlt_test_pipe	1692188651.466199	WBS2MJRkxEn2xw	1	zn5zR+PKyNqJLA
false	false	false	false	false	false	false	false	false	pipeline_name	text	_dlt_pipeline_state	dlt_test_pipe	1692188651.466199	WBS2MJRkxEn2xw	2	WV6DNovz7V1xBg
false	false	false	false	false	false	false	false	false	state	text	_dlt_pipeline_state	dlt_test_pipe	1692188651.466199	WBS2MJRkxEn2xw	3	77zsRk9Z5yhAwQ
false	false	false	false	false	false	false	false	false	created_at	timestamp	_dlt_pipeline_state	dlt_test_pipe	1692188651.466199	WBS2MJRkxEn2xw	4	Sj5/mL9tZGlHRQ
false	false	false	false	false	false	false	false	false	_dlt_load_id	text	_dlt_pipeline_state	dlt_test_pipe	1692188651.466199	WBS2MJRkxEn2xw	5	lvbvQFPbk9g0og
false	false	false	false	false	false	false	false	false	load_id	text	_dlt_loads	dlt_test_pipe	1692188651.466199	G0HvoQ6BMNzYsw	0	+IeGJE0Ln0wj+w
true	false	false	false	false	false	false	false	false	schema_name	text	_dlt_loads	dlt_test_pipe	1692188651.466199	G0HvoQ6BMNzYsw	1	oZ7hho/aLYJobg
false	false	false	false	false	false	false	false	false	status	bigint	_dlt_loads	dlt_test_pipe	1692188651.466199	G0HvoQ6BMNzYsw	2	QrZ3e79agHFNgg
false	false	false	false	false	false	false	false	false	inserted_at	timestamp	_dlt_loads	dlt_test_pipe	1692188651.466199	G0HvoQ6BMNzYsw	3	gm9kEFQuPXGwiA
true	false	false	false	false	false	false	false	false	schema_version_hash	text	_dlt_loads	dlt_test_pipe	1692188651.466199	G0HvoQ6BMNzYsw	4	4eX9BoFV5oegAg
true	false	false	false	false	false	false	false	false	name	text	people	dlt_test_pipe	1692188651.466199	q9DzfCYuMwDjkg	0	ISj8XUllnHB1gA
true	false	false	false	false	false	false	false	false	age	bigint	people	dlt_test_pipe	1692188651.466199	q9DzfCYuMwDjkg	1	4YDwm8PtjtEPwA
true	false	false	false	false	false	false	false	false	nationality	text	people	dlt_test_pipe	1692188651.466199	q9DzfCYuMwDjkg	2	LJTMxFWgqqyH/w
true	false	false	false	false	false	false	false	false	street	text	people	dlt_test_pipe	1692188651.466199	q9DzfCYuMwDjkg	3	AmzkMpDFikafIw
true	false	false	false	false	false	false	false	false	building	bigint	people	dlt_test_pipe	1692188651.466199	q9DzfCYuMwDjkg	4	GNw+E3FAuC9o5A
false	false	false	false	false	false	false	false	false	_dlt_load_id	text	people	dlt_test_pipe	1692188651.466199	q9DzfCYuMwDjkg	5	7hhoAuL9tZGlHR

Here is the information contained in the load_info table:

pipeline__pipeline_name	destination_name	destination_displayable_credentials	destination_fingerprint	dataset_name	started_at	first_run	_dlt_load_id	_dlt_id
dlt_test_pipe	bigquery	dlthub-loader@dlthub-analytics.iam.gserviceaccount.com@dlthub-analytics	kgecbRsVn7pCkgx5EVBi	people	2023-08-16 12:24:09.511922 UTC	true	1692188672.110346	PP1cT3rrwur2pw
dlt_test_pipe	bigquery	dlthub-loader@dlthub-analytics.iam.gserviceaccount.com@dlthub-analytics	kgecbRsVn7pCkgx5EVBi	people	2023-08-16 12:25:12.789753 UTC	false	1692188728.938733	WcBNyAKI3NdVzg

Conclusion

In conclusion, implementing row and column level lineage within data processing is crucial for maintaining data integrity, validation, and troubleshooting. The dlt framework offers a robust solution for achieving both forms of lineage, providing a comprehensive understanding of data transformations and origins.

Row level lineage: Utilizing unique identifiers like _dlt_id and _dlt_parent_id, dlt enables precise tracing of data from source to destination. This level of detail is essential for tasks like data validation, debugging, and compliance.
Column level lineage: By leveraging column schema properties such as name, data type, and is_variant, dlt reveals column transformations, offering insights into data manipulation throughout the pipeline.
Extend lineage into transformation: To maintain dlt lineage into transformations, log metadata at each transformation step, including transformation type, logic, and timestamps, while extending lineage columns to represent transformed data's lineage and relationships.

Combining row and column level lineage provides data professionals with a holistic view of data's journey, enhancing comprehension of its source, transformations, and changes. The lineage information stored in dlt facilitates effective troubleshooting, validation, and compliance checks, bolstering governance of data pipelines.

In summary, the integration of lineage through dlt empowers organizations to construct transparent and reliable data pipelines. This practice ensures data quality, cultivating a foundation for accurate and trustworthy data-driven decisions.

Start using dlt today

What are you waiting for?

Dive into our getting started docs
Join the slack community for discussion and help!

dlt-dbt-DuckDB-MotherDuck: My super simple and highly customizable approach to the Modern Data Stack in a box

August 14, 2023 · 10 min read

Rahul Joshi

Developer Relations at dltHub

info

TL;DR: I combined dlt, dbt, DuckDB, MotherDuck, and Metabase to create a Modern Data Stack in a box that makes it very easy to create a data pipeline from scratch and then deploy it to production.

I started working in dltHub in March 2023, right around the time when we released DuckDB as a destination for dlt. As a Python user, being able to create a data pipeline, load the data in my laptop, and explore and query the data all in python was awesome.

At the time I also came across this very cool article by Jacob Matson in which he talks about creating a Modern Data Stack(MDS) in a box with DuckDB. I was already fascinated with dlt and all the other new tools that I was discovering, so reading about this approach of combining different tools to execute an end-to-end proof of concept in your laptop was especially interesting.

Fast forward to a few weeks ago when dlt released MotherDuck as a destination. The first thing that I thought of was an approach to MDS in a box where you develop locally with DuckDB and deploy in the cloud with MotherDuck. I wanted to try it out.

What makes this awesome

In my example, I wanted to customize reports on top of Google Analytics 4 (GA4) and combine it with data from GitHub. This is usually challenging because, while exporting data from GA4 to BigQuery is simple, combining it with other sources and creating custom analytics on top of it can get pretty complicated.

By first pulling all the data from different sources into DuckDB files in my laptop, I was able to do my development and customization locally.

local-workflow

And then when I was ready to move to production, I was able to seamlessly switch from DuckDB to MotherDuck with almost no code re-writing!

production-workflow

Thus I got a super simple and highly customizable MDS in a box that is also close to company production setting.

What does this MDS in a box version look like?

Tool	Layer	Why it’s awesome
`dlt`	data ingestion	ridiculously easy to write a customized pipeline in Python to load from any source
DuckDB	data warehouse in your laptop	free, fast OLAP database on your local laptop that you can explore using SQL or python
MotherDuck	data warehouse in the cloud	DuckDB, but in cloud: fast, OLAP database that you can connect to your local duckdb file and share it with the team in company production settings
dbt	data transformation	an amazing open source tool to package your data transformations, and it also combines well with dlt, DuckDB, and Motherduck
Metabase	reporting	open source, has support for DuckDB, and looks prettier than my Python notebook

How this all works

The example that I chose was inspired by one of my existing workflows: that of understanding dlt-related metrics every month. Previously, I was using only Google BigQuery and Metabase to understand dlt’s product usage, but now I decided to test how a migration to DuckDB and MotherDuck would look like.

The idea is to build a dashboard to track metrics around how people are using and engaging with dlt on different platforms like GitHub (contributions, activity, stars etc.), dlt website and docs (number of website/docs visits etc.).

This is a perfect problem to test out my new super simple and highly customizable MDS in a box because it involves combining data from different sources (GitHub API, Google Analytics 4) and tracking them in a live analytics dashboard.

Loading the data using dlt
The advantage of using dlt for data ingestion is that dlt makes it very easy to create and customize data pipelines using just Python.
In this example, I created two data pipelines:
- BigQuery → DuckDB: Since all the Google Analytics 4 data is stored in BigQuery, I needed a pipeline that could load all events data from BigQuery into a local DuckDB instance. BigQuery does not exist as a verified source for dlt, which means that I had to write this pipeline from scratch.
- GitHub API → DuckDB:
  dlt has an existing GitHub source that loads data around reactions, PRs, comments, and issues. To also load data on stargazers, I had to modify the existing source.
dlt is simple and highly customizable:
- Even though Bigquery does not exist as a dlt source, dlt makes it simple to write a pipeline that uses Bigquery as a source. How this looks like:
  1. Create a dlt project:
    dlt init bigquery duckdb
    This creates a folder with the directory structure
    ├── .dlt │ ├── config.toml │ └── secrets.toml ├── bigquery.py └── requirements.txt
  2. Add BigQuery credentials inside .dlt/secrets.toml.
  3. Add a Python function inside bigquery.py that requests the data.
  4. Load the data by simply running python bigquery.py.
    See the accompanying repo for a detailed step-by-step on how this was done.
- The data in BigQuery is nested, which dlt automatically normalizes on loading.
  BigQuery might store data in nested structures which would need to be flattened before being loaded into the target database. This typically increases the challenge in writing data pipelines.
  dlt simplifies this process by automatically normalizing such nested data on load.
  
  Example of what the nested data in BigQuery looks like.
  
  dlt loads the main data into table ga_events, and creates another table ga_events__event_params for the nested data.
- The existing Github source does not load information on stargazers. dlt makes it easy to customize the Github source for this.
  The way the existing GitHub verified source is written, it only loads data on GitHub issues, reactions, comments, and pull requests. To configure it to also load data on stargazers, all I had to do was to add a python function for it in the pipeline script.
  See the accompanying repo for a detailed step-by-step on how this was done.
Using DuckDB as the data warehouse
DuckDB is open source, fast, and easy to use. It simplifies the process of validating the data after loading it with the data pipeline.
In this example, after running the BigQuery pipeline, the data was loaded into a locally created DuckDB file called ‘bigquery.duckdb’, and this allowed me to use python to the explore the loaded data:
The best thing about using DuckDB is that it provides a local testing and development environment. This means that you can quickly and without any additional costs test and validate your workflow before deploying it to production.
Also, being open source, it benefits from community contributions, particularly dbt-duckdb adapter and the DuckDB Metabase driver, which make it very useful in workflows like these.
dbt for data transformations
Because of dlt’s dbt runner and DuckDB’s dbt adapter, it was very easy to insert dbt into the existing workflow. What this looked like:
1. I first installed dbt along with the duckdb adapter using pip install dbt-duckdb .
2. I then created a dbt project inside the dlt project using dbt init and added any transforms as usual.
3. Finally, I added the dlt’s dbt runner to my python script, and this configured my pipeline to automatically transform the data after loading it. See the documentation for more information on the dbt runner.
Metabase for the dashboard
Metabase OSS has a DuckDB driver, which meant that I could simply point it to the DuckDB files in my system and build a dashboard on top of this data.
Going to production: Using MotherDuck as the destination
So far the process had been simple. The integrations among dlt, dbt, DuckDB, and Metabase made the loading, transformation, and visualization of data fairly straight-forward. But the data loaded into DuckDB existed only in my machine, and if I wanted share this data with my team, then I needed to move it to a different storage location accessible by them.
The best and the easiest way to do this was to use MotherDuck: a serverless cloud analytics platform built on top of DuckDB, where you can host your local DuckDB databases.
Why choose MotherDuck
1. Go from development to production with almost no code re-writing:
  This was my main reason for choosing MotherDuck. MotherDuck integrates with dlt, dbt, and Metabase just as well as DuckDB. And I was able to replace DuckDB with MotherDuck in my pipeline with almost no code re-writing!
  What this process looked like:
  1. First, I modified the dlt pipelines to load to MotherDuck instead of DuckDB as follows:
    1. I added credentials for MotherDuck inside .dlt/secrets.toml
    2. I made a minor update to the code: i.e. just by changing destination = "duckdb" to destination = "motherduck" the pipelines were already configured to load the data into MotherDuck instead of DuckDB
  2. With this change, I was already able to deploy my pipelines with GitHub actions.
  3. After deploying, I simply changed the DuckDB path to the MotherDuck path in Metabase, and then I deployed Metabase on GCP.
  The reason this is great is because it greatly simplifies the development lifecycle. Using DuckDB + MotherDuck, you can develop and test your pipeline locally and then move seamlessly to production.
2. Very easy to copy local DuckDB databases to MotherDuck
  This was especially useful in this demo. Google Analytics 4 events data is typically large and when fetching this data from BigQuery, you are billed for the requests.
  In this example, after I ran the BigQuery -> DuckDB pipeline during development, I wanted to avoid loading the same data again when deploying the pipeline. I was able to do this by copying the complete local DuckDB database to MotherDuck, and configuring the pipeline to only load new data from BigQuery.
3. Easy to share and collaborate
  Being able to share data with the team was the main goal behind moving from DuckDB to a cloud data warehouse. MotherDuck provides a centralized storage system for the DuckDB databases which you can share with your team, allowing them to access this data from their own local DuckDB databases.
  In my example, after I load the data to MotherDuck, I can provide access to my team just by clicking on ‘Share’ in the menu of their web UI.

Conclusion:

This was a fun and interesting exercise of creating a simple, yet powerful Modern Data Stack in a box. For me the biggest positives about this approach are:

Everything was happening on my laptop during the development giving me full control. Still going to production was seamless and I didn't need to change my code and data transformations at all.
I really liked that I could come with my ideas on what data I need and just write the pipelines in Python using dlt. I was not forced to pick from a small pull of existing data extractors. Both, customizing code contributed by others and writing my bigquery source from scratch, were fun and not going beyond Python and data engineering knowledge that I had.
I'm impressed by how simple and customizable my version of MDS is. dlt, DuckDB, and MotherDuck share similar philosophy of giving full power to the local user and and making it easy to interact with them in Python.

I repeat this entire process for the BigQuery pipeline in this video:

dlt AI Assistant provides answers you need!

June 26, 2023 · 9 min read

Tong Chen

Data Engineer Intern at dltHub

Hi there! 👋

Maintenance of data pipelines can often pose significant challenges for organizations. Ensuring the smooth operation of these pipelines, managing updates and changes, and addressing potential bottlenecks are just a few examples of the complexities involved. This is where dlt comes into play as a valuable solution, as dlt can efficiently alleviate these maintenance burdens and optimize data pipeline management.

In this article, we delve into the capabilities of our dlt AI Assistant, the conversational companion for the dltHub docs. Our aim was to explore how dlt can effectively address data pipeline maintenance challenges and streamline advanced dlt project processes. To gain insights, we posed several questions to our AI Assistant. Let's delve into the details and discover how it responded!

💡 How does `dlt` reduce maintenance ?

dlt reduces maintenance in several ways:
Automating structuring: dlt automates the transition from unstructured (nested, untyped) to structured (tabular, typed) data, which eliminates the need for manual data structuring and reduces the chances of errors.
Schema evolution: dlt can automatically adapt to changes in the source data schema, such as new fields or tables. This means that you don't have to manually update your pipelines when the source data changes. You can monitor schema evolution and get alerts on schema changes. Read more about monitoring schema evolution here.
Standardization: By providing a standardized way of loading data, dlt enables better collaboration and communication among team members. This reduces the need for individual maintenance efforts and ensures that everyone is working with the same understanding of the data structure.
Pre-built sources: dlt offers a collection of ready-built sources that can be easily integrated into your pipelines, reducing the need for custom development and maintenance. You can find existing sources here.
Customization: dlt allows for easy customization of pipelines, such as renaming columns, anonymizing data, and filtering. This means that you can adapt your pipelines to your specific needs without having to maintain complex custom code. See examples of customization here.

By automating structuring, handling schema evolution, standardizing data loading, providing pre-built sources, and allowing easy customization, dlt significantly reduces the maintenance effort required for your data pipelines.

💡 How does schema evolution help companies' bottom line?

Schema evolution helps companies' bottom line in several ways:
Reduced maintenance costs: By automatically adapting to changes in the source data schema, dlt reduces the need for manual updates to data pipelines. This saves time and resources, leading to lower maintenance costs.
Faster time to market: Schema evolution allows companies to quickly adapt to changes in their data, enabling them to make data-driven decisions faster. This can lead to quicker product iterations, improved customer experiences, and ultimately, increased revenue.
Improved data quality: By automating schema evolution, dlt helps ensure that data is consistently structured and typed, reducing the risk of errors and improving overall data quality. Better data quality leads to more accurate insights and better decision-making, which can positively impact the bottom line.
Increased collaboration and efficiency: Schema evolution promotes better collaboration and communication among team members, as everyone operates on a shared understanding of the data structure. This can lead to increased efficiency and productivity, ultimately benefiting the company's bottom line.
Enhanced data governance: By tracking schema changes and alerting stakeholders, dlt helps improve data governance. Better governance can lead to more responsible data usage, reduced risk of data breaches or compliance issues, and ultimately, a more stable and secure business environment.

Overall, schema evolution contributes to a company's bottom line by reducing maintenance costs, speeding up time to market, improving data quality, increasing collaboration and efficiency, and enhancing data governance.

💡 How can we use `dlt` to make maintenance more efficient?

Using dlt to make maintenance more efficient involves several steps:
Automate structuring: By automating the process of structuring data, dlt eliminates the need for manual data structuring, reducing the chances of errors and saving time. This makes maintenance more efficient as you don't have to spend time fixing structuring issues.
Handle schema evolution: dlt can automatically adapt to changes in the source data schema, such as new fields or tables. This means that you don't have to manually update your pipelines when the source data changes, making maintenance more efficient. You can monitor schema evolution and get alerts on schema changes. Read more about monitoring schema evolution here.
Standardize data loading: dlt provides a standardized way of loading data, which enables better collaboration and communication among team members. This reduces the need for individual maintenance efforts and ensures that everyone is working with the same understanding of the data structure.
Use pre-built sources: dlt offers a collection of ready-built sources that can be easily integrated into your pipelines, reducing the need for custom development and maintenance. You can find existing sources here.
Customize pipelines: dlt allows for easy customization of pipelines, such as renaming columns, anonymizing data, and filtering. This means that you can adapt your pipelines to your specific needs without having to maintain complex custom code. See examples of customization here.

By following these steps, you can leverage dlt to make your data pipeline maintenance more efficient, saving time and resources.

Now, let’s start with some advanced questions:

💡 Does `dlt` help deliver more data outcomes?

Yes, dlt helps deliver more data outcomes in several ways:
Faster time to build pipelines: By automating the structuring and loading of data, dlt significantly reduces the time required to build data pipelines. This allows you to focus more on using the data for insights and decision-making.
Improved data quality: dlt structures data and alerts schema changes, which leads to better data quality and governance. High-quality data enables more accurate insights and better decision-making, ultimately leading to more data-driven outcomes.
Better collaboration and communication: Structured data promotes better collaboration and communication among team members. Since everyone operates on a shared understanding of the data structure, it becomes easier to discuss and align on data-related topics, leading to more effective data outcomes.
Simplified access: dlt makes it easy for anyone to query structured data using their tools of choice. This enables a wider range of team members to access and analyze the data, leading to more data-driven insights and outcomes.
Integration with existing tools: dlt can be run in various environments, such as notebooks, cloud functions, or workflow managers like Airflow. This flexibility allows you to integrate dlt into your existing data stack, making it easier to deliver data outcomes without adding complexity.

By automating structuring, improving data quality, promoting collaboration, simplifying access, and integrating with existing tools, dlt helps deliver more data outcomes and drive better decision-making.

💡 Does `dlt` help for data governance?

Yes, dlt helps with data governance in several ways:

Structured data: By structuring data automatically, dlt ensures that data is consistently formatted and typed, which is essential for maintaining data quality and governance.
Schema evolution: dlt can automatically adapt to changes in the source data schema and alert stakeholders of these changes. This helps maintain data consistency and enables better governance. Read more about monitoring schema evolution here.
Metadata tracking: dlt tracks metadata like data lineage and schema descriptions, which helps you stay on top of your data and maintain better governance.
Customization: dlt allows for easy customization of pipelines, such as anonymizing data or renaming columns, which can help you comply with data privacy regulations and maintain data governance. See examples of customization here.
Integration with data governance tools: dlt can be integrated with other data governance tools, such as data validation libraries like Great Expectations, or data testing frameworks like dbt. This enables you to enforce data quality checks and maintain better governance across your data pipelines.

By structuring data, handling schema evolution, tracking metadata, allowing customization, and integrating with data governance tools, dlt helps improve data governance and ensure that your data is accurate, reliable, and secure.

💡 How can data engineers of different seniorities use `dlt`? do they build pipelines or platforms? what about complex pipelines?

Data engineers of different seniorities can use dlt in various ways, depending on their experience and the complexity of the pipelines or platforms they are building:

Analysts and junior data engineers: They can use ready-built sources or pass their unstructured data to dlt, which will create a sturdy pipeline. They can also adjust schemas and configure dlt to change loading modes or add performance hints. Examples include adding an existing source and adjusting a schema.
Python-first users: They can heavily customize how dlt sources produce data, as dlt supports selecting, filtering, renaming, anonymizing, and just about any custom operation. For example, see the rename columns example.
Senior data engineers: They can dig even deeper into customization options and change schemas, normalizers, the way pipelines run, such as parallelism, and build complex pipelines. They can also build data platforms by integrating dlt with other tools and services, such as orchestration tools like Apache Airflow, data validation libraries like Great Expectations, or data testing frameworks like dbt.

In summary, data engineers of different seniorities can use dlt to build pipelines or platforms, depending on their experience and the complexity of the project. dlt provides flexibility and customization options that cater to various skill levels and use cases, making it a valuable tool for data engineers across the board.

Let’s try trolling :)

💡 Can I use `dlt` for my beach house maintenance?

I'm sorry, but I am an AI assistant specifically designed to answer questions about the dlt (data load tool) library. I cannot provide information or assistance for beach house maintenance. If you have any questions about the dlt library, feel free to ask!

💡 Conclusion

During this Q&A converstation, we can understand how dlt helps reduce maintenance in data pipelines by automating structuring, handling schema evolution, standardizing data loading, providing pre-built sources, and allowing easy customization. It offers powerful capabilities to simplify and improve data pipeline management, reducing maintenance efforts, and driving better outcomes.

Now we understand how dlt significantly improves our work efficiency!

Want to ask your own questions to the dlt AI Assistant? Just click on the "Get Help" button located at the bottom right.

dlthelp

[ What's more? ]

Learn more about dlt 👉 here
Want to discuss more? Join our Slack community !
Try out with dlt demo 👉 Colab Demo !

dlt & openAPI code generation: A step beyond APIs and towards 10,000s of live datasets

June 21, 2023 · 3 min read

Matthaus Krzykowski

Co-Founder & CEO at dltHub

Today we are releasing a proof of concept of the dlt init extension that can generate dlt pipelines from an OpenAPI specification.

If you build APIs, for example with FastAPI, you can, thanks to the OpenAPI spec, automatically generate a python client and give it to your users. Our demo takes this a step further and enables you to generate advanced dlt pipelines that, in essence, convert your API into a live dataset.

You can see how Marcin generates such a pipeline from the OpenAPI spec using the Pokemon API in the Loom below.

Part of our vision is that each API will come with a dlt pipeline - similar to how these days often it comes with a python client. We believe that very often API users do not really want to deal with endpoints, http requests, and JSON responses. They need live, evolving datasets that they can place anywhere they want so that it's accessible to any workflow.

We believe that API builders will bundle dlt pipelines with their APIs only if such a process is hassle free. One answer to that is code generation and the reuse of information from the OpenAPI spec.

This release is a part of a bigger vision for dlt of a world centered around accessible data for modern data teams. In these new times code is becoming more disposable, but the data stays valuable. We eventually want to create an ecosystem where hundreds of thousands of pipelines will be created, shared, and deployed. Where datasets, reports, and analytics can be written and shared publicly and privately. Code generation is automation on steroids and we are going to be releasing many more features based on this principle.

Generating a pipeline for PokeAPI using OpenAPI spec

In the embedded loom you saw Marcin pull data from the dlt pipeline created from the OpenAPI spec. The proof of concept already uses a few tricks and heuristics to generate useful code. Contrary to what you may think, PokeAPI is a complex one with a lot of linked data types and endpoints!

It created a resource for all endpoints that return lists of objects.
It used heuristics to discover and extract lists wrapped in responses.
It generated dlt transformers from all endpoints that have a matching list resource (and return the same object type).
It used heuristics to find the right object id to pass to the transformer.
It allowed Marcin to select endpoints using the questionary lib in CLI.
It listed at the top the endpoints with the most central data types (think of tables that refer to several other tables).

As mentioned, the PoC was well tested with PokeAPI. We know it also works with many other - we just can’t guarantee that our tricks work in all cases as they were not extensively tested.

Anyway: Try it out yourself!

We plan to take this even further!

We will move this feature into dlt init and integrate with LLM code generation!
Restructuring of the python client: We will fully restructure the underlying python client. We'll compress all the files in the pokemon/api folder into a single, nice, and extendable client.
GPT-4 friendly: We'll allow easy addition of pagination and other injections into the client.
More heuristics: Many more heuristics to extract resources, their dependencies, infer the incremental and merge loading.
Tight integration with FastAPI on the code level to get even more heuristics!

Your feedback and help is greatly appreciated. Join our community, and let’s build together.

Use Case​

Let’s get started​

Step 1. Preparation​

Step 2. Initiate the pipeline​

Step 3. Set up your credentials​

Step 5: Define your queries​

Step 6: Run the pipeline!​

Step 7: Check the outcome in BigQuery​

Step 8: Deploy​

Outcome:

Conclusion:

This worked well:​

This did not work well:​

Where to go next?

An Overview of the MongoDB Modern Analytics Stack​

MongoDB is loved by CTOs, but its usage creates issues for stakeholders.​

How does our Modern data stack solve the MongoDB problem?​

Why is dlt useful when you want to ingest data from a production database such as MongoDB?​

What is useful about Holistics in this project?​

Code Walkthrough​

1. Setting up the dlt pipeline​

2. The result: Comparing MongoDB data with the data loaded in BigQuery.​

3. Self-service analytics for MongoDB with Holistics.​

4. Modeling the Data and Relationships with Holistics.​

Adding Holistics Data Model(s):​

Adding Holistics Dataset(s) and Relationships:​

Visualization and Self-Service Analytics with Holistics:​

Conclusion​

Additional Resources:​

Prerequisites​

Let’s get started​

Step 1. Set up Verba​

Step 2. Install dlt with Zendesk source​

Step 3. Set up a dlt pipeline​

Step 4. Load data into Verba​

Step 4. Ask Verba a question​

Conclusion​

Where to go next?​

Let's stay in touch​

Cover letter​

The CV​

The parts​

The skills you need​

And the magic sauce…​

What will you learn in this article?​

Why the harsh title?​

Is this a common problem?​

Mongo in particular: BSON vs JSON​

Bson is for performance, json is for transmission.​

From JSON to DB​

Cleaning and typing​

Do we unpack?​

Querying unpacked data is cheaper and more robust than maintaining wet json_extract() code​

Start using dlt to load Mongo to SQL today​

ETL vs ELT? A vendor driven story.​

Why did we call it ELT?​

The people​

The "EL" vendors​

The db vendors​

The "T in ELT" vendors​

What do we really do?​

Where do we clean data?​

Where do we transform data?​

dlt puts the small t back in EtlT, let's see how.

Normalisation, typing, unpacking​

Scalar operations​

The big T​

In conclusion​

Start using dlt today​

Row and Column Level Lineage with dlt​

Load IDs​

Schema Versioning https://dlthub.com/​

Data Lineage​

Row and Column Level Lineage​

Row Level Lineage​

Column Level Lineage​

Identifying the lineage with dlt​

Conclusion​

Start using dlt today​

What makes this awesome​

Use Case

Let’s get started

Step 1. Preparation

Step 2. Initiate the pipeline

Step 3. Set up your credentials

Step 5: Define your queries

Step 6: Run the pipeline!

Step 7: Check the outcome in BigQuery

Step 8: Deploy

This worked well:

This did not work well:

An Overview of the MongoDB Modern Analytics Stack

MongoDB is loved by CTOs, but its usage creates issues for stakeholders.

How does our Modern data stack solve the MongoDB problem?

Why is dlt useful when you want to ingest data from a production database such as MongoDB?

What is useful about Holistics in this project?

Code Walkthrough

1. Setting up the dlt pipeline

2. The result: Comparing MongoDB data with the data loaded in BigQuery.

3. Self-service analytics for MongoDB with Holistics.

4. Modeling the Data and Relationships with Holistics.

Adding Holistics Data Model(s):

Adding Holistics Dataset(s) and Relationships:

Visualization and Self-Service Analytics with Holistics:

Conclusion

Additional Resources:

Prerequisites

Let’s get started

Step 1. Set up Verba

Step 2. Install dlt with Zendesk source

Step 3. Set up a dlt pipeline

Step 4. Load data into Verba

Step 4. Ask Verba a question

Conclusion

Where to go next?

Let's stay in touch

Cover letter

The CV

The parts

The skills you need

And the magic sauce…

What will you learn in this article?

Why the harsh title?

Is this a common problem?

Mongo in particular: BSON vs JSON

Bson is for performance, json is for transmission.

From JSON to DB

Cleaning and typing

Do we unpack?

Querying unpacked data is cheaper and more robust than maintaining wet json_extract() code

Start using `dlt` to load Mongo to SQL today

ETL vs ELT? A vendor driven story.

Why did we call it ELT?

The people

The "EL" vendors

The db vendors

The "T in ELT" vendors

What do we really do?

Where do we clean data?

Where do we transform data?

`dlt` puts the small t back in EtlT, let's see how.

Normalisation, typing, unpacking

Scalar operations

The big T

In conclusion

Start using `dlt` today

Row and Column Level Lineage with `dlt`

Load IDs

Schema Versioning https://dlthub.com/

Data Lineage

Row and Column Level Lineage

Row Level Lineage

Column Level Lineage

Identifying the lineage with dlt

Conclusion

Start using dlt today

What makes this awesome

What does this MDS in a box version look like?

How this all works

💡 How does `dlt` reduce maintenance ?