Hey GPT, tell me about dlthub!

June 20, 2023 · 7 min read

Data Engineer Intern at dltHub

Hi there! 👋

In this article, we asked ChatGPT questions about dlt, the open source library. Questions included some basic introduction, random fun questions, and advanced dlt project processes. Let’s see how it answered our questions!

💡 Who can use `dlt`?

dlt is designed to be accessible and useful for various roles within a data team, including:
Data professionals and aspiring data professionals who want to showcase their understanding and value to data teams. They can use dlt to build end-to-end projects, load data to databases, transform data, and prepare reporting. Read more in the Data Beginner guide.
Data engineers who want to remove tedious work, reduce ETL maintenance, empower their team members, and enhance their productivity. dlt can be used to automate data loading, enable collaboration, and support natural workflows. Read more in the Data Engineer guide.
Data analysts, data scientists, and analytics engineers who want to build and use pipelines for single use cases, core business pipelines, or heavily customized pipelines. dlt offers different levels of abstraction to cater to various needs and use cases. Read more in the Data Beginner guide.
Overall, dlt is suitable for anyone working with data pipelines, from beginners to experienced professionals.

💡 I am just starting in the data field, how can I use `dlt`?

If you are just starting in the data field, dlt can be a great tool to help you showcase your understanding and value to data teams. The first step is to get familiar with dlt by playing around with the Google Colab demo or installing it on your computer. Once you have dlt installed, you can initialize a project and pipeline to load data from an API, transform it, and load it into a database in minutes.

💡 How can data scientists use `dlt` in data science projects?

Data scientists can use dlt in data science projects to easily load and transform data from various sources. dlt offers different levels of abstraction that enable and empower data scientists to build and use pipeline.
For example, data scientists can use dlt to build pipelines for their specific use cases, such as grabbing data from an API that they need for their analysis. They can heavily customize how dlt sources produce data, such as selecting, filtering, renaming, anonymizing, and just about any custom operation.
In addition, data scientists can use dlt to create highly scalable, easy to maintain, straightforward to deploy data pipelines without having to wait for help from a data engineer. This empowers them to turn messy, unstructured data into regularly updated datasets and focus on their analysis and modeling work. Overall, dlt can help data scientists be more efficient and productive in their work by simplifying the data loading process and allowing them to focus on the analysis and modeling.

Now, let’s start with some advanced questions:

💡 Is `dlt` like dbt? How to use `dlt` with dbt? Can you show me the process?

Yes, dlt and dbt have similar functionalities, but they differ in their approach and scope. dlt is a Python package that helps you load data from various sources and transform it to be ready for analysis. dbt, on the other hand, is a SQL transformation tool that allows you to structure and transform your data using SQL. However, you can use dlt with dbt to create a complete data pipeline from data extraction to transformation and loading. Here is an example process to use dlt with dbt:
To use dbt with dlt, you can follow these steps:
Install dbt and dlt using pip.
Create a dlt pipeline to extract, transform, and load data into your data warehouse. You can use the dlt SQL client to create tables, views, and other structures that can be used for analysis.
Create a dbt project and configure it to use the same data warehouse as your dlt pipeline.
Write SQL queries to transform the data in your dbt project.
Run the dbt project to create the necessary structures and transformations in your data warehouse.
Use a BI tool to visualize and analyze the transformed data.
Overall, using dlt with dbt can help you create a complete data pipeline that covers data extraction, transformation, and loading. This can help you streamline your data processes and make it easier to analyze your data.

💡 Can I use `dlt` with Streamlit? Or Sheets? Or Power BI? Metabase, Looker? What do you recommend?

Yes, dlt can be used with Streamlit, Google Sheets, Power BI, Metabase, and Looker. The choice of tool will depend on your specific use case and requirements. For example, if you want to enable business users to explore data and create their own dashboards, then Metabase or Looker might be a good choice.
If you want to build a custom dashboard with Python, then Streamlit might be a good choice. If you want to use a cloud-based BI tool, then Power BI might be a good choice. In terms of using dlt with these tools, dlt can be used to load data into a database or data warehouse that can then be connected to these tools. For example, you can use dlt to load data into a PostgreSQL database and then use that database as a data source for Power BI.
Overall, I would recommend using the tool that best fits your specific use case and requirements, and then using dlt to load and transform your data as needed.

💡 How can I use `dlt` to create a machine learning model?

dlt can be used to create machine learning models in the following way:
Choose an API that produces data.
Choose a use case for this data.
Build a dlt pipeline that loads data from the API for your use case.
Extract data from your pipeline. 5>. Create a machine learning model depends on your use case.
Finally, you can deploy your model with Metabase dashboard.

Let’s try trolling :)

💡 I am tired of lakehouses, can I build a beach house with dlt? or a forest house??

I am sorry, but dlt is a data loading tool, and it is not related to building beach houses. It is meant to help you automate the process of loading and managing data in your applications. If you have any questions related to data loading or management, I will be happy to help!

💡 If data is the new oil, then what is `dlt`? newer oil?

dlt is not a type of oil. It is a data loading tool that automates time-consuming tasks, empowers stakeholders to solve pipeline maintenance issues, and offers different levels of abstraction that enable and empower data professionals to build and use pipelines. Its purpose is to make data loading a commodity and to simplify the process of loading data for organizations.

💡 Conclusion

We intended to ask ChatGPT different questions about dlt and see if there are any misunderstandings. For example: how dlt can be intergrated in various use cases or how data teams can use dlt in different projects. Seems it worked really well and answered our questions precisely based on our documentation and blog! Moreover, when we tried to ask some random questions, ChatGPT also gave us proper answers! GPT really seems to understands what we were trying to communicate with it!

What questions you would love to ask? Share with us in our Slack community ! See you there 😊

[ What's more? ]

Learn more about dlt 👉 here
Give the 👉 Colab Demo a try

Automating the data engineer: Addressing the talent shortage

June 15, 2023 · 6 min read

Adrian Brudaru

Open source data engineer

automated pipeline automaton

Why is there a data engineer shortage?

High Demand and Rapid Growth: The increasing reliance on data-driven decision-making and the rise of big data technologies have created a surge in demand for skilled data engineers.
Skill Gap and Specialization: Data engineering requires a unique blend of technical skills, and finding individuals with the right combination of programming, database management, and cloud computing expertise can be challenging.
Competition from Other Data Roles: The allure of data science and other data-related roles has attracted professionals, leading to a talent shortage in the data engineering field.

How big is the data engineer shortage?

💡 "In Europe there are 32K data engineers and 48K open positions to hire one. In the US the ratio is 41K to 79K" Source: Linkedin data analysis blog post

Well that doesn’t look too bad - if only we could all be about 2x as efficient :)

Bridging the gap: How to make your data engineers 2x more efficient?

There are 2 ways to make the data engineers more efficient:

Option 1: Give them more to do, tell them how to do their jobs better!

For some reason, this doesn’t work out great. All the great minds of our generation told us we should be more like them

do more architecture;
learn more tech;
use this new toy!
learn this paradigm.
take a step back and consider your career choices.
write more tests;
test the tests!
analyse the tests :[
write a paper about the tests...
do all that while alerts go off 24/7 and you are the bottleneck for everyone downstream, analysts and business people screaming. (┛ಠ_ಠ)┛彡┻━┻

“I can't do what ten people tell me to do. So I guess I'll remain the same”
Otis Redding, Sittin' On The Dock Of The Bay

Option 2: Take away unproductive work

A data engineer has a pretty limited task repertoire - so could we give some of their work to roles we can hire?

Let’s see what a data engineer does, according to GPT:

Data curation: Ensuring data quality, integrity, and consistency by performing data profiling, cleaning, transformation, and validation tasks.
Collaboration with analysts: Working closely with data analysts to understand their requirements, provide them with clean and structured data, and assist in data exploration and analysis.
Collaboration with DWH architects: Collaborating with data warehouse architects to design and optimize data models, schemas, and data pipelines for efficient data storage and retrieval.
Collaboration with governance managers: Partnering with governance managers to ensure compliance with data governance policies, standards, and regulations, including data privacy, security, and data lifecycle management.
Structuring and loading: Designing and developing data pipelines, ETL processes, and workflows to extract, transform, and load data from various sources into the target data structures.
Performance optimization: Identifying and implementing optimizations to enhance data processing and query performance, such as indexing, partitioning, and data caching.
Data documentation: Documenting data structures, data lineage, and metadata to facilitate understanding, collaboration, and data governance efforts.
Data troubleshooting: Investigating and resolving data-related issues, troubleshooting data anomalies, and providing support to resolve data-related incidents or problems.
Data collaboration and sharing: Facilitating data collaboration and sharing across teams, ensuring data accessibility, and promoting data-driven decision-making within the organization.
Continuous improvement: Staying updated with emerging technologies, industry trends, and best practices in data engineering, and actively seeking opportunities to improve data processes, quality, and efficiency.

Let’s get a back of the napkin estimation of how much time they spend on those areas

Here’s an approximation as offered by GPT. Of course, actual numbers depend on the maturity of your team and their unique challenges.

Collaboration with others (including data curation): Approximately 40-60% of their working hours. This includes tasks such as collaborating with team members, understanding requirements, data curation activities, participating in meetings, and coordinating data-related activities.
Data analysis: Around 10-30% of their working hours. This involves supporting data exploration, providing insights, and assisting analysts in understanding and extracting value from the data.
Technical problem-solving (structuring, maintenance, optimization): Roughly 30-50% of their working hours. This includes solving data structuring problems, maintaining existing data structures, optimizing data pipelines, troubleshooting technical issues, and continuously improving processes.

By looking at it this way, solutions become clear:

Let someone else do curation. Analysts could talk directly to producers. By removing the middle man, you improve speed and quality of the process too.
Automate data structuring: While this is not as time consuming as the collaboration, it’s the second most time consuming process.
Let analyst do exploration of structured data at curation, not before load. This is a minor optimisation, but 10-30% is still very significant towards our goal of reducing workload by 50%.

How much of their time could be saved?

Chat GPT thinks:

it is reasonable to expect significant time savings with the following estimates:

Automation of Structuring and Maintenance: By automating the structuring and maintenance of data, data engineers can save 30-50% or more of their time previously spent on these tasks. This includes activities like schema evolution, data transformation, and pipeline optimization, which can be streamlined through automation.
Analysts and Producers Handling Curation: Shifting the responsibility of data curation to analysts and producers can save an additional 10-30% of the data engineer's time. This includes tasks such as data cleaning, data validation, and data quality assurance, which can be effectively performed by individuals closer to the data and its context.

It's important to note that these estimates are approximate and can vary based on the specific circumstances and skill sets within the team.

40-80% of a data engineer’s time could be spared

💡 40-80% of a data engineer’s time could be spared

To achieve that,

Automate data structuring.
Govern the data without the data engineer.
Let analysts explore data as part of curation, instead of asking data engineers to do it.

This looks good enough for solving the talent shortage. Not only that, but doing things this way lets your team focus on what they do best.

A recipe to do it

Use something with schema inference and evolution to load your data.
Notify stakeholders and producers of data changes, so they can curate it.
Don’t explore json with data engineers - let analyst explore structured data.

Ready to stop the pain? Read this explainer on how to do schema evolution with dlt. Want to discuss? Join our slack.

GPT-accelerated learning: Understanding open source codebases

June 14, 2023 · 5 min read

Tong Chen

Data Engineer Intern at dltHub

info

💡Check out the accompanying colab demo: Google Colaboratory demo

Hi there! 👋 In this article, I will show you a demo on how to train ChatGPT with the open-source dlt repository. Here is the article structure, and you can jump directly to the part that interests you. Let's get started!

I. Introduction

II. Walkthrough

III. Result

IV. Summary

I. Introduction

Navigating an open-source repository can be overwhelming because comprehending the intricate labyrinths of code is always a significant problem. As a person who just entered the IT industry, I found an easy way to address this problem with an ELT tool called dlt (data load tool) - the Python library for loading data.

In this article, I would love to share a use case - training GPT with an Open-Source dlt Repository by using the dlt library. In this way, I can write prompts about dlt and get my personalized answers.

II. Walkthrough

The code provided below demonstrates training a chat-oriented GPT model using the dlt- hub repositories (dlt and pipelines). To train the GPT model, we utilized the assistance of two services: Langchain and Deeplake. In order to use these services for our project, you will need to create an account on both platforms and obtain the access token. The good news is that both services offer cost-effective options. GPT provides a $5 credit to test their API, while Deeplake offers a free tier.

The credit for the code goes to Langchain, which has been duly acknowledged at the end.

1. Run the following commands to install the necessary modules on your system.

python -m pip install --upgrade langchain deeplake openai tiktoken

# Create accounts on platform.openai.com and deeplake.ai. After registering, retrieve the access tokens for both platforms and securely store them for use in the next step. Enter the access tokens grabbed in the last step and enter them when prompted

import os
import getpass

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake

os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')
os.environ['ACTIVELOOP_TOKEN'] = getpass.getpass('Activeloop Token:')
embeddings = OpenAIEmbeddings(disallowed_special=())

2. Create a directory to store the code for training the model. Clone the desired repositories into that.

  # making a new directory named dlt-repo
!mkdir dlt-repo
# changing the directory to dlt-repo 
%cd dlt-repo 
# cloning git repos into the dlt-repo directory
# dlt code base
!git clone https://github.com/dlt-hub/dlt.git
# example pipelines to help you get started
!git clone https://github.com/dlt-hub/pipelines.git
# going back to previous directory
%cd .. 

3. Load the files from the directory

import os
from langchain.document_loaders import TextLoader

root_dir = './dlt-repo' # load data from 
docs = []
for dirpath, dirnames, filenames in os.walk(root_dir):
    for file in filenames:
        try: 
            loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')
            docs.extend(loader.load_and_split())
        except Exception as e: 
            pass

4. Load the files from the directory

import os
from langchain.document_loaders import TextLoader

root_dir = './dlt-repo' # load data from 
docs = []
for dirpath, dirnames, filenames in os.walk(root_dir):
    for file in filenames:
        try: 
            loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')
            docs.extend(loader.load_and_split())
        except Exception as e: 
            pass

5. Splitting files to chunks

# This code uses CharacterTextSplitter to split documents into smaller chunksbased on character count and store the resulting chunks in the texts variable.

from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(docs)

6. Create Deeplake dataset

# Set up your deeplake dataset by replacing the username with your Deeplake account and setting the dataset name. For example if the deeplakes username is “your_name” and the dataset is “dlt-hub-dataset” 

username = "your_deeplake_username" # replace with your username from app.activeloop.ai
db = DeepLake(dataset_path=f"hub://{username}/dlt_gpt", embedding_function=embeddings, public=True) #dataset would be publicly available
db.add_documents(texts) 

# Assign the dataset and embeddings to the variable db , using deeplake dataset.
# Replace your_username with actual username
db = DeepLake(dataset_path="hub://"your_username"/dlt_gpt", read_only=True, embedding_function=embeddings)

# Create a retriever
retriever = db.as_retriever()
retriever.search_kwargs['distance_metric'] = 'cos'
retriever.search_kwargs['fetch_k'] = 100
retriever.search_kwargs['maximal_marginal_relevance'] = True
retriever.search_kwargs['k'] = 10

7. Initialize the GPT model

from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain

model = ChatOpenAI(model_name='gpt-3.5-turbo') 
qa = ConversationalRetrievalChain.from_llm(model,retriever=retriever)

III. Result

After the walkthrough, we can start to experiment different questions and it will output answers based on our training from dlt hub repository.

Here, I asked " why should data teams use dlt? "

chatgptq1

It outputted:

It works seamlessly with Airflow and other workflow managers, making it easy to modify and maintain your code.
You have complete control over your data. You can rename, filter, and modify it however you want before it reaches its destination.

Next, I asked " Who is dlt for? "

chatgptq2

It outputted:

dlt is meant to be accessible to every person on the data team, including data engineers, analysts, data scientists, and other stakeholders involved in data loading. It is designed to reduce knowledge requirements and enable collaborative working between engineers and analysts.

IV. Summary

It worked! we can see how GPT can learn about an open source library by using dlt and utilizing the assistance of Langchain and Deeplake. Moreover, by simply follow the steps above, you can customize the GPT model training to your own needs.

Curious? Give the Colab demo💡 a try or share your questions with us, and we'll have ChatGPT address them in our upcoming article.

[ What's more? ]

Learn more about [dlt] 👉 here
Need help or want to discuss? Join our Slack community ! See you there 😊

Schema Evolution

June 10, 2023 · 6 min read

Adrian Brudaru

Schema Evolution

Schema evolution combines a technical process with a curation process, so let's understand the process, and where the technical automation needs to be combined with human curation.

Whether you are aware or not, you are always getting structured data for usage

Data used is always structured, but usually produced unstructured.

Structuring it implicitly during reading is called "schema on read", while structuring it upfront is called "schema on write".

To fit unstructured data into a structured database, developers have to perform this transition before loading. For data lake users who read unstructured data, their pipelines apply a schema during read - if this schema is violated, the downstream software will produce bad outcomes.

We tried running away from our problems, but it didn't work.

Because structuring data is difficult to deal with, people have tried to not do it. But this created its own issues.

Loading json into db without typing or structuring - This anti-pattern was created to shift the structuring of data to the analyst. While this is a good move for curation, the db support for structuring data is minimal and unsafe. In practice, this translates to the analyst spending their time writing lots of untested parsing code and pushing silent bugs to production.
Loading unstructured data to lakes - This pattern pushes the curation of data to the analyst. The problem here is similar to the one above. Unstructured data is hard to analyse and curate, and the farther it is from the producer, the harder it is to understand.

So no, one way or another we are using schemas.

If curation is hard, how can we make it easier?

Make data easier to discover, analyze, explore. Structuring upfront would do that.
Simplify the human process by decentralizing data ownership and curation - the analyst can work directly with the producer to define the dataset produced.

Structuring & curating data are two separate problems. Together they are more than the sum of the parts.

The problem is that curating data is hard.

Typing and normalising data are technical processes.
Curating data is a business process.

Here's what a pipeline building process looks like:

Speak with the producer to understand what the data is. Chances are the producer does not document it and there will be many cases that need to be validated analytically.
Speak with the analyst or stakeholder to get their requirements. Guess which fields fulfill their requirements.
Combine the 2 pieces of info to filter and structure the data so it can be loaded.
Type the data (for example, convert strings to datetime).
Load the data to warehouse. Analyst can now validate if this was the desired data with the correct assumptions.
Analyst validates with stakeholder that this is the data they wanted. Stakeholder usually wants more.
Possibly adjust the data filtering, normalization.
Repeat entire process for each adjustment.

And when something changes,

The data engineer sees something break.
They ask the producer about it.
They notify the analyst about it.
The analyst notifies the business that data will stop flowing until adjustments.
The analyst discusses with the stakeholder to get any updated requirements.
The analyst offers the requirements to the data engineer.
The data engineer checks with the producer/data how the new data should be loaded.
Data engineer loads the new data.
The analyst can now adjust their scripts, re-run them, and offer data to stakeholder.

Divide et impera! The two problems are technical and communicational, so let's let computers solve tech and let humans solve communication.

Before we start solving, let's understand the problem:

For usage, data needs to be structured.
Because structuring is hard, we try to reduce the amount we do by curating first or defering to the analyst by loading unstructured data.
Now we are trying to solve two problems at once: structuring and curation, with each role functioning as a bottleneck for the other.

So let's de-couple these two problems and solve them appropriately:

The technical issue is that unstructured data needs to be structured.
The curation issue relates to communication - so taking the engineer out of the loop would make this easier.

Automate the tech: Structuring, typing, normalizing

The only reason to keep data unstructured was the difficulty of applying structure.

By automating schema inference, evolution, normalization, and typing, we can just load our jsons into structured data stores, and curate it in a separate step.

Alert the communicators: When there is new data, alert the producer and the curator.

To govern how data is produced and used, we need to have a definition of the data that the producer and consumer can both refer to. This has typically been tackled with data contracts - a type of technical test that would notify the producer and consumer of violations.

So how would a data contract work?

Human process:
1. Humans define a data schema.
2. Humans write a test to check if data conforms to the schema.
3. Humans implement notifications for test fails.
Technical process:
1. Data is extracted.
2. Data is staged to somewhere where it can be tested.
3. Data is tested:
  1. If the test fails, we notify the producer and the curator.
  2. If the test succeeds, it gets transformed to the curated form.

So how would we do schema evolution with dlt?

Data is extracted, dlt infers schema and can compare it to the previous schema.
Data is loaded to a structured data lake (staging area).
Destination schema is compared to the new incoming schema.
1. If there are changes, we notify the producer and curator.
2. If there are no changes, we carry on with transforming it to the curated form.

So, schema evolution is essentially a simpler way to do a contract on schemas. If you had additional business-logic tests, you would still need to implement them in a custom way.

The implementation recipe

Use dlt. It will automatically infer and version schemas, so you can simply check if there are changes. You can just use the normaliser + loader or build extraction with dlt. If you want to define additional constraints, you can do so in the schema.
Define your slack hook or create your own notification function. Make sure the slack channel contains the data producer and any stakeholders.
Capture the load job info and send it to the hook.

Using the Google Sheets `dlt` pipeline in analytics and ML workflows

June 5, 2023 · 4 min read

Rahul Joshi

Data Science Intern at dltHub

Why we need a simple Google Sheets -> data warehouse pipeline

Spreadsheets are great. They are really simple to use and offer a lot of functionality to query, explore, manipulate, import/export data. Their wide availability and ease of sharing also make them great tools for collaboration. But they have limitations and cannot be used for storage and processing of large-scale complex data. Most organizational data is actually stored in data warehouses and not spreadsheets.

However, because of the easy set up and intuitive workflow, Google Sheets are still used by many people to track and analyze smaller datasets. But even this data often needs to be combined with the rest of the organizational data in the data warehouse for reasons like analytics, reporting etc. This is not a problem when the dataset is small and static and just needs to be exported once to the data warehouse. In most cases, however, the Google Sheets data is not static and is updated regularly, thus creating a need for an ETL pipeline, and thereby complicating an otherwise simple and intuitive workflow.

Since dlt has a Google Sheets pipeline that is very easy to set up and deploy, we decided to write a blog to demonstrate how some very common use-cases of Google Sheets can be enchanced by inserting this dlt pipeline into the process.

Use-case #1: Google sheets pipeline for measuring marketing campaign ROI

As an example of such a use-case, consider this very common scenario: You're the marketing team of a company that regularly launches social media campaigns. You track some of the information such as campaign costs in Google Sheets, whereas all of the other related data such as views, sign-ups, clicks, conversions, revenue etc. is stored in the marketing data warehouse. To optimize your marketing strategy, you decide to build a dashboard to measure the ROI for the campaigns across different channels. Hence, you would like to have all your data in one place to easily be able to connect your reporting tool to it.

To demonstrate this process, we created some sample data where we stored costs related to some campaigns in a Google Sheet and and the rest of the related data in BigQuery.

campaign-roi-google-sheets campaign-roi-data-warehouse

We then used the dlt google sheets pipeline by following these simple steps to load the Google Sheets data into BigQuery.

With the data loaded, we finally connected Metabase to the data warehouse and created a dashboard to understand the ROIs across each platform: campaign-roi-dashboard-1
campaign-roi-dashboard-2

Use-case #2: Evaluating the performance of your ML product using google sheets pipeline

Another use-case for Google Sheets that we've come across frequently is to store annotated training data for building machine learning (ML) products. This process usually involves a human first manually doing the annotation and creating the training set in Google Sheets. Once there is sufficient data, the next step is to train and deploy the ML model. After the ML model is ready and deployed, the final step would be to create a workflow to measure its performance. Which, depending on the data and product, might involve combining the manually annotated Google Sheets data with the product usage data that is typically stored in some data warehouse

A very common example for such a workflow is with customer support platforms that use text classfication models to categorize incoming customer support tickets into different issue categories for an efficient routing and resolution of the tickets. To illustrate this example, we created a Google Sheet with issues manually annotated with a category. We also included other manually annotated features that might help measure the effectiveness of the platform, such as priority level for the tickets and customer feedback.

customer-support-platform-google-sheets

We then populated a BigQuery dataset with potential product usage data, such as: the status of the ticket (open or closed), response and resolution times, whether the ticket was escalated etc. customer-support-platform-data-warehouse

Then, as before, we loaded the google sheets data to the data warehouse using the dlt google sheets pipeline and following these steps.

Finally we connected Metabase to it and built a dashboard measuring the performance of the model over the period of a month:

customer-support-platform-dashboard

The structured data lake: How schema evolution enables the next generation of data platforms

May 26, 2023 · 7 min read

Adrian Brudaru

Open source data engineer

info

Google Colaboratory demo

This colab demo was built and shown by our working student Rahul Joshi, for the Berlin Data meetup, where he talked about the state of schema evolution in the open source.

What is schema evolution?

In the fast-paced world of data, the only constant is change, and it usually comes unannounced.

Schema on read

Schema on read means your data does not have a schema, but your consumer expects one. So when they read, they define the schema, and if the unstructured data does not have the same schema, issues happen.

Schema on write

So, to avoid things breaking on running, you would want to define a schema upfront - hence you would structure the data. The problem with structuring data is that it’s a labor intensive process that makes people take pragmatic shortcuts of structuring only some data, which later leads to lots of maintenance.

Schema evolution means that a schema is automatically generated on write for the data, and automatically adjusted for any changes in the data, enabling a robust and clean environment downstream. It’s an automatic data structuring process that is aimed at saving time during creation, maintenance, and recovery.

Why do schema evolution?

One way or another, produced raw unstructured data becomes structured during usage. So, which paradigm should we use around structuring?

Let’s look at the 3 existing paradigms, their complexities, and what a better solution could look like.

The old ways

The data warehouse paradigm: Curating unstructured data upfront

Traditionally, many organizations have adopted a 'curate first' approach to data management, particularly when dealing with unstructured data.

The desired outcome is that by curating the data upfront, we can directly extract value from it later. However, this approach has several pitfalls.

Why curating unstructured data first is a bad idea

It's labor-intensive: Unstructured data is inherently messy and complex. Curating it requires significant manual effort, which is time-consuming and error-prone.
It's difficult to scale: As the volume of unstructured data grows, the task of curating it becomes increasingly overwhelming. It's simply not feasible to keep up with the onslaught of new data. For example, Data Mesh paradigm tries to address this.
It delays value extraction: By focusing on upfront curation, organizations often delay the point at which they can start extracting value from their data. Valuable insights are often time-sensitive, and any delay could mean missed opportunities.
It assumes we know what the stakeholders will need: Curating data requires us to make assumptions about what data will be useful and how it should be structured. These assumptions might be wrong, leading to wasted effort or even loss of valuable information.

The data lake paradigm: Schema-on-read with unstructured data

In an attempt to bypass upfront data structuring and curation, some organizations adopt a schema-on-read approach, especially when dealing with data lakes. While this offers flexibility, it comes with its share of issues:

Inconsistency and quality issues: As there is no enforced structure or standard when data is ingested into the data lake, the data can be inconsistent and of varying quality. This could lead to inaccurate analysis and unreliable insights.
Complexity and performance costs: Schema-on-read pushes the cost of data processing to the read stage. Every time someone queries the data, they must parse through the unstructured data and apply the schema. This adds complexity and may impact performance, especially with large datasets.
Data literacy and skill gap: With schema-on-read, each user is responsible for understanding the data structure and using it correctly, which is unreasonable to expect with undocumented unstructured data.
Lack of governance: Without a defined structure, data governance can be a challenge. It's difficult to apply data quality, data privacy, or data lifecycle policies consistently.

The hybrid approach: The lakehouse

The data lakehouse uses the data lake as a staging area for creating a warehouse-like structured data store.
This does not solve any of the previous issues with the two paradigms, but rather allows users to choose which one they apply on a case-by-case basis.

The new way

The current solution : Structured data lakes

Instead of trying to curate unstructured data upfront, a more effective approach is to structure the data first with some kind of automation. By applying a structured schema to the data, we can more easily manage, query, and analyze the data.

Here's why structuring data before curation is a good idea:

It reduces maintenance: By automating the schema creation and maintenance, you remove 80% of maintenance events of pipelines.
It simplifies the data: By imposing a structure on the data, we can reduce its complexity, making it easier to understand, manage, and use.
It enables automation: Structured data is more amenable to automated testing and processing, including cleaning, transformation, and analysis. This can significantly reduce the manual effort required to manage the data.
It facilitates value extraction: With structured data, we can more quickly and easily extract valuable insights. We don't need to wait for the entire dataset to be curated before we start using it.
It's more scalable: Reading structured data enables us to only read the parts we care about, making it faster, cheaper, and more scalable.

Therefore, adopting a 'structure first' approach to data management can help organizations more effectively leverage their unstructured data, minimizing the effort, time, and complexity involved in data curation and maximizing the value they can extract from their data.

An example of such a structured lake would be parquet file data lakes, which are both, structured and inclusive of all data. However, the challenge here is creating the structured parquet files and maintaining the schemas, for which the delta lake framework provides some decent solutions, but is still far from complete.

The better way

So, what if writing and merging parquet files is not for you? After all, file-based data lakes capture a minority of the data market.

`dlt` is the first python library in the open source to offer schema evolution

dlt enables organizations to impose structure on data as it's loaded into the data lake. This approach, often termed as schema-on-load or schema-on-write, provides the best of both worlds:

Easier maintenance: By notifying the data producer and consumer of loaded data schema changes, they can quickly decide together how to adjust downstream usage, enabling immediate recovery.
Consistency and quality: By applying structure and data typing rules during ingestion, dlt ensures data consistency and quality. This leads to more reliable analysis and insights.
Improved performance: With schema-on-write, the computational cost is handled during ingestion, not when querying the data. This simplifies queries and improves performance.
Ease of use: Structured data is easier to understand and use, lowering the skill barrier for users. They no longer need to understand the intricate details of the data structure.
Data governance: Having a defined schema allows for more effective data governance. Policies for data quality, data privacy, and data lifecycle can be applied consistently and automatically.

By adopting a 'structure first' approach with dlt, organizations can effectively manage unstructured data in common destinations, optimizing for both, flexibility and control. It helps them overcome the challenges of schema-on-read, while reaping the benefits of a structured, scalable, and governance-friendly data environment.

To try out schema evolution with dlt, check out our colab demo.

colab demo

Want more?

Join our Slack
Read our schema evolution blog post
Stay tuned for the next article in the series: How to do schema evolution with dlt in the most effective way

Using Google BigQuery and Metabase to understand product usage

May 25, 2023 · 5 min read

Rahul Joshi

Data Science Intern at dltHub

info

TL;DR: Trying to become more user-centric and make data driven decisions? Get started with the SQL source pipeline + BigQuery + Metabase

When you have a web and / or mobile app but no data yet

If you're a startup without a dedicated data team but a sizeable number of users on your website or mobile app, then chances are that you are collecting and storing all your product data in OLTP databases like MySQL, Postgres, etc. As you have grown, you have likely been aiming to become more user-centric, yet you find that no one at your company has information on what your users do or what their experience is like. Stakeholders should be making data-driven decisions, but they are not yet because they are unable to use the existing data to understand user behavior and experience. This is usually the point when folks realize they need a data warehouse.

Why a data warehouse is necessary

OLTP databases are great because they are optimized to handle high-volume real-time transactions and maintain data integrity and consistency. However, they are not very well-suited for advanced analytics and data modelling. If you want to create reports, dashboards, and more that help you understand you users, you are going to want to extract, load, and transform (ELT) into a OLAP database like Google BigQuery, Snowflake, etc. To do this, you will need to create a data pipeline, which can be quite challenging if your company does not have a dedicated data engineering team.

Why a data pipeline is necessary

Production dashboards rely on the availability of consistent, structured data, which necessitates deploying a data pipeline that is idompotent, can manage the schema and handle schema changes, can be deployed to load data incrementally, etc. For most startups, it's not obvious how to create such pipelines. This is why we decided to demonstrate how one can set up such a data pipeline and build analytics dashboards on top of it.

Why a reporting tool is necessary

We chose to build our dashboard in Metabase because it also offers an open source edition. The advantage of reporting tools like Metabase is that they are easy and intuitive to use even for people who can't write SQL, but at the same time they are powerful enough for those who would like to use SQL.

How we set this up

1. Creating a PostgreSQL -> BigQuery pipeline

Our aim was to create a Metabase dashboard to explore data in a transactional database. The data set that we chose was a sample of The Dell DVD Store 2 database, which we put into a Postgres database deployed on a Google Cloud SQL instance. To make this data available to Metabase, we needed to first load all of the data into a BigQuery instance, and for this we needed a data pipeline. We created this pipeline by doing very simple customizations on the existing dlt sql_database pipeline. See the accompanying repo for the steps we followed.

2. Building a Metabase reporting dashboard

With the database uploaded to BigQuery, we were now ready to build a dashboard. We created a Metabase cloud account and connected it to our BigQuery instance. This made the whole database accessible to Metabase and we were able to analyze the data.

The DVD store database contains data on the products (film DVDs), product categories, existing inventory, customers, orders, order histories etc. For the purpose of the dashboard, we decided to explore the question: How many orders are being placed each month and which films and film categories are the highest selling?

In addition to this, we were also able to set up email alerts to get notified whenever the stock of a DVD was either empty or close to emptying.

3. Deploying the pipeline

With our dashboard ready, all we had to do was deploy our pipeline so that the dashboard could get updated with new data daily. Since the dashboard only uses some of the tables, we needed to modify the pipeline, that was configured to load the entire database, to instead only update the necessary tables. We also wanted to make it possible for the pipeline to load tables incrementally whenever possible.

We first started by selecting the tables that we wanted to update, namely: orders, orderlines, products, categories, inventory. We then decided whether we wanted to update the tables incrementally or with full replace:

Tables orders and orderlines contain data on the orders placed. This means that they also contain a date column and hence are loaded incrementally every day.
Tables products, categories, and inventory contain information on the existing products. These tables don't contain a date column and are updated whenever there is any change in inventory. Since the values of the existing data in the tables can change, these tables are not updated incrementally, but are instead fully loaded each time the pipeline is run.

In order to specify these conditions and deploy our pipeline in production, we followed these steps.

Understanding how developers view ELT tools using the Hacker News API and GPT-4

May 15, 2023 · 4 min read

Rahul Joshi

Data Science Intern at dltHub

info

TL;DR: We created a Hacker News -> BigQuery dlt pipeline to load all comments related to popular ELT keywords and then used GPT-4 to summarize the comments. We now have a live dashboard that tracks these keywords and an accompanying GitHub repo detailing our process.

Motivation

To figure out how to improve dlt, we are constantly learning about how people approach extracting, loading, and transforming data (i.e. ELT). This means we are often reading posts on Hacker News (HN), a forum where many developers like ourselves hang out and share their perspectives. But finding and reading the latest comments about ELT from their website has proved to be time consuming and difficult, even when using Algolia Hacker News Search to search.

So we decided to set up a dlt pipeline to extract and load comments using keywords (e.g. Airbyte, Fivetran, Matillion, Meltano, Singer, Stitch) from the HN API. This empowered us to then set up a custom dashboard and create one sentence summaries of the comments using GPT-4, which made it much easier and faster to learn about the strengths and weaknesses of these tools. In the rest of this post, we share how we did this for ELT. A GitHub repo accompanies this blog post, so you can clone and deploy it yourself to learn about the perspective of HN users on anything by replacing the keywords.

Creating a `dlt` pipeline for Hacker News

For the dashboard to have access to the comments, we needed a data pipeline. So we built a dlt pipeline that could load the comments from the Algolia Hacker News Search API into BigQuery. We did this by first writing the logic in Python to request the data from the API and then following this walkthrough to turn it into a dlt pipeline.

With our dlt pipeline ready, we loaded all of the HN comments corresponding to the keywords from January 1st, 2022 onward.

Using GPT-4 to summarize the comments

Now that the comments were loaded, we were ready to use GPT-4 to create a one sentence summary for them. We first filtered out any irrelevant comments that may have been loaded using simple heuritics in Python. Once we were left with only relevant comments, we called the gpt-4 API and prompted it to summarize in one line what the comment was saying about the chosen keywords. If you don't have access to GPT-4 yet, you could also use the gpt-3.5-turbo API.

Since these comments were posted in response to stories or other comments, we fed in the story title and any parent comments as context in the prompt. To avoid hitting rate-limit error and losing all progress, we ran this for 100 comments at a time, saving the results in the CSV file each time. We then built a streamlit app to load and display them in a dashboard. Here is what the dashboard looks like:

Deploying the pipeline, Google Bigquery, and Streamlit app

With all the comments loaded and the summaries generated in bulk, we were ready to deploy this process and have the dashboard update daily with new comments.

We decided to deploy our streamlit app on a GCP VM. To have our app update daily with new data we did the following:

We first deployed our dlt pipeline using GitHub Actions to allow new comments to be loaded to BigQuery daily
We then wrote a Python script that could pull new comments from BigQuery into the VM and we scheduled to run it daily using crontab
This Python script also calls the gpt-4 API to generate summaries only for the new comments
Finally, this Python script updates the CSV file that is being read by the streamlit app to create the dashboard. Check it out here!

Follow the accompanying GitHub repo to create your own Hacker News/GPT-4 dashboard.

Internal Dashboard for Google Analytics 4

April 27, 2023 · 3 min read

Rahul Joshi

Data Science Intern at dltHub

info

TL;DR: As of last week, there is a dlt pipeline that loads data from Google Analytics 4 (GA4). We’ve been excited about GA4 for a while now, so we decided to build some internal dashboards and show you how we did it.

Why GA4?

We set out to build an internal dashboard demo based on data from Google Analytics (GA4). Google announced that they will stop processing hits for Universal Analytics (UA) on July 1st, 2023, so many people are now having to figure out how to set up analytics on top of GA4 instead of UA and struggling to do so. For example, in UA, a session represents the period of time that a user is actively engaged on your site, while in GA4, a session_start event generates a session ID that is associated with all future events during the session. Our hope is that this demo helps you begin this transition!

Initial explorations

We decided to make a dashboard that helps us better understand data attribution for our blog posts (e.g. As DuckDB crosses 1M downloads / month, what do its users do?). Once we got our credentials working, we then used the GA4 dlt pipeline to load data into a DuckDB instance on our laptop. This allowed us to figure out what requests we needed to make to get the necessary data to show the impact of each blog post (e.g. across different channels, what was the subsequent engagement with our docs, etc). We founded it helpful to use GA4 Query Explorer for this.

Internal dashboard

Dashboard 1 Dashboard 2

With the data loaded locally, we were able to build the dashboard on our system using Streamlit. You can also do this on your system by simply cloning this repo and following the steps listed here.

After having the pipeline and the dashboard set up just how we liked it, we were now ready to deploy it.

Deploying the data warehouse

We decided to deploy our Streamlit app on a Google Cloud VM instance. This means that instead of storing the data locally, it would need to be in a location that could be accessed by the Streamlit app. Hence we decided to load the data onto a PostgreSQL database in the VM. See here for more details on our process.

Deploying the `dlt` pipeline with GitHub Actions

Once we had our data warehouse set up, we were ready to deploy the pipeline. We then followed the deploy a pipeline walkthrough to configure and deploy a pipeline that will load the data daily onto our data warehouse.

Deploying the dashboard

We finally deployed our Streamlit app on our Google Cloud VM instance by following these steps.

Enjoy this blog post? Give `dlt` a ⭐ on GitHub 🤜🤛

Is DuckDB a database for ducks?

March 16, 2023 · 3 min read

Matthaus Krzykowski

Co-Founder & CEO at dltHub

Using DuckDB, dlt, & GitHub to explore DuckDB

tip

TL;DR: We created a Colab notebook for you to learn more about DuckDB (or any open source repository of interest) using DuckDB, dlt, and the GitHub API 🙂

So is DuckDB full of data about ducks?

Nope, you can put whatever data you want into DuckDB ✨

Many data analysts, data scientists, and developers prefer to work with data on their laptops. DuckDB allows them to start quickly and easily. When working only locally becomes infeasible, they can then turn this local “data pond” into a data lake, storing their data on object storage like Amazon S3, and continue to use DuckDB as a query engine on top of the files stored there.

If you want to better understand why folks are excited about DuckDB, check out this blog post.

Perhaps ducks use DuckDB?

Once again, the answer is also 'nein'. As far as we can tell, usually people use DuckDB 🦆

To determine this, we loaded emoji reaction data for DuckDB repo using data load tool (dlt) from the GitHub API to a DuckDB instance and explored who has been reacting to issues / PRs in the open source community. This is what we learned…

The three issues / PRs with the most reactions all-time are

The three issues / PRs with the most reactions in 2023 are

Some of the most engaged users (other than the folks who work at DuckDB Labs) include

@Tishj, @xhochy, and @handstuyennn, who received the most 👍 reactions
@lloydtabb, @cboettig, and @LindsayWray, who received the most ❤️ reactions
@dforsber, @ankoh, and @djouallah, who gave the most total reactions

All of these users seem to be people. Admittedly, we didn’t look at everyone though, so there could be ducks within the flock. You can check yourself by playing with the Colab notebook.

Maybe it’s called DuckDB because you can use it to create a "data pond" that can grow into a data lake + ducks like water?

Although this is a cool idea, it is still not the reason that it is called DuckDB 🌊

Using functionality offered by DuckDB to export the data loaded to it as Parquet files, you can create a small “data pond” on your local computer. To make it a data lake, you can then add these files to Google Cloud Storage, Amazon S3, etc. And if you want this data lake to always fill with the latest data from the GitHub API, you can deploy the dlt pipeline.

Check this out in the Colab notebook and let us know if you want some help setting this up.

Just tell me why it is called DuckDB!!!

Okay. It’s called DuckDB because ducks are amazing and @hannes once had a pet duck 🤣

Why "Duck" DB? Source: DuckDB: an Embeddable Analytical RDBMS

💡 Who can use dlt?​

💡 I am just starting in the data field, how can I use dlt?​

💡 How can data scientists use dlt in data science projects?​

💡 Is dlt like dbt? How to use dlt with dbt? Can you show me the process?​

💡 Can I use dlt with Streamlit? Or Sheets? Or Power BI? Metabase, Looker? What do you recommend?​

💡 How can I use dlt to create a machine learning model?​

💡 I am tired of lakehouses, can I build a beach house with dlt? or a forest house??​

💡 If data is the new oil, then what is dlt? newer oil?​

💡 Conclusion​

Why is there a data engineer shortage?​

How big is the data engineer shortage?​

Bridging the gap: How to make your data engineers 2x more efficient?​

Option 1: Give them more to do, tell them how to do their jobs better!​

Option 2: Take away unproductive work​

Let’s get a back of the napkin estimation of how much time they spend on those areas​

How much of their time could be saved?​

40-80% of a data engineer’s time could be spared

A recipe to do it

I. Introduction​

II. Walkthrough​

1. Run the following commands to install the necessary modules on your system.​

2. Create a directory to store the code for training the model. Clone the desired repositories into that.​

3. Load the files from the directory​

4. Load the files from the directory​

5. Splitting files to chunks​

6. Create Deeplake dataset​

7. Initialize the GPT model​

III. Result​

IV. Summary​

Whether you are aware or not, you are always getting structured data for usage​

We tried running away from our problems, but it didn't work.​

If curation is hard, how can we make it easier?​

Structuring & curating data are two separate problems. Together they are more than the sum of the parts.​

Divide et impera! The two problems are technical and communicational, so let's let computers solve tech and let humans solve communication.​

Automate the tech: Structuring, typing, normalizing​

Alert the communicators: When there is new data, alert the producer and the curator.​

The implementation recipe​

Why we need a simple Google Sheets -> data warehouse pipeline​

Use-case #1: Google sheets pipeline for measuring marketing campaign ROI​

Use-case #2: Evaluating the performance of your ML product using google sheets pipeline​

What is schema evolution?

Schema on read​

Schema on write​

Why do schema evolution?

The old ways​

The data warehouse paradigm: Curating unstructured data upfront​

The data lake paradigm: Schema-on-read with unstructured data​

The hybrid approach: The lakehouse​

The new way​

The current solution : Structured data lakes​

The better way​

dlt is the first python library in the open source to offer schema evolution​

Want more?​

When you have a web and / or mobile app but no data yet​

Why a data warehouse is necessary​

Why a data pipeline is necessary​

Why a reporting tool is necessary​

How we set this up​

1. Creating a PostgreSQL -> BigQuery pipeline​

2. Building a Metabase reporting dashboard​

3. Deploying the pipeline​

Motivation​

Creating a dlt pipeline for Hacker News​

Using GPT-4 to summarize the comments​

Deploying the pipeline, Google Bigquery, and Streamlit app​

Why GA4?​

Initial explorations​

Internal dashboard​

Deploying the data warehouse​

Deploying the dlt pipeline with GitHub Actions​

Deploying the dashboard​

Enjoy this blog post? Give dlt a ⭐ on GitHub 🤜🤛​

Using DuckDB, dlt, & GitHub to explore DuckDB​

So is DuckDB full of data about ducks?​

Perhaps ducks use DuckDB?​

Maybe it’s called DuckDB because you can use it to create a "data pond" that can grow into a data lake + ducks like water?​

Just tell me why it is called DuckDB!!!​

Enjoy this blog post? Give data load tool (dlt) a ⭐ on GitHub here 🤜🤛​

DHelp

Ask a question

💡 Who can use `dlt`?

💡 I am just starting in the data field, how can I use `dlt`?

💡 How can data scientists use `dlt` in data science projects?

💡 Is `dlt` like dbt? How to use `dlt` with dbt? Can you show me the process?

💡 Can I use `dlt` with Streamlit? Or Sheets? Or Power BI? Metabase, Looker? What do you recommend?

💡 How can I use `dlt` to create a machine learning model?

💡 I am tired of lakehouses, can I build a beach house with dlt? or a forest house??

💡 If data is the new oil, then what is `dlt`? newer oil?

💡 Conclusion

Why is there a data engineer shortage?

How big is the data engineer shortage?

Bridging the gap: How to make your data engineers 2x more efficient?

Option 1: Give them more to do, tell them how to do their jobs better!

Option 2: Take away unproductive work

Let’s get a back of the napkin estimation of how much time they spend on those areas

How much of their time could be saved?

I. Introduction

II. Walkthrough

1. Run the following commands to install the necessary modules on your system.

2. Create a directory to store the code for training the model. Clone the desired repositories into that.

3. Load the files from the directory

4. Load the files from the directory

5. Splitting files to chunks

6. Create Deeplake dataset

7. Initialize the GPT model

III. Result

IV. Summary

Whether you are aware or not, you are always getting structured data for usage

We tried running away from our problems, but it didn't work.

If curation is hard, how can we make it easier?

Structuring & curating data are two separate problems. Together they are more than the sum of the parts.

Divide et impera! The two problems are technical and communicational, so let's let computers solve tech and let humans solve communication.

Automate the tech: Structuring, typing, normalizing

Alert the communicators: When there is new data, alert the producer and the curator.

The implementation recipe

Why we need a simple Google Sheets -> data warehouse pipeline

Use-case #1: Google sheets pipeline for measuring marketing campaign ROI

Use-case #2: Evaluating the performance of your ML product using google sheets pipeline

Schema on read

Schema on write

The old ways

The data warehouse paradigm: Curating unstructured data upfront

The data lake paradigm: Schema-on-read with unstructured data

The hybrid approach: The lakehouse

The new way

The current solution : Structured data lakes

The better way

`dlt` is the first python library in the open source to offer schema evolution

Want more?

When you have a web and / or mobile app but no data yet

Why a data warehouse is necessary

Why a data pipeline is necessary

Why a reporting tool is necessary

How we set this up

1. Creating a PostgreSQL -> BigQuery pipeline

2. Building a Metabase reporting dashboard

3. Deploying the pipeline

Motivation

Creating a `dlt` pipeline for Hacker News

Using GPT-4 to summarize the comments

Deploying the pipeline, Google Bigquery, and Streamlit app

Why GA4?

Initial explorations

Internal dashboard

Deploying the data warehouse

Deploying the `dlt` pipeline with GitHub Actions

Deploying the dashboard

Enjoy this blog post? Give `dlt` a ⭐ on GitHub 🤜🤛

Using DuckDB, dlt, & GitHub to explore DuckDB

So is DuckDB full of data about ducks?

Perhaps ducks use DuckDB?

Maybe it’s called DuckDB because you can use it to create a "data pond" that can grow into a data lake + ducks like water?

Just tell me why it is called DuckDB!!!

Enjoy this blog post? Give data load tool (dlt) a ⭐ on GitHub here 🤜🤛