Getting Started with Dataform on Google Cloud Platform (GCP)

 Dataform is a powerful tool for managing and automating data workflows, specifically designed for teams working with Google BigQuery. It allows you to create efficient data pipelines, organize SQL code, and implement practices like dependency management and automatic documentation.

In this blog, we'll walk you through how to set up a workspace inside a Dataform repository, explain the default folder structure, explore the main tabs, and provide some JavaScript functions to make your workflow more dynamic and reusable.

Setting Up Your Dataform Workspace

Assuming you already have a Google Cloud Platform (GCP) project and BigQuery set up, the next step is to navigate to Dataform within the BigQuery Studio.

Once inside Dataform, you will work within a repository. Within this repository, you can create multiple workspaces for different tasks or environments, such as development, staging, or production.

To get started:

  1. Navigate to your Dataform repository.
  2. Create or open an existing workspace to begin working on your data pipelines.

Default Folder Structure in Dataform

When you create a new workspace, Dataform provides a predefined folder structure:

  • definitions/: Contains SQL or data transformation scripts.
  • includes/: Holds reusable SQL snippets or helper functions in JavaScript.
  • models/: Where you define your core SQL models for data transformations.
  • datasets/: Organize the datasets, views, or incremental outputs.
  • tests/: Write unit tests to ensure the correctness and quality of your data transformations.

This structure helps to keep your codebase neat and manageable, especially as your project scales.

Managing Workflow in Dataform

Workflows in Dataform are built using SQL actions and transformations defined in the models/ directory. Dataform automatically generates a dependency graph, which ensures that each transformation runs in the correct order based on table dependencies. This feature is especially useful for tracking data lineage and understanding how data flows through your pipeline.

Tabs in Dataform

When working in Dataform, you'll come across several key tabs that make it easy to manage your transformations and workflows:

  • Code: This is where you write, edit, and organize your SQL code, as well as define models and scripts.
  • Compile Graph: This tab shows the dependency graph between all tables, allowing you to visualize and manage dependencies.
  • Executions: This tab lets you view previous runs of your workflow, showing details of each execution's performance and errors.
  • Start Execution: This tab allows you to manually run the entire workflow or specific parts of it. It’s particularly useful when you need to trigger an on-demand refresh of your tables.

Making Workflows Dynamic with JavaScript

Dataform allows you to incorporate JavaScript within your SQL transformations, making it easier to handle repetitive tasks, manage variables, and make your queries more flexible. Below are a few useful functions that you can implement.

Dynamically Declare Tables

Instead of hardcoding table names throughout your scripts, you can store them in an array and declare them dynamically. Here's an example of a JavaScript function that does this:


const tables = [
  { name: "table1", database: "my-database", schema: "my-schema" },
  { name: "table2", database: "my-database", schema: "my-schema" },
  { name: "table3", database: "my-database", schema: "my-schema" },
];

// Function to generate and apply table declarations
function generateTableDeclarations() {
  tables.forEach(table => {
    declare({
      type: "declaration",
      schema: table.schema,
      name: table.name,
      database: table.database
    });
  });
}

// Call the function to generate the declarations
generateTableDeclarations();

With this function, you can easily add, remove, or update tables without modifying multiple files.

Using Variables for Dynamic Queries

You can also use JavaScript to create variables that make your queries dynamic, ensuring that they can be reused across different queries. Here’s an example:


// Create variables for multiple queries
const selling_date = "2023-01-01";

module.exports = { selling_date };


You can now reference selling_date in your SQL transformations without hardcoding the value in multiple places.

Conclusion

Dataform is a powerful tool for managing and automating data pipelines on GCP, especially when working with BigQuery. By leveraging its workspace structure, the flexibility of JavaScript, and the built-in dependency graph, you can build efficient, dynamic workflows that are easy to manage and scale.

If you liked this post, don’t forget to share and comment below!




Share:

No comments:

Post a Comment

We'd like to hear your comments!

Recent Posts