How to build your first data pipeline

Create a simple data pipeline in a few clicks

Data Pipelines allows people of all skill levels to build pipelines quickly and easily.

To keep things simple, in this tutorial we will create a pipeline using inline data (sample data you enter in the browser). The process is exactly the same as using a real data source, for example an SQL table or CSV files. Inline data can be combined with 'real' external data, useful for debugging while developing pipelines. You can read more about how to connect your data here.

Before you create your first pipeline, you may want to create a Project but it's optional. You can use projects to group pipelines into logical collections.

There are multiple ways to create pipelines (for example by importing a JSON definition), but in this tutorial we will focus on the Pipeline Builder.

Click Add Pipeline, enter a name for your pipeline then click Create new.

A new pipeline will be created and opened in the Pipeline Builder view. Depending on what data connections you have defined, you may not see some of the connection types under Load datasets.

An empty pipeline

Click Manual input, add some sample data and click Load. Every time a dataset is added, regardless of the connection type, it needs to be registered in your pipeline by clicking Load.

Click Load to add the table you have just created to your pipeline.

Once you have some data in your pipeline, you can use the widgets in the control panel on the left to add transformation steps. There is no limit on the number of steps per pipeline. Each time you add a step, click Preview to apply the transformation and refresh the pipeline output preview.

Group cars by make and calculate average year for each

Add another step to filter cars that were made after 2010.

Filter grouped dataset by average year greater than 2010

Your pipeline may comprise many datasets and hundreds of transformation steps and can be difficult to understand at a glance. You may find the Pipeline DAG (directed acyclic graph) view useful to quickly visualize the relationships and transformation steps in your pipeline.

DAG for the sample pipeline containing a single input dataset and two transformation steps

Pipelines and their corresponding graphs may get complex.

DAG for a more complex pipeline containing joins

Some transformations can be used to create a new dataset. For example, as shown in the below screenshot, the Join operation can be applied in two ways. One will overwrite the left dataset with the results of the join operation, the other will create a new dataset which can then be used just like any other dataset. It can be combined with other datasets and be applied transformations.

The first join will overwrite the left table in the join, the other will yield a new table.

If you are familiar writing Spark SQL you may add a SQL step where you can enter custom Spark SQL manually. You may use any Apache Spark function from the official Spark documentation.

Applying custom Spark SQL

Any dataset displayed in the builder can be written to a data connection of your choice via either a schedule or the Quick Write widget.

The Quick Write widget is intended for debugging purposes only. It is very basic and does not have the write options (eg. whether to GZip the output) that the scheduler does. It uses some sensible defaults.