Data Pipelines makes it easy to create pipelines from large datasets without in-depth knowledge of Apache Spark. It removes the need for programming and SQL knowledge usually associated with data analytics. Our platform provides a user friendly interface to build reporting pipelines step by step.
Once you have an account with us, you will be able to create pipelines from your data stored locally (if self-hosted), in Amazon S3 or via connecting to SQL databases. Each pipeline reads data and produces output in CSV or Parquet format or SQL tables. Each step in a pipeline corresponds to an SQL operation (select, join, etc...) but since these operations are built with the help of the UI, only basic tabular data processing skills are necessary.
Once a pipeline is built, it can be scheduled to run at desired intervals or time of day.
Data Pipelines uses Apache Spark to process data which means that no workload is ever too large.
A market research company collects Ad impressions in CSV files stored in Amazon S3.
These files are too large to inspect them via Excel or similar tools.
Solution: Using DataPipelines, it only takes a few clicks for researchers to connect the data and start inspecting it via the graphical interface.
A mobile game's micro transaction metadata is stored in parquet files in Amazon S3.
They want to deliver several aggregated daily performance reports written to a MySQL database.
Solution: Admins connect their data to our platform and within minutes start creating pipelines, at each step previewing what the final output is going to look like. Pipelines then are scheduled to run once a day.
A company has archives of historical medical data stored on hard drives that need processing and migrating to Amazon S3.
They have their own servers and prefer to host everything in-house.
Solution: DataPipelines can be self-hosted which enables seamless processing and combining of data stored locally and in Amazon S3. Locally processed data is then written to their S3 bucket.