Apache Airflow Snowflake

broken image


Hi Everyone, this lockdown time is giving me opportunity to explore many more things. Today I am sharing with you data ingestion in snowflake through Airflow. In this blog I am going to give you an idea regarding following points,

  • What is airflow?
  • How to create custom Operator for snowflake connector?
  • How airflow communicate with snowflake stage with runnable example?

If you are running 2.0, you may need to install separate packages (e.g. Apache-airflow-providers-snowflake) to use the hooks, operators, and connections described here. In an Astronomer project this can be accomplished by adding the packages to your requirements.txt file. To learn more, read Airflow Docs on Provider Packages. Though I'm fully on-board the Snowflake bandwagon I'd like to present a well-rounded recommendation that includes some downsides of the tool. I'm looking for experience in the following areas: Aspects of the tool that are less mature than desired, maybe that needed additional expense or labor to handle.

Let's start with Airflow step by step,

What is airflow?

Pip install apache-airflow-backport-providers-snowflake='2020.11.23' For Airflow=2.0.0 you will need to have: pip install apache-airflow-providers-snowflake. Files for apache-airflow-providers-snowflake, version 1.2.0; Filename, size File type Python version Upload date Hashes; Filename, size apacheairflowproviderssnowflake-1.2.0-py3-none-any.whl (23.6 kB) File type Wheel Python version py3 Upload date Apr 13, 2021. Snowflake Airflow Connection Metadata ¶; Parameter. Snowflake user name. Password: string. Password for Snowflake user.

Apache airflow is Orchestration tool we can create workflow automation and scheduling task that can be used to manage data pipelines programmatically. Airflow uses DAG (directed acyclic graphs) to manage tasks.

Task: Task means whatever work we must do in scheduling form like uploading data from S3/GCS bucket or Download file from S3/GCS bucket and apply some custom transformation on top it.

Airflow does not work on data streaming concept and Task do not move data from one task to another. They have their own execution flow, but they can exchange metadata.

Note: Sometimes people may think DAG definition is a place to do actual data processing but that is not correct at all! The DAG definition purpose is to evaluate and execute it periodically to reflect changes only.

How to create custom Operator for snowflake connector?

In this section I will give you overall idea about what are the things and settings we have required for the connector code and custom operator. Let's follow below steps to create custom operator.

Step 1: First we require authentication information so that Airflow can talk with snowflake stage through coding but internally it will use credentials and other info. So, we require information like

  • Snowflake username
  • Snowflake password
  • Snowflake account
  • Warehouse
  • Database
  • Schema
  • Staging query
  • Table name

Keep this information handy so that we can configure it into Airflow config.

Step 2: In Airflow there is an option like connections.

Go to Admin --> Connections

In connection section there is one option to create new connection. Click on the create option.

We can get following screen.

Ok, Then I will tell you, how to fill this format because this part is very important, and we are going to use this information in coding at the time of custom operator creation.

1.Conn Id: this is user defined field so we can write here 'snowflake'

2.Conn Type: this is also user defined field so we can write here 'snowflake'

3.Host: in this field we can write our snowflake host name

'XXXXXXX.us-central1.gcp.snowflakecomputing.com'. We must create host name using some standard rule which is provided by snowflake. Because for every cloud there is a separate hostname. I am using here Google cloud (GCP) so my hostname is different. Similarly, for AWS there may have different rules to create hostname. You can get these rules in snowflake documentation.

4. Schema: By default, value is 'PUBLIC'

Airflow Etl

5. Login: this is username

6. Password: this password

7. Port: By default, 443 is port for communication with snowflake

8. Extra: This is very important field. It takes JSON and we can mention any fields in this JSON so that we can access it through code. Like

9. Save this connection information.

Step 3: Now, we are ready with the configuration. Create 'custome_operator' folder in dag's directory. Create one python file and give name like 'SnowflakeCustomOperator.py'

a. Write import statement

here I have used BaseOperator class to create custom operator and AirflowPlugin to create plugin for future use.

b. Class name and constructor

I used BaseOperator here to create custom SnowflakeOperator and it will provide some default method for implementations. I have mentioned input parameter in a constructor, and I am taking input param as table name, stage name, file format. If you want, you can take number of arguments in parameters.

c. Now, here we have written our core logic to communicate with snowflake database.

Using BaseHook.class object we can read our connection information that we have configured already in connection tag as above. We can get here key value pair for each field and simply we are ignoring hardcode value from the code.

d. If you want to export this custom operator as plugin then you can write following code otherwise not mandatory.

Airflow Snowflake Operator

Now we have finished our custom operator creation part. Let's get start to use this operator in our code,

Step 4: Create one DAG file in our dag's directory and give name like 'SnowflakeCustomOperatorDag.py'

a. import statements, here we have import our custom operator file also.

b. Now we can write default argument for DAG as per our need in my condition I have mentioned following default argument.

c. Create Dag definition and give dag_id and other information like

d. Create task using our custom operator and mention all the input parameter for execution. Here I have taken sample public dataset with 5 million records and I am going to insert these records in snowflake table using external staging. Here I have already created one stage in snowflake. There is one table to store that 5 million data and our input data is in GCS bucket.

e. Now if you want to create any cleanup task you can create using python operator. In my condition I have only created one method for job complete notification. You can write any cleanup logic in the same method otherwise keep as it is. I have used here Python operator to call this method after finishing first task.

f. Now we can give here our scheduling rule. So that, we can finish our first task that is snowflake data insertion and after that we can run our cleanup code.

Conclusion : Now, See your data get inserted into snowflake table. Here we are reading file from GCS bucket. You can replace path with local file also.

If you are getting any error related to access GCS or S3 bucket from code, then you want to simply configure connection for cloud services also. Because I am assuming here airflow setup is on GCP only.

Final output

Thank you friends...Good Day...



Do You know, out of so many ETL tools available in the market, which one suits best for your business? If you are already a Snowflake customer, are you having trouble setting up and maintaining a reliable Snowflake ETL tool for data loading? This article lists down some of the popular ETL platforms which will seamlessly load data into Snowflake data warehouse.

Table of contents

  • 5 Best Snowflake ETL Tools
    • Daton
    • Stitch Data
    • Blendo
    • Hevo Data
    • Apache Airflow

Introduction to Snowflake

Snowflake is a cloud-based data warehouse created by three data warehousing experts at Oracle Corporation in 2012. Snowflake Computing, the vendor behind the Snowflake Cloud Data Warehouse product, raised over $400 million over the past eight years and acquired thousands of customers. One might wonder if another data warehouse vendor is needed in an already crowded field of traditional data warehousing technologies like Oracle, Teradata, SQL Server, and cloud data warehouses like Amazon Redshift and Google Big Query. Well, the answer is the disruption caused by cloud technologies and cloud opportunities for new technology companies. Public clouds enabled start-ups to shed past baggage, learn from the past, challenge the status quo, and take a fresh look at cloud opportunities to create a new data warehouse product. Read Snowflake pros and Cons to understand the modern, cloud-built data warehouse for consumers of cloud technologies.

You can register for a $400 free trial of Snowflake within minutes. This credit is sufficient to store a terabyte of data and run a small data warehouse environment for a few days.

What is Snowflake ETL?

ETL (Extract Transform Load) is the process of data extraction from various sources, transformation into compatible formats, and loading into a destination. Snowflake ETL, similarly refers to the extraction of relevant data from data sources, transforming and then loading it into Snowflake.

Modern businesses prefer ELT tools over the traditional ETL system. With the ELT approach, necessary transformations are made after it is loaded into the data warehouse.

Why ETL your data into Snowflake?

If you are already storing your valuable data in some other database, here are some of the unique features of Snowflake:

  • Decoupled architecture: Decoupled layers of storage, compute, and cloud services in Snowflake architecture allow independent scaling.
  • JSON using SQL: Snowflake supports JSON data using a set of functions like a variant, parse_json.
  • Fast Clone: Fast Clone is a feature that enables you to clone a table or an entire database, instantly with no additional service cost.
  • Encryption: Snowflake supports various encryption mechanisms like end-to-end encryption, client-side encryption, guaranteeing a high level of data security.
  • Query optimization: Query optimization engines automatically run in the background to improve the query performances and taking care of processes such as indexing or partitioning.

How to Evaluating Snowflake ETL tools?

Every organization needs to invest in the right ETL tool for its business operations. We have listed some major factors to be considered before choosing an ETL service:

  • Ease of use: Check, whether you can use the tool effortlessly. It can be simple drag and drop GUIs or writing SQL or Python scripts to enable complex transformations in the ETL process.
  • Supports multiple data sources: The ETL service provider should support various data sources for analytics.
  • Extensibility: Most ETL tools support a fixed set of data sources, but there should also be an option for custom additions of new data sources.
  • Data Transformation: Check the level of data transformation supported by the ETL tool.
  • Pricing: Cost is always a concern and the price depends on a range of factors and use cases.
  • Product documentation: A detailed documentation for in-house engineers is always helpful.
  • Customer support: Timely, efficient, and multi-channel customer support is very essential for troubleshooting an ETL tool.

5 Best Snowflake ETL Tools

The following ETL tools are popular for catering data needs of modern businesses especially the ones who use the Snowflake data warehouse.

Daton

Daton is an effective data integration tool that would seamlessly extract all the relevant Data from popular data sources then consolidate and store it in the data warehouse of your choice for more effective analysis. The best part is that you can use Daton without any coding experience and it is the cheapest data pipeline available in the market.

Daton features:

  • IP Address Extraction
  • Load Scheduling
  • CSV and Excel export with data manipulation
  • Data Warehousing
  • Heterogeneous platform support
  • Knowledge modules for developers
  • High Volume Processing

Daton Limitations

Apache Airflow Snowflakes

  • Daton focuses only on the extraction and loading process as ELT systems are becoming more popular.
  • Data transformation has to be taken care of by the analysts after it is loaded in the data warehouse. They will be able to customize according to the use-cases.

Daton Pricing:

Apache airflow snowflake provider

There is a wide range of plans. STARTER at $150, GROWTH at $500, BUSINESS at $1000. The ENTERPRISE plan is for all the custom requirements and LITE plan is Pay per Use.

Stitch Data

Stitch provides a powerful ETL tool designed for developers. Stitch connects to all data sources and SaaS tools to replicate all relevant data to a data warehouse.

Stitch Data Features:

  • Advanced scheduling
  • Post-load webhooks
  • Notification extensibility
  • API key management.

Stitch Data Limitations

  • It is difficult to replicate data from a single source to multiple destinations.
  • Data source library is limited.

Stitch Data Pricing:

There is a 14-day trial period. The three plans available are Free plan, Standard plan at $100 per month and Enterprise plan for custom requirements.

Blendo

Blendo is a good option for businesses who want to access and model company data. It provides a simple ETL service which run data loads from different data sources easily.

Blendo features

  • Blendo is easy to set up, and no coding is required.
  • It supports over 45 data sources, including SaaS platforms, cloud storages and databases.
  • Data monitoring and notification features are available for data pipeline breakdowns.
  • Customer support includes Intercom online chat and email.
  • Product documentation is available on the website.

Blendo Limitations

  • The product focuses mainly on the extraction and loading process not much on data transformations.
  • Users cannot add a new data source or edit an existing one without help from experts.

Blendo Pricing

Blendo offers a 14-day fully-featured free trial. The basic plan starts at $150 per month.

Hevo Data

Hevo Data offers a robust and comprehensive data integration solution. It is also quick to set up and has an intuitive interface to customize your ETL process.

Hevo Data features

  • Hevo claims to be a zero-maintenance data pipeline.
  • It offers a powerful plug-and-play data transformation.
  • Hevo supports real-time and reliable data replication with no data loss.
  • The fault-tolerant architecture resolves data disruptions along with the operational status of real-time loads.
  • It supports 100+ data sources like SaaS platforms, cloud storage, databases, BI tools.
  • It offers unparalleled customer support through online Intercom chat, email, and over the phone.
  • Extensive documentation, including video resources on the product, is given on the company website.

Hevo Data Limitations

  • Alerting needs to be improved.
  • The data transformation interface should be user-friendly.
  • Scheduling of a pipeline job is difficult.

Hevo Data Pricing

It comes with a 14-day free trial. The basic plan starts at $149 per month.

Apache Airflow

Apache Airflow is the most popular open-source product for businesses looking to develop a custom Snowflake ETL tool in-house.

Apache Airflow features

  • Apache Airflow is an open-source product free to download and use.
  • It enables to build data workflows as Directed Acyclic Graphs (DAGs).
  • Detailed online documentation is available for setup and troubleshooting.
  • Customer Support is provided through an Airflow Slack community.
  • Users can add functionality to Apache workflows using Python.
  • ETL your data into any system through custom code, or a pre-built module/plugin.

Apache Airflow Limitations

This tool involves a lot of scripting and Python code for setup and operations.

Apache Airflow Pricing

Open-source, licensed under Apache License Version 2.0.

Conclusion

There are several Snowflake ETL tools available in the market. Modern businesses are more comfortable in outsourcing their data needs to these ETL service providers. So that they can dedicate more time and resources in effective analytics and business intelligence. Daton is a no-code Data Pipeline which takes care of all the business data for organizations using the Snowflake data warehouse.

Sign up for a free trial of Daton today!

Recently Blogs

Related Blogs

Talk to our
Data Architects

Sign up for a free trial of Daton today.

Take your analytics game to the next level





broken image