Discovering Apache Airflow
If you’ve been keeping an eye on the data engineering market, you might have noticed that Apache Airflow is becoming more and more common in job postings.
Although I haven’t had the opportunity to use it professionally, I decided to explore it and learn more to see what it’s all about.
Apache Airflow is an open source workflow orchestration tool, that allows you to build and run batch data pipelines, by creating tasks and managing depencies between them.
Mandatory history lesson : Airflow was developed by Airbnb in 2014, and was made open source in 2016.
Advntages :
Before we dive into Airflow, it’s important to mention that Airflow is a code based tool, that represents a workflow as a DAG (Directed Acyclic Graph). Here are the key advantages of expressing data pipelines and their configuration as code :
- Maintainable : Developers can follow explicitly what has been specified, by reading the code.
- Versionable: Code revisions can easily be tracked by a version control system such as Git.
- Collaborative: Teams of developers can easily collaborate on both development and maintenance of the code for the entire workflow.
- Testable: Any revisions can be passed through unit tests to ensure the code still works as intended.
Principles :
Now that we’ve seen the advantages of pipelines as code, let’s take a deeper look into Airflow. The principles of Airflow, as described by Airbnb, are the following :
- Dynamic: Airflow pipelines are configurated as Python code, allowing for dynamic pipeline generation. This allows for writting code that instantiate pipelines dynamically.
- Extensible: easily define your own operators, executors and extend the library so that it fits the level of abstraction that suits your environment.
- Elegant: Airflow pipelines are lean and explicit. Parameterizing your scripts is built in the core of Airflow using powerful Jinja templating engine.
- Scalable: Airflow has a modular architecture and uses a message queue to talk to orchestrate an arbitrary number of workers. Airflow is ready to scale to infinity.
All of this sounds promosing. Let’s so how it translates in reality, by taking a look at Airflow’s architecture. An Airflow installation generally consists of the following components:
- A scheduler, which handles both triggering scheduled workflows, and submitting tasks to the executor to run.
- An executor, which handles running tasks.
- In most production-suitable environments, executors will actually push task execution out to _workers_.
- A webserver, which presents a handy user interface to inspect, trigger and debug the behaviour of DAGs and tasks.
- A folder of DAG files, read by the scheduler and executor
- A metadata database, used by the scheduler, executor and webserver to store the state of each DAG.
Hands-on :
Let’s get our hands dirty and create our first flow.
An Apache Airflow DAG is a Python script which consists of the following logical blocks :
- Python library imports :
- DAG argument specification :
- the DAG definition, or instantiation :
- Individual task definitions, which are the nodes of the DAG :
- and finally, the task pipeline, which specifies the dependencies between tasks :
Once our Python file ready, we can copy it to the DAG folder, and launch it using the UI :
Setting up a first flow is pretty easy and straightforward.
Use cases :
Apache Airflow has supported many companies in reaching their goals, for example :
- Sift used Airflow for defining and organizing Machine Learning pipeline dependencies.
- SeniorLink increased the visibility of their batch processes and decoupled them.
- Experity deployed Airflow as an enterprise scheduling tool.
- Onefootball used Airflow to orchestrate SQL transformations in their data warehouses, and to send daily analytics emails.
The Good and the Bad :
After this brief overview of Apache Airflow, let’s explore the advantages and disadvantages of incorporating it into your project.
Starting with the Pros :
- Open source : Airflow is supported by a large and active tech community, making it easier to find answers to your issues online.
- Easy to use : Anyone with Python knowledge, which is the fourth most popular programming language worlwide [1], can deploy a workflow. So Airflow is available for a wide range of developers .
Some of the Cons are :
- Code based : while some may consider Airflow’s code-based approach an advantage, it can also be a disadvantage, since Airflow’s learning curve can be challenging and confusing for novice data engineers, compared to no-code alternatives.
- Batch-only : unlike Big Data tools such as Kafka or Spark, Airflow exclusively works with batches and is not designed for data streaming.
Conclusion :
With its extensive list of features and functions, I can see why Apache Airflow is rapidly becoming popular among the best workflow management tool. While there is no one-size-fits-all solution, Airflow has proven to meet the data processing needs of numerous use cases.