The cart is empty

Apache Airflow is an open-source platform designed for scheduling and managing complex workflows and data pipelines. Renowned for its flexibility and extensibility, it has become a popular choice among data engineers and developers across various industries. In this article, we delve into the setup and utilization of Apache Airflow on CentOS, a Linux distribution known for its stability and security.

Prerequisites

Before installing Apache Airflow, ensure that your CentOS system meets the following requirements:

  • CentOS 7 or newer
  • Python 3.6 or higher
  • Pip, the Python package manager

Installation of Apache Airflow

  1. System Update: Begin by updating your system using the sudo yum update command to ensure you have the latest package versions.

  2. Python and Pip Installation: If Python 3 is not yet installed, you can install it using sudo yum install python3. Pip should be installed along with Python.

  3. Creation of Virtual Environment: To isolate Airflow's dependencies from other Python applications, it's recommended to create a virtual environment. This can be done using the following commands:

    python3 -m venv airflow_venv
    source airflow_venv/bin/activate
    
  4. Apache Airflow Installation: With the virtual environment activated, install Airflow using pip. For a basic installation, use:

    pip install apache-airflow
    

 

Additional packages for features such as database support or integrations can be added as needed.

 

Airflow Configuration

  1. Database Initialization: Airflow utilizes a database to track the state of tasks and workflows. Initialize the database using:

    airflow db init
    
  2. User Creation: To access the Airflow web interface, create a user account:

    airflow users create \
        --username admin \
        --firstname FIRST_NAME \
        --lastname LAST_NAME \
        --role Admin \
        --email This email address is being protected from spambots. You need JavaScript enabled to view it.
    
  3. Configuration File: Airflow allows extensive configuration through the airflow.cfg file. This file can be found in the Airflow installation directory and can be edited to suit your requirements.

 

Running Airflow

  1. Web Interface: Launch the Airflow web interface using:

    airflow webserver -p 8080
    

    After launching, you can access the Airflow web interface at http://localhost:8080.

  2. Scheduler: Start the scheduler, which regularly checks workflow definitions and executes tasks according to the schedule, using:

    airflow scheduler
    

 

 

Creation and Management of Workflows

To create a workflow in Apache Airflow, you need to define a Directed Acyclic Graph (DAG) - a graph that represents all tasks and their dependencies. Below are the basic steps to create a simple DAG:

  1. DAG File Creation: Create a Python file in the dags directory within your Airflow installation. For example, my_first_dag.py.

  2. DAG Definition: Define the DAG in the file using constructs like below. Replace 'my_first_dag' and other parameters as per your requirements:

    from datetime import timedelta
    from airflow import DAG
    from airflow.operators.dummy_operator import DummyOperator
    from airflow.utils.dates import days_ago
    
    default_args = {
        'owner': 'airflow',
        'depends_on_past': False,
        'email': [This email address is being protected from spambots. You need JavaScript enabled to view it.'],
        'email_on_failure': False,
        'email_on_retry': False,
        'retries': 1,
        'retry_delay': timedelta(minutes=5),
    }
    
    dag = DAG(
        'my_first_dag',
        default_args=default_args,
        description='A simple tutorial DAG',
        schedule_interval=timedelta(days=1),
        start_date=days_ago(2),
        tags=['example'],
    )
    
    start = DummyOperator(
        task_id='start',
        dag=dag,
    )
    
    end = DummyOperator(
        task_id='end',
        dag=dag,
    )
    
    start >> end
    

This example creates a DAG named my_first_dag with two tasks: start and end, where end follows start. Tasks are defined using DummyOperator, which is a basic operator that performs no action and is used for demonstration purposes.

Administration and Monitoring

Administration and monitoring are crucial aspects when operating Apache Airflow. The web interface provides detailed insights into the status of DAGs, tasks, their executions, logs, and more. You can manually trigger DAGs or individual tasks, stop running processes, or test specific parts of workflows.

Security

Security should be considered when deploying Airflow in a production environment. Airflow offers several mechanisms for security, including user authentication, encryption of sensitive information, and setting permissions at the DAG level. It's recommended to explore the Airflow documentation and implement suitable security measures for your needs.

 

Apache Airflow is a powerful tool for orchestrating complex workflows and data pipelines. Its flexibility and wide range of integrations enable efficient automation and monitoring of processes across various development and production environments. With proper configuration and usage on CentOS, Airflow can provide a robust and reliable solution for managing your data tasks.