Apache Airflow is an open-source workflow management platform That is widely used among organizations across the globe. It uses an internal database to store all the metadata about running tasks and workflows. In the evolving technological landscape, every solution needs to be scalable to support the thinly spread enterprise architecture.
Airflow can be made scalable by deploying it on a Kubernetes cluster. In this article we will learn how deploy the airflow application in an AWS EKS cluster and persist the data by using an external storage like FSx Lustre.
Table of Contents
Prerequisites
The prerequisites for deploying the application on EKS cluster includes the following.
- IAM Roles for the Cluster and the worker nodes.
 - VPC network setup.
 - EKS cluster and Optional EC2 instance.
 - FSx Lustre Filesystem
 
IAM roles
The node IAM role gives permissions to the kubelet running on the node to make calls to other APIs on your behalf. This includes permissions to access container registries where your application containers are stored.
Before you can launch a node group, you must create an IAM role for those worker nodes to use when they are launched.
Create IAM roles for the EKS cluster and worker nodes.
a. EKS_cluster_role with AWS managed policy AmazonEKSClusterPolicy
															b. EKSWorkerNode with AWS managed policies AmazonEC2ContainerRegistryReadOnly, AmazonEKS_CNI_Policy, and AmazonEKSWorkerNodePolicy
															VPC network setup
Create a VPC in your desired region. Make sure that the public subnets are enabled for auto assigning public IP address.
															Create EKS Cluster
An Amazon EKS cluster consists of two primary components:
- The Amazon EKS control plane which consists of control plane nodes that run the Kubernetes software, such as etcd and the Kubernetes API server. These components run in AWS owned accounts.
 - A data plane made up of Amazon EKS worker nodes or Fargate compute registered to the control plane. Worker nodes run in customer accounts; Fargate compute runs in AWS owned accounts.
 
Cluster configuration
Select the IAM role to allow the Kubernetes control plane to manage AWS resources on your behalf. This property cannot be changed after the cluster is created.
															Cluster access
By default, Amazon EKS creates an access entry that associates the AmazonEKSClusterAdminPolicy access policy to the IAM principal creating the cluster.
Any IAM principal assigned the IAM permission to create access entries can create an access entry that provides cluster access to any IAM principal after cluster creation.
The cluster will source authenticated IAM principals from both EKS access entry APIs and the aws-auth ConfigMap.
															Networking
Each Managed Node Group requires you to specify one of more subnets that are defined within the VPC used by the Amazon EKS cluster. Nodes are launched into subnets that you provide. The size of your subnets determines the number of nodes and pods that you can run within them. You can run nodes across multiple AWS availability zones by providing multiple subnets that are each associated different availability zones. Nodes are distributed evenly across all of the designated Availability Zones.
															Cluster endpoint access
You can limit, or completely disable, public access from the internet to your Kubernetes cluster endpoint.
Amazon EKS creates an endpoint for the managed Kubernetes API server that you use to communicate with your cluster (using Kubernetes management tools such as kubectl). By default, this API server endpoint is public to the internet, and access to the API server is secured using a combination of AWS Identity and Access Management (IAM) and native Kubernetes Role Based Access Control (RBAC).
You can, optionally, limit the CIDR blocks that can access the public endpoint. If you limit access to specific CIDR blocks, then it is recommended that you also enable the private endpoint, or ensure that the CIDR blocks that you specify include the addresses that worker nodes and Fargate pods (if you use them) access the public endpoint from.
You can enable private access to the Kubernetes API server so that all communication between your worker nodes and the API server stays within your VPC. You can limit the IP addresses that can access your API server from the internet, or completely disable internet access to the API server.
															Controller node setup
The controller node may be an EC2 machine or a bare metal computer where we install and configure AWS CLI, IAM authenticator, and Kubectl for accessing and making changes to the EKS cluster.
- AWS CLI
 
				
					snap install aws-cli –classic
aws configure 
				
			
		
															2. IAM Authenticator
				
					curl -o aws-iam-authenticator https://amazon-eks.s3.us-west-2.amazonaws.com/1.15.10/2020-02-22/bin/linux/amd64/aws-iam-authenticator 
				
			
		
															Apply execute permissions to the binary.
				
					chmod +x ./aws-iam-authenticator 
				
			
		Copy the binary to a folder in your PATH.
				
					sudo mv ./aws-iam-authenticator /usr/local/bin 
				
			
		Verify the installation
				
					aws-iam-authenticator help 
				
			
		
															3. Kubectl
Download the kubectl binary for your cluster’s Kubernetes version from Amazon S3.
Kubernetes version on EKS cluster – 1.29
Instance architecture – Linux (amd64)
				
					curl -O https://s3.us-west-2.amazonaws.com/amazon-eks/1.29.0/2024-01-04/bin/linux/amd64/kubectl 
				
			
		
															Apply execute permissions to the binary.
				
					chmod +x ./kubectl 
				
			
		Copy the binary to a folder in your PATH.
				
					mkdir -p $HOME/bin && cp ./kubectl $HOME/bin/kubectl && export PATH=$HOME/bin:$PATH 
				
			
		After you install kubectl, you can verify its version.
				
					kubectl version –client 
				
			
		
															To configure the kubectl to communicate with the EKS cluster run the following command. Replace region-code with the AWS Region that your cluster is in. Replace my-cluster with the name of your cluster.
				
					aws eks update-kubeconfig --region region-code --name my-cluster 
				
			
		
															Check the complete configuration of the kubectl by running the following command
				
					more /root/.kube/config 
				
			
		
															Now upon running the following command, we should be able to see the services on our cluster.
				
					kubectl get svc 
				
			
		
															Create Worker Nodes
Create a node group on the EKS cluster’s compute tab.
A node group is a group of EC2 instances that supply compute capacity to your Amazon EKS cluster. You can add multiple node groups to your cluster.
															Node group configuration
Amazon EKS managed node groups make it easy to provision compute capacity for your cluster. Node groups consist of one or more EC2 instances running the latest EKS-optimized AMIs. All nodes are provisioned as part of an EC2 autoscaling group that is managed for you by Amazon EKS and all resources including EC2 instances and autoscaling groups run within your AWS account.
- You can apply Kubernetes labels during node group creation and update them at any time.
 - Nodes are automatically tagged for auto-discovery by the Kubernetes cluster auto scaler.
 
															Node group compute configuration
Provision nodes for your cluster with the latest EKS-optimized AMIs. Easily update nodes to the latest AMI or Kubernetes versions when they are available.
Use launch templates to customize the configuration of the EC2 instances created as part of your node group.
															Node group scaling configuration
You can change the size of your node group at any time.
															Node group update configuration
Amazon EKS managed node groups supports updating nodes to newer AMI versions in parallel. By default, nodes are updated one at a time. However, if your applications can tolerate a higher level of disruption, you can decrease the overall time to complete a node group version update by increasing the parallelization level. You can increase this parallelization level by setting the node group maximum number of unavailable nodes, either as an absolute number, or as a percentage of the node group size.
															Node group network configuration
Select multiple subnets for a node group to provision nodes across multiple AWS availability zones.
															Now we can see the list of nodes created under this node group by running the following command.
				
					kubectl get nodes 
				
			
		
															Also, we can see the node in the EC2 console with the same private IP address as found in the above screenshot.
															Create FSx
The type of file system we choose is Amazon FSx for Lustre as it is cost-effective, scalable and high-performing file storage for compute workloads like airflow.
															File system details
Persistent file systems are ideal for longer-term storage and workloads. Data is replicated and file servers are replaced if they fail.
Scratch file systems are ideal for temporary storage and shorter-term processing of data. Data is not replicated and does not persist if a file server fails.
Choose SSD storage for latency-sensitive workloads or workloads requiring the highest levels of IOPS/throughput.
Choose HDD storage for throughput-focused workloads that aren’t latency-sensitive. For HDD-based file systems, the optional SSD cache improves performance by automatically placing your most frequently read data on SSD (the cache size is 20% of your file system size).
Throughput per unit of storage represents the throughput that will be available per TiB of provisioned storage. Total throughput for your file system is this value multiplied by storage capacity.
															Network & security
The VPC Security Groups associated with your file system’s network interfaces determine which compute instances can access your file system. If you don’t select a VPC Security Group, Amazon FSx will automatically associate your VPC’s default Security Group with your file system’s network interfaces.
															The provided security groups do not permit Lustre LNET network traffic on port 988 for the file system to be created.
															Deploying Apache Airflow on a EKS Cluster
Install the FSx CSI Driver
To use the FSx as a persistence volume on a EKS cluster, the IAM Role that is attached to the cluster must have the permission to access the FSx with read and write permission. This can be achieved by adding the following IAM policy the cluster role.
				
					{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "iam:CreateServiceLinkedRole",
        "iam:AttachRolePolicy",
        "iam:PutRolePolicy"
       ],
      "Resource": "arn:aws:iam::*:role/aws-service-role/fsx.amazonaws.com/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "fsx:*"
      ],
      "Resource": ["*"]
  }]
}
 
				
			
		Once you have the policy added your instance IAM role, you can start to deploy the FSx CSI driver. This will deploy a StatefulSet, a DaemonSet, and all the RBAC rules needed to allow the FSx CSI Driver to manage your storage:
				
					$kubectl create -k "github.com/kubernetes-sigs/aws-fsx-csi-driver/deploy/kubernetes/overlays/stable/?ref=master" 
				
			
		
															Namespace for airflow deployment
There will be few namespaces created by default in the EKS cluster. Those can be listed using the following command.
				
					kubectl get ns 
				
			
		
															We can create a custom namespace for the Airflow.
				
					kubectl create namespace airflow
kubectl get ns 
				
			
		
															Going further we are going to use this namespace for installing the airflow. Hence, we can set this namespace as default by using the following command.
				
					kubectl config set-context --current --namespace=airflow 
				
			
		We can also avoid this step and use -n airflow at the end of each command which need to do anything with our namespace.
Note: Once the application installation is completed, the default configuration setting must be reset. Else, all the future process will be done in the same namespace.
Preparing resources for airflow deployment
Create a path for managing the airflow resources.
				
					mkdir ~/.kube/airflow-resources
cd ~/.kube/airflow-resources
 
				
			
		
															Storageclass
Create the Storage class configuration file in the name airflow-storageclass.yaml as shown below. Where the subnet and the security group used to create the FSx must be mentioned
				
					apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: aws-fsx-sc
provisioner: fsx.csi.aws.com
parameters:
  subnetId: subnet-1111aaa22c3ddd44e
  securityGroupIds: sg-1a22222b333c4d555
 
				
			
		Apply the storage class using the following command.
				
					kubectl apply -f airflow-storageclass.yaml 
				
			
		Check the storage class using the following command.
				
					kubectl describe sc aws-fsx-sc 
				
			
		
															Persistent Volume Claim
Create the persistent volume claim configuration file for the FSX in the name persistent-volume-claim.yaml as shown below.
				
					apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: aws-fsx-pvc
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: aws-fsx-sc
  resources:
    requests:
      storage: 100Gi
 
				
			
		Apply the volume claim using the following command.
				
					kubectl apply -f persistent-volume-claim.yaml 
				
			
		Check the volume claim configuration using the following command.
				
					kubectl describe pvc aws-fsx-pvc 
				
			
		
															Config Map
Create config map configuration file in the name airflow-configmap.yaml as shown below
				
					apiVersion: v1 
kind: ConfigMap 
metadata: 
  name: airflow-config 
data: 
  executor: "KubernetesExecutor" 
 
				
			
		Apply the config map using the following command.
				
					kubectl apply -f airflow-configmap.yaml 
				
			
		Check the config using the following command.
				
					kubectl describe configmap airflow-config  
				
			
		
															Scheduler
Create the scheduler configuration for Airflow in the name scheduler-serviceaccount.yaml as shown below.
				
					kind: ServiceAccount 
apiVersion: v1 
metadata: 
  name: airflow-scheduler 
  labels: 
    tier: airflow 
    component: scheduler 
    release: release-name 
    chart: "airflow-1.5.0" 
    heritage: Helm 
 
				
			
		Apply the scheduler configuration using the following command.
				
					kubectl apply -f scheduler-serviceaccount.yaml 
				
			
		Check the scheduler configuration using the following command.
				
					kubectl describe sa airflow-scheduler  
				
			
		
															Pod Launcher Role and Role Binding
Create the Pod launcher role in the name pod-launcher-role.yaml as shown below.
				
					kind: Role 
apiVersion: rbac.authorization.k8s.io/v1 
metadata: 
  name: pod-launcher-role 
  namespace: "airflow" 
  labels: 
    tier: airflow 
    release: release-name 
    chart: "airflow-1.5.0" 
    heritage: Helm 
rules: 
  - apiGroups: [""] 
    resources: ["services", "endpoints", "pods"] 
    verbs: ["get", "list", "create", "delete", "watch", "patch"] 
  - apiGroups: [""] 
    resources: ["pods/logs"] 
    verbs: ["get", "list", "create", "delete", "watch", "patch"] 
  - apiGroups: [""] 
    resources: ["pods/exec"] 
    verbs: ["get", "list", "create", "delete", "watch", "patch"] 
  - apiGroups: [""] 
    resources: ["events"] 
    verbs: ["list"] 
 
				
			
		Create the pod launcher role binding in the name pod-launcher-role-binding.yaml as shown below.
				
					kind: RoleBinding 
apiVersion: rbac.authorization.k8s.io/v1 
metadata: 
  namespace: "airflow" 
  name: pod-launcher-rolebinding 
  labels: 
    tier: airflow 
    release: release-name 
    chart: "airflow-1.5.0" 
    heritage: Helm 
roleRef: 
  apiGroup: rbac.authorization.k8s.io 
  kind: Role 
  name: pod-launcher-role 
subjects: 
  - kind: ServiceAccount 
    name: airflow-scheduler 
    namespace: "airflow" 
 
				
			
		Apply the role and role binding configurations using the following commands.
				
					kubectl apply -f pod-launcher-role.yaml
kubectl apply -f pod-launcher-role-binding.yaml
 
				
			
		Check the configuration of the role and role binding using the following commands.
				
					kubectl describe role pod-launcher-role
kubectl describe role pod-launcher-role-binding
 
				
			
		
															Deployment
Now that all our prerequisites and the cluster resources are ready, we are going to create our deployment file for the airflow in the name airflow-deployment.yaml.
				
					kind: Deployment 
apiVersion: apps/v1 
metadata: 
  name: airflow 
  namespace: "airflow" 
spec: 
  replicas: 1 
  selector: 
    matchLabels: 
      deploy: airflow 
      name: airflow 
      component: webserver 
  template: 
    metadata: 
      labels: 
        deploy: airflow 
        name: airflow 
        component: webserver 
    spec: 
      serviceAccountName: airflow-scheduler 
      containers: 
        - name: airflow-scheduler 
          image: 'apache/airflow:2.2.4' 
          imagePullPolicy: IfNotPresent 
          env: 
            - name: AIRFLOW__CORE__EXECUTOR 
              valueFrom: 
                configMapKeyRef: 
                  name: airflow-config 
                  key: executor 
          volumeMounts: 
            - name: aws-fsx-pv
              mountPath: /opt/airflow-kubernetes/fsx
          command: 
            - airflow 
          args: 
            - scheduler 
        - name: airflow-webserver 
          env: 
            - name: AIRFLOW__CORE__EXECUTOR 
              valueFrom: 
                configMapKeyRef: 
                  name: airflow-config 
                  key: executor 
          image: 'apache/airflow:2.2.4' 
          imagePullPolicy: IfNotPresent 
          ports: 
            - containerPort: 8080 
          command: 
            - airflow 
          args: 
            - webserver 
      restartPolicy: Always 
      volumes: 
        - name: aws-fsx-pv
          persistentVolumeClaim:
            claimName: aws-fsx-pvc
 
				
			
		Deploy the airflow webserver by applying the deployment configuration using the following command.
				
					kubectl apply -f airflow-deployment.yaml 
				
			
		The entire configuration of the airflow deployment including all that we applied earlier can be checked now using the following command.
				
					kubectl describe deployment airflow  
				
			
		
															To check the status of the pods running in the deployment, use the following command.
				
					kubectl get pods 
				
			
		Service
We need to create a Kubernetes service for the airflow webserver. The YAML file to create the webserver service with the defined network port is called airflow-service.yaml and is shown below
				
					kind: Service 
apiVersion: v1 
metadata: 
  name: webserver-svc 
  namespace: airflow 
spec: 
  type: LoadBalancer 
  selector: 
    deploy: airflow 
    name: airflow 
    component: webserver 
  ports: 
    - name: airflow-ui 
      protocol: TCP 
      port: 8080 
      targetPort: 8080
 
				
			
		Now we can create the webserver service using the following command.
				
					kubectl apply -f airflow-service.yaml  
				
			
		And check the details of the webserver using the following command.
				
					kubectl describe service webserver-svc 
				
			
		
															Airflow is finally deployed on the Kubernetes cluster and is running successfully, we can open the Airflow webserver UI using the cluster endpoint.
															Conclusion
Now, we have learnt how to install Airflow on EKS and use FSx as the persistent volume to preserve the data.
In the process of learning I have used the default DB engine that runs within the airflow. However for the production scenario, this will not we advised.
While deploying for production using a MySQL or PostgreSQL will be better.
External databases like RDS, Aurora, DynamoDB can also be used.
We highly appreciate your patience and time spent reading this article.
Stay tuned for more Content.
Happing reading !!! Let us learn together !!!