AWS, Devops, EKS, Kubernetes

Airflow in EKS with FSx

Apache Airflow is an open-source workflow management platform That is widely used among organizations across the globe. It uses an internal database to store all the metadata about running tasks and workflows. In the evolving technological landscape, every solution needs to be scalable to support the thinly spread enterprise architecture.

Airflow can be made scalable by deploying it on a Kubernetes cluster. In this article we will learn how deploy the airflow application in an AWS EKS cluster and persist the data by using an external storage like FSx Lustre.

Prerequisites

The prerequisites for deploying the application on EKS cluster includes the following.

IAM Roles for the Cluster and the worker nodes.
VPC network setup.
EKS cluster and Optional EC2 instance.
FSx Lustre Filesystem

IAM roles

The node IAM role gives permissions to the kubelet running on the node to make calls to other APIs on your behalf. This includes permissions to access container registries where your application containers are stored.

Before you can launch a node group, you must create an IAM role for those worker nodes to use when they are launched.

Create IAM roles for the EKS cluster and worker nodes.

a. EKS_cluster_role with AWS managed policy AmazonEKSClusterPolicy

b. EKSWorkerNode with AWS managed policies AmazonEC2ContainerRegistryReadOnly, AmazonEKS_CNI_Policy, and AmazonEKSWorkerNodePolicy

VPC network setup

Create a VPC in your desired region. Make sure that the public subnets are enabled for auto assigning public IP address.

Create EKS Cluster

An Amazon EKS cluster consists of two primary components:

The Amazon EKS control plane which consists of control plane nodes that run the Kubernetes software, such as etcd and the Kubernetes API server. These components run in AWS owned accounts.
A data plane made up of Amazon EKS worker nodes or Fargate compute registered to the control plane. Worker nodes run in customer accounts; Fargate compute runs in AWS owned accounts.

Cluster configuration

Select the IAM role to allow the Kubernetes control plane to manage AWS resources on your behalf. This property cannot be changed after the cluster is created.

Cluster access

By default, Amazon EKS creates an access entry that associates the AmazonEKSClusterAdminPolicy access policy to the IAM principal creating the cluster.

Any IAM principal assigned the IAM permission to create access entries can create an access entry that provides cluster access to any IAM principal after cluster creation.

The cluster will source authenticated IAM principals from both EKS access entry APIs and the aws-auth ConﬁgMap.

Networking

Each Managed Node Group requires you to specify one of more subnets that are defined within the VPC used by the Amazon EKS cluster. Nodes are launched into subnets that you provide. The size of your subnets determines the number of nodes and pods that you can run within them. You can run nodes across multiple AWS availability zones by providing multiple subnets that are each associated different availability zones. Nodes are distributed evenly across all of the designated Availability Zones.

Cluster endpoint access

You can limit, or completely disable, public access from the internet to your Kubernetes cluster endpoint.

Amazon EKS creates an endpoint for the managed Kubernetes API server that you use to communicate with your cluster (using Kubernetes management tools such as kubectl). By default, this API server endpoint is public to the internet, and access to the API server is secured using a combination of AWS Identity and Access Management (IAM) and native Kubernetes Role Based Access Control (RBAC).

You can, optionally, limit the CIDR blocks that can access the public endpoint. If you limit access to specific CIDR blocks, then it is recommended that you also enable the private endpoint, or ensure that the CIDR blocks that you specify include the addresses that worker nodes and Fargate pods (if you use them) access the public endpoint from.

You can enable private access to the Kubernetes API server so that all communication between your worker nodes and the API server stays within your VPC. You can limit the IP addresses that can access your API server from the internet, or completely disable internet access to the API server.

Controller node setup

The controller node may be an EC2 machine or a bare metal computer where we install and configure AWS CLI, IAM authenticator, and Kubectl for accessing and making changes to the EKS cluster.

AWS CLI

				
					snap install aws-cli –classic
aws configure

2. IAM Authenticator

				
					curl -o aws-iam-authenticator https://amazon-eks.s3.us-west-2.amazonaws.com/1.15.10/2020-02-22/bin/linux/amd64/aws-iam-authenticator

Apply execute permissions to the binary.

				
					chmod +x ./aws-iam-authenticator

Copy the binary to a folder in your PATH.

				
					sudo mv ./aws-iam-authenticator /usr/local/bin

Verify the installation

				
					aws-iam-authenticator help

3. Kubectl

Download the kubectl binary for your cluster’s Kubernetes version from Amazon S3.

Kubernetes version on EKS cluster – 1.29

Instance architecture – Linux (amd64)

				
					curl -O https://s3.us-west-2.amazonaws.com/amazon-eks/1.29.0/2024-01-04/bin/linux/amd64/kubectl

Apply execute permissions to the binary.

				
					chmod +x ./kubectl

Copy the binary to a folder in your PATH.

				
					mkdir -p $HOME/bin && cp ./kubectl $HOME/bin/kubectl && export PATH=$HOME/bin:$PATH

After you install kubectl, you can verify its version.

				
					kubectl version –client

To configure the kubectl to communicate with the EKS cluster run the following command. Replace region-code with the AWS Region that your cluster is in. Replace my-cluster with the name of your cluster.

				
					aws eks update-kubeconfig --region region-code --name my-cluster

Check the complete configuration of the kubectl by running the following command

				
					more /root/.kube/config

Now upon running the following command, we should be able to see the services on our cluster.

				
					kubectl get svc

Create Worker Nodes

Create a node group on the EKS cluster’s compute tab.

A node group is a group of EC2 instances that supply compute capacity to your Amazon EKS cluster. You can add multiple node groups to your cluster.

Node group configuration

Amazon EKS managed node groups make it easy to provision compute capacity for your cluster. Node groups consist of one or more EC2 instances running the latest EKS-optimized AMIs. All nodes are provisioned as part of an EC2 autoscaling group that is managed for you by Amazon EKS and all resources including EC2 instances and autoscaling groups run within your AWS account.

You can apply Kubernetes labels during node group creation and update them at any time.
Nodes are automatically tagged for auto-discovery by the Kubernetes cluster auto scaler.

Node group compute configuration

Provision nodes for your cluster with the latest EKS-optimized AMIs. Easily update nodes to the latest AMI or Kubernetes versions when they are available.

Use launch templates to customize the configuration of the EC2 instances created as part of your node group.

Node group scaling configuration

You can change the size of your node group at any time.

Node group update configuration

Amazon EKS managed node groups supports updating nodes to newer AMI versions in parallel. By default, nodes are updated one at a time. However, if your applications can tolerate a higher level of disruption, you can decrease the overall time to complete a node group version update by increasing the parallelization level. You can increase this parallelization level by setting the node group maximum number of unavailable nodes, either as an absolute number, or as a percentage of the node group size.

Node group network configuration

Select multiple subnets for a node group to provision nodes across multiple AWS availability zones.

Now we can see the list of nodes created under this node group by running the following command.

				
					kubectl get nodes

Also, we can see the node in the EC2 console with the same private IP address as found in the above screenshot.

Create FSx

The type of file system we choose is Amazon FSx for Lustre as it is cost-effective, scalable and high-performing file storage for compute workloads like airflow.

File system details

Persistent file systems are ideal for longer-term storage and workloads. Data is replicated and file servers are replaced if they fail.

Scratch file systems are ideal for temporary storage and shorter-term processing of data. Data is not replicated and does not persist if a file server fails.

Choose SSD storage for latency-sensitive workloads or workloads requiring the highest levels of IOPS/throughput.

Choose HDD storage for throughput-focused workloads that aren’t latency-sensitive. For HDD-based file systems, the optional SSD cache improves performance by automatically placing your most frequently read data on SSD (the cache size is 20% of your file system size).

Throughput per unit of storage represents the throughput that will be available per TiB of provisioned storage. Total throughput for your file system is this value multiplied by storage capacity.

Network & security

The VPC Security Groups associated with your file system’s network interfaces determine which compute instances can access your file system. If you don’t select a VPC Security Group, Amazon FSx will automatically associate your VPC’s default Security Group with your file system’s network interfaces.

The provided security groups do not permit Lustre LNET network traffic on port 988 for the file system to be created.

Deploying Apache Airflow on a EKS Cluster

Install the FSx CSI Driver

To use the FSx as a persistence volume on a EKS cluster, the IAM Role that is attached to the cluster must have the permission to access the FSx with read and write permission. This can be achieved by adding the following IAM policy the cluster role.

				
					{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "iam:CreateServiceLinkedRole",
        "iam:AttachRolePolicy",
        "iam:PutRolePolicy"
       ],
      "Resource": "arn:aws:iam::*:role/aws-service-role/fsx.amazonaws.com/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "fsx:*"
      ],
      "Resource": ["*"]
  }]
}

Once you have the policy added your instance IAM role, you can start to deploy the FSx CSI driver. This will deploy a StatefulSet, a DaemonSet, and all the RBAC rules needed to allow the FSx CSI Driver to manage your storage:

				
					$kubectl create -k "github.com/kubernetes-sigs/aws-fsx-csi-driver/deploy/kubernetes/overlays/stable/?ref=master"

Namespace for airflow deployment

There will be few namespaces created by default in the EKS cluster. Those can be listed using the following command.

				
					kubectl get ns

We can create a custom namespace for the Airflow.

				
					kubectl create namespace airflow
kubectl get ns

Going further we are going to use this namespace for installing the airflow. Hence, we can set this namespace as default by using the following command.

				
					kubectl config set-context --current --namespace=airflow

We can also avoid this step and use -n airflow at the end of each command which need to do anything with our namespace.

Note: Once the application installation is completed, the default configuration setting must be reset. Else, all the future process will be done in the same namespace.

Preparing resources for airflow deployment

Create a path for managing the airflow resources.

				
					mkdir ~/.kube/airflow-resources
cd ~/.kube/airflow-resources

Storageclass

Create the Storage class configuration file in the name airflow-storageclass.yaml as shown below. Where the subnet and the security group used to create the FSx must be mentioned

				
					apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: aws-fsx-sc
provisioner: fsx.csi.aws.com
parameters:
  subnetId: subnet-1111aaa22c3ddd44e
  securityGroupIds: sg-1a22222b333c4d555

Apply the storage class using the following command.

				
					kubectl apply -f airflow-storageclass.yaml

Check the storage class using the following command.

				
					kubectl describe sc aws-fsx-sc

Persistent Volume Claim

Create the persistent volume claim configuration file for the FSX in the name persistent-volume-claim.yaml as shown below.

				
					apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: aws-fsx-pvc
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: aws-fsx-sc
  resources:
    requests:
      storage: 100Gi

Apply the volume claim using the following command.

				
					kubectl apply -f persistent-volume-claim.yaml

Check the volume claim configuration using the following command.

				
					kubectl describe pvc aws-fsx-pvc

Config Map

Create config map configuration file in the name airflow-configmap.yaml as shown below

				
					apiVersion: v1 
kind: ConfigMap 
metadata: 
  name: airflow-config 
data: 
  executor: "KubernetesExecutor"

Apply the config map using the following command.

				
					kubectl apply -f airflow-configmap.yaml

Check the config using the following command.

				
					kubectl describe configmap airflow-config

Scheduler

Create the scheduler configuration for Airflow in the name scheduler-serviceaccount.yaml as shown below.

				
					kind: ServiceAccount 
apiVersion: v1 
metadata: 
  name: airflow-scheduler 
  labels: 
    tier: airflow 
    component: scheduler 
    release: release-name 
    chart: "airflow-1.5.0" 
    heritage: Helm

Apply the scheduler configuration using the following command.

				
					kubectl apply -f scheduler-serviceaccount.yaml

Check the scheduler configuration using the following command.

				
					kubectl describe sa airflow-scheduler

Pod Launcher Role and Role Binding

Create the Pod launcher role in the name pod-launcher-role.yaml as shown below.

				
					kind: Role 
apiVersion: rbac.authorization.k8s.io/v1 
metadata: 
  name: pod-launcher-role 
  namespace: "airflow" 
  labels: 
    tier: airflow 
    release: release-name 
    chart: "airflow-1.5.0" 
    heritage: Helm 
rules: 
  - apiGroups: [""] 
    resources: ["services", "endpoints", "pods"] 
    verbs: ["get", "list", "create", "delete", "watch", "patch"] 
  - apiGroups: [""] 
    resources: ["pods/logs"] 
    verbs: ["get", "list", "create", "delete", "watch", "patch"] 
  - apiGroups: [""] 
    resources: ["pods/exec"] 
    verbs: ["get", "list", "create", "delete", "watch", "patch"] 
  - apiGroups: [""] 
    resources: ["events"] 
    verbs: ["list"]

Create the pod launcher role binding in the name pod-launcher-role-binding.yaml as shown below.

				
					kind: RoleBinding 
apiVersion: rbac.authorization.k8s.io/v1 
metadata: 
  namespace: "airflow" 
  name: pod-launcher-rolebinding 
  labels: 
    tier: airflow 
    release: release-name 
    chart: "airflow-1.5.0" 
    heritage: Helm 
roleRef: 
  apiGroup: rbac.authorization.k8s.io 
  kind: Role 
  name: pod-launcher-role 
subjects: 
  - kind: ServiceAccount 
    name: airflow-scheduler 
    namespace: "airflow"

Apply the role and role binding configurations using the following commands.

				
					kubectl apply -f pod-launcher-role.yaml
kubectl apply -f pod-launcher-role-binding.yaml

Check the configuration of the role and role binding using the following commands.

				
					kubectl describe role pod-launcher-role
kubectl describe role pod-launcher-role-binding

Deployment

Now that all our prerequisites and the cluster resources are ready, we are going to create our deployment file for the airflow in the name airflow-deployment.yaml.

				
					kind: Deployment 
apiVersion: apps/v1 
metadata: 
  name: airflow 
  namespace: "airflow" 
spec: 
  replicas: 1 
  selector: 
    matchLabels: 
      deploy: airflow 
      name: airflow 
      component: webserver 
  template: 
    metadata: 
      labels: 
        deploy: airflow 
        name: airflow 
        component: webserver 
    spec: 
      serviceAccountName: airflow-scheduler 
      containers: 
        - name: airflow-scheduler 
          image: 'apache/airflow:2.2.4' 
          imagePullPolicy: IfNotPresent 
          env: 
            - name: AIRFLOW__CORE__EXECUTOR 
              valueFrom: 
                configMapKeyRef: 
                  name: airflow-config 
                  key: executor 
          volumeMounts: 
            - name: aws-fsx-pv
              mountPath: /opt/airflow-kubernetes/fsx
          command: 
            - airflow 
          args: 
            - scheduler 
        - name: airflow-webserver 
          env: 
            - name: AIRFLOW__CORE__EXECUTOR 
              valueFrom: 
                configMapKeyRef: 
                  name: airflow-config 
                  key: executor 
          image: 'apache/airflow:2.2.4' 
          imagePullPolicy: IfNotPresent 
          ports: 
            - containerPort: 8080 
          command: 
            - airflow 
          args: 
            - webserver 
      restartPolicy: Always 
      volumes: 
        - name: aws-fsx-pv
          persistentVolumeClaim:
            claimName: aws-fsx-pvc

Deploy the airflow webserver by applying the deployment configuration using the following command.

				
					kubectl apply -f airflow-deployment.yaml

The entire configuration of the airflow deployment including all that we applied earlier can be checked now using the following command.

				
					kubectl describe deployment airflow

To check the status of the pods running in the deployment, use the following command.

				
					kubectl get pods

Service

We need to create a Kubernetes service for the airflow webserver. The YAML file to create the webserver service with the defined network port is called airflow-service.yaml and is shown below

				
					kind: Service 
apiVersion: v1 
metadata: 
  name: webserver-svc 
  namespace: airflow 
spec: 
  type: LoadBalancer 
  selector: 
    deploy: airflow 
    name: airflow 
    component: webserver 
  ports: 
    - name: airflow-ui 
      protocol: TCP 
      port: 8080 
      targetPort: 8080

Now we can create the webserver service using the following command.

				
					kubectl apply -f airflow-service.yaml

And check the details of the webserver using the following command.

				
					kubectl describe service webserver-svc

Airflow is finally deployed on the Kubernetes cluster and is running successfully, we can open the Airflow webserver UI using the cluster endpoint.

Conclusion

Now, we have learnt how to install Airflow on EKS and use FSx as the persistent volume to preserve the data.

In the process of learning I have used the default DB engine that runs within the airflow. However for the production scenario, this will not we advised.

While deploying for production using a MySQL or PostgreSQL will be better.

External databases like RDS, Aurora, DynamoDB can also be used.

We highly appreciate your patience and time spent reading this article.

Stay tuned for more Content.

Happing reading !!! Let us learn together !!!

Author

Prabhu RP

Lead DevOps | Cloud Associate Solution Architect |...

Airflow, AWS, DevOps, EKS, FSx for lustre

Prabhu RP

Lead DevOps | Cloud Associate Solution Architect | GitOps Specialist With over 13 year of IT experience including 8+ years of DevOps experience, carrying good expertise on the domain and technology.