APrime’s LLM Kit, Part 3: In-Depth Walkthrough of Self-Hosted AI Setup

BY PATRICK FACHERIS
Senior Software Engineer

This guide provides a detailed breakdown of the various steps required to launch a self-hosted AI Model in AWS via our free quickstart script and Terraform modules. If you have additional questions or want to learn more about anything in this guide, please reach out to us via llm@aprime.io.

This is Part 3 of our three-part series on hosting your own LLM in AWS. It is a direct follow-up to Part 2, our quickstart guide. For most users, the quickstart guide will suffice to get you up and running with default settings. To learn more about why it makes sense to host your own model, please visit the introduction in Part 1.

Prerequisites

This detailed walkthrough assumes that a reader has cloned our demo repo and inspected the quickstart.sh script and our open-source Terraform module. If you would like to see a more detailed guide of the steps and discussion of any decisions made within the module, please skip ahead to the previous blog post in this series.

Deploying the Text Generation Inference (TGI) Service

Resource Requirements

For optimal performance, we recommend allocating a single ECS task per EC2 instance. This setup ensures that each task has exclusive access to any GPU resources. Given this dynamic, we decided to allocate CPU and memory to just a little less than what is available on the EC2 instance, leaving a bit of buffer room for other processes to run on that instance.

Model and Quantization Settings

What is Quantization? Quantization involves reducing the precision of the model’s weights with the goals of significantly speeding up inference while reducing memory usage.

Choosing a Quantization Setting: Select a quantization setting that balances performance and accuracy based on your specific use case and hardware capabilities.

In our own experimentation, we have had the best luck with the bitsandbytes strategy on T4 NVIDIA GPUs. This strategy is compatible with all model formats, but newer GPUs may have access to better quantization methods. It’s also important to examine the model you’re using and whether it was created with a specific quantization method in mind, if that’s the case you’ll likely need to use that quantization setting.

You can read more about quantization here.

Container Environment Variables

Specify your environment variables with the environment option on the container definition for TGI, we set the NUMBA_CACHE_DIR to work around a Docker startup issue when using the official TGI image where the Numba – an open source JIT compiler that improves execution speed of Python code used in the Docker image – complains about a cache directory not being set (see GitHub for more details):

environment = [
    {
      name  = "NUMBA_CACHE_DIR"
      value = "/tmp/numba_cache"
    },
    {
      name  = "MODEL_ID"
      value = "teknium/OpenHermes-2.5-Mistral-7B”
    },
    {
      name  = "QUANTIZE"
      value = "bitsandbytes"
    }
 ]

EFS Persistence

To speed up future launches, we recommend using Amazon Elastic File System (EFS) for persistent storage of the /data directory. We do not dive into the exact details for setting up an EFS volume in this example, but our open source Terraform module (see repo) described below sets this up automatically.

In order for EFS persistence to work properly, you’ll need to build a custom image with a non-root user in this setup to prevent privileged container execution on the host. For ease of getting started I’ve built one from a fork of the TGI repo: open-webui. However, it is recommended you build your own when fully productionalizing these components as we cannot guarantee that the forked version will be kept up to date. You can do this locally or set up a CI/CD pipeline on GitHub, see https://github.com/open-webui/open-webui/pull/2322 and my forked change for more details on the build arguments to pass when building yourself.

Putting it Together into the Service Definition

module "ecs_service" {
  source  = "terraform-aws-modules/ecs/aws//modules/service"
  version = "5.11.0"

  create = true

  capacity_provider_strategy = {
    "my-cluster-asg" = {
      capacity_provider = module.ecs_cluster.autoscaling_capacity_providers[ "my-cluster-asg"].name
weight            = 100
base              = 1
    }
  }

  cluster_arn   = module.ecs_cluster.arn
  desired_count = 1
  name          = "text-generation-inference"

  placement_constraints = [{
    type = "distinctInstance"
  }]

  create_task_definition = true
  
  container_definitions = {
    text_generation_inference = {
      name = "text_generation_inference"
      image = "https://ghcr.io/huggingface/text-generation-inference:2.0.4"
      environment = [
   {
     name  = "NUMBA_CACHE_DIR"
     value = "/tmp/numba_cache"
   },
   {
     name  = "MODEL_ID"
     value = "teknium/OpenHermes-2.5-Mistral-7B”
   },
   {
     name  = "QUANTIZE"
     value = "bitsandbytes"
   }
 ]
      port_mappings = [{
        name          = "http"
        containerPort =  11434
        hostPort      =  11434
        protocol      = "tcp"
      }]
      resource_requirements = [
        {
          type  = "GPU"
          value = 1
        }
      ],
    }
  }
}

Note, we will come back to this service definition later to add the necessary fields for service communication with the UI.

Deploying Open WebUI

Open WebUI is a simple frontend that is similar to ChatGPT’s interface but allows us to interact with any OpenAI-compatible API, meaning a service that provides the same endpoints as OpenAI allowing it to function as a drop in replacement in all of the OpenAI client libraries by just configuring the API URL. Despite TGI providing a compatible chat/completion endpoint, we still need a little bit of work to make these work together flawlessly.

Integration with TGI

First we use an NGINX sidecar to provide an OpenAI-compatible /v1/models endpoint. This setup allows seamless interaction between the UI and the TGI service. To do this, we add two new container definitions to the service above: one which creates the appropriate NGINX configuration on disk, and one that runs NGINX.

NGINX Configuration

The following NGINX configuration returns a static response from the /v1/models endpoint that allows Open WebUI to discover the model we are running with TGI.

resource "time_static" "activation_date" {}

locals {
  nginx_config = <<EOF
server {
    listen 80;

    location /v1/models {
        default_type application/json;
        return 200 '{ "object": "list", "data": [ { "id": "teknium/OpenHermes-2.5-Mistral-7B", "object": "model", "created": ${time_static.activation_date.unix}, "owned_by": "system" } ]}';
    }

    location / {
        proxy_pass http://localhost:11434/;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}
EOF
}

Empty Service Volume

In order to populate the configuration for NGINX in the init container, we must create an empty volume that is mounted to each container to be created. We can do this by adding the following input to the service module above:

volume = {
  nginx_config = {}
}

Add Container Definitions

In the above service we specified the container definition for the TGI container, let’s add two more for NGINX. We’ll start with the init container:

container_definitions = {
  init_nginx = {
        entrypoint = [
          "bash",
          "-c",
          "set -ueo pipefail; mkdir -p /etc/nginx/conf.d/; echo ${base64encode(local.nginx_config)} | base64 -d > /etc/nginx/conf.d/default.conf; cat /etc/nginx/conf.d/default.conf",
        ]
        image                    = "public.ecr.aws/docker/library/bash:5"
        name                     = "init_nginx"
        mount_points             = [{
containerPath = "/etc/nginx/conf.d"
sourceVolume  = "nginx_config"
readOnly      = false
   }]
  },
...

And continue with the main NGINX container having a dependency on it running successfully:

nginx = {
        dependencies = [
          {
            containerName = "init_nginx"
            condition     = "SUCCESS"
          }
        ]
        image                  = "nginx:stable-alpine"
        name                   = "nginx"
        port_mappings = [{
          name          = "http-proxy"
          containerPort = 80
          hostPort      = 80
          protocol      = "tcp"
        }]
        mount_points             = [{
containerPath = "/etc/nginx/conf.d"
sourceVolume  = "nginx_config"
readOnly      = false
   }]
      },
...

Now NGINX will start and service the required models endpoint. But we’re not quite done as we still need the UI service, which needs to be able to communicate with TGI securely.

ECS Service Connect

Set up ECS Service Connect to keep the TGI endpoint private and secure. This ensures that only authorized users and services can access your AI model.

Create a new top-level resource for the service connect namespace:

resource "aws_service_discovery_http_namespace" "this" {
  name        = "mynamespace"
}

On the service module we defined above for TGI, add a new input as follows:

service_connect_configuration = {
  enabled   = true
  namespace = aws_service_discovery_http_namespace.this.arn
  service = {
    client_alias = {
      port = 80
    }
    port_name      = "http-proxy"
    discovery_name = "text-generation-inference"
  }
}

Open WebUI Service

Next we’ll create an ECS service for the Open WebUI service and have it configured to communicate with TGI over Service Connect.

Some interesting things to note:

We do not actually need an API key for connecting to our TGI deployment, but Open WebUI expects one so we include a “fake” key.
As a result of using Service Connect, we can refer to text-generation-inference.$namespace to connect to the TGI instance in the namespace we created.

module "ecs_service" {
  source  = "terraform-aws-modules/ecs/aws//modules/service"
  version = "5.11.0"

  create = true

  cluster_arn   = module.ecs_cluster.arn
  desired_count = 1
  name          = "open-webui"

  create_task_definition = true
  
  container_definitions = {
    open_webui = {
        name = "open_webui"
        image = "ghcr.io/open-webui/open-webui:main"
        environment = [
            {
              name  = "WEBUI_URL"
              value = "https://open-webui.mydomain.com"
            },
            {
              name  = "PORT"
              value = 8080
            },
            {
              name  = "OPENAI_API_KEY"
              value = "fake"
            },
            {
              name  = "OPENAI_API_BASE_URL"
              value = "text-generation-inference.${aws_service_discovery_http_namespace.this.name}"
            },
        ],
        port_mappings = [{
              name          = "http"
              containerPort =  8080
              hostPort      =  8080
              protocol      = "tcp"
        }]
        load_balancer = {
              service = {
                target_group_arn = module.alb.target_groups["open_webui"].arn
                container_name   = "open_webui"
                container_port   = 8080
              }
        }
    }
  }
}

At this point, we should be good to apply the Terraform! Assuming that it completes successfully, we can move on to the next section. If you’ve run into any issues you may have luck using our Terraform module utilizing the Quickstart guide.

Application Load Balancer (ALB) for the Service

Certificate

We look up the Route 53 Zone ID for the registered domain, in this case “mydomain.com” and create a certificate using the official AWS ACM module:

data "aws_route53_zone" "this" {
  name         = "mydomain.com"
  private_zone = true
}

module "acm" {
  source  = "terraform-aws-modules/acm/aws"
  version = "5.0.0"

  create_certificate = true

  domain_name          = "mydomain.com"
  validate_certificate = true
  validation_method    = "DNS"
  zone_id              = data.aws_route53_zone.this.zone_id
}

ALB

We create an ALB that supports SSL and listens on port 443 using the ALB module. It is configured with the certificate previously created certificate:

module "alb" {
  source  = "terraform-aws-modules/alb/aws"
  version = "9.1.0"

  create = true

  load_balancer_type = "application"
  name               = "my-cluster-tgi-alb"
  internal           = false

  listeners = {
    http-https-redirect = {
      port     = 80
      protocol = "HTTP"

      redirect = {
        port        = "443"
        protocol    = "HTTPS"
        status_code = "HTTP_301"
      }
    }

    https = {
      port            = 443
      protocol        = "HTTPS"
      ssl_policy      = "ELBSecurityPolicy-TLS13-1-2-Res-2021-06"
      certificate_arn = module.acm.acm_certificate_arn
    }
  }

  target_groups = {
    text_generation_inference = {
      name                              = "my-cluster-tgi"
      protocol                          = "HTTP"
      port                              = 11434
      create_attachment                 = false
      target_type                       = "ip"
      deregistration_delay              = 10
      load_balancing_cross_zone_enabled = true

      health_check = {
        enabled             = true
        healthy_threshold   = 5
        interval            = 30
        matcher             = "200"
        path                = "/health"
        port                = "traffic-port"
        protocol            = "HTTP"
        timeout             = 5
        unhealthy_threshold = 2
      }
    }
  }

  create_security_group = true
  vpc_id                = var.vpc_id

  security_group_ingress_rules = {
    http = {
      from_port   = 80
      to_port     = 80
      ip_protocol = "tcp"
      cidr_ipv4   = "0.0.0.0/0"
    }
    https = {
      from_port   = 443
      to_port     = 443
      ip_protocol = "tcp"
      cidr_ipv4   = "0.0.0.0/0"
    }
  }

  security_group_egress_rules =   {
    all = {
      ip_protocol = "-1"
      cidr_ipv4   = "0.0.0.0/0"
    }
  }

  route53_records = {
    A = {
      name    = "open-webui"
      type    = "A"
      zone_id = var.route53_zone_id
    }
    AAAA = {
      name    = "open-webui"
      type    = "AAAA"
      zone_id = var.route53_zone_id
    }
  }
}

Create an ECS Cluster with GPU Capacity

To simplify some of the common cluster setup tasks, we continue making use of official AWS Terraform modules. They provide a really amazing autoscaling module which we’ll make use of in this section.

Autoscaling Group Setup

By creating an autoscaling group to manage your EC2 instances, you will ensure that the appropriate number of instances are running to meet your workload demands.

When doing this you need to configure the instances to join the ECS cluster you will create. This includes setting up the necessary IAM roles and security groups to allow communication between the instances and the ECS service.

Below is an example of what that Terraform would look like when using the autoscaling module.

ECS Optimized GPU AMI

The following gets the AMI (Amazon Machine Image) ID for the recommended ECS optimized GPU AMI that will be used to launch the EC2 instances:

data "aws_ssm_parameter" "ecs_optimized_ami" {
  name = "/aws/service/ecs/optimized-ami/amazon-linux-2/gpu/recommended"
}

Security Group

Using the ID of the ALB we created earlier, we can create a security group that allows the ALB to communicate with our autoscaling group’s instances. This is really only necessary if you want to expose the TGI service via ALB or plan on deploying the UI to the EC2 instances as well.

module "autoscaling_sg" {
  source  = "terraform-aws-modules/security-group/aws"
  version = "~> 5.0"

  create = true

  name        = "my-cluster-asg-sg"
  vpc_id      = var.vpc_id

  computed_ingress_with_source_security_group_id = [
    {
      rule                     = "http-80-tcp"
      source_security_group_id = var.alb_security_group_id
    }
  ]
  number_of_computed_ingress_with_source_security_group_id = 1

  egress_rules = ["all-all"]
}

The Autoscaling Group

Now that we have our security group and AMI ID, we can use the autoscaling module with some reasonable settings and referring to the values above.

We attach the AmazonEC2ContainerServiceforEC2Role role which is managed by AWS to allow our EC2 instances to join our yet to be created ECS cluster:

module "autoscaling" {
  source  = "terraform-aws-modules/autoscaling/aws"
  version = "~> 6.5"

  create = true

  name = "my-cluster-asg"

  image_id      = jsondecode(data.aws_ssm_parameter.ecs_optimized_ami.value)["image_id"]
  instance_type = "g4dn.xlarge"

  security_groups                 = [module.autoscaling_sg.security_group_id]
  user_data                       = base64encode(local.user_data)

  Create_iam_instance_profile = true
  iam_role_policies           = {
    AmazonEC2ContainerServiceforEC2Role = "arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role"
    AmazonSSMManagedInstanceCore        = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
  }

  vpc_zone_identifier = var.vpc_private_subnets
  health_check_type   = "EC2"
  min_size            = 1
  max_size            = 1
  desired_capacity    = 1

  # https://github.com/hashicorp/terraform-provider-aws/issues/12582
  autoscaling_group_tags = {
    AmazonECSManaged = true
  }
}

User Data

You may have noticed above that the user data field is referring to local. This field on an EC2 instance can be used to populate files at instance startup time. In this case, we leverage that to populate the ECS configuration file, which is used by the ECS-optimized AMI at startup.

Here’s an example of what user data should look like for this autoscaling group assuming a cluster name of “my-cluster” will be created:

locals {
  # https://github.com/aws/amazon-ecs-agent/blob/master/README.md#environment-variables
  user_data = <<-EOT
    #!/bin/bash

    cat <<'EOF' >> /etc/ecs/ecs.config
    ECS_CLUSTER=my-cluster
    ECS_LOGLEVEL=info
    ECS_ENABLE_TASK_IAM_ROLE=true
    EOF
  EOT
}

ECS Cluster

Capacity Provider

We create a local representing the capacity provider for the EC2 autoscaling group created above, this attempts to maintain 100% utilization:

locals {
  default_autoscaling_capacity_providers = {
    "my-cluster-asg" = {
      auto_scaling_group_arn         = module.autoscaling.autoscaling_group_arn
      managed_termination_protection = "ENABLED"

      managed_scaling = {
        maximum_scaling_step_size = 2
        minimum_scaling_step_size = 1
        status                    = "ENABLED"
        target_capacity           = 100
      }

      default_capacity_provider_strategy = {
        weight = 0
      }
    }
  }
}

Cluster

We can make use of that capacity provider when defining the cluster:

module "ecs_cluster" {
  source  = "terraform-aws-modules/ecs/aws//modules/cluster"
  version = "5.11.0"

  create = true

  # Cluster
  cluster_name          = "my-cluster"

  # Capacity providers
  default_capacity_provider_use_fargate = true
  autoscaling_capacity_providers        = local.default_autoscaling_capacity_providers
}

We are here to help!

If you found the modules helpful, we encourage you to star the repo and follow us on GitHub for future updates! And if you need help implementing your own AI models or incorporating LLMs and other types of AI into your product, send us a message or schedule a call — we look forward to learning about your work and exploring how we can help!

Let APrime help you overcome your challenges

and build your core technology

Are you ready to accelerate?

Talk to an Expert