APrime’s LLM Kit, Part 3: In-Depth Walkthrough of Self-Hosted AI Setup
This guide provides a detailed breakdown of the various steps required to launch a self-hosted AI Model in AWS via our free quickstart script and Terraform modules. If you have additional questions or want to learn more about anything in this guide, please reach out to us via llm@aprime.io.
This is Part 3 of our three-part series on hosting your own LLM in AWS. It is a direct follow-up to Part 2, our quickstart guide. For most users, the quickstart guide will suffice to get you up and running with default settings. To learn more about why it makes sense to host your own model, please visit the introduction in Part 1.
Prerequisites
This detailed walkthrough assumes that a reader has cloned our demo repo and inspected the quickstart.sh
script and our open-source Terraform module. If you would like to see a more detailed guide of the steps and discussion of any decisions made within the module, please skip ahead to the previous blog post in this series.
Deploying the Text Generation Inference (TGI) Service
Resource Requirements
For optimal performance, we recommend allocating a single ECS task per EC2 instance. This setup ensures that each task has exclusive access to any GPU resources. Given this dynamic, we decided to allocate CPU and memory to just a little less than what is available on the EC2 instance, leaving a bit of buffer room for other processes to run on that instance.
Model and Quantization Settings
What is Quantization? Quantization involves reducing the precision of the model’s weights with the goals of significantly speeding up inference while reducing memory usage.
Choosing a Quantization Setting: Select a quantization setting that balances performance and accuracy based on your specific use case and hardware capabilities.
In our own experimentation, we have had the best luck with the bitsandbytes
strategy on T4 NVIDIA GPUs. This strategy is compatible with all model formats, but newer GPUs may have access to better quantization methods. It’s also important to examine the model you’re using and whether it was created with a specific quantization method in mind, if that’s the case you’ll likely need to use that quantization setting.
You can read more about quantization here.
Container Environment Variables
Specify your environment variables with the environment option on the container definition for TGI, we set the NUMBA_CACHE_DIR to work around a Docker startup issue when using the official TGI image where the Numba – an open source JIT compiler that improves execution speed of Python code used in the Docker image – complains about a cache directory not being set (see GitHub for more details):
environment = [
{
name = "NUMBA_CACHE_DIR"
value = "/tmp/numba_cache"
},
{
name = "MODEL_ID"
value = "teknium/OpenHermes-2.5-Mistral-7B”
},
{
name = "QUANTIZE"
value = "bitsandbytes"
}
]
EFS Persistence
To speed up future launches, we recommend using Amazon Elastic File System (EFS) for persistent storage of the /data directory. We do not dive into the exact details for setting up an EFS volume in this example, but our open source Terraform module (see repo) described below sets this up automatically.
In order for EFS persistence to work properly, you’ll need to build a custom image with a non-root user in this setup to prevent privileged container execution on the host. For ease of getting started I’ve built one from a fork of the TGI repo: open-webui. However, it is recommended you build your own when fully productionalizing these components as we cannot guarantee that the forked version will be kept up to date. You can do this locally or set up a CI/CD pipeline on GitHub, see https://github.com/open-webui/open-webui/pull/2322 and my forked change for more details on the build arguments to pass when building yourself.
Putting it Together into the Service Definition
module "ecs_service" {
source = "terraform-aws-modules/ecs/aws//modules/service"
version = "5.11.0"
create = true
capacity_provider_strategy = {
"my-cluster-asg" = {
capacity_provider = module.ecs_cluster.autoscaling_capacity_providers[ "my-cluster-asg"].name
weight = 100
base = 1
}
}
cluster_arn = module.ecs_cluster.arn
desired_count = 1
name = "text-generation-inference"
placement_constraints = [{
type = "distinctInstance"
}]
create_task_definition = true
container_definitions = {
text_generation_inference = {
name = "text_generation_inference"
image = "https://ghcr.io/huggingface/text-generation-inference:2.0.4"
environment = [
{
name = "NUMBA_CACHE_DIR"
value = "/tmp/numba_cache"
},
{
name = "MODEL_ID"
value = "teknium/OpenHermes-2.5-Mistral-7B”
},
{
name = "QUANTIZE"
value = "bitsandbytes"
}
]
port_mappings = [{
name = "http"
containerPort = 11434
hostPort = 11434
protocol = "tcp"
}]
resource_requirements = [
{
type = "GPU"
value = 1
}
],
}
}
}
Note, we will come back to this service definition later to add the necessary fields for service communication with the UI.
Deploying Open WebUI
Open WebUI is a simple frontend that is similar to ChatGPT’s interface but allows us to interact with any OpenAI-compatible API, meaning a service that provides the same endpoints as OpenAI allowing it to function as a drop in replacement in all of the OpenAI client libraries by just configuring the API URL. Despite TGI providing a compatible chat/completion endpoint, we still need a little bit of work to make these work together flawlessly.
Integration with TGI
First we use an NGINX sidecar to provide an OpenAI-compatible /v1/models endpoint. This setup allows seamless interaction between the UI and the TGI service. To do this, we add two new container definitions to the service above: one which creates the appropriate NGINX configuration on disk, and one that runs NGINX.
NGINX Configuration
The following NGINX configuration returns a static response from the /v1/models endpoint that allows Open WebUI to discover the model we are running with TGI.
resource "time_static" "activation_date" {}
locals {
nginx_config = <<EOF
server {
listen 80;
location /v1/models {
default_type application/json;
return 200 '{ "object": "list", "data": [ { "id": "teknium/OpenHermes-2.5-Mistral-7B", "object": "model", "created": ${time_static.activation_date.unix}, "owned_by": "system" } ]}';
}
location / {
proxy_pass http://localhost:11434/;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
EOF
}
Empty Service Volume
In order to populate the configuration for NGINX in the init container, we must create an empty volume that is mounted to each container to be created. We can do this by adding the following input to the service module above:
volume = {
nginx_config = {}
}
Add Container Definitions
In the above service we specified the container definition for the TGI container, let’s add two more for NGINX. We’ll start with the init container:
container_definitions = {
init_nginx = {
entrypoint = [
"bash",
"-c",
"set -ueo pipefail; mkdir -p /etc/nginx/conf.d/; echo ${base64encode(local.nginx_config)} | base64 -d > /etc/nginx/conf.d/default.conf; cat /etc/nginx/conf.d/default.conf",
]
image = "public.ecr.aws/docker/library/bash:5"
name = "init_nginx"
mount_points = [{
containerPath = "/etc/nginx/conf.d"
sourceVolume = "nginx_config"
readOnly = false
}]
},
...
And continue with the main NGINX container having a dependency on it running successfully:
nginx = {
dependencies = [
{
containerName = "init_nginx"
condition = "SUCCESS"
}
]
image = "nginx:stable-alpine"
name = "nginx"
port_mappings = [{
name = "http-proxy"
containerPort = 80
hostPort = 80
protocol = "tcp"
}]
mount_points = [{
containerPath = "/etc/nginx/conf.d"
sourceVolume = "nginx_config"
readOnly = false
}]
},
...
Now NGINX will start and service the required models endpoint. But we’re not quite done as we still need the UI service, which needs to be able to communicate with TGI securely.
ECS Service Connect
Set up ECS Service Connect to keep the TGI endpoint private and secure. This ensures that only authorized users and services can access your AI model.
Create a new top-level resource for the service connect namespace:
resource "aws_service_discovery_http_namespace" "this" {
name = "mynamespace"
}
On the service module we defined above for TGI, add a new input as follows:
service_connect_configuration = {
enabled = true
namespace = aws_service_discovery_http_namespace.this.arn
service = {
client_alias = {
port = 80
}
port_name = "http-proxy"
discovery_name = "text-generation-inference"
}
}
Open WebUI Service
Next we’ll create an ECS service for the Open WebUI service and have it configured to communicate with TGI over Service Connect.
Some interesting things to note:
- We do not actually need an API key for connecting to our TGI deployment, but Open WebUI expects one so we include a “fake” key.
- As a result of using Service Connect, we can refer to
text-generation-inference.$namespace
to connect to the TGI instance in the namespace we created.
module "ecs_service" {
source = "terraform-aws-modules/ecs/aws//modules/service"
version = "5.11.0"
create = true
cluster_arn = module.ecs_cluster.arn
desired_count = 1
name = "open-webui"
create_task_definition = true
container_definitions = {
open_webui = {
name = "open_webui"
image = "ghcr.io/open-webui/open-webui:main"
environment = [
{
name = "WEBUI_URL"
value = "https://open-webui.mydomain.com"
},
{
name = "PORT"
value = 8080
},
{
name = "OPENAI_API_KEY"
value = "fake"
},
{
name = "OPENAI_API_BASE_URL"
value = "text-generation-inference.${aws_service_discovery_http_namespace.this.name}"
},
],
port_mappings = [{
name = "http"
containerPort = 8080
hostPort = 8080
protocol = "tcp"
}]
load_balancer = {
service = {
target_group_arn = module.alb.target_groups["open_webui"].arn
container_name = "open_webui"
container_port = 8080
}
}
}
}
}
At this point, we should be good to apply the Terraform! Assuming that it completes successfully, we can move on to the next section. If you’ve run into any issues you may have luck using our Terraform module utilizing the Quickstart guide.
Application Load Balancer (ALB) for the Service
Certificate
We look up the Route 53 Zone ID for the registered domain, in this case “mydomain.com” and create a certificate using the official AWS ACM module:
data "aws_route53_zone" "this" {
name = "mydomain.com"
private_zone = true
}
module "acm" {
source = "terraform-aws-modules/acm/aws"
version = "5.0.0"
create_certificate = true
domain_name = "mydomain.com"
validate_certificate = true
validation_method = "DNS"
zone_id = data.aws_route53_zone.this.zone_id
}
ALB
We create an ALB that supports SSL and listens on port 443 using the ALB module. It is configured with the certificate previously created certificate:
module "alb" {
source = "terraform-aws-modules/alb/aws"
version = "9.1.0"
create = true
load_balancer_type = "application"
name = "my-cluster-tgi-alb"
internal = false
listeners = {
http-https-redirect = {
port = 80
protocol = "HTTP"
redirect = {
port = "443"
protocol = "HTTPS"
status_code = "HTTP_301"
}
}
https = {
port = 443
protocol = "HTTPS"
ssl_policy = "ELBSecurityPolicy-TLS13-1-2-Res-2021-06"
certificate_arn = module.acm.acm_certificate_arn
}
}
target_groups = {
text_generation_inference = {
name = "my-cluster-tgi"
protocol = "HTTP"
port = 11434
create_attachment = false
target_type = "ip"
deregistration_delay = 10
load_balancing_cross_zone_enabled = true
health_check = {
enabled = true
healthy_threshold = 5
interval = 30
matcher = "200"
path = "/health"
port = "traffic-port"
protocol = "HTTP"
timeout = 5
unhealthy_threshold = 2
}
}
}
create_security_group = true
vpc_id = var.vpc_id
security_group_ingress_rules = {
http = {
from_port = 80
to_port = 80
ip_protocol = "tcp"
cidr_ipv4 = "0.0.0.0/0"
}
https = {
from_port = 443
to_port = 443
ip_protocol = "tcp"
cidr_ipv4 = "0.0.0.0/0"
}
}
security_group_egress_rules = {
all = {
ip_protocol = "-1"
cidr_ipv4 = "0.0.0.0/0"
}
}
route53_records = {
A = {
name = "open-webui"
type = "A"
zone_id = var.route53_zone_id
}
AAAA = {
name = "open-webui"
type = "AAAA"
zone_id = var.route53_zone_id
}
}
}
Create an ECS Cluster with GPU Capacity
To simplify some of the common cluster setup tasks, we continue making use of official AWS Terraform modules. They provide a really amazing autoscaling module which we’ll make use of in this section.
Autoscaling Group Setup
By creating an autoscaling group to manage your EC2 instances, you will ensure that the appropriate number of instances are running to meet your workload demands.
When doing this you need to configure the instances to join the ECS cluster you will create. This includes setting up the necessary IAM roles and security groups to allow communication between the instances and the ECS service.
Below is an example of what that Terraform would look like when using the autoscaling module.
ECS Optimized GPU AMI
The following gets the AMI (Amazon Machine Image) ID for the recommended ECS optimized GPU AMI that will be used to launch the EC2 instances:
data "aws_ssm_parameter" "ecs_optimized_ami" {
name = "/aws/service/ecs/optimized-ami/amazon-linux-2/gpu/recommended"
}
Security Group
Using the ID of the ALB we created earlier, we can create a security group that allows the ALB to communicate with our autoscaling group’s instances. This is really only necessary if you want to expose the TGI service via ALB or plan on deploying the UI to the EC2 instances as well.
module "autoscaling_sg" {
source = "terraform-aws-modules/security-group/aws"
version = "~> 5.0"
create = true
name = "my-cluster-asg-sg"
vpc_id = var.vpc_id
computed_ingress_with_source_security_group_id = [
{
rule = "http-80-tcp"
source_security_group_id = var.alb_security_group_id
}
]
number_of_computed_ingress_with_source_security_group_id = 1
egress_rules = ["all-all"]
}
The Autoscaling Group
Now that we have our security group and AMI ID, we can use the autoscaling module with some reasonable settings and referring to the values above.
We attach the AmazonEC2ContainerServiceforEC2Role
role which is managed by AWS to allow our EC2 instances to join our yet to be created ECS cluster:
module "autoscaling" {
source = "terraform-aws-modules/autoscaling/aws"
version = "~> 6.5"
create = true
name = "my-cluster-asg"
image_id = jsondecode(data.aws_ssm_parameter.ecs_optimized_ami.value)["image_id"]
instance_type = "g4dn.xlarge"
security_groups = [module.autoscaling_sg.security_group_id]
user_data = base64encode(local.user_data)
Create_iam_instance_profile = true
iam_role_policies = {
AmazonEC2ContainerServiceforEC2Role = "arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role"
AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}
vpc_zone_identifier = var.vpc_private_subnets
health_check_type = "EC2"
min_size = 1
max_size = 1
desired_capacity = 1
# https://github.com/hashicorp/terraform-provider-aws/issues/12582
autoscaling_group_tags = {
AmazonECSManaged = true
}
}
User Data
You may have noticed above that the user data field is referring to local. This field on an EC2 instance can be used to populate files at instance startup time. In this case, we leverage that to populate the ECS configuration file, which is used by the ECS-optimized AMI at startup.
Here’s an example of what user data should look like for this autoscaling group assuming a cluster name of “my-cluster” will be created:
locals {
# https://github.com/aws/amazon-ecs-agent/blob/master/README.md#environment-variables
user_data = <<-EOT
#!/bin/bash
cat <<'EOF' >> /etc/ecs/ecs.config
ECS_CLUSTER=my-cluster
ECS_LOGLEVEL=info
ECS_ENABLE_TASK_IAM_ROLE=true
EOF
EOT
}
ECS Cluster
Capacity Provider
We create a local representing the capacity provider for the EC2 autoscaling group created above, this attempts to maintain 100% utilization:
locals {
default_autoscaling_capacity_providers = {
"my-cluster-asg" = {
auto_scaling_group_arn = module.autoscaling.autoscaling_group_arn
managed_termination_protection = "ENABLED"
managed_scaling = {
maximum_scaling_step_size = 2
minimum_scaling_step_size = 1
status = "ENABLED"
target_capacity = 100
}
default_capacity_provider_strategy = {
weight = 0
}
}
}
}
Cluster
We can make use of that capacity provider when defining the cluster:
module "ecs_cluster" {
source = "terraform-aws-modules/ecs/aws//modules/cluster"
version = "5.11.0"
create = true
# Cluster
cluster_name = "my-cluster"
# Capacity providers
default_capacity_provider_use_fargate = true
autoscaling_capacity_providers = local.default_autoscaling_capacity_providers
}
We are here to help!
If you found the modules helpful, we encourage you to star the repo and follow us on GitHub for future updates! And if you need help implementing your own AI models or incorporating LLMs and other types of AI into your product, send us a message or schedule a call — we look forward to learning about your work and exploring how we can help!
Let Aprime help you overcome your challenges
and build your core technology
Are you ready to accelerate?