Why Self-Host Llama 3 on AWS
Three reasons teams self-host instead of using an API. First, cost: at high query volumes (10,000+ per day), self-hosting is cheaper than paying per-token API fees. Second, data privacy: your prompts and responses never leave your infrastructure. Third, control: you pick the model version, the quantization level, and the serving parameters. No surprise model updates from an API provider.
The tradeoff is operational work. You manage the infrastructure, the scaling, and the uptime. This guide covers how to do that properly on AWS.
Choosing the Right Instance
Llama 3 comes in 8B and 70B parameter versions. The instance you need depends on the model size and your latency requirements.
| Model | Instance | GPU Memory | On-Demand Cost |
|---|---|---|---|
| Llama 3 8B (FP16) | g5.2xlarge (1x A10G) | 24 GB | ~$1.21/hr |
| Llama 3 8B (INT4) | g5.xlarge (1x A10G) | 24 GB | ~$1.01/hr |
| Llama 3 70B (INT4) | g5.12xlarge (4x A10G) | 96 GB | ~$5.67/hr |
| Llama 3 70B (FP16) | p4d.24xlarge (8x A100) | 320 GB | ~$32.77/hr |
For most production workloads, the 8B model on a g5.2xlarge is the sweet spot. It handles 30 to 50 concurrent requests with sub-second latency and costs about $880/month on demand, or $550/month with a 1-year reserved instance.
Setting Up vLLM for Inference
vLLM is the fastest open-source inference engine for LLMs. It uses PagedAttention to maximize GPU utilization and serves requests through an OpenAI-compatible API endpoint.
# SSH into your EC2 instance
pip install vllm
# Start the server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 4096 \
--gpu-memory-utilization 0.9
The server exposes /v1/chat/completions and /v1/completions endpoints, so you can swap it into any application that currently calls the OpenAI API by changing the base URL.
Adding a Load Balancer
For production traffic, put an Application Load Balancer (ALB) in front of your vLLM instances. Configure health checks on the /health endpoint. Set connection draining to 300 seconds so in-flight requests complete before an instance is removed.
If you need more than one instance, use an Auto Scaling Group with a target tracking policy based on GPU utilization (CloudWatch custom metric via nvidia-smi). Scale out when average GPU utilization exceeds 70%. Scale in when it drops below 30%.
Docker and ECS Deployment
For teams that prefer containers, package vLLM in a Docker image and deploy on ECS with GPU-enabled task definitions.
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
RUN pip install vllm
CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \
"--model", "meta-llama/Meta-Llama-3-8B-Instruct", \
"--host", "0.0.0.0", "--port", "8000"]
ECS manages container placement on GPU instances. You still need to handle model weight downloads; either bake them into the Docker image (large image, slow deploy) or mount them from an EFS volume (fast deploy, shared storage).
Cost Comparison: Self-Hosted vs API
At 30,000 queries per day with average 500 input + 200 output tokens:
| Option | Monthly Cost |
|---|---|
| GPT-4o API | ~$6,750 |
| GPT-4o-mini API | ~$675 |
| Llama 3 8B on g5.2xlarge (reserved) | ~$550 |
| Llama 3 70B on g5.12xlarge (reserved) | ~$4,100 |
The 8B model beats both GPT-4o and GPT-4o-mini on cost. The 70B model is cheaper than GPT-4o but more expensive than GPT-4o-mini. The quality gap between Llama 3 8B and GPT-4o is real, so the decision depends on whether your use case needs the stronger model.
Production Checklist
- Use Spot Instances for non-critical workloads (60% to 70% savings, but instances can be interrupted)
- Enable CloudWatch alarms for GPU memory, inference latency, and 5xx error rates
- Set up model weight versioning in S3 so you can roll back to a previous model checkpoint
- Configure request queuing so traffic spikes do not crash the inference server
- Run load tests before launch. vLLM's benchmark suite (benchmark_serving.py) is a good starting point.
Frequently Asked Questions
Can I use SageMaker instead of raw EC2?
Yes. SageMaker endpoints handle autoscaling and deployment for you, but they cost 15% to 30% more than raw EC2 because of the managed service markup. For teams without DevOps expertise, SageMaker is worth the premium.
Should I use the 8B or 70B model?
Start with the 8B. If your eval suite shows it falls short on your tasks, try the 70B. Many production chatbots and RAG systems run fine on the 8B model with good prompts and context.