LoRA Adapters for LLMInferenceService
Overview
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that allows you to adapt large language models to specific tasks without modifying the base model weights. LLMInferenceService provides native support for serving multiple LoRA adapters alongside a base model, enabling efficient multi-tenant deployments and task-specific model specialization.
Why Use LoRA Adapters?
- Storage Efficiency: Share a single base model across multiple task-specific adaptations (typically 50-500MB per adapter vs 10-100GB for full models)
- Multi-Tenancy: Serve multiple specialized versions of the same model from a single deployment
- Fast Iteration: Update task-specific adapters without redeploying the base model
- Cost Optimization: Reduce GPU memory and storage costs compared to deploying multiple full models
LoRA adapters are loaded at service startup and vLLM can switch between them dynamically per request with minimal overhead (~1-5ms).
Prerequisites
Before configuring LoRA adapters, ensure:
- vLLM Runtime: LoRA support requires vLLM (default runtime for LLMInferenceService)
- Storage Initializer: Enabled for
hf://ands3://adapters (enabled by default) - Base Model Compatibility: Your base model must be trained with the same architecture as the adapters
- Kubernetes Resources: Sufficient GPU memory to load base model + all adapters
Note: Each adapter typically requires 50-500MB of GPU memory depending on rank and model size.
Configuration
Basic LoRA Configuration
Add LoRA adapters to your LLMInferenceService using the spec.model.lora.adapters field:
apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
name: my-llm-service
spec:
model:
uri: hf://Qwen/Qwen2.5-7B-Instruct
name: Qwen/Qwen2.5-7B-Instruct
lora:
adapters:
- name: sql-adapter
uri: hf://my-org/qwen-sql-lora
- name: code-adapter
uri: s3://my-bucket/adapters/code-lora
Field Reference
| Field | Type | Required | Description |
|---|---|---|---|
spec.model.lora.adapters | array | No | List of LoRA adapters to attach to the base model |
spec.model.lora.adapters[].name | string | Yes | Unique adapter name used for inference requests |
spec.model.lora.adapters[].uri | string | Yes | Adapter source URI (must use hf://, s3://, or pvc:// scheme) |
spec.model.lora.maxRank | integer | No | Maximum LoRA rank supported by the runtime (maps to vLLM --max-lora-rank). If not set, vLLM's default applies (16). |
spec.model.lora.maxAdapters | integer | No | Maximum number of LoRA adapters in GPU memory simultaneously (maps to vLLM --max-loras). If not set, vLLM's default applies (1). |
spec.model.lora.maxCpuAdapters | integer | No | Maximum number of LoRA adapters cached in CPU memory (maps to vLLM --max-cpu-loras). If not set, vLLM defaults this to maxAdapters. |
Constraints
- Adapter names must be unique within a service
- Adapter names must differ from the base model name
- Adapter names are case-sensitive
- All adapters are loaded at startup (no dynamic loading)
Supported URI Schemes
HuggingFace Hub (hf://)
Download adapters directly from HuggingFace Hub.
Format: hf://organization/repository or hf://organization/repository/subdirectory
lora:
adapters:
- name: my-adapter
uri: hf://edbeeching/opt-125m-lora
Authentication (for private repositories):
template:
containers:
- name: storage-initializer
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: huggingface-secret
key: token
HuggingFace adapters require the storage-initializer to be enabled (default behavior).
S3-Compatible Storage (s3://)
Use adapters from S3, MinIO, Ceph, or any S3-compatible object storage.
Format: s3://bucket-name/path/to/adapter
lora:
adapters:
- name: my-adapter
uri: s3://my-bucket/adapters/domain-lora
S3 Configuration with Credentials:
template:
containers:
- name: storage-initializer
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: s3-config
key: AWS_ACCESS_KEY_ID
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: s3-config
key: AWS_SECRET_ACCESS_KEY
- name: S3_ENDPOINT
value: "https://minio.example.com"
- name: S3_USE_HTTPS
value: "1"
Supported S3-Compatible Providers:
- AWS S3
- MinIO
- Ceph Object Gateway
- Google Cloud Storage (S3 compatibility mode)
- Azure Blob Storage (S3 compatibility mode)
PersistentVolumeClaim (pvc://)
Use pre-downloaded adapters from a Kubernetes PVC for fastest startup or air-gapped environments.
Format: pvc://pvc-name/path/within/pvc
lora:
adapters:
- name: my-adapter
uri: pvc://adapter-pvc/domain-lora
PVC Requirements:
- PVC must exist in the same namespace
- Access mode:
ReadOnlyManyorReadWriteMany(for multiple replicas) - Contains adapter files in safetensors or PyTorch format
Example PVC Setup:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: adapter-pvc
spec:
accessModes:
- ReadOnlyMany
resources:
requests:
storage: 10Gi
storageClassName: nfs-storage
PVC adapters provide the fastest service startup time since no download phase is required. This is ideal for production deployments and air-gapped environments.
Complete Examples
Example 1: Single HuggingFace Adapter
Simple deployment with one public adapter for SQL code generation:
apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
name: qwen-sql
spec:
model:
uri: hf://Qwen/Qwen2.5-7B-Instruct
name: Qwen/Qwen2.5-7B-Instruct
lora:
adapters:
- name: sql-adapter
uri: hf://my-org/qwen-sql-lora
replicas: 2
template:
containers:
- name: main
image: vllm/vllm-openai:latest
resources:
limits:
nvidia.com/gpu: "1"
cpu: "8"
memory: 32Gi
Example 2: Multiple Adapters from Different Sources
Multi-tenant deployment serving adapters for different tasks:
apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
name: qwen-multi-tenant
spec:
model:
uri: hf://Qwen/Qwen2.5-7B-Instruct
name: Qwen/Qwen2.5-7B-Instruct
lora:
adapters:
- name: sql-adapter
uri: hf://my-org/qwen-sql-lora
- name: code-adapter
uri: s3://my-bucket/adapters/code-lora
- name: domain-adapter
uri: pvc://adapter-pvc/domain-lora
replicas: 3
template:
containers:
- name: main
image: vllm/vllm-openai:latest
resources:
limits:
nvidia.com/gpu: "1"
cpu: "8"
memory: 32Gi
- name: storage-initializer
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: s3-config
key: AWS_ACCESS_KEY_ID
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: s3-config
key: AWS_SECRET_ACCESS_KEY
Usage at Inference Time
OpenAI-Compatible API
Once deployed, select adapters by specifying the adapter name in the model parameter:
Using an Adapter:
curl -k https://<service-url>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "sql-adapter",
"messages": [
{"role": "user", "content": "Generate SQL to find all active users"}
]
}'
Using the Base Model (no adapter):
curl -k https://<service-url>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [
{"role": "user", "content": "What is Kubernetes?"}
]
}'
vLLM automatically switches between adapters per request with minimal latency overhead. No service restart is required to switch adapters.
How It Works
Automatic Integration
When you configure LoRA adapters, the LLMInferenceService controller automatically:
-
Download Phase (
hf://ands3://adapters):- Injects storage-initializer as an init container
- Downloads all adapters in parallel
- Mounts adapters to
/mnt/lora/<adapter-name>
-
Mount Phase (
pvc://adapters):- Creates volume mounts for each PVC adapter
- Mounts to
/mnt/lora/<adapter-name>(read-only) - No download required
-
vLLM Configuration:
- Automatically adds
--enable-loraflag - Sets
--max-lora-rank,--max-loras,--max-cpu-lorasonly when explicitly configured inspec.model.lora; vLLM's own defaults apply otherwise - Adds
--lora-modules <name>=<path> <name2>=<path2> ...
- Automatically adds
Path Sanitization
Adapter names are sanitized for filesystem compatibility:
- Invalid characters (
/,:, etc.) are replaced with- - Example:
my/adapter:v1becomesmy-adapter-v1
Resource Considerations
GPU Memory Usage:
- Each adapter typically requires 50-500MB GPU memory depending on rank and model size
- Formula:
adapter_memory ≈ rank × num_layers × hidden_dim × 2 × sizeof(fp16) - All adapters are loaded simultaneously into GPU memory
Download Time:
- Depends on adapter size and network bandwidth
- HuggingFace: typically 10-60 seconds per adapter
- S3: depends on endpoint proximity and bandwidth
- PVC: no download time (instant)
Advanced Configuration
Tuning LoRA Runtime Parameters
Use the spec fields to configure vLLM's LoRA runtime settings:
spec:
model:
lora:
maxRank: 128 # increase if adapters were trained with rank > 16 (vLLM default)
maxAdapters: 3 # max adapters in GPU memory simultaneously (vLLM default: 1)
maxCpuAdapters: 6 # max adapters cached in CPU memory (vLLM default: maxAdapters)
adapters:
- name: sql-adapter
uri: hf://my-org/qwen-sql-lora
Manual LoRA Configuration
If you need full control, you can disable automatic configuration by including --lora-modules in your container args:
template:
containers:
- name: main
args:
- "--model"
- "/mnt/models"
- "--enable-lora"
- "--lora-modules"
- "my-adapter=/custom/path"
When you manually specify --lora-modules, the controller skips automatic LoRA configuration. You are responsible for ensuring adapters are downloaded and paths are correct.
Monitoring and Troubleshooting
Verification
Check that adapters loaded successfully by viewing pod logs:
kubectl logs <pod-name> -c storage-initializer
# Look for: "Successfully downloaded adapter to /mnt/lora/<name>"
kubectl logs <pod-name> -c main
# Look for: "Loading LoRA adapters" and adapter names
Common Issues
| Issue | Cause | Solution |
|---|---|---|
| Download failure | Invalid HF/S3 credentials | Verify HF_TOKEN or S3 credentials in environment variables |
| PVC mount failure | PVC doesn't exist or wrong namespace | Ensure PVC exists in same namespace as LLMInferenceService |
| Adapter not found at inference | Adapter name mismatch | Use exact adapter name from spec.model.lora.adapters[].name in model parameter |
| OOM errors | Too many adapters or insufficient GPU memory | Reduce number of adapters or increase GPU memory allocation |
| Adapter name conflict | Duplicate adapter names | Ensure all adapter names are unique |
Storage Initializer Dependency
If you disable the storage-initializer (storageInitializer.enabled: false), hf:// and s3:// adapters will fail to download. Only pvc:// adapters will work.
Limitations
Unsupported URI Schemes
OCI Registries (oci://): Currently not supported for LoRA adapters.
Workaround: Download the adapter to a PVC manually and use pvc:// scheme:
# Pre-download job
apiVersion: batch/v1
kind: Job
metadata:
name: download-adapter
spec:
template:
spec:
containers:
- name: downloader
image: python:3.11
command: ["sh", "-c"]
args:
- |
pip install huggingface-hub
python -c "from huggingface_hub import snapshot_download; snapshot_download('my-org/my-lora', local_dir='/mnt/adapter')"
volumeMounts:
- name: adapter-storage
mountPath: /mnt/adapter
volumes:
- name: adapter-storage
persistentVolumeClaim:
claimName: adapter-pvc
restartPolicy: Never
Related Documentation
- Configuration Guide: Detailed spec reference for LLMInferenceService
- Model Storage: Supported storage backends for base models
- Dependencies: Required infrastructure components
Summary
LoRA adapters in LLMInferenceService provide:
- ✅ Three URI schemes: HuggingFace Hub, S3-compatible storage, and PVC
- ✅ Automatic integration: Controller handles downloads, mounts, and vLLM configuration
- ✅ Dynamic switching: Per-request adapter selection with minimal overhead
- ✅ Multi-tenancy: Serve multiple task-specific models from a single deployment
- ✅ Production-ready: Support for private repositories, custom endpoints, and air-gapped deployments
For complete working examples, see the KServe samples repository.