Performance and sizing
This guide provides sizing recommendations and performance characteristics to help you plan Virtual MCP Server (vMCP) deployments.
Resource requirements
Baseline resources
Minimal deployment (development/testing):
- CPU: 100m (0.1 cores)
- Memory: 128Mi
Production deployment (recommended):
- CPU: 500m (0.5 cores)
- Memory: 512Mi
Scaling factors
Resource needs increase based on:
- Number of backends: Each backend adds minimal overhead (~10-20MB memory)
- Request volume: Higher traffic requires more CPU for request processing
- Composite tool complexity: Workflows with many parallel steps consume more memory
- Token caching: Authentication token cache grows with unique client count
Backend scale recommendations
vMCP performs well across different scales:
| Backend Count | Use Case | Notes |
|---|---|---|
| 1-5 | Small teams, focused toolsets | Minimal resource overhead |
| 5-15 | Medium teams, diverse tools | Recommended range for most use cases |
| 15-30 | Large teams, comprehensive | Increase health check interval |
| 30+ | Enterprise-scale deployments | Consider multiple vMCP instances |
Performance characteristics
Backend discovery
- Timing: Happens once per client session
- Duration: Typically completes in 1-3 seconds for 10 backends
- Timeout: 15 seconds (returns HTTP 504 on timeout)
- Parallelism: Backends queried concurrently for capabilities
Health checks
- Interval: Every 30 seconds by default (configurable)
- Impact: Minimal overhead on backend servers
- Timeout: 10 seconds by default (configurable via
healthCheckTimeout) - Configuration: See Configure health checks
Tool routing
- Overhead: Single-digit millisecond latency for routing and conflict resolution
- Caching: Routing table cached per session for consistent behavior
- Lookup: O(1) hash table lookup for tool/resource/prompt routing
Composite workflows
- Parallelism: Up to 10 parallel step executions by default (configurable)
- Execution model: DAG-based with dependency resolution
- Bottleneck: Limited by slowest backend response time in each level
- Memory: Step results cached in memory during workflow execution
Token caching
- Reduction: 90%+ reduction in authentication overhead for repeated requests
- Duration: Tokens cached until expiration
- Scope: Per-client, per-backend token cache
- Impact: Significantly improves response times for authenticated backends
Horizontal scaling
vMCP is stateless and supports horizontal scaling:
Scaling characteristics
- Independence: Each vMCP instance operates independently
- Session affinity: Client sessions are sticky to a single instance (via session ID)
- State: No shared state between instances
- Method: Scale by increasing replicas in the Deployment
Example scaling configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: vmcp-my-vmcp
spec:
replicas: 3 # Scale to 3 instances
# ... rest of deployment spec
Load balancing
When using multiple replicas, ensure your load balancer supports session affinity:
- Kubernetes Service: Use
sessionAffinity: ClientIP - Ingress: Configure session affinity/sticky sessions at the Ingress level
- Gateway API: Use appropriate session affinity configuration
When to scale
Scale up (increase resources)
Increase CPU and memory when you observe:
- High CPU usage (>70% sustained) during normal operations
- Memory pressure or OOM (out-of-memory) kills
- Slow response times (>1 second) for simple tool calls
- Health check timeouts or frequent backend unavailability
Scale out (increase replicas)
Add more vMCP instances when:
- CPU usage remains high despite increasing resources
- You need higher availability and fault tolerance
- Request volume exceeds capacity of a single instance
- You want to distribute load across multiple availability zones
Scale configuration
Adjust operational settings when scaling:
Configuration for large backend counts (15+)
spec:
config:
operational:
failureHandling:
# Reduce health check frequency to minimize overhead
healthCheckInterval: 60s
# Increase thresholds for better stability
unhealthyThreshold: 5
Configuration for high request volumes
spec:
podTemplateSpec:
spec:
containers:
- name: vmcp
resources:
requests:
cpu: '1'
memory: 1Gi
limits:
cpu: '2'
memory: 2Gi
Performance optimization
Reduce backend discovery time
- Use inline mode for static backend configurations (eliminates Kubernetes API queries)
- Minimize backend count by grouping related tools in fewer servers
- Ensure fast backend responses to initialize requests
Reduce authentication overhead
- Enable token caching (enabled by default)
- Use unauthenticated mode for internal/trusted backends
- Configure appropriate token expiration in your OIDC provider
Optimize composite workflows
- Minimize dependencies between steps to maximize parallelism
- Use
failureMode: continuewhen appropriate to avoid blocking entire workflows - Set appropriate timeouts for slow backends
Monitor performance
Use the vMCP telemetry integration to monitor:
- Backend request latency and error rates
- Workflow execution times and failure patterns
- Health check success/failure rates
See Telemetry and metrics for configuration details.