Serving Service

Overview

The Serving Service (ServingService) provides comprehensive model deployment and serving capabilities for the Kamiwaza AI Platform. Located in kamiwaza_client/services/serving.py, this service manages Ray cluster operations, model deployment, and inference requests.

Key Features

Ray Service Management
Model Deployment
Model Instance Management
Model Loading/Unloading
Health Monitoring
VRAM Estimation

Ray Service Management

Available Methods

start_ray() -> Dict[str, Any]: Initialize Ray service
get_status() -> Dict[str, Any]: Get Ray cluster status

# Start Ray service
status = client.serving.start_ray()

# Check Ray status
ray_status = client.serving.get_status()

Model Deployment

Available Methods

estimate_model_vram(model_id: UUID) -> int: Estimate model VRAM requirements
deploy_model(deployment: CreateModelDeployment) -> ModelDeployment: Deploy a model
list_deployments() -> List[ModelDeployment]: List all deployments
list_active_deployments() -> List[UIModelDeployment]: List only active deployments with running instances
get_deployment(deployment_id: UUID) -> ModelDeployment: Get deployment details
stop_deployment(deployment_id: UUID): Stop a deployment
get_deployment_status(deployment_id: UUID) -> DeploymentStatus: Get deployment status

# Estimate VRAM requirements
vram_needed = client.serving.estimate_model_vram(model_id)

# Deploy a model
deployment = client.serving.deploy_model(CreateModelDeployment(
    model_id=model_id,
    name="my-deployment",
    replicas=1,
    max_concurrent_requests=4
))

# List all deployments
deployments = client.serving.list_deployments()

# List only active deployments (deployed status with running instances)
active_deployments = client.serving.list_active_deployments()
# Each active deployment will have:
# - id: The deployment ID
# - m_id: The model ID
# - m_name: The model name
# - status: The deployment status
# - instances: List of running instances
# - lb_port: The load balancer port
# - endpoint: The HTTP endpoint for the deployment (e.g. http://hostname:port/v1)

# Get deployment status
status = client.serving.get_deployment_status(deployment_id)

# Stop deployment
client.serving.stop_deployment(deployment_id)

Model Instance Management

Available Methods

list_model_instances() -> List[ModelInstance]: List all model instances
get_model_instance(instance_id: UUID) -> ModelInstance: Get instance details
get_health(deployment_id: UUID) -> Dict[str, Any]: Get deployment health
unload_model(deployment_id: UUID): Unload model from memory
load_model(deployment_id: UUID): Load model into memory

# List model instances
instances = client.serving.list_model_instances()

# Get instance details
instance = client.serving.get_model_instance(instance_id)

# Check deployment health
health = client.serving.get_health(deployment_id)

# Load/Unload model
client.serving.unload_model(deployment_id)
client.serving.load_model(deployment_id)

Error Handling

The service includes built-in error handling for common scenarios:

try:
    deployment = client.serving.deploy_model(deployment_config)
except DeploymentError as e:
    print(f"Deployment failed: {e}")
except ResourceError as e:
    print(f"Resource allocation failed: {e}")
except APIError as e:
    print(f"Operation failed: {e}")

Best Practices

Always estimate VRAM requirements before deployment
Monitor deployment health regularly
Use appropriate number of replicas based on load
Implement proper error handling
Clean up unused deployments
Consider using advanced generation parameters for better control
Load/unload models to manage memory efficiently

Overview​

Key Features​

Ray Service Management​

Available Methods​

Model Deployment​

Available Methods​

Model Instance Management​

Available Methods​

Error Handling​

Best Practices​

Overview

Key Features

Ray Service Management

Available Methods

Model Deployment

Available Methods

Model Instance Management

Available Methods

Error Handling

Best Practices