Deploying The Mistral-7B-Instruct-v0.3 Chat Model on Ray Serve
With all the node pools provisioned, we can now proceed to deploy Mistral-7B-Instruct-v0.3 chatbot infrastructure.
Let's begin by deploying the ray-service-mistral.yaml
file:
namespace/mistral created
rayservice.ray.io/mistral created
Creating the Ray Service Pods for Inference
The ray-service-mistral.yaml
file defines the Kubernetes configuration for deploying the Ray Serve service for the mistral7bv0.3 AI chatbot:
apiVersion: v1
kind: Namespace
metadata:
name: mistral
---
#----------------------------------------------------------------------
# NOTE: For deployment instructions, refer to the DoEKS website.
#----------------------------------------------------------------------
apiVersion: ray.io/v1
kind: RayService
metadata:
name: mistral
namespace: mistral
spec:
serviceUnhealthySecondThreshold: 900
deploymentUnhealthySecondThreshold: 600
serveConfigV2: |
applications:
- name: mistral-deployment
import_path: "mistral1:entrypoint"
route_prefix: "/"
deployments:
- name: mistral-7b
autoscaling_config:
min_replicas: 1
max_replicas: 1
target_num_ongoing_requests_per_replica: 1
rayClusterConfig:
rayVersion: '2.22.0'
enableInTreeAutoscaling: true
headGroupSpec:
rayStartParams:
dashboard-host: '0.0.0.0'
num-cpus: "0" # this is to ensure no tasks or actors are scheduled on the head Pod
num-gpus: "0"
template:
metadata:
labels:
ray.io/node-type: head
spec:
containers:
- name: head
image: public.ecr.aws/aws-containers/aiml/mistral-7b:0.1.0
imagePullPolicy: Always # Ensure the image is always pulled when updated
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "ray stop"]
ports:
- containerPort: 6379
name: gcs
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
- containerPort: 8000
name: serve
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
resources:
limits:
cpu: "4"
memory: 10Gi
requests:
cpu: "2"
memory: 10Gi
env:
- name: PORT
value: "8000"
- name: LD_LIBRARY_PATH
value: "/home/ray/anaconda3/lib:$LD_LIBRARY_PATH"
nodeSelector:
instanceType: mixed-x86
provisionerType: Karpenter
workload: rayhead
tolerations:
- key: node.kubernetes.io/not-ready
operator: Exists
effect: NoExecute
tolerationSeconds: 300
volumes:
- name: ray-logs
emptyDir: {}
workerGroupSpecs:
- groupName: worker-group
replicas: 1
minReplicas: 1
maxReplicas: 1
rayStartParams:
resources: '"{\"neuron_cores\": 2}"'
template:
metadata:
labels:
ray.io/node-type: worker
spec:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: ray.io/node-type
operator: In
values:
- head
topologyKey: kubernetes.io/zone
containers:
- name: worker
image: public.ecr.aws/aws-containers/aiml/mistral-7b:0.1.0
imagePullPolicy: Always # Ensure the image is always pulled when updated
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "ray stop"]
# We are using 2 Neuron cores per HTTP request hence this configuration handles 6 requests per second
resources:
limits:
cpu: 5
memory: 26Gi
aws.amazon.com/neuron: "1"
requests:
cpu: 4
memory: 26Gi
aws.amazon.com/neuron: "1"
env:
# Model and Neuron configuration
- name: MODEL_ID
value: "askulkarni2/neuron-mistral7bv0.3"
- name: NEURON_CORES
value: "2" # Number of Neuron cores to use
- name: NEURON_RT_ASYNC_EXEC
value: "1" # Enable asynchronous execution
- name: NEURON_RT_NUM_CORES
value: "2" # Total number of Neuron cores available
- name: NEURON_RT_VISIBLE_CORES
value: "0,1" # Which specific cores to use (cores 0 and 1)
# compilation settings
- name: NEURON_CC_FLAGS
value: "-O1" # Optimization level for compilation
- name: NEURON_COMPILE_ONLY
value: "0" # Don't just compile, also run the model
# Cache configuration
- name: NEURON_COMPILE_CACHE_URL
value: "/tmp/model/cache" # Where to store compiled artifacts
- name: NEURON_RT_CACHE_DIRECTORY
value: "/tmp/model/cache" # Runtime cache location
- name: NEURON_RT_USE_PREFETCHED_NEFF
value: "1" # Use pre-compiled neural network files
# System paths
- name: NEURON_RT_LOG_LEVEL
value: "INFO" # Change to INFO or DEBUG when troubleshooting
- name: LD_LIBRARY_PATH
value: "/home/ray/anaconda3/lib:$LD_LIBRARY_PATH" # Library path
- name: PORT
value: "8000" # Service port
# ray
- name: RAY_gcs_server_request_timeout_seconds
value: "120"
- name: RAY_SERVE_KV_TIMEOUT_S
value: "120"
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
- mountPath: /dev/shm
name: dshm
- mountPath: /tmp/model
name: nvme-storage
volumes:
- name: ray-logs
emptyDir: {}
- name: dshm
emptyDir:
medium: Memory
- name: nvme-storage
hostPath:
path: /mnt/k8s-disks/0
type: Directory
nodeSelector:
instanceType: trn1.2xlarge
provisionerType: Karpenter
neuron.amazonaws.com/neuron-device: "true"
tolerations:
- key: "aws.amazon.com/neuron"
operator: "Exists"
effect: "NoSchedule"
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 300
This configuration accomplishes the following:
- Creates a Kubernetes namespace named
mistral
for resource isolation - Deploys a RayService named
rayservice.ray.io/mistral
that utilizes a Python script to create the Ray Serve component - Provisions a Head Pod and Worker Pods to pull Docker images from Amazon Elastic Container Registry (ECR)
After applying the configurations, we'll monitor the progress of the head and worker pods:
NAME READY STATUS RESTARTS AGE
mistral-raycluster-xxhsj-head-l6zwx 0/2 ContainerCreating 0 3m4s
mistral-raycluster-xxhsj-worker-group-worker-b8wqf 0/1 Init:0/1 0 3m4s
...
mistral-raycluster-xxhsj-head-l6zwx 1/2 Running 0 3m48s
mistral-raycluster-xxhsj-head-l6zwx 2/2 Running 0 3m59s
mistral-raycluster-xxhsj-worker-group-worker-b8wqf 0/1 Init:0/1 0 4m25s
mistral-raycluster-xxhsj-worker-group-worker-b8wqf 0/1 PodInitializing 0 4m36s
mistral-raycluster-xxhsj-worker-group-worker-b8wqf 0/1 Running 0 4m37s
mistral-raycluster-xxhsj-worker-group-worker-b8wqf 1/1 Running 0 4m48
It may take up to 5-8 minutes for both pods to be ready.
We can also use the following command to wait for the pods to get ready:
pod/mistral-raycluster-xxhsj-head-l6zwx met
pod/mistral-raycluster-xxhsj-worker-group-worker-b8wqf met
Once the pods are fully deployed, we'll verify that everything is in place:
NAME READY STATUS RESTARTS AGE
pod/mistral-raycluster-xxhsj-head-l6zwx 2/2 Running 0 5m34s
pod/mistral-raycluster-xxhsj-worker-group-worker-b8wqf 1/1 Running 0 5m34s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/mistral ClusterIP 172.20.112.247 <none> 6379/TCP,8265/TCP,10001/TCP,8000/TCP,8080/TCP 2m6s
NAME DESIRED WORKERS AVAILABLE WORKERS CPUS MEMORY GPUS STATUS AGE
raycluster.ray.io/mistral-raycluster-xxhsj 1 1 6 36Gi 0 ready 5m36s
NAME SERVICE STATUS NUM SERVE ENDPOINTS
rayservice.ray.io/mistral WaitForServeDeploymentReady
Note that the service status is WaitForServeDeploymentReady
. This indicates that Ray is still working to get the model deployed.
Configuring RayService may take up to 10 minutes.
We can wait for the RayService to be running with this command:
rayservice.ray.io/mistral condition met
With everything properly deployed, we can now proceed to create the web interface for the chatbot.