InferVision NN model deployment webinar
Modle life span
- build time: load modle into memory (zip, encode, send, read, unzip, decode, certificate…)
- luanch time: load modle into GPU (api, check load, free, status verification, luanch config)
- run time: read input, provide output (input api, output api, execute)
free
instance under process -> process kill -> free memory
gRPC: process communication
Scheduler center
- Register (receive Archive, return id)
- Reference (receive input, return output)
Instance status:
- working, but can receive new task
- will be killed, cannot receive new task
Assumption: model can be loaded into one GPU
How to find an instance:
- find a matching instance is working now, just send
- no matching instance, but available GPU, lauch new instance
- no matching instance, no available GPU, to kill a working instance (find it and kill it), lauch new instance …
- no matching instance, no available GPU, nobody to kill, wait… [based timeout from Client]
Schduler challange
- Better monitoring and analysis
- Better instance matching
- Better load balance
- Avoid race condition
- Error handle
- log
- Mimic production environment (load, profile, test, optimization)
- Test (unit test, integration test, etc.)
- Interfare
- Distributed
Client
- I/O cost ->High efficient IPC (process commu: sharing memory)
- Client no request, Server idle ->Keep scheduler request busy
Based on memory on client, determine request n -> shared memory
Preprocess input = CPU on client -> to shared memory -> to request queue -> to scheduler
Same modle on one GPU 4 models parralle,batch
Operator combine -> to batches not to multi-instances
docker container layers v.s. model layers [LARGE v.s. SMALL]