Problem: Model latency was too high for real-time edge deployment.
Engineered a custom inference server using C++ and ONNX Runtime, achieving a 40x speedup over the Python baseline. Implemented dynamic batching and request queuing to maximize GPU throughput under high load.