部署

Inference

定义

The process of running a trained model on new data to produce predictions or generated outputs. Inference cost and latency are the dominant operational concerns in production AI, particularly for large generative models that can cost cents per request at scale.

相关术语

Model Serving

The process of deploying trained ML models to production environments where they can receive inputs and return predictions at scale. Model serving infrastructure must address throughput, latency, versioning, and cost while meeting SLAs.

Inference Cost

The compute and financial cost of running a model to produce a single prediction or generated response. Inference cost is often the dominant AI operational expenditure at scale and is managed through model compression, caching, quantization, and batching strategies.

Batch Inference

Running a model on a large dataset in a single scheduled job rather than in real time. Batch inference is more cost-efficient than real-time serving for use cases such as nightly report generation, bulk document classification, or periodic customer scoring.

了解术语只是第一步，将其落地应用才是第二步。

预约一次 Physical AI 适配性沟通，探讨这些 AI 概念如何转化到您所在的具体行业与业务挑战中。