Latency
In the context of machine learning, latency refers to the amount of time it takes to process one inference request.
What is a good latency result?
In real-time or near real-time use cases such as performing image recognition on individual photos being captured by a camera, lower latency is important because it improves the user experience. In other cases, such as performing image recognition on a large library of photos, achieving higher throughput might be preferable; designating larger batch sizes or running concurrent instances might allow the overall workload to complete more quickly.
The dynamics of these performance tradeoffs mean that there is no single “good” score for all machine learning scenarios. Some testers might prefer lower latency, while others would sacrifice latency to achieve the higher level of throughput that their use case demands.
When latency is your top priority
Suppose you’re a data science firm helping a client optimize manufacturing yield by speeding the inspection process. Cameras photograph each product and upload pictures to a server with software that queries a machine learning model. Because your goal is to identify defective items as quickly as possible, your priority is latency, and you could adjust the AIXPRT variables to use small batch sizes and a single instance.
Throughput
In the context of machine learning, throughput refers to the number of inputs a system processes in a given time period.
What is a good throughput result?
In real-time or near real-time use cases such as performing image recognition on individual photos being captured by a camera, lower latency is important because it improves the user experience. In other cases, such as performing image recognition on a large library of photos, achieving higher throughput might be preferable; designating larger batch sizes or running concurrent instances might allow the overall workload to complete more quickly.
The dynamics of these performance tradeoffs mean that there is no single “good” score for all machine learning scenarios. Some testers might prefer lower latency, while others would sacrifice latency to achieve the higher level of throughput that their use case demands.
When throughput is your top priority
Suppose you’re an archivist with hundreds of thousands of historical photographs you must categorize. Because your goal is to process this large volume of images efficiently, you care most about throughput, and you could adjust the AIXPRT variables to use large batch sizes and run concurrent instances.
Why does this matter for servers and cloud instances?
Currently, servers and cloud instances handle the bulk of machine learning inference work for many common apps. For app publishers, the process of choosing an ideal back-end setup can be difficult. Depending on the purpose of an app, publishers may want to prioritize high latency, high throughput, or a finely tuned balance between the two. Regardless of the priority, AIXPRT provides users with accurate and reliable data about these critical inference measures. That data can help businesses make successful infrastructure decisions and avoid costly periods of trial and error or customer dissatisfaction.
Why does this matter for PCs?
As desktop and laptop computing power continues to grow, an increasing number of app publishers are moving inference workloads from the back end to the client side. With inference work on the client side, apps can provide users with features such as voice and facial recognition, image classification, object detection, recommendations, and pattern recognition, even while offline or in areas that have slow or spotty network connections. Users that rely on these powerful apps need to know if their gear can handle the load, and AIXPRT can help identify which PCs are up to the task.