PTChatterly

Overview

Frequently asked questions

Results

Meet PTChatterly: a new AI benchmarking and sizing framework from PT that can quantify the performance and user experience your customers can expect from an underlying solution running their in-house LLMs.

In-house large language models (LLMs) allow organizations to build AI-assisted chatbots trained on their own private data, opening up new opportunities for innovation and efficiency. They’re a hot topic in AI today, but for customers trying to find the right underlying cloud or on-prem hardware for their in-house LLMs, it’s difficult to know what to choose. Similarly, organizations selling computing solutions face challenges quantifying their solutions’ performance for LLM use cases.

To answer this need, we built PTChatterly.

As a new LLM benchmark and sizing framework, PTChatterly can tell your customers exactly what performance they can expect from the solution running their private LLMs. Utilizing an existing LLM and the retrieval augmented generation (RAG) method, it searches a local corpus of data and constructs responses in AI-assisted chatbot conversations with multiple simulated users. It generates meaningful, real-world metrics, e.g., “32 people can have simultaneous conversations with at worst XX response time.”

This tool is exclusively available from PT—and we’re ready to run it on your solution. Contact us to get started.

Want to learn more about PTChatterly?
Send us your questions and check back here often for more details.

Filters

About PTChatterly

Leveraging PTChatterly

Purchasing PTChatterly

Technical questions

Reset filters

What is PTChatterly and what is its purpose?

Organizations of all sizes, from SMB to enterprise, are considering implementing in-house generative artificial intelligence (GenAI) chatbots that use a combination of an existing large language model (LLM) and the organization’s own data. Teams might use these in-house GenAI chatbots for customer service, to improve productivity and efficiency, to help train new staff, or for a host of other applications. Building such chatbots internally keeps the organization’s data on premises and therefore private, a key advantage for many groups. Before they can reap the benefits of such a chatbot, however, they’ll need to choose technologies to run that solution.

PTChatterly is the testing service these organizations need to determine what hardware solutions will be the best fit for their in-house AI chatbots.

PTChatterly helps you size and understand a solution’s performance for an in-house chatbot that utilizes retrieval augmented generation (RAG) with a popular LLM and a private database of business information. Its primary claim is one that is simple and clear for buyers and users of all types: The solution supported up to X simultaneous chatbot users with acceptable performance.

As a benchmark and sizing framework, PTChatterly couples a full-stack AI implementation of an LLM, augmented with in-house data, with a testing harness that lets you determine how many people the chatbot can support. Rather than reporting results in technical measurements that few users would understand, it provides a metric that is meaningful and simple to grasp: For example, it might say that the server under test supports 32 people having simultaneous conversations with a response time of 10 seconds or less.

This repeatable, real-world testing service gives organizations hard data on how solutions will support a given in-house LLM solution.

Who can benefit from using PTChatterly?

Almost any organization working with or wanting to work with in-house chatbots that combine their data with an LLM! PTChatterly is a valuable tool for those selling GenAI LLM-focused hardware solutions and for anyone building their own in-house AI-assisted chatbot and trying to determine what hardware they need to support it.

If you sell technology solutions, PTChatterly can quantify your solutions’ advantage over the competition or over previous-generation solutions for this use case. It can also help demonstrate the benefits of adding higher-bin CPUs, more GPUs, more or faster networking, or potentially other additional resources. For your customers who want real-world AI proof points, PTChatterly delivers clear measures they can understand and trust.

If you’re building your own in-house LLM solution, PTChatterly can act as a sizing tool to help guide your hardware solution purchases. You can use it to determine what kind of solution—and how much computing power—you are likely to need for the user count you anticipate and the response time those users require. Like any such tool, it is, of course, an approximation, but if you want a quick and affordable way to figure out how many users a solution you’re considering can support, PTChatterly is the answer.

Why PTChatterly? Can’t I measure this on my own?

If you sell solutions, your engineers can almost certainly do this kind of testing—but it may well take a great deal of time and work. For a very reasonable price, PTChatterly can give you answers quickly and provide insight into how your different offerings compare in performance. We know of no faster or better way to learn the performance of your solutions for an AI-assisted chatbot use case.

If you’re shopping for a solution, you can use PTChatterly to assess performance on different hardware platforms or with different response times and user counts on a single platform. If you’re an organization that has set up a proof of concept but needs to understand how a more powerful solution might help you grow your AI implementation, PTChatterly can provide useful approximations using its corpus and LLM. And it can help you plan resources if you haven’t yet deployed such a solution at all.

PTChatterly can tell you how many users a given hardware solution can support under a desired response time threshold with an organization’s data and a publicly available LLM; offer performance data on CPU, GPU, disk, network, and other system components; and allow you to compare those results between multiple different hardware solutions. It can be the fastest path to choosing a solution that meets your need.

Why does PTChatterly use a retrieval augmented generation (RAG) architecture?

RAG is an efficient method of extending the ability of LLMs to use local, business information. It analyzes the supplementary information and transforms it to an efficient format for accurate and fast retrieval. RAG AI chatbots provide better and more contextual responses with data not in the LLM or with up-to-date data, without the high costs of retraining the model.

What are the primary components of PTChatterly?

PTChatterly is a full-stack AI implementation of an AI-assisted chatbot, utilizing a RAG architecture, a local LLM, and in-house data. It consists of nine primary components:

Corpus: The in-house knowledge base that the PTChatterly AI-assisted chatbot uses, along with the LLM, for answering questions from multiple users
Bulk loader: A set of custom Python modules that reads and parses the corpus, as a set of documents or dataset; stores it in a vector database; and creates a vector index for semantic searches
Multi-threaded benchmark harness: Custom client harness written in Go that simulates multiple users having simultaneous conversations with the AI-assisted chatbot, provides corpus-specific conversations to the broker, and collects user-experience response times
Broker service: Custom Go code that orchestrates the data flow, receiving queries from clients and using the framework's services to generate a response
Embedding service: Handles calls to the embedding model via an API and enables efficient and more accurate semantic searching of the corpus
Vector database: An information store that provides efficient search for structured and unstructured documents. It stores the original data, its embedding, and a vector index.
Semantic search: Performs a non-keyword, sentence-based query of the documents in the vector database via an API endpoint.
Reranking service: Uses AI to compare the results of the semantic search of the corpus to the original query and to form a better context for answering the query.
Large language model using retrieval augmented generation: An AI model that recognizes and generates text, utilizing the specific information the search gathered from the corpus and forming a response as a chatbot. The LLM employs RAG to retrieve supplementary data to use when generating responses, helping to provide better and more contextual responses based on more up-to-date data, without the high costs of retraining the model.

What is the roadmap for future features?

Like the AI field in general, PTChatterly is evolving constantly. We are improving its features and optimizations, adding new features, and continuing to test it. Currently, we’re working on:

A multimodal version of PTChatterly using multimodal RAG and unimodal (text) questions with multimodal (images and text) responses
Adding new corpuses of data to address the needs of different types of businesses
Improving support for clusters of systems and clusters of cloud instances
A version of PTChatterly using unimodal textual agentic RAG, which can orchestrate more complex dataflows than the basic RAG we are currently using

We currently have support for AMD and NVIDIA GPUs; as more manufacturers make GPUs and those GPUs grow in market popularity, we will expand our support to include additional GPU offerings.

What is the key claim PTChatterly generates?

PTChatterly helps you size and understand the performance of a solution for an in-house, AI-assisted chatbot that utilizes retrieval augmented generation (RAG) with a popular LLM and a private database of business information. Its primary claim is one that is simple and clear for buyers and users of all types:

The solution supported up to X simultaneous chatbot users with acceptable performance.

Acceptable performance is something that you get to define. Learn more in the section titled What response time choices must I make when running PTChatterly, and what do they mean for users? below.

For those who want to dig deeper into a solution’s performance, PTChatterly delivers a great deal of detailed information on response time and other performance metrics. Explore those details in the next section What data does PTChatterly capture, and what does it mean to users?.

How long does it take to get a result from PTChatterly?

From the time we have complete access to a single system configuration and the supporting client VM or system, we can deliver a result (the median of three test runs) in two weeks.

The two-week period assumes that the system in question uses AMD or Intel CPUs and that any GPUs are from AMD or NVIDIA. Systems with different CPU or GPU architectures may take additional time and cost. It also assumes that PTChatterly uses its default components: vector database, reranking service, embedding service, and semantic search method, as well as one of its multiple default LLMs and corpuses. Changing any of these components will add time and cost.

What data does PTChatterly capture, and what does it mean to users?

PTChatterly delivers a simple claim: How many simultaneous users of an in-house chatbot the solution under test, or SUT, can support with acceptable response time. It does this by simulating sets of users of varying sizes.

PTChatterly captures three different major technical response time metrics for each set of simultaneous users.

The first is total response time for each question posed to the chatbot, or the amount of time the AI-assisted chatbot takes to provide a full response to a single question. This metric tells how long the hypothetical users of PTChatterly waited to receive a complete answer to each question.

The second is time to first token (TTFT), where a “token” is a portion of a word. (In AI, tokens represent portions of data on which an AI algorithm can operate.) Like many AI-assisted chatbots, PTChatterly works in streaming mode, meaning that the chatbot “writes” an answer to the user as it is generating that answer in real time. TTFT tells us how long the hypothetical users waited before the chatbot started “writing” its answer. For AI-assisted chatbots that use streaming mode—as many do, including ChatGPT—a long TTFT could lead to users losing interest and moving on, so this is an important measure to know.

The final metric is tokens per second (TPS), or the speed at which the chatbot produced individual words. This metric tells users how quickly the chatbot displayed its scrolling answer.

PTChatterly captures each of these metrics for differing numbers of simultaneous users asking questions in conversations with the chatbot. All three contribute to an understanding of what real people would experience. As an example, let's say a solution delivers a total response time of 20 seconds, a TTFT of 2 seconds, and a TPS rate of 4 tokens per second. 20 seconds may seem like a long time to wait for an answer, but a user would start seeing their answer scroll in just 2 seconds—not long at all—and see the scrolling answer move reasonably quickly. On the other hand, a total response time of 10 seconds with a TTFT of 5 seconds might actually feel slower to the user, even though they'd receive their full answer faster.

To this point, when PTChatterly generates the final claim for your use, it takes into account that users have finite patience and expect that answers will appear in a timely matter. Accordingly, PTChatterly enforces a specific response-time threshold, based on one of these metrics, to determine the maximum number of simultaneous chatbot users the solution under test can support with acceptable performance. (You can choose both the threshold and the metric on which to base it, or we can help you decide.)

There is no definitive industry standard for acceptable response time for an AI-assisted chatbot. Our estimates of typical total response time thresholds vary from one second to about three and a half minutes depending on the type of data corpus, the length and complexity of the question, and the use case.

Because total response times depend on the complexity of the question, by default PTChatterly uses the 95^th percentile of the total response time to determine the maximum user capacity of the chatbot solution. This means that the AI-assisted chatbot fully answered 95 percent of the questions from the simulated users in under the threshold time. You can change this percentile to whatever you find acceptable for your use case.

You may also choose to use the average total response time to determine maximum user capacity. In this case, you would set the percentile to 100%.

Regardless of what percentile you select for the threshold, PTChatterly will provide data that includes the average response time, 5^th percentile response time, and 95^th percentile response time, plus a chart that highlights how response times change as the number of simultaneous users changes. This data will allow those interested in performance details to see how the solution's performance scaled as PTChatterly added users. (PTChatterly provides this data for all three response time metrics: total response time, TTFT, and TPS.) PTChatterly also provides raw timing data for each user count, which lets you see whether any simulated users were major outliers in the quality of service the solution delivered to them.

Though mostly technical folks will dig into this data, you can also get useful claims from it. For example, you might note the lowest possible total response time the system delivered with a minimum user count.

For those interested in hardware comparisons and performance details, PTChatterly also uses standard system performance-monitoring tools to collect a variety of key CPU, GPU (if present), and network statistics on the systems it tests.

What response time choices must I make when running PTChatterly, and what do they mean for users?

As we discussed in the previous section, when you start to run the PTChatterly service, you must make the following choices:

Response time threshold: The acceptable limit for the maximum time it can take the chatbot to respond to a question. As we noted above, our estimates of typical response time thresholds vary from one second to about three and a half minutes depending on the type of corpus and the use case.
Response time threshold percentile: The percentage of questions that must come in below the response time threshold for PTChatterly to consider the average response time acceptable. The PTChatterly default is 95 percent, meaning that the solution under test answered 95 percent of simulated questions within the response time threshold, while 5 percent of the simulated questions required more time to answer.

We are happy to offer guidance on what real-world options will work best to highlight your system's strengths.

Are there any privacy concerns?

The source for the default corpus of data for PTChatterly, which we based on an Airbnb data set, was public and open source at the time we downloaded it, meaning we do not have privacy concerns. As an extra step, we anonymized the host data.

For an additional charge, we can test using on your own corpus of private data. If we do not already have a standing NDA with your organization, we are happy to sign one.

When I buy the service to test a single configuration, what do I get?

When you choose to have PT run PTChatterly on a single configuration, you get a complete PTChatterly result set on the number of users that configuration can support at a given response time on a given set of data. Those results will be the median of three test runs. (To ensure repeatability, at PT we generally execute three runs for any performance testing and report the median result.)

As with all PT testing, the results are confidential to you and PT. If you would like to publish the results, we’d be happy to discuss the options for doing so.

What is the pricing structure for PTChatterly?

The cost to receive a single publishable result on a single configuration is $20,000. This base cost assumes that the system in question uses AMD or Intel processors and that any GPUs are from AMD or NVIDIA. It also assumes that PTChatterly uses its default vector database, reranking service, embedding service, and semantic search method, as well as one of its default LLMs and corpuses.

We are happy to discuss switching any of these components for an additional cost.

Can we customize the benchmark parameters (number of simulated users, size of corpus, etc.)?

Yes. You can customize the following parameters at no cost:

Number of simulated users
Size of corpus (within some limitations)
Text-based large language model it should use (current options inclue Mistral, Llama 2, Llama 3.1, Llama 3.2, or Llama 3.3)
Response time threshold
Response time threshold percentile
Limit on answer size

Some of these parameters, such as response time threshold, define what acceptable performance means, while others, such as number of simulated users and limit on answer size, will affect performance itself. After we know the details of the configuration we'll be testing, we can recommend the parameters that we believe will yield optimal performance—which will differ depending on the configuration.

For an additional cost, we can also customize for you the corpus, the embedding service, the vector database, and/or the semantic-search option, or port the service to use a different large language model than the ones we support.

What benchmark parameters do you recommend?

The benchmark parameters we recommend for your use case will depend on the configuration we’re testing, your goals for the testing, and any particular parameters that you’ve told us are critical for your goals (e.g., testing with a specific LLM or a specific corpus size). We’ll work with you to craft a path that gives us the best chance of optimizing PTChatterly performance.

Can we provide our own corpus? If we do so, what are the privacy implications?

Yes, we can run PTChatterly using a corpus you provide. We’ll ensure that we manage your data securely and will not use it outside the bounds of the test. (And of course, if we do not already have a standing NDA with your organization, we are happy to sign one.) You must ensure that you provide only data you have permission to use.

Testing using your own corpus entails an additional cost.

May I buy the benchmark and do the testing myself?

No. Due to the complexity and the many moving parts, we are offering PTChatterly only as a service at this time.

That said, if you’d like to become a partner with full access to its components, we can discuss licensing pricing and restrictions.

Do you offer consulting for businesses setting up their own AI system?

Absolutely. If you're considering building an AI-assisted chatbot supplemented with in-house data and are deciding among multiple possible backing solutions, we can utilize PTChatterly to:

Quantify the performance of the solutions—whether on-premises or in the cloud—that you are considering
Correctly size the right solution, including the right configuration of CPUs, GPUs, storage, networking, and memory, for your corpus of data
Compare the performance of a potential new solution to that of an existing solution, if you need to know whether you can run your chatbot with the technologies you currently utilize

For an additional cost, we also offer consulting services that go beyond PTChatterly to help you decide what solutions will best suit your needs and evaluate them. Contact us to learn more.

How does PTChatterly work?

At core, PTChatterly simulates multiple users simultaneously asking questions of a chatbot and receiving answers, which PTChatterly generates using the data in the corpus as a primary source.

PTChatterly uses a RAG architecture. As simulated users ask questions of the AI-assisted chatbot, the LLM uses its knowledge base and the data it obtains from the corpus to answer those user questions. The questions go through a broker service to an embedding model, which transforms the questions into a data format PTChatterly can work with. The question, in its new format, returns to the broker service, which performs a semantic search against the corpus, which we have stored in an efficient vector database. The search returns sections of the corpus related to the user’s question. (Optionally, at this stage the broker might route the data through a reranking service to reduce and summarize the answer; for our Airbnb corpus, this step is not necessary.)

The broker service then sends the original question, any previous conversations, and the matching knowledge from the corpus to the LLM. The LLM makes sense of the question and returns an answer to the user via the broker service.

PTChatterly measures the time it takes from the simulated user posting a question to the user receiving an answer, among other performance metrics.

What are the system requirements for PTChatterly?

PTChatterly should work on a system if it uses AMD or Intel CPUs and, at this time, if any GPUs in the system are from relatively recent offerings from AMD or NVIDIA that we have tested. (The GPUs must support sufficiently advanced GPU computing instructions and have optimized CPU computing libraries available—so, not too old, and not the newest on the market. We plan to expand support for additional CPU and/or GPU architectures, including AMD GPUs, in future versions.) This includes workstations and laptops as well as servers, multi-node clusters, and cloud instances, though initially we are focusing on single-node systems. In the case of clusters or cloud instances, each non-AI component of PTChatterly should run in separate VMs/instances.

In addition, ideally—and as a requirement when running PTChatterly on a cluster and/or in the cloud—you should also have a client/VM system on the same network so you do not have the chatbot client, which is simulating users, running on the SUT.

Because PTChatterly simulates a real-world, in-house, AI-assisted chatbot, we recommend running it only on systems that you are seriously considering for that use case. (You likely would not back your chatbot with an old laptop without a GPU, for example, and if you ran PTChatterly on such a system, we anticipate that the results would support that choice.)

What corpus does PTChatterly currently use and why?

The current single-system version of PTChatterly uses a text-only corpus of Airbnb rentals. The corpus includes details about home listings and reviews from customers, from which we scrubbed any obvious personal information (e.g., host names) before ingesting. We selected this corpus because it is more than a simple table of data, such as you might find in a basic SQL database; it also had a non-trivial, nested, multilayered schema.

The multimodal version of PTChatterly, which is currently in development, uses a corpus of past Principled Technologies test and research reports. This corpus will include the text of the reports, charts and graphs created by PT designers, and photographs taken by PT photographers.

We chose these corpuses of data in part because they do not present licensing or privacy concerns. At the time we downloaded it several years ago, the Airbnb corpus was free and open source, and the text, charts, graphs, and photographs in the PT corpus are our intellectual property.

If you’d like to bring your own corpus of data to PTChatterly, we would be happy to use it in your testing for an additional cost and will keep your data secure. This work will take more than the standard turnaround time.

Can you benchmark other models besides Llama 3 using PTChatterly?

Yes. In addition to multiple versions of Llama 3 (3.1, 3.2, or 3.3), at no extra cost, we can benchmark Mistral, DeepSeek, or Llama 2 using PTChatterly.

If you’d like to use a different model, we would be happy to enlarge PTChatterly to support that model at an additional cost. We will be adding support for more LLMs over time.

Last updated 2/10/25

Want to learn more about PTChatterly?
Send us your questions and check back here often for more details.

Principled Technologies is more than a name: Those two words power all we do. Our principles are our north star, determining the way we work with you, treat our staff, and run our business. And in every area, technologies drive our business, inspire us to innovate, and remind us that new approaches are always possible.

Dell PowerEdge XE9680 servers with AMD Instinct MI300X Accelerators: the power to host GenAI with Llama 3.1 405B LLMs

Run your in-house AI chatbot on an AMD EPYC 9534 processor-powered Dell PowerEdge R6615 server

Running your in-house chatbot using Llama 3.1 405B LLMs on Dell PowerEdge XE9680 servers with NVIDIA H100 GPUs