(Or, How We Learned to Stop Worrying and Love the Cloud)

Have you ever wondered how a cloud-based, multi-agent AI service handles a flurry of incoming user requests without spontaneously combusting? Well, wonder no more! In this blog post, we’ll embark on a fun, light-hearted journey through a simulation that tests just how gracefully an AI agentic system (think: a set of AI agents working together) scales, splits, and handles all sorts of query traffic. We’ll talk about our framework, assumptions, code, and results—complete with a snazzy chart to show off our data.

The Task at Hand

Imagine you’re running a magical multi-agent AI system in the cloud. Users keep sending in queries, each one with a certain number of tokens, and your AI agents must figure out how to handle them. Sometimes they split requests, if they requre some custom processing, sometimes they drop them if they’re too many of them. We want to figure out:

In other words, we want to simulate the day-to-day life of our AI agentic system and see what happens when the going gets tough—or at least, when the queries keep rolling in.

Our Simulation Framework

For our simulation, we put on our Python hats and set up a playful environment. We used SimPy - discrete-event simulation framework based on standard Python. Our main steps looked a bit like this:

  1. Generate incoming requests: We produce a batch of user queries with random sizes (in tokens).
  2. Route the requests: Some requests processed by the Agent which takes them; others are split into multiple sub-agents, to simulate some complex queries.
  3. Cost modeling: We keep track of how many tokens each request uses, then calculate LLM (or multi-agent) costs and AKS (Azure Kubernetes Service) costs based on usage—both in dollars.
  4. Statistics: We record everything—number of direct requests, split requests, dropped requests, total tokens, and costs.
  5. Repeat: We run this entire simulation multiple times (100 runs, in our case) to get nice summary statistics.

This framework gives us a robust way to see how our hypothetical cloud-based AI system behaves under a variety of random conditions.

Assumptions & Numbers

Now for the fun part—how did we decide which numbers to plug in?

Our goal was not to create a perfectly realistic production environment, but rather to highlight the trade-offs and behaviors that might emerge under these assumptions.

A Quick Peek at the Code

We won’t overwhelm you with all the gritty details, but here’s the general shape of our Python code:

import numpy as np
import pandas as pd
from tqdm import trange

NUM_RUNS = 100
results = []

for run in trange(NUM_RUNS):
    # 1. Generate requests
    user_queries = np.random.poisson(lam=5000, size=some_size)  # Just an example

    # 2. Route requests
    direct_requests = 0
    split_requests = 0
    dropped_requests = 0
    total_tokens = 0
    
    # ... Some logic for splitting or dropping requests ...
    
    # 3. Track costs (in dollars)
    llm_cost = ...
    aks_cost = ...
    total_cost = llm_cost + aks_cost

    # 4. Append results for each run
    results.append({
        'direct_requests': direct_requests,
        'split_requests': split_requests,
        'dropped_requests': dropped_requests,
        'total_requests': direct_requests + split_requests,
        'llm_cost': llm_cost,
        'aks_cost': aks_cost,
        'total_cost': total_cost,
        # ... other stats ...
    })

# Convert results to a DataFrame and compute summary statistics
df = pd.DataFrame(results)
summary = df.describe()
print(summary)

Essentially:

  1. We generate a batch of requests with random sizes.
  2. We decide whether to direct them, split them, or drop them.
  3. We calculate how many tokens went through and the costs in dollars.
  4. We collect all these stats in a DataFrame and spit out summary statistics.

You can find the notebook here

Results: The Grand Reveal

After 100 simulation runs, we got some neat summary statistics:

Metric mean std min max
direct_requests 6479 85.76 6279 6641
split_requests 25931 157 25462 26347
dropped_requests 591 38.74 506 696
total_requests 32410 182 31945 32823
percent_split 80.01 0.23 79.46 80.60
user_query_tokens 3561909 53503 3445529 3669273
system_query_tokens 95085770 517611 93430542 96296984
llm_response_tokens 42782889 225980 42077728 43267966
total_llm_tokens 141430570 743063 139073264 143077558
llm_cost (USD) 424 2.23 417 429
aks_cost (USD) 858 2.32 852 864
total_cost (USD) 1283 4.35 1269 1293
max_nodes 11.00 0.00 11.00 11.00
avg_nodes 8.01 0.02 7.95 8.07

What does all this mean?

And here’s the visual side of things (see the attached figure). It shows:

From the chart, we can see that whenever the load spikes (green line), the AKS nodes usage (blue line) tends to respond. The failed requests (orange line) stay relatively low, indicating our scaling strategy handles most spikes well. The PCA correlation curve (gray dotted line) helps confirm which periods show the strongest relationships among these metrics.

chart

Wrapping Up (and a Question for You!)

That’s it, folks! Our little simulation shows how a cloud-based agentic AI system might behave under varying loads, token sizes, and cost constraints. We made some assumptions about how queries are split, what gets dropped, and how costs scale—but in reality, you’d tweak these parameters to match your production environment.

Question for you: If you’re running a similar system of AI agents, do your numbers correlate with ours? Maybe you see a different ratio of split vs. direct requests, or maybe your costs behave differently. Let us know how our model compares to your real-world experiences!

We hope this gave you a fun peek into the world of AI simulation and resource planning. If you’re ever curious about whether your next big idea might break the bank (or break your cluster), a quick simulation like this can give you some peace of mind—or at least a direction for more fine-grained modeling!

Thanks for joining us on this cloud-surfing, token-crunching, agent-wrangling adventure. Until next time, happy simulating!