The Foundation: Little's Law

L = λ × W

The entire simulation is anchored by Little's Law — the fundamental theorem of queuing theory. It guarantees that for any stable system, the average concurrency (L) equals throughput (λ) times latency (W).

L
Concurrency

In-flight requests active at any instant — bounded by your connection pool.

λ
Throughput

Successfully served requests per second after dropping errors.

W
Latency

Full residence time: queuing wait + service time across every hop.

Practical intuition: If your P99 latency doubles (W↑), your effective throughput (λ) must halve for a fixed thread pool (L). This is why a single slow database query can starve your entire web tier of connections.

Saturation Formula

ρ = (RPS × Latency_ms / 1000) / MaxConnections

When ρ > 1.0 the node is over-saturated. The engine adds a queuing penalty of (ρ − 1) × 25msper unit of overflow. If ρ > 2.5, there is a 40% chance of a dropped request (simulating connection queue exhaustion).

Nested Private Services

In a Project Workspace you can drop one private service inside another as a Microservice node. The engine does notreplay a frozen benchmark snapshot for that node. Instead, before sampling begins, it simulates the nested service's entire internal architecture against the exact RPS it receives in the current run, then feeds the resulting latency, error rate, and saturation back into the parent simulation.

How a Microservice node is resolved

effRPS = systemRPS × trafficWeight(node) innerResult = simulate(service.canvas, effRPS) ← full recursive run node.latency = innerResult.p50 node.errorRate = innerResult.errorRatePct node.saturation = peak internal node saturation

Because the nested error rate drives both the node and the request path, per-node and overall metrics always agree. A service that is healthy at 100 RPS but collapses at 100k RPS will show exactly that — its internal Server/DB tiers saturate under the real load instead of reporting infinite capacity. Recursion is depth-guarded (up to 4 levels) so deeply composed systems resolve without runaway loops.

Each Microservice node also exposes a Request Absorption Rate. When a service references another service internally, this rate is the probability that an inbound request cascades to the downstream service — modelling caching, early returns, or validation that "absorbs" traffic at the edge.

Example: if user-service sits at the project root and is also nested inside both payment-service and lifestyle-service with an absorption rate of 1.0, it accumulates traffic from all three paths — direct plus both cascades — rather than only its direct share.

Network Physics

Every hop between nodes pays a network cost before any component logic runs. The cost depends on whether the nodes share a VPC and region.

Same VPC
Private subnet routing — essentially free
0.5ms
Same region, different VPC
VPC peering or transit gateway
3–5ms
Cross-region
Backbone transit over geographic distances
60–100ms
Public Internet (no VPC)
At least one node outside a private network
15–25ms
HTTPS / TLS
Handshake + encryption overhead on top of network
+3ms
mTLS (strict)
Mutual certificate exchange (service mesh)
+2ms
gRPC / protobuf
Binary serialization reduces payload size → 30% lower transmission latency
−30%

Server / VM

CPU performance degrades exponentially as effective RPS per core approaches a saturation threshold (~800 RPS/vCPU). Low RAM adds an additional multiplier from page-fault pressure.

nodeLat += BASE(5ms) × 1.5^(effRPS / (vCPU × 800)) × ramPenalty
CPU Safety Limit

800 RPS per vCPU is the stable zone. Beyond it, context-switching overhead causes non-linear latency spikes.

RAM Penalty

Applies only when RAM < 1 GB: penalty = 1.5 / ram. At 0.5 GB → 3× multiplier; at 0.25 GB → 6×.

cpuvCPU countSets the per-core RPS ceiling. More cores = linear scaling until memory is the bottleneck.
ramGBBelow 1 GB RAM triggers the page-fault penalty multiplier.
connsmax connectionsConnection pool size. Controls saturation threshold via Little's Law.

Container

Uses the same exponential CPU model as Server/VM. Containers typically run with fractional CPU shares and low RAM, so the RAM penalty is almost always active at default values (0.5 GB → 3× multiplier).

nodeLat += BASE(5ms) × 1.5^(effRPS / (cpuShares × 800)) × ramPenalty
cpuCPU shares (0.1–8)Fractional CPU allocation. 0.5 = half a vCPU.
ramGB (0.1–16)Almost always below 1 GB — RAM penalty nearly always active.

Serverless Function

Every invocation incurs a cold-start penalty. Higher memory allocation gives proportionally more CPU, so 1 GB functions start roughly 2.8× faster than 128 MB functions.

coldStart = (100ms + rand(0–500ms)) / √(memory_MB / 128)
Cold Start Range

100–600 ms at 128 MB. Scales down inversely with the square root of memory allocation.

Hard Timeout

If cumulative path latency exceeds timeout × 1000 ms, the request is dropped as a 504 error.

memoryMB (128–10240)Higher memory = more CPU = faster cold starts and execution.
timeoutseconds (1–900)Absolute ceiling on total path latency. Exceeded → dropped request.

WebSocket Server

Long-lived connections have very low per-message overhead. The main constraint is the total concurrent connection ceiling — once saturated, new handshakes are rejected. Heartbeat adds 0.5 ms overhead only when heartbeatInterval > 0 (i.e. actively configured). Setting it to 0 disables ping/pong entirely.

nodeLat += 1ms + (heartbeatInterval > 0 ? 0.5ms : 0)

Utilisation uses Little's Law directly: concurrent connections ≈ throughput × avg session (30 s). At 1 000 new connections/s with a 30 s session lifetime → 30 000 concurrent, on a 50 000-conn server → 60% load.

connsmax connectionsTotal simultaneous WebSocket clients. Saturation = dropped handshakes.
heartbeatIntervalseconds (0 = disabled)Set to 0 to disable ping/pong. Any positive value adds 0.5 ms per-message overhead.

Service Worker

Models a background worker pool — Node.js worker_threads, Python Celery, Go goroutine pool, Java ExecutorService. Each worker processes one job at a time with a deterministic service time. The engine uses the M/D/1 Pollaczek–Khinchine (P-K) formula: Poisson arrivals, deterministic (fixed) service time — the most accurate closed-form model for a fixed thread pool.

serviceRate = workerCount × (1 000 / jobDurationMs) [jobs/s] ρ = effRPS / serviceRate if ρ < 1 (stable): Wq = ρ × jobDuration / (2 × (1 − ρ)) ← P-K mean wait if ρ ≥ 1 (overloaded): Wq = (ρ − 1) × jobDuration × 2 ← linear backlog divergence nodeLat += jobDurationMs + Wq if backlog > queueDepth → DROP (proportional probability)
4 workers, 100ms job, 100 RPS
ρ = 100 / (4×10) = 2.5
100 + (1.5×100×2) = 400ms
Severely overloaded — queue explodes
4 workers, 100ms job, 30 RPS
ρ = 30 / 40 = 0.75
100 + (0.75×100/(2×0.25)) = 250ms
Healthy but queue building
8 workers, 50ms job, 100 RPS
ρ = 100 / 160 = 0.625
50 + (0.625×50/(2×0.375)) ≈ 92ms
Comfortable — low wait
Memory ceiling: If workerCount × memoryPerWorkerMb exceeds 16 GB (assumed host), the engine silently clamps effectiveWorkers to fit. At 64 MB/worker the ceiling is 256 workers; at 512 MB/worker only 32 workers run regardless of config.
workerCount1–1024Thread / process pool size. Directly sets serviceRate ceiling.
jobDurationMsms (1–60 000)Average job execution time. Lower = higher throughput per worker. Sets the D in M/D/1.
queueDepth1–100 000Max backlog before jobs are dropped. Drop probability scales with how far backlog exceeds this.
retryEnabledtoggleOn failure, 50% of jobs are retried once. Clears transient errors but adds one jobDuration of latency.
memoryPerWorkerMbMB (16–8 192)Heap per worker. Clamps effective worker count if total exceeds 16 GB host memory.

Container Orchestrator (K8s / ECS)

Models a cluster of worker nodes managed by a control plane. Throughput capacity scales linearly with node count, CPU per node, and overcommit ratio. It simulates scheduling overhead and optional autoscaling latency.

capacity = nodeCount × cpuPerNode × 800 × overcommitRatio util = effRPS / capacity nodeLat += 5ms + (util > 0.8 ? (util − 0.8) × 50ms : 0)
Overcommit: In cloud environments, CPU is often over-provisioned. A ratio of 2.0 means the orchestrator expects to handle 1600 RPS per physical core by assuming not all pods are peaking simultaneously.
nodeCountnodesNumber of worker nodes in the cluster.
cpuPerNodecoresCompute capacity per node.
overcommitRatiomultiplierCPU over-provisioning factor. Higher = more capacity but higher saturation risk.
autoScalingtoggleWhen enabled, high utilization triggers a 10% pod startup latency penalty.

GraphQL Server

GraphQL introduces a parsing and resolution layer. Performance is sensitive to query depth and the number of underlying resolvers invoked per request.

resolverLat = resolversPerQuery × (dataLoaderBatching ? 1.2ms : 2.0ms) nodeLat += 5ms (parsing) + resolverLat + (maxQueryDepth × 0.5ms)
DataLoader Batching

Reduces resolver latency by 40% by collapsing multiple downstream N+1 requests into single batches.

Query Caching

If queryCacheRatio hits, the entire resolution is skipped, returning in a fixed 2ms.

resolversPerQuerycountAverage number of data fetchers triggered per root query.
maxQueryDepthlayersComplexity ceiling. Deeper queries add recursive overhead.
queryCacheRatio0–1Probability of a whole-query result being served from memory.

Microservice (Nested Service)

A Microservice node represents another private service composed inside this one. It is a black box at this level — but not a static one. The engine recursively simulates the referenced service's full internal canvas at the effective RPS this node receives, so its latency, error rate, and saturation always reflect the real load. See Nested Private Services for the full model.

effRPS = systemRPS × trafficWeight inner = simulate(service.canvas, effRPS) nodeLat += inner.p50 · drop ~ inner.errorRatePct
Request Absorption Rate: when this service internally calls another service, the absorption rate is the probability a request cascades downstream. 1.0 = every request propagates; 0.2 = 80% is absorbed at the edge (cache hits, early returns, validation drops).
hitRatio0–1 (Request Absorption Rate)Fraction of inbound traffic that cascades to internally-referenced services. Lower = more traffic absorbed before reaching downstream services.

API Gateway

The gateway adds a fixed 2 ms routing overhead and enforces rate limits. Requests above the rate limit are immediately dropped and counted as errors.

if (rateLimit > 0 && effRPS > rateLimit) → DROP (error)
else nodeLat += 2ms
rateLimitreq/s (0 = disabled)Hard ceiling on accepted RPS. Overflow → error rate increase.
protocolhttp | httpsHTTPS adds +3 ms TLS overhead via the global network physics layer.

Load Balancer

The LB routes each request to exactly one downstream using a weighted random selection. The routing weight depends on the algorithm configured.

round-robin1 (equal)

Each backend gets equal share regardless of capacity.

weighted-round-robinexplicit edge weight, fallback → target CPU cores

If you set a weight on the connection edge, that takes precedence. Otherwise the target's CPU count is used automatically.

weighted-least-connsame as weighted-round-robin

Same weight logic. Approximates least-connections via capacity.

resource-basedcpu + (ram / 4)

Combines CPU and RAM to compute a composite capacity weight.

ip-hash / random1 (equal)

Equal probability per backend — hash affinity is approximated as uniform distribution.

Auto-weight example:Load balancer with weighted-round-robin, one target is 32-core "High" and another is 2-core "Standard". No edge weights set → weights become 32 and 2 → High receives 94% of traffic, Standard receives 6%.
Cache-path behaviour: When an LB has a caching connection (isCaching) to a cache node (Redis, Memcached), it acts as a cache-first router. On a cache hit the request stops at the cache. On a miss the LB routes to exactly one backend using the configured algorithm — not all backends simultaneously.
algorithmround-robin | weighted-* | resource-based | ip-hash | randomRouting strategy. See algorithm cards above.
connsmax connections (1 000–1 000 000)Connection pool ceiling. Default 100 000. At high RPS, a low value causes saturation errors — real LBs handle hundreds of thousands of concurrent connections.

WAF / Firewall

Deep packet inspection adds latency proportional to the number of active rules. Shallow mode skips payload analysis.

rulesPenalty = log₁₀(rulesCount) × 2ms
nodeLat += (deep ? 8ms : 2ms) + rulesPenalty
Shallow Mode

Header-only inspection. Base +2 ms + rules overhead. Faster but misses payload-level attacks.

Deep Inspection

Full packet reassembly and payload matching. Base +8 ms + rules overhead.

Example: 100 active rules → log₁₀(100) × 2 = 4 ms. Deep mode total: 8 + 4 = 12 ms per hop. 5 000 rules → log₁₀(5000) × 2 ≈ 7.4 ms penalty.

rulesCount10–5000Active WAF rules. Logarithmic overhead (doubling rules does not double latency).
inspectionModeshallow | deepShallow = +2 ms base. Deep = +8 ms base. Both add the rules penalty on top.

DNS Resolver

DNS resolution latency is modelled with a probabilistic cache-hit using a hyperbolic saturation curve. TTL=30 s gives a 50% hit rate; TTL=300 s gives ~91%; TTL=3 600 s gives ~99.2%. This reflects real resolver behaviour better than a linear cap — most of the caching benefit is captured in the first few minutes of TTL.

hitProbability = ttl / (ttl + 30)
if (hit) nodeLat += 0.5ms (local cache)
else nodeLat += 5ms + rand(0–45ms) (upstream lookup)
TTL = 0s

5–50ms

No caching — every request hits upstream resolver

TTL = 30s

~27ms avg

Short TTL, half traffic hits resolver

TTL = 300s

~5ms avg

Default — most traffic served from cache

ttlseconds (0–86400)Record time-to-live. Cache hit probability = ttl / (ttl + 30). TTL=0 forces every request upstream.
failoverEnabledtoggleWhen enabled, a failed primary DNS target retries an alternate route (+5ms failover penalty).

CDN

On every request the CDN rolls a cache-hit against the configured ratio. Hits are served from the nearest edge PoP (fast); misses add a 75 ms origin-pull penalty on top of the 5 ms edge base, giving ~80 ms total on miss.

nodeLat += 5ms (edge base always)
if (rand() ≥ cacheHitRatio) nodeLat += 75ms (origin pull)
cacheHitRatio0–1 sliderProbability of serving from edge. 0.8 = 80% pay only 5 ms; 20% pay 80 ms.
ttlsecondsCache object lifetime. Not directly used in latency — informs what hit ratio to configure.

External Service (SaaS / API)

Models 3rd-party dependencies (Twilio, Stripe, Auth0). These services have their own latency distributions, error rates, and throughput limits outside your direct control.

baseLat = rand() > 0.9 ? p99Ms : p50Ms; if (effRPS > throughputRps) nodeLat += 100ms; // congestion penalty
Circuit Breaker

If error rate exceeds threshold, the CB opens and fails all requests immediately for a 30s cooldown, protecting the rest of your system.

Jitter Distribution

Simulated as a bifurcated distribution: 90% of requests hit p50; 10% hit p99 tail latency.

throughputRpsreq/sThe SLA ceiling. Exceeding this triggers 100ms congestion latency or 30% random drops.
latencyP50MsmsMedian latency. Typical duration for 9 out of 10 requests.
latencyP99MsmsTail latency. Applied to 10% of requests to model network jitter or cold paths.
errorRatePct%Baseline probability of the external service returning an error (e.g. 5xx).

SQL Database

SQL performance is bound by IOPS (disk I/O throughput). Partitioning doubles effective IOPS. RAM above 8 GB unlocks a 1.5× buffer-pool bonus (hot pages served from memory). Sharding multiplies total capacity linearly across shard nodes. The utilisation metric uses the same effective IOPS formula as latency — partitioning and RAM bonuses are reflected in both so the load gauge is always consistent.

effectiveIOPS = iops × (partitioning ≠ none ? 2 : 1) × (ram > 8 ? 1.5 : 1) × (sharding ? shardCount : 1)
nodeLat += (effRPS / effectiveIOPS) × 20ms
iops100–50 000Disk I/O operations/sec. Primary bottleneck metric for SQL.
ramGB (1–256)Buffer pool size. >8 GB unlocks 1.5× effective IOPS via hot-page caching.
connsmax connectionsConnection pool. Default 100 — typical PostgreSQL/MySQL default.
partitioningnone | range | hashRange/hash partitioning doubles effective IOPS and halves utilisation at the same RPS.
shardingtoggleEnables horizontal sharding. Multiplies effective IOPS by shard count.
shardCount2–256 (shown when sharding on)Number of shards. Each shard independently handles effRPS / shardCount queries.

NoSQL Database

NoSQL stores are optimized for high write throughput and horizontal scale. Sharding multiplies base throughput capacity by distributing data across nodes.

baseCapacity = 2 000 × (sharding ? shardCount : 1) RPS
nodeLat += (effRPS / (baseCapacity × (ram > 8 ? 1.5 : 1))) × 10ms
ramGB (1–256)In-memory document cache. >8 GB gives 1.5× effective capacity.
shardingtoggleDistributes data across multiple shard nodes. Multiplies base capacity by shard count.
shardCount2–256 (shown when sharding on)Default 3. Each additional shard adds 2 000 RPS to total capacity.

Elasticsearch

Search latency decreases with more cluster nodes (parallel shard execution). Low RAM per node causes JVM heap pressure and frequent GC pauses (+10 ms penalty). Utilisation is modelled against a capacity of 1 000 QPS per node — a conservative real-world estimate for mixed search/index workloads on moderate hardware.

nodeLat += (20ms / nodes) + (ram < 8 ? 10ms : 2ms)
22 ms
Single-node, adequate RAM
8.7 ms
3-node cluster, default
12 ms
10 nodes but heap-starved
nodes1–50Cluster size. Latency scales as 20/nodes — doubling nodes halves base search time. Capacity = nodes × 1 000 QPS.
ramGB per nodeJVM heap. Below 8 GB adds 10 ms GC pressure penalty; above 8 GB adds only 2 ms overhead.

Object Storage (S3 / GCS / Azure Blob)

Provider-agnostic object store model. Latency is composed of six additive steps: storage class retrieval tier, auth overhead (standard SigV4 vs. presigned), optional KMS encryption round-trip, Transfer Acceleration edge routing, per-prefix request-rate throttling, and bandwidth saturation. All steps respect the configured workload type (read vs. write).

nodeLat += classLatency (50 | 100 | 150 ms)
nodeLat += presignedUrl ? 0 : 10ms (SigV4 auth)
nodeLat += encryption === 'kms' ? 8ms : encryption === 'sse-s3' ? 1ms : 0
if (transferAcceleration) nodeLat ×= 0.6
prefixCap = prefixCount × (write ? 3 500 : 5 500) req/s
if (effRPS / prefixCap > 1) nodeLat += (sat − 1) × 200ms [+ error risk]
bwSat = (effRPS × 0.1 MB) / throughput_MBs
if (bwSat > 1) nodeLat += (bwSat − 1) × 100ms
throughputMB/s (1–10 000)Bandwidth cap. Assumes ~0.1 MB avg payload per request.
storageClassstandard | ia | glacier-instantstandard: 50 ms. Infrequent Access: 100 ms. Glacier Instant Retrieval: 150 ms.
presignedUrltoggleOff: +10 ms server-side SigV4 validation. On: signing is delegated to the client — 0 ms auth overhead.
prefixCount1–1 000Number of key prefixes. Each prefix handles 5 500 GET/s or 3 500 PUT/s before 503 SlowDown backoff fires.
encryptionnone | sse-s3 | kmsSSE-S3: +1 ms (AES-256 inline). SSE-KMS: +8 ms (extra KMS GenerateDataKey API call).
transferAccelerationtoggleRoutes via nearest CDN edge PoP → private backbone. Multiplies effective latency by 0.6 (40% reduction).

Redis

Redis is single-threaded and memory-bound. Its compute cost is negligible (0.5 ms) on top of the network round-trip. Enable sharding (Redis Cluster) to scale throughput linearly across shard nodes — each shard independently handles its key-space slice at the same 50 000 ops/s per GB rate.

cap = memory_GB × 50 000 × (sharding ? shardCount : 1) ops/s
util = effRPS / cap
nodeLat += 0.5ms + (util > 0.8 ? (util − 0.8) × 10ms : 0)
At 1 GB RAM, single instance handles 50 000 RPS. With sharding enabled and 3 shards, capacity grows to 150 000 RPS — matching Redis Cluster's horizontal scale behaviour.
memoryGB (0.1–64)Determines ops/s capacity: 50 000 × memory GB per shard.
connsmax connectionsConcurrent client connections. Feeds the global saturation check.
cacheHitRatio0–1 sliderUsed when Redis is the target of a caching connection (isCaching flag).
shardingtoggleEnables Redis Cluster mode. Throughput scales linearly with shard count.
shardCount2–256 (shown when sharding on)Default 3 (matching Redis Cluster minimum). Each shard adds full memory × 50k capacity.

Memcached

Memcached is a distributed, multi-threaded, volatile cache. Unlike Redis it has no persistence or complex data structures — just fast key/value storage. Its key advantage is multi-threading: throughput scales with both memory allocation and thread count. Sharding adds further horizontal capacity by distributing keys across independent Memcached nodes (client-side consistent hashing).

cap = memory_GB × threads × 40 000 × (sharding ? shardCount : 1) ops/s
util = effRPS / cap
nodeLat += 0.3ms + (util > 0.8 ? (util − 0.8) × 10ms : 0)
1 GB RAM, 4 threads → 160 000 ops/s capacity. Compared to Redis (50 000 ops/s at 1 GB), Memcached wins on raw throughput for simple get/set workloads due to multi-threading. Trade-off: no persistence, no data structures, no pub/sub.
memoryGB (0.1–64)Allocated RAM. Capacity = memory × threads × 40 000 ops/s.
threads1–64Worker threads. Default 4. Each thread independently processes requests.
maxItemSizeMB (0.1–128)Maximum cacheable object size. Default 1 MB (Memcached default slab ceiling).
cacheHitRatio0–1 sliderUsed when Memcached is the target of a caching connection (isCaching flag).
shardingtoggleClient-side consistent hashing across multiple Memcached nodes. Multiplies total capacity by shard count.
shardCount2–256 (shown when sharding on)Default 2. Each node runs independently — no cross-node replication.

Time-Series DB

Optimized for append-only streaming writes and window-based reads. It models heavy replication and background downsampling (merging old data points).

effCapacity = (writeRateKps × 1000) / replicationFactor nodeLat += (effRPS / effCapacity) × (downsampling ? 0.6 : 1.0)
Downsampling bonus: Enabling background downsampling reduces read overhead by 40% (0.6x multiplier) by pre-calculating aggregates (min, max, avg) at the cost of disk space.
writeRateKpsk req/sIngestion ceiling. Shared across nodes before sharding.
downsamplingEnabledtoggleReduces read-path latency by using pre-aggregated data windows.

Graph Database

Relies on pointer-following for deep relationship traversals. Latency grows non-linearly with traversal depth (hops) and average node degree (connections per node).

traversalCost = 10ms + degree^(depth / 2) nodeLat += traversalCost + (consistency === ‘strong’ ? 50ms : 0)
Strong Consistency

Adds 50ms sync-penalty to model global lock acquisition or ACID coordination across shards.

Traversal Depth

3 hops = baseline. 5 hops = exponential blowup. High-degree nodes (+20 edges) saturate CPUs rapidly.

avgDegreeedgesDensity of the graph. More edges = more data fetched per hop.
traversalDepthhopsSearch radius. Latency = degree^(depth/2).

Key-Value Store

Models simple O(1) lookups (DynamoDB, etcd, Consul). Highly parallel, deterministic performance intended for transient configuration or distributed locking.

nodeLat += 0.5ms + (consistency === ‘strong’ ? 2ms : 0)
consistencyLeveleventual | strongStrong consistency (like etcd/Raft) adds a 2ms consensus penalty.
evictionPolicylru | lfu | ttlBehavior when RAM is full. Affects utilization metrics only.

Blob Storage

Optimized for large object storage (images, MP4, backups). Unlike the general Object Storage node, it specifically models bandwidth-bound transfer latency with CDN bypass, multipart upload, replication penalties, and storage-class-aware first-byte cost.

transferLat = (5MB × 8) / (maxMbps / 1000) nodeLat += 20ms (metadata lookup) + transferLat
Bandwidth Scaling: While metadata is fast, the bits-in-flight time (assume 5MB avg) dominates. Increasing Max Throughput Mbps directly reduces transfer latency.
maxThroughputMbpsMbpsAvailable network bandwidth. 1000 Mbps = 1 Gbps.

Data Warehouse (Snowflake / BigQuery)

OLAP store for multi-terabyte queries. Columnar storage and massive query parallelism reduce scan times for "Very Complex" analytics jobs.

baseCost = complexityCount [simple=1k, complex=20k] nodeLat += baseCost / parallelismDegree
Columnar Storage

Enabled by default. Optimizes wide-table aggregation by reading only necessary columns from disk.

Query Parallelism

Total workers assigned to one job. 64 parallelism means the compute cost is divided by 64.

parallelismDegreethreadsCompute units dedicated to one query execution.
avgQueryComplexityselectSimple (1s) to Very Complex (60s) base costs.

Kafka / Event Stream

Kafka latency is shaped by partition count (parallelism) and broker count (replication overhead). More partitions reduce per-partition queue depth via log₂ bonus; more brokers add a small ISR sync penalty. Latency is floored at 1 ms — even a massively over-provisioned cluster has serialization overhead.

partitionBonus = log₂(partitions) × 2ms
brokerPenalty = brokers × 0.5ms
nodeLat += max(1ms, 10ms − partitionBonus + brokerPenalty)
3 brokers, 10 partitions

10 − (log₂(10)×2) + (3×0.5) = 10 − 6.64 + 1.5 = 4.9 ms

5 brokers, 100 partitions

10 − (log₂(100)×2) + (5×0.5) = max(1, −0.8) = 1 ms

brokers1–100Broker count. Each broker adds 0.5 ms ISR replication overhead.
partitions1–1000Partition count. Each doubling gives a fixed log₂ reduction in per-message queuing.

Message Queue (SQS)

FIFO queues guarantee exactly-once ordered delivery at the cost of extra coordination overhead. Standard queues trade ordering guarantees for lower latency.

nodeLat += type === ‘fifo’ ? 15ms : 5ms
typestandard | fifoStandard: +5 ms. FIFO: +15 ms (ordering/dedup overhead).
visibilityTimeoutsecondsHow long a message is invisible after being consumed. Does not affect request-path latency.

Pub / Sub

Fan-out adds a per-subscriber delivery scheduling cost to publish latency. 0.5 ms per subscriber reflects the broker's cost of dispatching to each downstream delivery agent.

nodeLat += 3ms + fanoutCount × 0.5ms
fanoutCount1–100Number of downstream subscribers. Each adds 0.5 ms fan-out scheduling overhead.

Mail Server

SMTP is inherently high-latency (TCP handshake + EHLO + DATA phase + server ACK: 100–250 ms baseline). The connection pool queuing follows M/M/c queueing theory — latency diverges sharply as utilisation approaches and exceeds 1.0. Above 1.5× saturation, connections are dropped as errors.

util = effRPS / concurrentConns
baseLat = 100ms + rand(0–150ms)
if (util > 1) baseLat += (util − 1)² × 500ms
if (util > 1.5 && rand() < 0.4) → DROP (error)
concurrentConns1–500Maximum simultaneous SMTP sessions. Queuing diverges above 100% utilisation; errors above 150%.
dailyLimitmessages/dayGradual throttling kicks in at 10× fair-share per second. At 100× fair-share, 50% drop probability.
protocolsmtp | imap | pop3Protocol label for documentation. SMTP overhead is uniform across variants in simulation.

RabbitMQ

An AMQP broker implementing smart-routing via exchanges. Performance depends on persistence settings, acknowledgment modes, and the total consumer pool size across all queues.

nodeLat += 2ms + (manualACK ? 3ms : 0) + (persistence ? 5ms : 0) cap = consumerCount × 5000 msgs/s if (util > 1) nodeLat += (util − 1) × 20ms
Smart Routing: RabbitMQ handles logical exchange-to-queue routing. Each consumer is assumed to process 5,000 messages/second baseline capacity. Manual ACK modes add round-trip overhead to the delivery path.
consumerCountcountTotal workers pulling from the queue. Multiplies throughput capacity.
acknowledgmentModeauto | manualManual ACK adds 3ms per-message coordination latency.
persistenceEnabledtoggleDurable queues add 5ms disk-sync latency to the publish path.

Scheduler (Cron / Job Runner)

Models a background task scheduler. Unlike the Service Worker (which focuses on thread pooling), the Scheduler focuses on long-running jobs and concurrent execution limits. It follows a variation of Little's Law for queueing wait times.

load = (effRPS × jobDurationMs / 1000) / maxConcurrentJobs nodeLat += jobDurationMs + (load > 1 ? (load − 1) × jobDurationMs : 0)
Concurrency Limit: If the total work (RPS × duration) exceeds available concurrency slots, jobs start queueing linearly. A 500ms job on a 10-job server at 40 RPS results in (2.0 - 1) × 500 = 500ms extra wait.
maxConcurrentJobscountMaximum simultaneous tasks. Throughput ceiling = concurrency / duration.
avgJobDurationMsmsMean execution time per task. Determines how quickly concurrency slots are freed.
cronResolutionSecsecThe granularity of the clock (e.g. 60s for standard crontab). Informational.

LLM / AI Model

LLM latency is dominated by token generation speed. The engine estimates output tokens as 10% of the context window size and converts generation time to milliseconds.

outputTokens = contextWindow × 0.1
nodeLat += (outputTokens / tokensPerSec) × 1000ms
8 ms
Small context, fast model
4k ctx, 50 t/s
256 ms
Default settings
128k ctx, 50 t/s
1 280 ms
Slow/large model
128k ctx, 10 t/s
contextWindowk tokens (4–1024)Max input context. Proxy for output length (10% assumed as completion).
tokensPerSect/s (1–200)Generation throughput. Primary bottleneck for LLM latency.
temperature0–2 sliderSampling randomness. Does not affect latency in simulation.

Vector Database

ANN search latency grows with vector dimensionality. The index type determines the search algorithm: HNSW (graph-based) is fastest; Flat (brute-force) is slowest.

dimPenalty = dimensions / 512
indexBonus = hnsw ? 0.5 : flat ? 5 : 1 (ivf)
nodeLat += 5ms × dimPenalty × indexBonus
HNSW
7.5 ms
at 1536 dims
IVF
15 ms
at 1536 dims
FLAT
75 ms
at 1536 dims
dimensions64–3072Embedding vector size. Latency scales linearly with dimensions/512.
indexTypehnsw | ivf | flatHNSW: 0.5× penalty. IVF: 1× (baseline). Flat: 5× (brute-force scan).
similaritycosine | l2 | dotDistance metric. Not differentiated in latency; use for documentation.

Embedding API

Embedding generation is a fixed-cost operation whose per-item cost is amortized by batching. Larger batches reduce effective per-request latency.

nodeLat += latencyMs / log₂(batchSize)
batch = 1

Full 50 ms per request (no batch savings).

batch = 128

50 / log₂(128) = 50 / 7 ≈ 7.1 ms per item.

latencyMsms (5–500)Baseline embedding call latency at batch size 1.
batchSize1–2048Items per API call. Higher batch → logarithmically lower per-item latency.

AI Agent / Orchestrator

Agentic workflows call the LLM multiple times per user request. The orchestrator models only its own routing overhead — each downstream LLM call accrues its own latency separately.

strategyPenalty = tree ? 5 : reflex ? 2 : 1
nodeLat += 10ms × avgSteps × strategyPenalty
chain
×1Linear tool calls. 3 steps = 30 ms orchestration overhead.
reflex
×2Feedback loops. Agent evaluates its own output before proceeding.
tree
×5Tree-of-thought: explores multiple reasoning paths in parallel — expensive.
multiStepFactor1–20Average number of LLM tool-call iterations per user request.
strategychain | reflex | treeReasoning strategy. Multiplies step overhead: chain×1, reflex×2, tree×5.

Document Parser / ETL

Document parsing latency grows with chunk size (more data per unit) and shrinks with parallelism.

chunkPenalty = chunkSizeBytes / 512
nodeLat += (25ms × chunkPenalty) / parallelJobs
chunkSizeBytes128–8192Size of each processed chunk. Larger chunks = more work per task = higher latency.
parallelJobs1–100Concurrent workers. Linearly reduces effective per-request latency.

Workflow Engine

Workflow engines (Temporal, n8n, Airflow) persist state at every step, adding a fixed 10 ms checkpoint overhead. Queuing latency activates above 80% concurrency saturation and grows as the overflow portion using an M/D/1 model (fixed service time, variable arrival). The penalty is (sat − 0.8) × 25ms — only the overflow fraction is penalised, not the full saturation value.

nodeLat += 10ms (state persistence overhead)
execSaturation = effRPS / concurrency
if (execSaturation > 0.8) nodeLat += (execSaturation − 0.8) × 25ms
Utilisation uses Little's Law: active executions ≈ throughput × avg workflow duration (capped at 10 s). At default concurrency of 50: 100 RPS × 10 s / 50 slots = 20× over-subscribed → 2000% → clamped to 100%. At 2 RPS × 10 s = 20 active / 50 slots → 40% utilisation.
concurrency10–5000Maximum simultaneous workflow executions. Above 80% triggers overflow queuing penalty.
timeoutseconds (1–3600)Per-step execution ceiling. Long steps may cascade into downstream timeouts.
retryPolicytoggleAuto-retry on step failure adds 50 ms retry overhead but clears transient errors.