Simulation Engine Documentation

The Foundation: Little's Law

L = λ × W

The entire simulation is anchored by Little's Law — the fundamental theorem of queuing theory. It guarantees that for any stable system, the average concurrency (L) equals throughput (λ) times latency (W).

Concurrency

In-flight requests active at any instant — bounded by your connection pool.

Throughput

Successfully served requests per second after dropping errors.

Latency

Full residence time: queuing wait + service time across every hop.

Practical intuition: If your P99 latency doubles (W↑), your effective throughput (λ) must halve for a fixed thread pool (L). This is why a single slow database query can starve your entire web tier of connections.

Saturation Formula

ρ = (RPS × Latency_ms / 1000) / MaxConnections

When ρ > 1.0 the node is over-saturated. The engine adds a queuing penalty of (ρ − 1) × 25msper unit of overflow. If ρ > 2.5, there is a 40% chance of a dropped request (simulating connection queue exhaustion).

Nested Private Services

In a Project Workspace you can drop one private service inside another as a Microservice node. The engine does notreplay a frozen benchmark snapshot for that node. Instead, before sampling begins, it simulates the nested service's entire internal architecture against the exact RPS it receives in the current run, then feeds the resulting latency, error rate, and saturation back into the parent simulation.

How a Microservice node is resolved

effRPS = systemRPS × trafficWeight(node) innerResult = simulate(service.canvas, effRPS) ← full recursive run node.latency = innerResult.p50 node.errorRate = innerResult.errorRatePct node.saturation = peak internal node saturation

Because the nested error rate drives both the node and the request path, per-node and overall metrics always agree. A service that is healthy at 100 RPS but collapses at 100k RPS will show exactly that — its internal Server/DB tiers saturate under the real load instead of reporting infinite capacity. Recursion is depth-guarded (up to 4 levels) so deeply composed systems resolve without runaway loops.

Each Microservice node also exposes a Request Absorption Rate. When a service references another service internally, this rate is the probability that an inbound request cascades to the downstream service — modelling caching, early returns, or validation that "absorbs" traffic at the edge.

Example: if user-service sits at the project root and is also nested inside both payment-service and lifestyle-service with an absorption rate of 1.0, it accumulates traffic from all three paths — direct plus both cascades — rather than only its direct share.

Network Physics

Every hop between nodes pays a network cost before any component logic runs. The cost depends on whether the nodes share a VPC and region.

Same VPC

Private subnet routing — essentially free

0.5ms

Same region, different VPC

VPC peering or transit gateway

3–5ms

Cross-region

Backbone transit over geographic distances

60–100ms

Public Internet (no VPC)

At least one node outside a private network

15–25ms

HTTPS / TLS

Handshake + encryption overhead on top of network

+3ms

mTLS (strict)

Mutual certificate exchange (service mesh)

+2ms

gRPC / protobuf

Binary serialization reduces payload size → 30% lower transmission latency

−30%

Server / VM

CPU performance degrades exponentially as effective RPS per core approaches a saturation threshold (~800 RPS/vCPU). Low RAM adds an additional multiplier from page-fault pressure.

nodeLat += BASE(5ms) × 1.5^(effRPS / (vCPU × 800)) × ramPenalty

CPU Safety Limit

800 RPS per vCPU is the stable zone. Beyond it, context-switching overhead causes non-linear latency spikes.

RAM Penalty

Applies only when RAM < 1 GB: penalty = 1.5 / ram. At 0.5 GB → 3× multiplier; at 0.25 GB → 6×.

cpu	vCPU count	Sets the per-core RPS ceiling. More cores = linear scaling until memory is the bottleneck.
ram	GB	Below 1 GB RAM triggers the page-fault penalty multiplier.
conns	max connections	Connection pool size. Controls saturation threshold via Little's Law.

Container

Uses the same exponential CPU model as Server/VM. Containers typically run with fractional CPU shares and low RAM, so the RAM penalty is almost always active at default values (0.5 GB → 3× multiplier).

nodeLat += BASE(5ms) × 1.5^(effRPS / (cpuShares × 800)) × ramPenalty

cpu	CPU shares (0.1–8)	Fractional CPU allocation. 0.5 = half a vCPU.
ram	GB (0.1–16)	Almost always below 1 GB — RAM penalty nearly always active.

Serverless Function

Every invocation incurs a cold-start penalty. Higher memory allocation gives proportionally more CPU, so 1 GB functions start roughly 2.8× faster than 128 MB functions.

coldStart = (100ms + rand(0–500ms)) / √(memory_MB / 128)

Cold Start Range

100–600 ms at 128 MB. Scales down inversely with the square root of memory allocation.

Hard Timeout

If cumulative path latency exceeds timeout × 1000 ms, the request is dropped as a 504 error.

memory	MB (128–10240)	Higher memory = more CPU = faster cold starts and execution.
timeout	seconds (1–900)	Absolute ceiling on total path latency. Exceeded → dropped request.

WebSocket Server

Long-lived connections have very low per-message overhead. The main constraint is the total concurrent connection ceiling — once saturated, new handshakes are rejected. Heartbeat adds 0.5 ms overhead only when heartbeatInterval > 0 (i.e. actively configured). Setting it to 0 disables ping/pong entirely.

nodeLat += 1ms + (heartbeatInterval > 0 ? 0.5ms : 0)

Utilisation uses Little's Law directly: concurrent connections ≈ throughput × avg session (30 s). At 1 000 new connections/s with a 30 s session lifetime → 30 000 concurrent, on a 50 000-conn server → 60% load.

conns	max connections	Total simultaneous WebSocket clients. Saturation = dropped handshakes.
heartbeatInterval	seconds (0 = disabled)	Set to 0 to disable ping/pong. Any positive value adds 0.5 ms per-message overhead.

Service Worker

Models a background worker pool — Node.js worker_threads, Python Celery, Go goroutine pool, Java ExecutorService. Each worker processes one job at a time with a deterministic service time. The engine uses the M/D/1 Pollaczek–Khinchine (P-K) formula: Poisson arrivals, deterministic (fixed) service time — the most accurate closed-form model for a fixed thread pool.

serviceRate = workerCount × (1 000 / jobDurationMs) [jobs/s] ρ = effRPS / serviceRate if ρ < 1 (stable): Wq = ρ × jobDuration / (2 × (1 − ρ)) ← P-K mean wait if ρ ≥ 1 (overloaded): Wq = (ρ − 1) × jobDuration × 2 ← linear backlog divergence nodeLat += jobDurationMs + Wq if backlog > queueDepth → DROP (proportional probability)

4 workers, 100ms job, 100 RPS

ρ = 100 / (4×10) = 2.5

100 + (1.5×100×2) = 400ms

Severely overloaded — queue explodes

4 workers, 100ms job, 30 RPS

ρ = 30 / 40 = 0.75

100 + (0.75×100/(2×0.25)) = 250ms

Healthy but queue building

8 workers, 50ms job, 100 RPS

ρ = 100 / 160 = 0.625

50 + (0.625×50/(2×0.375)) ≈ 92ms

Comfortable — low wait

Memory ceiling: If workerCount × memoryPerWorkerMb exceeds 16 GB (assumed host), the engine silently clamps effectiveWorkers to fit. At 64 MB/worker the ceiling is 256 workers; at 512 MB/worker only 32 workers run regardless of config.

workerCount	1–1024	Thread / process pool size. Directly sets serviceRate ceiling.
jobDurationMs	ms (1–60 000)	Average job execution time. Lower = higher throughput per worker. Sets the D in M/D/1.
queueDepth	1–100 000	Max backlog before jobs are dropped. Drop probability scales with how far backlog exceeds this.
retryEnabled	toggle	On failure, 50% of jobs are retried once. Clears transient errors but adds one jobDuration of latency.
memoryPerWorkerMb	MB (16–8 192)	Heap per worker. Clamps effective worker count if total exceeds 16 GB host memory.

Container Orchestrator (K8s / ECS)

Models a cluster of worker nodes managed by a control plane. Throughput capacity scales linearly with node count, CPU per node, and overcommit ratio. It simulates scheduling overhead and optional autoscaling latency.

capacity = nodeCount × cpuPerNode × 800 × overcommitRatio util = effRPS / capacity nodeLat += 5ms + (util > 0.8 ? (util − 0.8) × 50ms : 0)

Overcommit: In cloud environments, CPU is often over-provisioned. A ratio of 2.0 means the orchestrator expects to handle 1600 RPS per physical core by assuming not all pods are peaking simultaneously.

nodeCount	nodes	Number of worker nodes in the cluster.
cpuPerNode	cores	Compute capacity per node.
overcommitRatio	multiplier	CPU over-provisioning factor. Higher = more capacity but higher saturation risk.
autoScaling	toggle	When enabled, high utilization triggers a 10% pod startup latency penalty.

GraphQL Server

GraphQL introduces a parsing and resolution layer. Performance is sensitive to query depth and the number of underlying resolvers invoked per request.

resolverLat = resolversPerQuery × (dataLoaderBatching ? 1.2ms : 2.0ms) nodeLat += 5ms (parsing) + resolverLat + (maxQueryDepth × 0.5ms)

DataLoader Batching

Reduces resolver latency by 40% by collapsing multiple downstream N+1 requests into single batches.

Query Caching

If queryCacheRatio hits, the entire resolution is skipped, returning in a fixed 2ms.

resolversPerQuery	count	Average number of data fetchers triggered per root query.
maxQueryDepth	layers	Complexity ceiling. Deeper queries add recursive overhead.
queryCacheRatio	0–1	Probability of a whole-query result being served from memory.

Microservice (Nested Service)

A Microservice node represents another private service composed inside this one. It is a black box at this level — but not a static one. The engine recursively simulates the referenced service's full internal canvas at the effective RPS this node receives, so its latency, error rate, and saturation always reflect the real load. See Nested Private Services for the full model.

effRPS = systemRPS × trafficWeight inner = simulate(service.canvas, effRPS) nodeLat += inner.p50 · drop ~ inner.errorRatePct

Request Absorption Rate: when this service internally calls another service, the absorption rate is the probability a request cascades downstream. 1.0 = every request propagates; 0.2 = 80% is absorbed at the edge (cache hits, early returns, validation drops).

hitRatio

0–1 (Request Absorption Rate)

Fraction of inbound traffic that cascades to internally-referenced services. Lower = more traffic absorbed before reaching downstream services.

API Gateway

The gateway adds a fixed 2 ms routing overhead and enforces rate limits. Requests above the rate limit are immediately dropped and counted as errors.

if (rateLimit > 0 && effRPS > rateLimit) → DROP (error)
else nodeLat += 2ms

rateLimit	req/s (0 = disabled)	Hard ceiling on accepted RPS. Overflow → error rate increase.
protocol	http \| https	HTTPS adds +3 ms TLS overhead via the global network physics layer.

Load Balancer

The LB routes each request to exactly one downstream using a weighted random selection. The routing weight depends on the algorithm configured.

round-robin1 (equal)

Each backend gets equal share regardless of capacity.

weighted-round-robinexplicit edge weight, fallback → target CPU cores

If you set a weight on the connection edge, that takes precedence. Otherwise the target's CPU count is used automatically.

weighted-least-connsame as weighted-round-robin

Same weight logic. Approximates least-connections via capacity.

resource-basedcpu + (ram / 4)

Combines CPU and RAM to compute a composite capacity weight.

ip-hash / random1 (equal)

Equal probability per backend — hash affinity is approximated as uniform distribution.

Auto-weight example:Load balancer with weighted-round-robin, one target is 32-core "High" and another is 2-core "Standard". No edge weights set → weights become 32 and 2 → High receives 94% of traffic, Standard receives 6%.

Cache-path behaviour: When an LB has a caching connection (isCaching) to a cache node (Redis, Memcached), it acts as a cache-first router. On a cache hit the request stops at the cache. On a miss the LB routes to exactly one backend using the configured algorithm — not all backends simultaneously.

algorithm	round-robin \| weighted-* \| resource-based \| ip-hash \| random	Routing strategy. See algorithm cards above.
conns	max connections (1 000–1 000 000)	Connection pool ceiling. Default 100 000. At high RPS, a low value causes saturation errors — real LBs handle hundreds of thousands of concurrent connections.

WAF / Firewall

Deep packet inspection adds latency proportional to the number of active rules. Shallow mode skips payload analysis.

rulesPenalty = log₁₀(rulesCount) × 2ms
nodeLat += (deep ? 8ms : 2ms) + rulesPenalty

Shallow Mode

Header-only inspection. Base +2 ms + rules overhead. Faster but misses payload-level attacks.

Deep Inspection

Full packet reassembly and payload matching. Base +8 ms + rules overhead.

Example: 100 active rules → log₁₀(100) × 2 = 4 ms. Deep mode total: 8 + 4 = 12 ms per hop. 5 000 rules → log₁₀(5000) × 2 ≈ 7.4 ms penalty.

rulesCount	10–5000	Active WAF rules. Logarithmic overhead (doubling rules does not double latency).
inspectionMode	shallow \| deep	Shallow = +2 ms base. Deep = +8 ms base. Both add the rules penalty on top.

DNS Resolver

DNS resolution latency is modelled with a probabilistic cache-hit using a hyperbolic saturation curve. TTL=30 s gives a 50% hit rate; TTL=300 s gives ~91%; TTL=3 600 s gives ~99.2%. This reflects real resolver behaviour better than a linear cap — most of the caching benefit is captured in the first few minutes of TTL.

hitProbability = ttl / (ttl + 30)
if (hit) nodeLat += 0.5ms (local cache)
else nodeLat += 5ms + rand(0–45ms) (upstream lookup)

TTL = 0s

5–50ms

No caching — every request hits upstream resolver

TTL = 30s

~27ms avg

Short TTL, half traffic hits resolver

TTL = 300s

~5ms avg

Default — most traffic served from cache

ttl	seconds (0–86400)	Record time-to-live. Cache hit probability = ttl / (ttl + 30). TTL=0 forces every request upstream.
failoverEnabled	toggle	When enabled, a failed primary DNS target retries an alternate route (+5ms failover penalty).

CDN

On every request the CDN rolls a cache-hit against the configured ratio. Hits are served from the nearest edge PoP (fast); misses add a 75 ms origin-pull penalty on top of the 5 ms edge base, giving ~80 ms total on miss.

nodeLat += 5ms (edge base always)
if (rand() ≥ cacheHitRatio) nodeLat += 75ms (origin pull)

cacheHitRatio	0–1 slider	Probability of serving from edge. 0.8 = 80% pay only 5 ms; 20% pay 80 ms.
ttl	seconds	Cache object lifetime. Not directly used in latency — informs what hit ratio to configure.

External Service (SaaS / API)

Models 3rd-party dependencies (Twilio, Stripe, Auth0). These services have their own latency distributions, error rates, and throughput limits outside your direct control.

baseLat = rand() > 0.9 ? p99Ms : p50Ms; if (effRPS > throughputRps) nodeLat += 100ms; // congestion penalty

Circuit Breaker

If error rate exceeds threshold, the CB opens and fails all requests immediately for a 30s cooldown, protecting the rest of your system.

Jitter Distribution

Simulated as a bifurcated distribution: 90% of requests hit p50; 10% hit p99 tail latency.

throughputRps	req/s	The SLA ceiling. Exceeding this triggers 100ms congestion latency or 30% random drops.
latencyP50Ms	ms	Median latency. Typical duration for 9 out of 10 requests.
latencyP99Ms	ms	Tail latency. Applied to 10% of requests to model network jitter or cold paths.
errorRatePct	%	Baseline probability of the external service returning an error (e.g. 5xx).

SQL Database

SQL performance is bound by IOPS (disk I/O throughput). Partitioning doubles effective IOPS. RAM above 8 GB unlocks a 1.5× buffer-pool bonus (hot pages served from memory). Sharding multiplies total capacity linearly across shard nodes. The utilisation metric uses the same effective IOPS formula as latency — partitioning and RAM bonuses are reflected in both so the load gauge is always consistent.

effectiveIOPS = iops × (partitioning ≠ none ? 2 : 1) × (ram > 8 ? 1.5 : 1) × (sharding ? shardCount : 1)
nodeLat += (effRPS / effectiveIOPS) × 20ms

iops	100–50 000	Disk I/O operations/sec. Primary bottleneck metric for SQL.
ram	GB (1–256)	Buffer pool size. >8 GB unlocks 1.5× effective IOPS via hot-page caching.
conns	max connections	Connection pool. Default 100 — typical PostgreSQL/MySQL default.
partitioning	none \| range \| hash	Range/hash partitioning doubles effective IOPS and halves utilisation at the same RPS.
sharding	toggle	Enables horizontal sharding. Multiplies effective IOPS by shard count.
shardCount	2–256 (shown when sharding on)	Number of shards. Each shard independently handles effRPS / shardCount queries.

NoSQL Database

NoSQL stores are optimized for high write throughput and horizontal scale. Sharding multiplies base throughput capacity by distributing data across nodes.

baseCapacity = 2 000 × (sharding ? shardCount : 1) RPS
nodeLat += (effRPS / (baseCapacity × (ram > 8 ? 1.5 : 1))) × 10ms

ram	GB (1–256)	In-memory document cache. >8 GB gives 1.5× effective capacity.
sharding	toggle	Distributes data across multiple shard nodes. Multiplies base capacity by shard count.
shardCount	2–256 (shown when sharding on)	Default 3. Each additional shard adds 2 000 RPS to total capacity.

Elasticsearch

Search latency decreases with more cluster nodes (parallel shard execution). Low RAM per node causes JVM heap pressure and frequent GC pauses (+10 ms penalty). Utilisation is modelled against a capacity of 1 000 QPS per node — a conservative real-world estimate for mixed search/index workloads on moderate hardware.

nodeLat += (20ms / nodes) + (ram < 8 ? 10ms : 2ms)

22 ms

Single-node, adequate RAM

8.7 ms

3-node cluster, default

12 ms

10 nodes but heap-starved

nodes	1–50	Cluster size. Latency scales as 20/nodes — doubling nodes halves base search time. Capacity = nodes × 1 000 QPS.
ram	GB per node	JVM heap. Below 8 GB adds 10 ms GC pressure penalty; above 8 GB adds only 2 ms overhead.

Object Storage (S3 / GCS / Azure Blob)

Provider-agnostic object store model. Latency is composed of six additive steps: storage class retrieval tier, auth overhead (standard SigV4 vs. presigned), optional KMS encryption round-trip, Transfer Acceleration edge routing, per-prefix request-rate throttling, and bandwidth saturation. All steps respect the configured workload type (read vs. write).

nodeLat += classLatency (50 | 100 | 150 ms)
nodeLat += presignedUrl ? 0 : 10ms (SigV4 auth)
nodeLat += encryption === 'kms' ? 8ms : encryption === 'sse-s3' ? 1ms : 0
if (transferAcceleration) nodeLat ×= 0.6
prefixCap = prefixCount × (write ? 3 500 : 5 500) req/s
if (effRPS / prefixCap > 1) nodeLat += (sat − 1) × 200ms [+ error risk]
bwSat = (effRPS × 0.1 MB) / throughput_MBs
if (bwSat > 1) nodeLat += (bwSat − 1) × 100ms

throughput	MB/s (1–10 000)	Bandwidth cap. Assumes ~0.1 MB avg payload per request.
storageClass	standard \| ia \| glacier-instant	standard: 50 ms. Infrequent Access: 100 ms. Glacier Instant Retrieval: 150 ms.
presignedUrl	toggle	Off: +10 ms server-side SigV4 validation. On: signing is delegated to the client — 0 ms auth overhead.
prefixCount	1–1 000	Number of key prefixes. Each prefix handles 5 500 GET/s or 3 500 PUT/s before 503 SlowDown backoff fires.
encryption	none \| sse-s3 \| kms	SSE-S3: +1 ms (AES-256 inline). SSE-KMS: +8 ms (extra KMS GenerateDataKey API call).
transferAcceleration	toggle	Routes via nearest CDN edge PoP → private backbone. Multiplies effective latency by 0.6 (40% reduction).

Redis

Redis is single-threaded and memory-bound. Its compute cost is negligible (0.5 ms) on top of the network round-trip. Enable sharding (Redis Cluster) to scale throughput linearly across shard nodes — each shard independently handles its key-space slice at the same 50 000 ops/s per GB rate.

cap = memory_GB × 50 000 × (sharding ? shardCount : 1) ops/s
util = effRPS / cap
nodeLat += 0.5ms + (util > 0.8 ? (util − 0.8) × 10ms : 0)

At 1 GB RAM, single instance handles 50 000 RPS. With sharding enabled and 3 shards, capacity grows to 150 000 RPS — matching Redis Cluster's horizontal scale behaviour.

memory	GB (0.1–64)	Determines ops/s capacity: 50 000 × memory GB per shard.
conns	max connections	Concurrent client connections. Feeds the global saturation check.
cacheHitRatio	0–1 slider	Used when Redis is the target of a caching connection (isCaching flag).
sharding	toggle	Enables Redis Cluster mode. Throughput scales linearly with shard count.
shardCount	2–256 (shown when sharding on)	Default 3 (matching Redis Cluster minimum). Each shard adds full memory × 50k capacity.

Memcached

Memcached is a distributed, multi-threaded, volatile cache. Unlike Redis it has no persistence or complex data structures — just fast key/value storage. Its key advantage is multi-threading: throughput scales with both memory allocation and thread count. Sharding adds further horizontal capacity by distributing keys across independent Memcached nodes (client-side consistent hashing).

cap = memory_GB × threads × 40 000 × (sharding ? shardCount : 1) ops/s
util = effRPS / cap
nodeLat += 0.3ms + (util > 0.8 ? (util − 0.8) × 10ms : 0)

1 GB RAM, 4 threads → 160 000 ops/s capacity. Compared to Redis (50 000 ops/s at 1 GB), Memcached wins on raw throughput for simple get/set workloads due to multi-threading. Trade-off: no persistence, no data structures, no pub/sub.

memory	GB (0.1–64)	Allocated RAM. Capacity = memory × threads × 40 000 ops/s.
threads	1–64	Worker threads. Default 4. Each thread independently processes requests.
maxItemSize	MB (0.1–128)	Maximum cacheable object size. Default 1 MB (Memcached default slab ceiling).
cacheHitRatio	0–1 slider	Used when Memcached is the target of a caching connection (isCaching flag).
sharding	toggle	Client-side consistent hashing across multiple Memcached nodes. Multiplies total capacity by shard count.
shardCount	2–256 (shown when sharding on)	Default 2. Each node runs independently — no cross-node replication.

Time-Series DB

Optimized for append-only streaming writes and window-based reads. It models heavy replication and background downsampling (merging old data points).

effCapacity = (writeRateKps × 1000) / replicationFactor nodeLat += (effRPS / effCapacity) × (downsampling ? 0.6 : 1.0)

Downsampling bonus: Enabling background downsampling reduces read overhead by 40% (0.6x multiplier) by pre-calculating aggregates (min, max, avg) at the cost of disk space.

writeRateKps	k req/s	Ingestion ceiling. Shared across nodes before sharding.
downsamplingEnabled	toggle	Reduces read-path latency by using pre-aggregated data windows.

Graph Database

Relies on pointer-following for deep relationship traversals. Latency grows non-linearly with traversal depth (hops) and average node degree (connections per node).

traversalCost = 10ms + degree^(depth / 2) nodeLat += traversalCost + (consistency === ‘strong’ ? 50ms : 0)

Strong Consistency

Adds 50ms sync-penalty to model global lock acquisition or ACID coordination across shards.

Traversal Depth

3 hops = baseline. 5 hops = exponential blowup. High-degree nodes (+20 edges) saturate CPUs rapidly.

avgDegree	edges	Density of the graph. More edges = more data fetched per hop.
traversalDepth	hops	Search radius. Latency = degree^(depth/2).

Key-Value Store

Models simple O(1) lookups (DynamoDB, etcd, Consul). Highly parallel, deterministic performance intended for transient configuration or distributed locking.

nodeLat += 0.5ms + (consistency === ‘strong’ ? 2ms : 0)

consistencyLevel	eventual \| strong	Strong consistency (like etcd/Raft) adds a 2ms consensus penalty.
evictionPolicy	lru \| lfu \| ttl	Behavior when RAM is full. Affects utilization metrics only.

Blob Storage

Optimized for large object storage (images, MP4, backups). Unlike the general Object Storage node, it specifically models bandwidth-bound transfer latency with CDN bypass, multipart upload, replication penalties, and storage-class-aware first-byte cost.

transferLat = (5MB × 8) / (maxMbps / 1000) nodeLat += 20ms (metadata lookup) + transferLat

Bandwidth Scaling: While metadata is fast, the bits-in-flight time (assume 5MB avg) dominates. Increasing Max Throughput Mbps directly reduces transfer latency.

maxThroughputMbps

Mbps

Available network bandwidth. 1000 Mbps = 1 Gbps.

Data Warehouse (Snowflake / BigQuery)

OLAP store for multi-terabyte queries. Columnar storage and massive query parallelism reduce scan times for "Very Complex" analytics jobs.

baseCost = complexityCount [simple=1k, complex=20k] nodeLat += baseCost / parallelismDegree

Columnar Storage

Enabled by default. Optimizes wide-table aggregation by reading only necessary columns from disk.

Query Parallelism

Total workers assigned to one job. 64 parallelism means the compute cost is divided by 64.

parallelismDegree	threads	Compute units dedicated to one query execution.
avgQueryComplexity	select	Simple (1s) to Very Complex (60s) base costs.

Kafka / Event Stream

Kafka latency is shaped by partition count (parallelism) and broker count (replication overhead). More partitions reduce per-partition queue depth via log₂ bonus; more brokers add a small ISR sync penalty. Latency is floored at 1 ms — even a massively over-provisioned cluster has serialization overhead.

partitionBonus = log₂(partitions) × 2ms
brokerPenalty = brokers × 0.5ms
nodeLat += max(1ms, 10ms − partitionBonus + brokerPenalty)

3 brokers, 10 partitions

10 − (log₂(10)×2) + (3×0.5) = 10 − 6.64 + 1.5 = 4.9 ms

5 brokers, 100 partitions

10 − (log₂(100)×2) + (5×0.5) = max(1, −0.8) = 1 ms

brokers	1–100	Broker count. Each broker adds 0.5 ms ISR replication overhead.
partitions	1–1000	Partition count. Each doubling gives a fixed log₂ reduction in per-message queuing.

Message Queue (SQS)

FIFO queues guarantee exactly-once ordered delivery at the cost of extra coordination overhead. Standard queues trade ordering guarantees for lower latency.

nodeLat += type === ‘fifo’ ? 15ms : 5ms

type	standard \| fifo	Standard: +5 ms. FIFO: +15 ms (ordering/dedup overhead).
visibilityTimeout	seconds	How long a message is invisible after being consumed. Does not affect request-path latency.

Pub / Sub

Fan-out adds a per-subscriber delivery scheduling cost to publish latency. 0.5 ms per subscriber reflects the broker's cost of dispatching to each downstream delivery agent.

nodeLat += 3ms + fanoutCount × 0.5ms

fanoutCount

1–100

Number of downstream subscribers. Each adds 0.5 ms fan-out scheduling overhead.

Mail Server

SMTP is inherently high-latency (TCP handshake + EHLO + DATA phase + server ACK: 100–250 ms baseline). The connection pool queuing follows M/M/c queueing theory — latency diverges sharply as utilisation approaches and exceeds 1.0. Above 1.5× saturation, connections are dropped as errors.

util = effRPS / concurrentConns
baseLat = 100ms + rand(0–150ms)
if (util > 1) baseLat += (util − 1)² × 500ms
if (util > 1.5 && rand() < 0.4) → DROP (error)

concurrentConns	1–500	Maximum simultaneous SMTP sessions. Queuing diverges above 100% utilisation; errors above 150%.
dailyLimit	messages/day	Gradual throttling kicks in at 10× fair-share per second. At 100× fair-share, 50% drop probability.
protocol	smtp \| imap \| pop3	Protocol label for documentation. SMTP overhead is uniform across variants in simulation.

RabbitMQ

An AMQP broker implementing smart-routing via exchanges. Performance depends on persistence settings, acknowledgment modes, and the total consumer pool size across all queues.

nodeLat += 2ms + (manualACK ? 3ms : 0) + (persistence ? 5ms : 0) cap = consumerCount × 5000 msgs/s if (util > 1) nodeLat += (util − 1) × 20ms

Smart Routing: RabbitMQ handles logical exchange-to-queue routing. Each consumer is assumed to process 5,000 messages/second baseline capacity. Manual ACK modes add round-trip overhead to the delivery path.

consumerCount	count	Total workers pulling from the queue. Multiplies throughput capacity.
acknowledgmentMode	auto \| manual	Manual ACK adds 3ms per-message coordination latency.
persistenceEnabled	toggle	Durable queues add 5ms disk-sync latency to the publish path.

Scheduler (Cron / Job Runner)

Models a background task scheduler. Unlike the Service Worker (which focuses on thread pooling), the Scheduler focuses on long-running jobs and concurrent execution limits. It follows a variation of Little's Law for queueing wait times.

load = (effRPS × jobDurationMs / 1000) / maxConcurrentJobs nodeLat += jobDurationMs + (load > 1 ? (load − 1) × jobDurationMs : 0)

Concurrency Limit: If the total work (RPS × duration) exceeds available concurrency slots, jobs start queueing linearly. A 500ms job on a 10-job server at 40 RPS results in (2.0 - 1) × 500 = 500ms extra wait.

maxConcurrentJobs	count	Maximum simultaneous tasks. Throughput ceiling = concurrency / duration.
avgJobDurationMs	ms	Mean execution time per task. Determines how quickly concurrency slots are freed.
cronResolutionSec	sec	The granularity of the clock (e.g. 60s for standard crontab). Informational.

LLM / AI Model

LLM latency is dominated by token generation speed. The engine estimates output tokens as 10% of the context window size and converts generation time to milliseconds.

outputTokens = contextWindow × 0.1
nodeLat += (outputTokens / tokensPerSec) × 1000ms

8 ms

Small context, fast model

4k ctx, 50 t/s

256 ms

Default settings

128k ctx, 50 t/s

1 280 ms

Slow/large model

128k ctx, 10 t/s

contextWindow	k tokens (4–1024)	Max input context. Proxy for output length (10% assumed as completion).
tokensPerSec	t/s (1–200)	Generation throughput. Primary bottleneck for LLM latency.
temperature	0–2 slider	Sampling randomness. Does not affect latency in simulation.

Vector Database

ANN search latency grows with vector dimensionality. The index type determines the search algorithm: HNSW (graph-based) is fastest; Flat (brute-force) is slowest.

dimPenalty = dimensions / 512
indexBonus = hnsw ? 0.5 : flat ? 5 : 1 (ivf)
nodeLat += 5ms × dimPenalty × indexBonus

HNSW

7.5 ms

at 1536 dims

IVF

15 ms

at 1536 dims

FLAT

75 ms

at 1536 dims

dimensions	64–3072	Embedding vector size. Latency scales linearly with dimensions/512.
indexType	hnsw \| ivf \| flat	HNSW: 0.5× penalty. IVF: 1× (baseline). Flat: 5× (brute-force scan).
similarity	cosine \| l2 \| dot	Distance metric. Not differentiated in latency; use for documentation.

Embedding API

Embedding generation is a fixed-cost operation whose per-item cost is amortized by batching. Larger batches reduce effective per-request latency.

nodeLat += latencyMs / log₂(batchSize)

batch = 1

Full 50 ms per request (no batch savings).

batch = 128

50 / log₂(128) = 50 / 7 ≈ 7.1 ms per item.

latencyMs	ms (5–500)	Baseline embedding call latency at batch size 1.
batchSize	1–2048	Items per API call. Higher batch → logarithmically lower per-item latency.

AI Agent / Orchestrator

Agentic workflows call the LLM multiple times per user request. The orchestrator models only its own routing overhead — each downstream LLM call accrues its own latency separately.

strategyPenalty = tree ? 5 : reflex ? 2 : 1
nodeLat += 10ms × avgSteps × strategyPenalty

chain

×1Linear tool calls. 3 steps = 30 ms orchestration overhead.

reflex

×2Feedback loops. Agent evaluates its own output before proceeding.

tree

×5Tree-of-thought: explores multiple reasoning paths in parallel — expensive.

multiStepFactor	1–20	Average number of LLM tool-call iterations per user request.
strategy	chain \| reflex \| tree	Reasoning strategy. Multiplies step overhead: chain×1, reflex×2, tree×5.

Document Parser / ETL

Document parsing latency grows with chunk size (more data per unit) and shrinks with parallelism.

chunkPenalty = chunkSizeBytes / 512
nodeLat += (25ms × chunkPenalty) / parallelJobs

chunkSizeBytes	128–8192	Size of each processed chunk. Larger chunks = more work per task = higher latency.
parallelJobs	1–100	Concurrent workers. Linearly reduces effective per-request latency.

Workflow Engine

Workflow engines (Temporal, n8n, Airflow) persist state at every step, adding a fixed 10 ms checkpoint overhead. Queuing latency activates above 80% concurrency saturation and grows as the overflow portion using an M/D/1 model (fixed service time, variable arrival). The penalty is (sat − 0.8) × 25ms — only the overflow fraction is penalised, not the full saturation value.

nodeLat += 10ms (state persistence overhead)
execSaturation = effRPS / concurrency
if (execSaturation > 0.8) nodeLat += (execSaturation − 0.8) × 25ms

Utilisation uses Little's Law: active executions ≈ throughput × avg workflow duration (capped at 10 s). At default concurrency of 50: 100 RPS × 10 s / 50 slots = 20× over-subscribed → 2000% → clamped to 100%. At 2 RPS × 10 s = 20 active / 50 slots → 40% utilisation.

concurrency	10–5000	Maximum simultaneous workflow executions. Above 80% triggers overflow queuing penalty.
timeout	seconds (1–3600)	Per-step execution ceiling. Long steps may cascade into downstream timeouts.
retryPolicy	toggle	Auto-retry on step failure adds 50 ms retry overhead but clears transient errors.