Ayoob AI

The Ayoob AI Architecture: Merging CPU, Workers, and WebGPU

·15 min read·Husain Ayoob
ArchitectureWebGPUWeb WorkersHeterogeneous ComputeEnterprise

One engine, three tiers, zero configuration

This article is the architectural capstone of a 29-article series on browser-based heterogeneous compute. Each previous article examined one component in depth. This article shows how they connect.

The system we built solves a single problem: given an operation and a dataset, execute it on the fastest available hardware while guaranteeing correctness, fault tolerance, and mathematical precision. The application developer calls one function. The engine handles everything else.

const result = await engine.dispatch(operation, data);

Behind that single call, the engine runs four stages in sequence. The total decision time is under 0.1 ms. The result is a dispatch to one of three compute tiers, with automatic fallback if the chosen tier fails.

Stage 1: Workload Characterisation

Before the engine considers hardware, it characterizes the workload. Three analyses run in parallel.

Control flow analysis

The engine inspects the operation's control flow topology and classifies it into one of three categories:

Uniform. Every element follows the same instruction path. Examples: element-wise arithmetic, radix sort scatter/gather, parallel histogram construction. GPU-safe. No SIMD branch divergence.

Bounded. The operation branches on data, but the number of distinct paths is small and predictable. Examples: clamping, enum-based classification, dictionary-encoded filter with IN clause. GPU-viable with predication cost (10% to 30% throughput reduction).

Categorical. Every element may follow a unique execution path. Examples: NFA regex traversal, Levenshtein distance, trie lookup, UTF-8 case-insensitive search. Categorical GPU Inhibition: penalty of negative infinity. GPU dispatch blocked unconditionally.

Output density profiling

For operations that produce atomic writes (counters, histogram bins, compaction pointers), the engine estimates the fraction of threads that will execute an atomic per cycle.

The estimation uses column statistics: histogram-based selectivity for filters, Chao1 cardinality estimation for group-by, and Phase 1 popcount for text search.

If estimated density exceeds 10% of the GPU's thread capacity per cycle, atomic contention would collapse throughput non-linearly. The engine assigns a categorical penalty of negative infinity. GPU dispatch blocked.

Encoding detection

For text operations, the engine samples the corpus for multi-byte UTF-8 sequences. If multi-byte content is detected and the search is case-insensitive, the combination of variable-width encoding and Unicode case folding produces categorical divergence. GPU dispatch blocked.

For string columns in structured queries, the engine verifies dictionary encoding is in place. String predicates are resolved to integer comparisons at compile time. The GPU never processes a string.

Characterisation output

The three analyses produce a workload profile:

interface WorkloadProfile {
  controlFlow: 'uniform' | 'bounded' | 'categorical';
  outputDensity: number;              // 0.0 to 1.0
  encodingClass: 'ascii' | 'utf8_multibyte';
  categoricalInhibition: boolean;     // true if any analysis triggers -Infinity
  arithmeticIntensity: number;        // FLOPs per byte
}

If categoricalInhibition is true, the engine skips Stages 2 and 3 entirely. The operation routes to the CPU tier (Workers or main thread depending on dataset size). No GPU resources are allocated. No further analysis is needed.

Stage 2: Precision Sufficiency Analysis

For operations that survive Stage 1, the Precision Sufficiency Analyser evaluates whether Float32 arithmetic can produce results within the caller's tolerance.

Three sensitivity tiers

High sensitivity (linear algebra). Matrix solve, eigenvalue decomposition, least-squares. The analyser estimates the condition number κ via Hager's O(n^2) algorithm. Expected relative error: κ * 1.19 x 10^-7 (Float32 machine epsilon). If this exceeds the caller's tolerance (default 10^-9 for financial workloads), the Float32 Safety Guard assigns negative infinity. GPU blocked.

Medium sensitivity (accumulation). SUM, AVG, running totals, windowed aggregations. The analyser estimates maximum intermediate accumulation and compares against the Float32 safe integer threshold (16,777,216). If the accumulation exceeds this boundary, GPU blocked for the numeric output. Additionally, operations that pass the pre-dispatch check receive post-dispatch spot-check verification: 16 sampled elements are re-computed in Float64 on the CPU. If any sample's relative error exceeds 10^-4, the GPU result is discarded and the operation re-executes on the CPU.

Low sensitivity (comparison). Filters, sorts, classifications. The analyser estimates the minimum gap between adjacent values and compares against the Float32 ULP at the relevant magnitude. If comparisons are unaffected by rounding, GPU dispatch is safe. The output is boolean or ordinal, not numeric.

Analysis output

interface PrecisionProfile {
  sensitivityTier: 'high' | 'medium' | 'low';
  riskScore: number;
  precisionInhibition: boolean;      // true if risk exceeds tolerance
  requiresPostDispatchVerification: boolean;
}

If precisionInhibition is true, the operation routes to CPU Float64. If requiresPostDispatchVerification is true, the GPU result will be spot-checked after execution.

Stage 3: Dispatch Scoring

For operations that survive Stages 1 and 2, a multi-factor scoring function computes the final dispatch score. The exact factors vary by domain: structured queries use a 6-factor formula with SQL-specific metrics, while sort operations use a 7-factor dispatch that includes sort-specific inputs. The following factors illustrate the query scoring path, which is the most general example.

The factors (structured query example)

Factor 1: Dataset cardinality. The number of elements entering the operator. Larger datasets favour GPU dispatch (more parallel work to amortize fixed overhead).

Factor 2: Predicate selectivity. For filters, the estimated fraction of rows that pass the predicate. Low selectivity means most GPU threads produce no output. Affects compaction efficiency and downstream operator cardinality.

Factor 3: Group cardinality. For GROUP BY operators, the estimated number of distinct groups via Chao1. Low cardinality (under 1,024) enables shared memory accumulators. High cardinality forces global memory atomics with severe contention.

Factor 4: Arithmetic intensity. The ratio of FLOPs to memory bytes. Compute-bound operations (GEMM: n/6 FLOPs/byte) justify GPU dispatch at small data sizes. Memory-bound operations (element-wise: 0.25 FLOPs/byte) require large datasets.

Factor 5: Memory access pattern. Sequential (coalesced GPU reads, full bandwidth) versus random (uncoalesced, 10% to 25% bandwidth). Filters and sorts are sequential. Hash joins and index lookups are random.

Factor 6: Hardware calibration ratio. The device-specific break-even between CPU and GPU, derived from runtime microbenchmarks at session start: memory bandwidth probe, dispatch overhead measurement, adapter capability query. Normalizes scoring across hardware.

The formula

operatorScore = (cardinality * arithmeticIntensity * accessPatternWeight)
              / (selectivityPenalty * groupCardinalityPenalty * calibrationRatio)

Score > 1.0: GPU dispatch. The GPU's compute or bandwidth advantage outweighs all overhead.

Score 0.3 to 1.0: Web Worker dispatch. The GPU's advantage is marginal or negative, but the dataset is large enough to benefit from multi-threaded CPU execution.

Score < 0.3: CPU main thread dispatch. The dataset is small enough that single-threaded execution is fastest (no worker wake overhead, no GPU dispatch overhead).

Pipeline fusion bonus

If the preceding operator in a fused pipeline was GPU-dispatched, the current operator's data is already resident on the GPU. The upload cost is zero. The scoring function adds a retention bonus that increases the score, pulling borderline operators onto the GPU path. This extends the fused segment, eliminating PCIe transfers between consecutive GPU operators.

Stage 4: Tier Routing and Execution

The score maps to one of three compute tiers.

Tier 1: CPU main thread

When used: Score < 0.3. Datasets under ~10,000 elements. Trivial post-aggregation sorts on small result sets. Also used for precision-sensitive operations that require Float64.

How it works: The operation executes synchronously on the calling thread. No worker spawn. No message passing. No buffer allocation. The data is already in JavaScript heap memory. The result is returned directly.

Performance: Sub-0.5 ms for typical small-dataset operations. Zero overhead. L1 cache locality for tight loops.

When it is the only option: VDI environments with no GPU and hardwareConcurrency = 1. The engine degrades gracefully to single-threaded execution.

Tier 2: SharedArrayBuffer Web Worker pool

When used: Score 0.3 to 1.0. Datasets between ~10,000 and ~500,000 elements (thresholds vary by hardware calibration). Also used as fallback when GPU is unavailable or device loss occurs.

How it works: A pre-warmed pool of threads (sized to navigator.hardwareConcurrency, typically 4 to 16) communicates via SharedArrayBuffer for zero-copy data sharing. Workers are parked on Atomics.wait() and wake in under 0.05 ms on Atomics.notify().

Each worker receives a contiguous chunk of the SharedArrayBuffer. For text search, the boundary overlap protocol extends each chunk by (patternLength - 1) bytes to prevent missed matches at partition boundaries.

Each worker independently selects its algorithm based on chunk characteristics: counting sort for bounded integers (range < 65,536), LSD radix-256 for wide-range numerics, insertion sort for chunks under 64 elements.

After all workers complete (signaled via Atomics.notify()), the main thread performs a k-way merge (for sorts) or result concatenation (for filters and aggregations).

Performance: 3x to 6x speedup over main thread for medium datasets. Consistent across hardware (CPU core counts vary less than GPU capabilities).

Tier 3: WebGPU compute pipeline

When used: Score > 1.0. Datasets above ~500,000 elements on discrete GPUs, ~2,000,000 on integrated GPUs (thresholds set by calibration). Compute-bound operations (GEMM) dispatch at smaller sizes due to high arithmetic intensity.

How it works: Data is uploaded to the GPU via a size-bucketed buffer pool (eliminating repeated allocation overhead). Operations execute as compute shader dispatches. Consecutive GPU-routed operators are pipeline-fused: intermediate results stay in GPU storage buffers, reducing transfers from 2N to N+1. Results are read back via mapAsync().

For text search, the two-phase pipeline runs a character frequency histogram pre-filter in 16 KB shared memory (Phase 1), eliminating up to 97% of candidates before byte-level matching (Phase 2). For streaming data, the searched-frontier mechanism ensures only new data is processed.

For structured queries, dictionary-encoded string columns are processed as integer arrays. WHERE clauses compile to u32 comparisons. GROUP BY uses Chao1-estimated shared memory accumulators for low-cardinality groups.

For sorting, the IEEE 754 bit-transform converts floats to sort-order-preserving unsigned integers, enabling O(n) radix-256 sort or local bitonic sort with global rank merge.

Performance: 10x to 75x speedup over Array.prototype.sort(). 2x to 20x over Web Workers. Sub-5 ms for 500,000-element operations on discrete hardware.

The cascading fallback

The three tiers are not independent options. They are a cascade. If the primary tier fails, the engine falls through to the next without application-level intervention.

GPU to Workers

If the GPU device is lost (driver crash, watchdog timeout, eGPU disconnection, power management, background tab throttling), the engine:

  1. Invalidates all cached state (pipeline cache, buffer pool, bind groups) within a single microtask. Time: under 0.1 ms.
  2. Re-dispatches pending operations to the Web Worker tier. The input data is intact in the SharedArrayBuffer (the GPU received a copy). Time: 0.1 to 0.5 ms for re-dispatch.
  3. Schedules hardware re-probe for the next invocation. The engine calls navigator.gpu.requestAdapter(), compares adapter info for hardware changes, re-runs calibration microbenchmarks, and resumes GPU dispatch with updated thresholds. Time: under 200 ms on next invocation.

The caller's promise resolves with correct results. Latency increases (GPU speed to Worker speed), but execution never fails.

Workers to main thread

If SharedArrayBuffer is unavailable (missing COOP/COEP headers, legacy browser) or navigator.hardwareConcurrency === 1 (single-core device), the Worker tier degrades to main-thread execution. The engine uses postMessage with transferable objects instead of SharedArrayBuffer, accepting 20% to 40% performance degradation. On single-core devices, all computation runs synchronously on the main thread.

GPU to main thread (direct)

If both GPU and Workers are unavailable (no WebGPU adapter, no SharedArrayBuffer, single-core CPU), the engine runs everything on the main thread. This is the lowest-performance path but guarantees that the engine functions on every browser, on every device, with no external dependencies.

The safety systems

Three independent systems can block GPU dispatch. Each evaluates separately. Any one can override the dispatch score.

Safety systemWhat it detectsPenaltyArticles
Branch divergence classifierPer-element conditional branching (NFA, DP, trie)-Infinity#2, #21
Atomic contention profilerOutput density > 10% causing non-linear throughput collapse-Infinity#10, #17
Precision Sufficiency AnalyserFloat32 error exceeding caller tolerance-Infinity#6, #14, #29

The GPU path runs only when all three systems confirm: no categorical divergence, no contention cliff, no precision risk. This layered architecture means the engine never dispatches a workload that is divergent, contended, or imprecise.

The resource management layer

Two systems manage GPU resources to prevent memory leaks and allocation overhead.

Buffer pool. Size-bucketed (power-of-two) pool of GPU storage buffers. Checkout/return protocol eliminates per-query allocation cost (0.35 ms per buffer reduced to 0.01 ms). Leak detection with configurable timeout. Force-destroy unreturned buffers. Pool budget set to 25% of maxStorageBufferBindingSize.

Memory limit checking. Before any GPU allocation, the engine verifies the dataset fits within maxStorageBufferBindingSize (128 MB to 4 GB depending on hardware). Oversized datasets route to CPU unconditionally. No allocation attempt. No out-of-memory risk.

The full pipeline in one diagram

Query enters
    |
    v
[Stage 1: Workload Characterisation]
    |-- Control flow analysis -> categorical? --> CPU (Workers or main thread)
    |-- Output density profiling -> >10%? -----> CPU (Workers or main thread)
    |-- Encoding detection -> UTF-8 + case-insensitive? -> CPU Workers
    |
    v (passed Stage 1)
[Stage 2: Precision Sufficiency Analysis]
    |-- High sensitivity (linear algebra) -> κ * ε > tolerance? --> CPU Float64
    |-- Medium sensitivity (accumulation) -> exceeds 16,777,216? --> CPU Float64
    |-- Low sensitivity (comparison) -> gap < ULP? --> flag, but usually passes
    |
    v (passed Stage 2)
[Stage 3: Dispatch Scoring]
    |-- Multi-factor formula (domain-specific): cardinality, selectivity,
    |   group cardinality, arithmetic intensity, access pattern, calibration
    |-- Pipeline fusion retention bonus (if preceding op was GPU)
    |
    +--> Score > 1.0  --> [Tier 3: WebGPU Compute]
    |                         |-- Buffer pool allocation
    |                         |-- Pipeline-fused dispatch
    |                         |-- Post-dispatch verification (if medium sensitivity)
    |                         |-- On device loss: cascade to Tier 2
    |
    +--> Score 0.3-1.0 --> [Tier 2: Web Worker Pool]
    |                         |-- SharedArrayBuffer zero-copy
    |                         |-- Atomics.wait/notify coordination
    |                         |-- Per-chunk adaptive algorithm
    |                         |-- Boundary overlap for text search
    |
    +--> Score < 0.3  --> [Tier 1: CPU Main Thread]
                              |-- Synchronous execution
                              |-- Float64 precision
                              |-- Zero overhead

Performance across the full hardware spectrum

The same application, the same query, on five different devices:

DeviceHardwareTier selected500K-row filter time500K-row sort time
Developer workstationRTX 4060GPU1.1 ms3.1 ms
MacBook AirM2 integratedGPU1.8 ms4.5 ms
Enterprise laptopIntel Iris XeWorkers4.8 ms11.8 ms
Corporate tabletAdreno 730Workers5.2 ms13.4 ms
VDI terminalNo GPUWorkers6.1 ms14.2 ms

No device crashes. No device runs the wrong tier. No device is penalized by a threshold calibrated for different hardware. The engine measured each device at session start and routed accordingly.

The developer on the RTX 4060 gets 1.1 ms filter times. The employee on the VDI terminal gets 6.1 ms. Both are within a single animation frame at 60 fps. Both see a responsive dashboard. The 5.5x performance gap is invisible to the user because both are below the perception threshold.

What ties it all together

This architecture did not emerge from a single design decision. It emerged from 29 specific engineering problems, each solved individually, then composed into a unified system:

  • Article #1: The adaptive dispatch engine and hardware calibration.
  • Article #2: SIMD divergence detection and categorical inhibition.
  • Article #3: IEEE 754 bit-transform for O(n) float sorting.
  • Article #4: The multi-factor scoring function and per-operator routing.
  • Article #5: Two-phase GPU text search with histogram pre-filter.
  • Article #6: Float32 precision risks in financial data.
  • Article #7: GPU device loss detection and recovery.
  • Article #8: Bitonic sort with asymmetric binary search merge.
  • Article #9: SharedArrayBuffer zero-copy parallel processing.
  • Article #10: Atomic contention mitigation and categorical threshold.
  • Article #11: Gaming anti-cheat with real-time pattern detection.
  • Article #12: Sub-200 ms hospitality CRM with face-scan recognition.
  • Article #13: Pipeline fusion eliminating PCIe transfer overhead.
  • Article #14: Float32 Safety Guard with condition number analysis.
  • Article #15: Searched-frontier tracking for streaming corpora.
  • Article #16: On-device cost analysis versus cloud APIs.
  • Article #17: Deep-dive on GPU synchronization primitives.
  • Article #18: Radix-256 sort bypassing Array.prototype.sort().
  • Article #19: Fault-tolerant AI workflows with device loss recovery.
  • Article #21: UTF-8 variable-width encoding detection and routing.
  • Article #22: Worker boundary overlap for zero missed matches.
  • Article #23: Dictionary encoding for GPU SQL string filtering.
  • Article #25: Self-calibrating dispatch thresholds.
  • Article #26: Chao1 estimator for GROUP BY cardinality prediction.
  • Article #27: GPU memory limits and buffer pool management.
  • Article #28: Arithmetic intensity and GEMM dispatch thresholds.
  • Article #29: Post-dispatch Float32 verification.

Each article solves one constraint. Together, they form an enterprise AI automation infrastructure that runs correct, fast, and fault-tolerant computation on any browser, on any hardware, without configuration.

The GPU makes it fast. The Workers make it parallel. The CPU makes it safe. The engine makes it automatic.

That is the architecture. No guessing. No hardcoded thresholds. No crossed fingers. Just measurement, scoring, dispatch, and verification. On every operation. On every device. Every time.

Where this ships

We are Ayoob AI, a Newcastle-based team building heterogeneous compute infrastructure for UK enterprises whose browser-side tooling needs to run correctly on any device, from a locked-down laptop to a discrete-GPU workstation. If you are hitting the limits of a single-tier execution model, we build the engine that escapes them. This architecture is the backbone of how we build AI software at Ayoob AI. Book a discovery call.

About the author
Husain Ayoob
Husain Ayoob

Founder & CEO, Ayoob AI Ltd

BSc Computer Science with AI, Northumbria University 2024. 5 UK patents pending covering the Ayoob AI stack. ISO 27001:2022 certified (organisation).

Full bio, patents, and press →

Frequently asked questions

Why does heterogeneous compute matter in the browser?

Because the hardware is different on every user's device and the right backend for a given operation depends on both the workload shape and the hardware. A fixed Web Worker pool runs well on a 16-thread workstation and crawls on a 4-core enterprise laptop. A fixed WebGPU shader flies on a discrete NVIDIA GPU and falls flat on Intel UHD integrated graphics. Static backend selection is dead. The right question is which backend, for this dataset, on this hardware, right now. Our engine answers that question automatically through workload characterisation and dispatch scoring, so the same application code runs optimally across the full hardware spectrum.

What does workload characterisation measure?

Three analyses run in parallel before any dispatch decision. Control flow analysis classifies operations as uniform (GPU-safe), bounded (GPU with predication cost), or divergent (CPU-only), because SIMD branch divergence destroys GPU performance on branchy workloads. Output density profiling estimates atomic contention, because high-density output on a GPU creates L2 cache line arbitration bottlenecks that collapse throughput. Encoding detection identifies variable-width text encoding that forces serial processing. Together these analyses produce a categorical dispatch signal that prevents the engine from ever routing a divergence-prone or contention-prone workload to the GPU.

How does precision sufficiency analysis work?

WebGPU operates on Float32, while JavaScript uses Float64. Narrowing Float64 to Float32 loses precision above 2^24 (16,777,216), which is a compliance violation for finance data. The Precision Sufficiency Analyser estimates condition numbers for linear algebra, checks accumulation bounds against the Float32 safe integer threshold, and calculates expected relative error. For HIGH-sensitivity operations (matrix solve, eigenvalue decomposition), it blocks GPU dispatch when error would exceed tolerance. For LOW-sensitivity operations (filters, sorts, comparisons), it confirms that Float32 and Float64 produce identical results via range checking.

What are the crossover thresholds between tiers?

CPU main thread handles datasets under 10,000 elements with sub-0.5ms execution and zero overhead. Web Worker pool with SharedArrayBuffer handles 10,000 to 500,000 elements with zero-copy parallelism and adaptive per-chunk algorithms. WebGPU compute handles 500,000-plus elements on capable hardware with thousands of parallel cores and pipeline-fused operations. These thresholds are not static. They are calibrated per-device at startup using runtime microbenchmarks, so a user on integrated Intel graphics sees different crossover points than a user on a discrete NVIDIA GPU. The application does not configure any of this manually.

Why does fallback matter for enterprise workflows?

Because browser GPUs fail. Drivers crash, background tabs lose GPU access, external GPUs disconnect, and device loss fires. If the application does not handle this, the user sees a blank screen or an unhandled promise rejection. Our cascading fallback invalidates GPU state on device loss and re-dispatches to Workers within a microtask. If SharedArrayBuffer is unavailable (older browsers, cross-origin isolation not set), Workers fall back to main-thread execution. Every operation completes with correct results regardless of tier. For UK enterprise AI workflows deployed to unknown user hardware, this is the reliability baseline that makes WebGPU viable in production.

Want to discuss how this applies to your business?

Book a Discovery Call