Deterministic exact query engine

Exact state-signature queries over complex finite-state data.

Signature Index turns structured operational data into a finite state-signature query problem. It is designed for fast, verifiable exact, threshold, near-miss, drilldown, segment-conditioned and outcome-count questions.

Open technical report Reference artifact See evidence

Core role Exact query layer

Answer type Counts, matches, near-misses, outcomes

Primary fit Rich repeated state-signature workloads

Not a Prediction model or database replacement

What it is

A specialized engine for asking exact questions about states.

Many industrial, operational and research systems do not only need storage, dashboards or prediction. They also need fast memory over complex situations: whether a state has appeared before, what was close to it, where it occurred and what outcomes followed.

Signature Index is built for that layer. It does not decide which domain question is important. It makes large families of exact state-signature questions practical once the domain is translated into objects, states, signatures, segments and measurable outcomes.

Domain data

→

Finite states

→

SI engine

→

Exact answers

Query families

Designed for rich repeated workloads.

Exact match

How many objects matched this exact state signature?

Threshold / roll-up

How many matched a broader state condition?

Near-miss

Which states are close to the selected signature, and what happened next?

Broad-to-exact

Which exact sub-states sit inside a broad alert or condition?

Segment-conditioned

How does this signature behave by line, asset, site, shift, model or region?

Multi-outcome counts

What were the counts for failure, defect, latency, alert, cost or other outcomes?

100M-event benchmark

Public ClickBench workload: 500,000 hidden state-signature queries.

The benchmark tested Signature Index as a deterministic exact state-signature query engine. It did not test prediction, anomaly detection or model accuracy.

Rows / events99,997,497

Total queries500,000

Query families5

Correctness mismatches0

Build time (AirMac M4, 24GB)~19.7 min

Peak RSS~2.42 GB

Query family	SI speedup vs named baseline	SI speedup vs reference	Interpretation
Exact	~439,441×	~704,270×	Exact state-signature lookup
Segment-conditioned	~372,104×	~624,308×	Signature behavior within a segment
Threshold / roll-up	~95.2×	~65.4×	Broader state-condition queries
Near-miss L1	~21.4×	~98.2×	Close state-signature neighborhoods
Broad-to-exact	~591.8×	~4,672×	Exact sub-states under a broad condition

The benchmark represents a large repeated query workload over a public clickstream dataset. Speedups are shown against the fastest named baseline used for each query family and against the reference-check path. Exact correctness was required first: 0 mismatches across 500,000 queries.

HEP / scientific data validation

Repeated selection and region queries over HEP-style event data.

The public technical report includes a HEP-oriented validation pass focused on repeated exact selection, region, cut-grid, neighborhood and segment-conditioned count workloads. The purpose is not to replace ROOT, RDataFrame or experiment-specific analysis frameworks. The purpose is to test whether an observed-support memory layer can accelerate repeated query families once event-level data have been translated into finite signatures.

Correctness mismatches0

Canonical Z-region vs repeated scan~652.6×

Segment-conditioned vs repeated scan~2,136×

Calibrated break-even~1.12 queries

100M-event scenario, 100 queries~89×

1B-event scenario, 1,000 queries~892×

What was tested

The HEP validation emphasizes query families that naturally recur during exploratory analysis: region selections, mass-bucket scans, cut-grid families, near-neighborhood counts and segment-conditioned selections. Results in the report distinguish measured validation summaries from calibrated scaling and break-even estimates.

How to read the results

SI is strongest when many related queries are asked repeatedly over the same observed support. Narrow pre-aggregates can still win for a single fixed aggregate. The HEP result should be read as evidence for a complementary repeated-query layer, not as a claim to replace established HEP analysis stacks.

Open report on Zenodo Open reference artifact

Evidence across domains

One engine pattern, several state-signature workloads.

These materials show SI as a reusable exact query layer. The domain changes the translation into states, segments and outcomes; the public claim remains the same: exact, fast, verifiable state-signature query serving.

Scale benchmark

Public ClickBench 100M-event workload

500,000 hidden exact, threshold, near-miss, broad-to-exact and segment-conditioned queries over nearly 100M public clickstream events.

99,997,497 rows / events
500,000 total hidden queries
0 correctness mismatches
Speedups up to ~439,441× vs named baseline

Open benchmark summary →

AI infrastructure

GPU cluster telemetry risk memory

Public Alibaba GPU Cluster Trace workload: recurring telemetry-state signatures, tail bottlenecks, fail-slow states, near-miss incidents and outcome counts.

3,033,232 telemetry-state rows
300 hidden queries
0 mismatches vs reference
513.7× median speedup vs fastest public baseline

Open AI infrastructure deck →

HEP / research data

HEP-style repeated selection workload

Scientific event-level validation for repeated exact selections, region queries, cut-grid families, neighborhood counts and segment-conditioned queries.

0 correctness mismatches
~652.6× canonical Z-region selection vs repeated scan
~2,136× segment-conditioned family count vs repeated scan
Calibrated scaling estimates up to ~892× in repeated-query scenarios

Open technical report →

Quant research

financial-state state-query memory

Controlled post-dataset query layer for exploratory financial analysis: exact regimes, threshold variants, near-miss cohorts and historical outcome counts.

7.5M instrument-date rows
375M formula values at 50 formulas
0 mismatches in internal checks
Build-inclusive break-even within normal exploratory query counts

Internal case material available on request

Industrial fit

Where the engine is naturally useful.

Signature Index is strongest where the problem can be expressed as finite objects, multi-level states, composed signatures, segments and measurable outcomes.

Asset performance management

Exact and near-miss memory over asset states, alarms, conditions and historical outcomes.

Predictive maintenance support

A deterministic evidence layer underneath predictive or diagnostic systems.

Manufacturing quality

Drilldown from broad defect classes to exact process-state signatures.

Process-state exploration

Fast queries over thresholds, operating regimes, recipes, units, sites and outcomes.

Private evaluation without disclosure

SI can be evaluated as a controlled black-box engine.

A practical evaluation can be scoped around an agreed state-signature workload and validated by reference counts, hidden answer keys or agreed audit outputs. The public artifact includes a semantic reference implementation; private evaluations can use domain-oriented evaluation harnesses without exposing proprietary internals.

Case pack

Agree state dimensions, segments, outcomes and query families upfront.

Black-box run

Run SI as an engine on a public, synthetic, anonymized or partner-defined state matrix.

Validation

Check exactness by reference counts, hidden answer keys, hashes or agreed output schema.

Metrics

Report mismatch count, latency, throughput, memory, build time and speedup vs agreed baselines.

Looking for 2–3 private evaluation cases

The best fit is a repeated exact state-signature workload: asset telemetry, process states, alarm histories, quality events, downtime records, GPU-cluster telemetry or high-dimensional research states.

Boundaries

What Signature Index does not claim.

It is not a general replacement for databases, SQL engines, dataframes or historians.
It is not an end-to-end anomaly detector, predictive model or autonomous decision system.
It does not decide which domain question is important; it serves exact query families once defined.
It is intended as a specialized engine underneath diagnostic, monitoring, research or planning products.
The public repository contains a semantic reference implementation, not a production-optimized or domain-specific engine.
The public materials do not disclose proprietary mathematical internals or private optimized evaluation harnesses.

Materials

Download the short technical materials.

The public materials include the technical report, reference artifact, benchmark summaries and domain-oriented evidence. The public code is a semantic reference implementation, not a production-optimized engine.

Technical report / DOI GitHub artifact 100M benchmark AI infrastructure