Temporal safety

A backtest is only honest if the data it sees is the data the model would have seen at trade time. In weather settlement, this is harder than it looks: the NWS CLI overnight final for date D is not published until roughly 10:00 ET on D+1, and METAR corrections trickle in for days. Calling research() on a past date returns whatever is in the cache today — including corrections that did not exist when the contract settled.

mostlyright.core.temporal ships two primitives for this. KnowledgeView is the silent filter — it returns only rows knowable by your chosen cutoff. LeakageDetector (and the assert_no_leakage() helper) is the loud audit — it raises when one row claims to be knowable but is not.

KnowledgeView — the silent filter

KnowledgeView wraps a DataFrame and exposes dataframe(), which returns a defensive copy filtered to knowledge_time <= as_of. Rows past the cutoff are dropped. The original DataFrame is never mutated.

Python
TypeScript

1
import pandas as pd
2
from mostlyright.core import KnowledgeView, TimePoint
3

4
df = pd.DataFrame({
5
    "knowledge_time": pd.to_datetime([
6
        "2025-01-01T00:00:00Z",
7
        "2025-01-02T00:00:00Z",
8
        "2025-01-03T00:00:00Z",
9
    ], utc=True),
10
    "value": [10, 20, 30],
11
})
12

13
view = KnowledgeView(df, TimePoint("2025-01-02T12:00:00Z"))
14
filtered = view.dataframe()
15
# 2 rows: only the rows with knowledge_time <= 2025-01-02T12:00:00Z

1
import { KnowledgeView, TimePoint } from "@mostlyrightmd/core/temporal";
2

3
const rows = [
4
  { knowledge_time: "2025-01-01T00:00:00Z", value: 10 },
5
  { knowledge_time: "2025-01-02T00:00:00Z", value: 20 },
6
  { knowledge_time: "2025-01-03T00:00:00Z", value: 30 },
7
];
8

9
const view = new KnowledgeView(rows, new TimePoint("2025-01-02T12:00:00Z"));
10
const filtered = view.rows();
11
// 2 rows: only the rows with knowledge_time <= 2025-01-02T12:00:00Z

The knowledge_time column is the load-bearing contract. Every row produced by the SDK’s catalog adapters carries it — for a METAR, knowledge_time is the moment the report became visible upstream (METAR observed_at plus typical IEM/AWC propagation). For a CLI record, it is cli_available_at(date, station) — roughly midnight LST + 10h for the overnight final.

KnowledgeView validates eagerly:

knowledge_time must exist as a column (Python) or field on every row (TypeScript).
The values must be tz-aware UTC. Naive timestamps raise SchemaValidationError with the violating row index.

These checks happen at construction time, not at filter time, so a misshapen DataFrame fails before any downstream code sees it.

assert_no_leakage — the loud audit

Where KnowledgeView drops rows past the cutoff, assert_no_leakage() raises. Use it as a guard at training-set boundaries: if any row claims to be knowable by a cutoff it is not, the call raises LeakageError with the violation count and a 10-row sample.

Python
TypeScript

1
import pandas as pd
2
from mostlyright.core import TimePoint, assert_no_leakage, LeakageError
3

4
df = pd.DataFrame({
5
    "knowledge_time": pd.to_datetime([
6
        "2025-01-01T00:00:00Z",
7
        "2025-01-03T00:00:00Z",   # past the cutoff
8
    ], utc=True),
9
    "value": [10, 99],
10
})
11

12
try:
13
    assert_no_leakage(df, TimePoint("2025-01-02T00:00:00Z"))
14
except LeakageError as err:
15
    print(err.violating_count)   # 1
16
    print(err.sample)            # list of up to 10 violating rows

1
import { assertNoLeakage, TimePoint, LeakageError } from "@mostlyrightmd/core/temporal";
2

3
const rows = [
4
  { knowledge_time: "2025-01-01T00:00:00Z", value: 10 },
5
  { knowledge_time: "2025-01-03T00:00:00Z", value: 99 },  // past the cutoff
6
];
7

8
try {
9
  assertNoLeakage(rows, new TimePoint("2025-01-02T00:00:00Z"));
10
} catch (err) {
11
  if (err instanceof LeakageError) {
12
    console.log(err.violatingCount);   // 1
13
    console.log(err.sample);           // up to 10 violating rows
14
  }
15
}

The error payload (violating_count + sample capped at 10) is the same across both SDKs. Sample row indices line up with the original DataFrame so callers can surface the bad rows without dumping the entire frame.

When to use which

Both primitives operate on the same input shape, but they answer different questions:

KnowledgeView — “Give me only the rows I would have seen at time T.” Use it when building a training set or a point-in-time inference input.
assert_no_leakage() — “Did I accidentally include rows from the future?” Use it as a guard after every join or merge that could introduce a row whose knowledge_time was not bounded.

Common pattern: build a training set with KnowledgeView, then call assert_no_leakage() once after the training-set join to verify no rows slipped through a transform that did not preserve the cutoff.

What `knowledge_time` actually means

knowledge_time is “the earliest UTC moment at which a downstream consumer could have observed this row.” It is distinct from observed_at (when the weather event happened) and from valid_at (the moment a forecast targets). For settlement use, the gap matters:

An METAR at observed_at=2025-01-06T05:51:00Z typically has knowledge_time ≈ observed_at + 1 minute — METARs propagate to AWC almost instantly.
A CLI overnight final for observation_date=2025-01-06 (LST) has knowledge_time ≈ 2025-01-07T15:00:00Z (midnight LST + 10h for an East Coast station). Calling research() with a cutoff before that returns no CLI record for that date — the SDK surfaces a climate_unavailable_reason of “CLI not yet published” rather than guessing.

mostlyright.snapshot.cli_available_at(date_str, station) is the canonical helper for the CLI-publication delay. The DataSnapshot dataclass returned by build_snapshot() uses it to gate climate inclusion against the requested as_of.

Temporal safety

KnowledgeView — the silent filter

assert_no_leakage — the loud audit

When to use which

What knowledge_time actually means

See also

What `knowledge_time` actually means