Temporal safety
A backtest is only honest if the data it sees is the data the model would have seen at trade time. In weather settlement, this is harder than it looks: the NWS CLI overnight final for date D is not published until roughly 10:00 ET on D+1, and METAR corrections trickle in for days. Calling research() on a past date returns whatever is in the cache today — including corrections that did not exist when the contract settled.
mostlyright.core.temporal ships two primitives for this. KnowledgeView is the silent filter — it returns only rows knowable by your chosen cutoff. LeakageDetector (and the assert_no_leakage() helper) is the loud audit — it raises when one row claims to be knowable but is not.
KnowledgeView — the silent filter
Section titled “KnowledgeView — the silent filter”KnowledgeView wraps a DataFrame and exposes dataframe(), which returns a defensive copy filtered to knowledge_time <= as_of. Rows past the cutoff are dropped. The original DataFrame is never mutated.
import pandas as pdfrom mostlyright.core import KnowledgeView, TimePoint
df = pd.DataFrame({ "knowledge_time": pd.to_datetime([ "2025-01-01T00:00:00Z", "2025-01-02T00:00:00Z", "2025-01-03T00:00:00Z", ], utc=True), "value": [10, 20, 30],})
view = KnowledgeView(df, TimePoint("2025-01-02T12:00:00Z"))filtered = view.dataframe()# 2 rows: only the rows with knowledge_time <= 2025-01-02T12:00:00Zimport { KnowledgeView, TimePoint } from "@mostlyrightmd/core/temporal";
const rows = [ { knowledge_time: "2025-01-01T00:00:00Z", value: 10 }, { knowledge_time: "2025-01-02T00:00:00Z", value: 20 }, { knowledge_time: "2025-01-03T00:00:00Z", value: 30 },];
const view = new KnowledgeView(rows, new TimePoint("2025-01-02T12:00:00Z"));const filtered = view.rows();// 2 rows: only the rows with knowledge_time <= 2025-01-02T12:00:00ZThe knowledge_time column is the load-bearing contract. Every row produced by the SDK’s catalog adapters carries it — for a METAR, knowledge_time is the moment the report became visible upstream (METAR observed_at plus typical IEM/AWC propagation). For a CLI record, it is cli_available_at(date, station) — roughly midnight LST + 10h for the overnight final.
KnowledgeView validates eagerly:
knowledge_timemust exist as a column (Python) or field on every row (TypeScript).- The values must be tz-aware UTC. Naive timestamps raise
SchemaValidationErrorwith the violating row index.
These checks happen at construction time, not at filter time, so a misshapen DataFrame fails before any downstream code sees it.
assert_no_leakage — the loud audit
Section titled “assert_no_leakage — the loud audit”Where KnowledgeView drops rows past the cutoff, assert_no_leakage() raises. Use it as a guard at training-set boundaries: if any row claims to be knowable by a cutoff it is not, the call raises LeakageError with the violation count and a 10-row sample.
import pandas as pdfrom mostlyright.core import TimePoint, assert_no_leakage, LeakageError
df = pd.DataFrame({ "knowledge_time": pd.to_datetime([ "2025-01-01T00:00:00Z", "2025-01-03T00:00:00Z", # past the cutoff ], utc=True), "value": [10, 99],})
try: assert_no_leakage(df, TimePoint("2025-01-02T00:00:00Z"))except LeakageError as err: print(err.violating_count) # 1 print(err.sample) # list of up to 10 violating rowsimport { assertNoLeakage, TimePoint, LeakageError } from "@mostlyrightmd/core/temporal";
const rows = [ { knowledge_time: "2025-01-01T00:00:00Z", value: 10 }, { knowledge_time: "2025-01-03T00:00:00Z", value: 99 }, // past the cutoff];
try { assertNoLeakage(rows, new TimePoint("2025-01-02T00:00:00Z"));} catch (err) { if (err instanceof LeakageError) { console.log(err.violatingCount); // 1 console.log(err.sample); // up to 10 violating rows }}The error payload (violating_count + sample capped at 10) is the same across both SDKs. Sample row indices line up with the original DataFrame so callers can surface the bad rows without dumping the entire frame.
When to use which
Section titled “When to use which”Both primitives operate on the same input shape, but they answer different questions:
KnowledgeView— “Give me only the rows I would have seen at time T.” Use it when building a training set or a point-in-time inference input.assert_no_leakage()— “Did I accidentally include rows from the future?” Use it as a guard after every join or merge that could introduce a row whoseknowledge_timewas not bounded.
Common pattern: build a training set with KnowledgeView, then call assert_no_leakage() once after the training-set join to verify no rows slipped through a transform that did not preserve the cutoff.
What knowledge_time actually means
Section titled “What knowledge_time actually means”knowledge_time is “the earliest UTC moment at which a downstream consumer could have observed this row.” It is distinct from observed_at (when the weather event happened) and from valid_at (the moment a forecast targets). For settlement use, the gap matters:
- An METAR at
observed_at=2025-01-06T05:51:00Ztypically hasknowledge_time ≈ observed_at + 1 minute— METARs propagate to AWC almost instantly. - A CLI overnight final for
observation_date=2025-01-06(LST) hasknowledge_time ≈ 2025-01-07T15:00:00Z(midnight LST + 10h for an East Coast station). Callingresearch()with a cutoff before that returns no CLI record for that date — the SDK surfaces aclimate_unavailable_reasonof “CLI not yet published” rather than guessing.
mostlyright.snapshot.cli_available_at(date_str, station) is the canonical helper for the CLI-publication delay. The DataSnapshot dataclass returned by build_snapshot() uses it to gate climate inclusion against the requested as_of.
See also
Section titled “See also”- Source identity — Mode 1 vs Mode 2 and the source-mismatch invariant.