Research · Beta · vN.3 · offline · human-gated

AI Agent 研究管线

一组单一职责的 AI agent 如何把当前宏观 regime 一步步变成一份可复核的策略提案 —— 全程留痕、离线运行、并在「人工把关」处强制停下。

这是组合背后的离线研究管线(vN.3)—— 不是自动交易器。它复用了经过测试的回测引擎(vN.1)与有界搜索(vN.2),再叠加一个「红队批判 + 人工把关」。每一次 provider 调用都写进审计日志;管线只写 research/proposals/,绝不碰 data/ 里的实盘账本。

管线流程

把它当「编排器-工人(orchestrator-workers)」系统来读:一个控制平面(Orchestrator)从左到右驱动数据平面里一组单一职责 agent,每次调用都留痕,而唯一的决策点是人工把关 —— 机器自己永不写入实盘账本。

%%{init: {"theme":"base","themeVariables":{"fontFamily":"ui-sans-serif, system-ui, -apple-system, Segoe UI, sans-serif","fontSize":"14px","background":"#faf9f5","lineColor":"#8a8576","primaryTextColor":"#141413"},"flowchart":{"htmlLabels":true,"nodeSpacing":50,"rankSpacing":58,"padding":16,"useMaxWidth":true,"curve":"basis"}}}%% flowchart TB IN(["data/regime.yaml + price panel offline inputs"]):::store subgraph CTRL["control plane"] ORCH["Orchestrator · the conductor drives every call logs audit · hashes proposal_id"]:::det end subgraph DP["single-purpose agents · data plane (offline)"] direction LR A1["RegimeAgent pure read → views no LLM · no RNG"]:::det A2["Signals factor z-scores"]:::det A3["HypothesisAgent falsifiable thesis LLM-optional"]:::llm A4["Generator + Search bounded replay search"]:::det A5["CriticAgent · red team DSR gate + stress re-sim produces accept flag"]:::redteam A6["CuratorAgent base / bull / bear drafts"]:::det A1 --> A2 --> A3 --> A4 --> A5 --> A6 end ART[("research/proposals/ID/ 5 artifacts + audit.jsonl never writes data/")]:::store GATE{"HUMAN GATE reviews drafts + verdict"}:::human LIVE(["human manually copies weights → data/ live book"]):::term ARCH(["archived · zero deploy"]):::term IN --> A1 A6 --> ART --> GATE GATE -- approve --> LIVE GATE -- decline --> ARCH ORCH -. drives + logs .-> DP classDef det fill:#efede4,stroke:#a39e8f,color:#141413; classDef llm fill:#dbe8f4,stroke:#6a9bcc,color:#234a68,stroke-width:2px,stroke-dasharray:5 3; classDef redteam fill:#f6ddd0,stroke:#d97757,color:#8a3a1d,stroke-width:2px; classDef store fill:#ece8dc,stroke:#b0aea5,color:#57534b; classDef human fill:#e1e8d3,stroke:#788c5d,color:#3c4a28,stroke-width:2px; classDef term fill:#faf9f5,stroke:#cdc9bc,color:#57534b; style CTRL fill:#f4f2ea,stroke:#dcd8cb,color:#57534b; style DP fill:#f4f2ea,stroke:#dcd8cb,color:#57534b;

图 A · 编排拓扑。节点颜色标记每一步的信任级别(见下方图例);虚线箭头是控制+记日志,实线箭头是数据流。

确定性 — 规则引擎,无 LLM、无随机 LLM 可选 — 默认规则引擎;配 key 才接入 Claude 红队 — 想方设法否决的批判者存储 / 工件 — 全程留痕,只写 research/proposals/ 人工把关 — 唯一决策点;机器永不自动接受

逐步拆解

1

RegimeAgent

纯读取 data/regime.yaml → 大类资产视图(无 LLM、无随机)。
2

Signals · 信号

从价格面板算真实跨资产因子 z-score(动量、防御)。
3

HypothesisAgent

regime + 信号 → 一份显式、可证伪的假设(方向 / 陈述 / 证伪条件)。
4

Generator + 搜索

从假设构造有界 vN.2 搜索空间,跑多窗回放搜索(尚非真正 walk-forward — 见下方局限披露)。
5

CriticAgent · 红队

DSR 闸门(去通胀 Sharpe)+ 用 finalist 自己权重做真实资产冲击压力重模拟。
6

Curator + 编排

编译草案、写 proposal + 审计;交给「人」复核。

单一职责 agent

RegimeAgent

负责: 数字

把细粒度战术矩阵(OW=+1/N=0/UW=−1)聚合成大类资产得分。完全确定性,不产出任何文字。

HypothesisAgent

负责: 假设

把 regime 视图 + 真实信号 z-score 变成每个大类的显式方向、一段陈述、以及 4–6 条证伪条件。全程留痕。

Generator + 搜索

负责: 候选策略

从假设导出有界搜索空间(方向固定、幅度搜索),再按样本外目标对 trial 排名。

CriticAgent

负责: 红队

严格的 DSR 闸门(零/负证据直接否决)+ 真实压力重模拟:用 finalist 自己的权重算 r_group = Σ wᵢ·shockᵢ。

CuratorAgent

负责: 草案

编译 base/bull/bear 权重(各自和为 1、通过约束)与决策时点字段。复用假设的陈述 + 证伪条件。

Orchestrator

负责: 溯源

哈希出可复现的 proposal_id、回放审计日志、写 5 份工件、追加 leaderboard。只写 research/proposals/。

不变量 1 · 人工把关

管线只写 research/proposals/ —— 永不创建或修改 data/ 下任何东西。有测试断言一次运行前后 `git status data/` 完全不变。由人复核草案,接受后才手动把权重抄进实盘账本。

不变量 2 · 离线

不需要联网。默认 provider 是确定性的规则引擎;只有在配置了 key 时,Claude provider 内部才会导入 Anthropic SDK。CI 跑确定性路径,所以 proposal_id 可复现。

Worked example · 最新一份真实 proposal

proposal_id c4e3bf45fbb2 · provider rulebased · grid/seed · deterministic=true · code a0515a0 · data 376e75e

regime → 假设

Regime quadrant Q4 (growth momentum -0.519, inflation momentum +0.275). Overweight tilt orientation: commodities, rates. Underweight tilt orientation: equities. Coarse-class views are aggregated from the fine tactical_matrix (OW=+1/N=0/UW=-1, mean per coarse class); the sign sets the search bound orientation, the magnitude is searched.

commodities: OWcredit: Nequities: UWrates: OW

生效的信号因子: defensive, momentum

入选策略

base_allocator: 60_40
tilt_strength: 0
回放 Sharpe (多窗): 0.9339
样本: 85 obs · 3 splits · 162 trials

红队裁决

Deflated Sharpe: 0.3361 (SR0 1.67)
是否接受: false
压力测试口径: finalist_asset_shock_resim
压力 flag: inf2022

这一次,红队用 finalist 自己的权重重模拟了全部 5 个历史情景,flag 了 inf2022;严格 DSR 闸门给出 accept=false。这正是系统按设计运作 —— 在薄、单一 regime 的数据上,它本就应当拒绝盖章。草案仍停在人工把关处。

⚠ 这里"回放 Sharpe"的真实口径(诚实披露)

搜索是把静态权重向量在多个测试窗口做 mark-to-market 排名。信号 z-score 已改为逐折重算、每折只用窗口起点前的数据(前视已消除);但它仍是排名静态权重(信号之外无逐折模型重训),且交易成本尚未建模(单次 rebalance ⇒ turnover 恒 0)。所以这是"带 train-only 信号的多窗回放",不是经验证的 expanding walk-forward。本页所有数字按 in_frame_pass / R0-R1 框架示意理解。上方示例已用此 train-only 代码 + 量纲修正的 Deflated Sharpe(年化 SR→每期, SR0 取实测跨 trial 方差)重新生成:在薄、单一 regime 数据上 DSR ≈ 0.34, 远低于 0.95 显著性门槛, 所以红队 critic 否决了自己的候选 —— 系统按设计拒绝盖章。剩余修复(交易成本 + 多次 rebalance)记录在 docs/09 §0.2。

审计轨迹 · 编排器实际跑了什么

编排器为这份 proposal 驱动的每一次 provider 调用,按发生顺序回放自 audit.jsonl。model 为空,是因为这次跑的是确定性规则引擎 —— 只有配置了 key 时才会导入 Claude provider。RegimeAgent(纯读取)与 Orchestrator(只写工件)不发起 provider 调用,所以不在此列。

%%{init: {"theme":"base","themeVariables":{"fontFamily":"ui-sans-serif, system-ui, -apple-system, Segoe UI, sans-serif","fontSize":"16px","background":"#faf9f5","actorBkg":"#efede4","actorBorder":"#a39e8f","actorTextColor":"#141413","actorLineColor":"#c4c0b2","signalColor":"#8a8576","signalTextColor":"#3c382f","noteBkg":"#dbe8f4","noteBorderColor":"#6a9bcc","noteTextColor":"#234a68","activationBkgColor":"#f6ddd0","activationBorderColor":"#d97757","sequenceNumberColor":"#faf9f5","labelBoxBkgColor":"#e1e8d3","labelBoxBorderColor":"#788c5d","labelTextColor":"#3c4a28","loopTextColor":"#3c4a28"},"sequence":{"useMaxWidth":false,"actorMargin":60,"boxMargin":14,"noteMargin":12,"messageMargin":42,"mirrorActors":true}}}%% sequenceDiagram autonumber participant O as Orchestrator participant R as RegimeAgent participant S as Signals participant H as HypothesisAgent participant G as Generator+Search participant C as CriticAgent participant U as CuratorAgent participant L as audit.jsonl participant Hum as Human Note over O,L: offline · deterministic by default · provider=rulebased · model=none O->>R: read regime to coarse views R-->>O: views (no provider call, not logged) O->>S: compute factor z-scores S-->>O: momentum / defensive O->>H: state hypothesis + falsification H-->>L: log regime_summary, falsification H-->>O: thesis + 4 to 6 falsifiers O->>G: bounded multi-window replay search G-->>L: log search_space G-->>O: finalist params O->>C: critique (DSR + stress re-sim) C-->>L: log critique → accept=false, flag inf2022 C-->>O: verdict O->>U: compile drafts U-->>L: log rationale U-->>O: base / bull / bear drafts O->>O: hash proposal_id · write 5 artifacts O->>Hum: hand off drafts + verdict Note over Hum: machine NEVER auto-accepts alt human approves Hum->>Hum: manually copy weights → data/ else human declines Hum->>Hum: archive · zero deploy end

图 B · 同一次运行的时序 trace(可横向滚动看全 9 条泳道)。蓝色便签=运行模式;橙色激活条标出红队 critic;只有 4 个真正调用 provider 的 agent 写入 audit.jsonl;运行终止于人工把关(approve / decline)。

逐条调用日志

01

hypothesis · regime_summary provider=rulebased model=—

state the macro hypothesis from the regime view
02

hypothesis · falsification provider=rulebased model=—

falsification conditions for the hypothesis
03

generator · search_space provider=rulebased model=—

generate vN.2 search_spec for the current regime
04

critic · critique provider=rulebased model=—

critique the finalist from DSR + stress context
05

curator · rationale provider=rulebased model=—

decision rationale prose

诚实声明

已提交的价格历史很薄、且单一 regime(~120 个交易日,一个 Q4 宏观 regime)。样本内没有 bull/bear 切换,所以 regime tilt 无法做跨 regime 验证。
压力冲击是按情景 benchmark 估的区间量级,不是逐 ETF 的实测值 —— 仅作框架验证。
这些 proposal 是示意性的、并不稳健。不要仅凭这点证据部署 —— 这正是一切都停在人工把关处的原因。

源码:research/agents/*(编排 + agent)、research/engine/*(vN.1 引擎 + 信号)、research/search/*(vN.2 搜索)。每份 proposal 产出 5 个工件 —— proposal.md、rebalance_draft.yaml、decision_draft.yaml、audit.jsonl、config.yaml —— 全部带可复现 proposal_id 提交到公开 repo。