§ 0 第 3 周 · 写完
2026.05.02
W3 · drafted
2026.05.02
log 2026 · 05 · 02 2026 · 05 · 02 约 3,200 字~ 2,400 words 阅读 12 分钟12 min read commit a4f2e91 commit a4f2e91

Pythia-6.9B 情绪向量复现 Pythia-6.9B emotion probing — a post-mortem

原计划「干净复现」,14 条发现里能 ship 的为零。 A clean replication, planned. Zero of fourteen findings ship raw.

§ 0TL;DRTL;DR

在 Pythia-6.9B 上复现 Anthropic 2026《情绪概念在大模型中的功能》一文里的概念向量探针。一轮审计扫掉 14 条发现里的全部 14 条——并非全错,而是没有一条达到「可以原样写进博客」的可信度。本文记录每一条是怎么塌的,以及我从这次塌方里学到的东西,比那 14 条结论本身更重要。 I replicated the concept-vector probing from Anthropic's 2026 emotion paper on Pythia-6.9B. A second-pass audit knocked out all 14 of my findings — not because they're wrong, but because none clears the bar for "ship as-is." This log walks each failure, and the lesson is bigger than the findings would have been.

§ 1实验设置setup

复现目标是论文 §3.2 的「情绪向量」:在残差流的某一层,构造一个方向 v,使得它的投影分数 x · v 与下游对话中的情绪标签强相关。 The replication targets §3.2 of the paper: build a direction v in some residual-stream layer such that the projection score x · v correlates strongly with downstream emotion labels.

形式上,给定数据集 D = {(x_i, y_i)},对每个 prompt 取最后一个 token 的隐藏态 h_i ∈ ℝ^d,然后求:Formally, given D = {(x_i, y_i)}, take the last-token hidden state h_i ∈ ℝ^d per prompt and solve:

$$ v^\star = \arg\max_{\|v\|=1} \frac{1}{n}\sum_{i=1}^{n} y_i \cdot (h_i \cdot v) - \lambda \|v\|_2^2 $$

实际我用的是 logistic 探针的权重向量,等价于上面的最大间隔解在标签为 ±1 时的近似。In practice I used the weight vector of a logistic probe — equivalent to the above margin solution under ±1 labels, modulo regularization.

train_probe.py · L42python# probe per layer, simple ridge-logistic
for layer in range(model.cfg.n_layers):
    H = cache[f"blocks.{layer}.hook_resid_post"][:, -1, :]
    probe = LogisticRegression(C=0.1, max_iter=2000)
    probe.fit(H.cpu().numpy(), y)
    scores[layer] = probe.score(H_val.cpu().numpy(), y_val)
chart · probe accuracy by layer · 1200×500 fig 01
图 01 · 占位 — Pythia-6.9B 各层探针验证准确率(实际曲线待替换) Fig 01 · placeholder — per-layer probe val accuracy on Pythia-6.9B (real curve TBD)

§ 214 条发现the 14 findings

第一轮我整理出 14 条「看起来很有意思」的发现,按层、按情绪、按模型规模分。这里只列三条做样本。 First pass yielded 14 "interesting" findings, sliced by layer, emotion, and model size. Three samples below.

  1. 第 18 层探针准确率最高(0.83),与论文 8B 模型的第 19 层接近。[1]Layer-18 probe peaks at 0.83 acc, near the paper's 8B layer-19.[1]
  2. 「愤怒」与「悲伤」的方向余弦相似度为 0.71,远高于随机基线 0.04。Anger and sadness directions have cosine 0.71, far above the random baseline of 0.04.
  3. 沿「快乐」方向 +α 干预后,模型在中性提示下生成的形容词正面词频提升 3.2×。Steering with +α along the "joy" direction lifts positive-adjective frequency 3.2× on neutral prompts.

§ 3审计的 14 个洞the audit · 14 holes

第二轮我自己当审稿人重读一遍。每条发现写下:(a) 控制了什么,(b) 漏了什么。结论是 14/14 都漏了至少一项关键控制。 Second pass: I played reviewer on myself. For each finding I wrote down (a) what's controlled, (b) what isn't. Verdict: 14 of 14 missed at least one control that mattered.

"复现一篇论文,最容易自欺的地方是:你已经知道了答案。" "The easiest way to fool yourself when replicating a paper is that you already know the answer."

举例:上面那条「愤怒/悲伤余弦 0.71」。我没有控制 token 长度——长 prompt 的隐藏态在 L2 范数上系统性更大,余弦相似度偏向高。把长度归一化后这个数字掉到 0.31,仍然显著,但不再是「远高于」。Example: that "anger/sadness cosine 0.71" finding. I never controlled prompt length — long prompts have systematically larger hidden-state L2 norms, biasing cosine high. Length-normalize and the number falls to 0.31. Still significant, no longer dramatic.

§ 4为什么没有一条 shipwhy none ships

不是因为它们错了。是因为我没有分清「我看到了一个数」和「这个数说明了什么」。一旦把后者写下来,控制变量就全跑出来了。 Not because they're wrong. Because I hadn't separated "I saw a number" from "the number means X." Once I wrote the second sentence, the missing controls walked themselves into the room.

§ 5下一步next

  • 把 14 条按「修补成本」排序,挑 3 条补控制再跑一轮。Rank the 14 by "fixable cost," pick 3, re-run with proper controls.
  • 写一个最小复现脚本,把 length-norm、null-space、跨 seed 三个控制做成默认开关。Build a minimal replication script with length-norm, null-space, cross-seed checks as default flags.
  • 下一篇博客直接写「我在审计自己时学到的 5 个 checklist 项」,比写 14 条结论有用。Next post: "5 checklist items I learned auditing myself" — more useful than the 14 findings would have been.
  1. 用的是 EleutherAI/pythia-6.9b-deduped, step143000, transformer-lens 加载。 EleutherAI/pythia-6.9b-deduped, step143000, loaded via transformer-lens.