2026.05.02 W3 · drafted
2026.05.02
Pythia-6.9B 情绪向量复现 Pythia-6.9B emotion probing — a post-mortem
原计划「干净复现」,14 条发现里能 ship 的为零。 A clean replication, planned. Zero of fourteen findings ship raw.
§ 0TL;DRTL;DR
在 Pythia-6.9B 上复现 Anthropic 2026《情绪概念在大模型中的功能》一文里的概念向量探针。一轮审计扫掉 14 条发现里的全部 14 条——并非全错,而是没有一条达到「可以原样写进博客」的可信度。本文记录每一条是怎么塌的,以及我从这次塌方里学到的东西,比那 14 条结论本身更重要。 I replicated the concept-vector probing from Anthropic's 2026 emotion paper on Pythia-6.9B. A second-pass audit knocked out all 14 of my findings — not because they're wrong, but because none clears the bar for "ship as-is." This log walks each failure, and the lesson is bigger than the findings would have been.
§ 1实验设置setup
复现目标是论文 §3.2 的「情绪向量」:在残差流的某一层,构造一个方向 v,使得它的投影分数 x · v 与下游对话中的情绪标签强相关。
The replication targets §3.2 of the paper: build a direction v in some residual-stream layer such that the projection score x · v correlates strongly with downstream emotion labels.
形式上,给定数据集 D = {(x_i, y_i)},对每个 prompt 取最后一个 token 的隐藏态 h_i ∈ ℝ^d,然后求:Formally, given D = {(x_i, y_i)}, take the last-token hidden state h_i ∈ ℝ^d per prompt and solve:
$$ v^\star = \arg\max_{\|v\|=1} \frac{1}{n}\sum_{i=1}^{n} y_i \cdot (h_i \cdot v) - \lambda \|v\|_2^2 $$
实际我用的是 logistic 探针的权重向量,等价于上面的最大间隔解在标签为 ±1 时的近似。In practice I used the weight vector of a logistic probe — equivalent to the above margin solution under ±1 labels, modulo regularization.
train_probe.py · L42python# probe per layer, simple ridge-logistic for layer in range(model.cfg.n_layers): H = cache[f"blocks.{layer}.hook_resid_post"][:, -1, :] probe = LogisticRegression(C=0.1, max_iter=2000) probe.fit(H.cpu().numpy(), y) scores[layer] = probe.score(H_val.cpu().numpy(), y_val)
§ 214 条发现the 14 findings
第一轮我整理出 14 条「看起来很有意思」的发现,按层、按情绪、按模型规模分。这里只列三条做样本。 First pass yielded 14 "interesting" findings, sliced by layer, emotion, and model size. Three samples below.
- 第 18 层探针准确率最高(0.83),与论文 8B 模型的第 19 层接近。[1]Layer-18 probe peaks at 0.83 acc, near the paper's 8B layer-19.[1]
- 「愤怒」与「悲伤」的方向余弦相似度为 0.71,远高于随机基线 0.04。Anger and sadness directions have cosine 0.71, far above the random baseline of 0.04.
- 沿「快乐」方向 +α 干预后,模型在中性提示下生成的形容词正面词频提升 3.2×。Steering with +α along the "joy" direction lifts positive-adjective frequency 3.2× on neutral prompts.
§ 3审计的 14 个洞the audit · 14 holes
第二轮我自己当审稿人重读一遍。每条发现写下:(a) 控制了什么,(b) 漏了什么。结论是 14/14 都漏了至少一项关键控制。 Second pass: I played reviewer on myself. For each finding I wrote down (a) what's controlled, (b) what isn't. Verdict: 14 of 14 missed at least one control that mattered.
"复现一篇论文,最容易自欺的地方是:你已经知道了答案。" "The easiest way to fool yourself when replicating a paper is that you already know the answer."
举例:上面那条「愤怒/悲伤余弦 0.71」。我没有控制 token 长度——长 prompt 的隐藏态在 L2 范数上系统性更大,余弦相似度偏向高。把长度归一化后这个数字掉到 0.31,仍然显著,但不再是「远高于」。Example: that "anger/sadness cosine 0.71" finding. I never controlled prompt length — long prompts have systematically larger hidden-state L2 norms, biasing cosine high. Length-normalize and the number falls to 0.31. Still significant, no longer dramatic.
§ 4为什么没有一条 shipwhy none ships
不是因为它们错了。是因为我没有分清「我看到了一个数」和「这个数说明了什么」。一旦把后者写下来,控制变量就全跑出来了。 Not because they're wrong. Because I hadn't separated "I saw a number" from "the number means X." Once I wrote the second sentence, the missing controls walked themselves into the room.
§ 5下一步next
- 把 14 条按「修补成本」排序,挑 3 条补控制再跑一轮。Rank the 14 by "fixable cost," pick 3, re-run with proper controls.
- 写一个最小复现脚本,把 length-norm、null-space、跨 seed 三个控制做成默认开关。Build a minimal replication script with length-norm, null-space, cross-seed checks as default flags.
- 下一篇博客直接写「我在审计自己时学到的 5 个 checklist 项」,比写 14 条结论有用。Next post: "5 checklist items I learned auditing myself" — more useful than the 14 findings would have been.