§ 0 阅读时
2026.04.28
read on
2026.04.28
note 2026 · 04 · 28 阅读 8 分钟8 min read paper · arXiv:2209.11895

阅读笔记 · Olsson 的 induction heads Reading note · Olsson's induction heads

为什么两层注意力就足以「学会复制」。 Why two attention layers suffice to "learn to copy."

§ 1原文要旨the claim

原文给的核心结论:induction head 是模型学会「上下文学习」的最小机制。它需要恰好两层 attention:第一层把每个 token 的「上一个 token」复制到当前位置;第二层在历史里找一个匹配的 key,然后输出它的 value。形式上,对一个序列 ... A B ... A → ?,模型预测 B The core claim: induction heads are the minimal mechanism by which a transformer learns in-context learning. They need exactly two attention layers: layer one copies each token's "previous token" forward; layer two queries history for a matching key and outputs its value. Formally, for ... A B ... A → ?, the model predicts B.

论文里那张「双 bump 损失曲线」是一个非常干净的可观测指标——loss 在某个 step 突然第二次下降,对应 induction head 形成的瞬间。The paper's twin-bump loss curve is a remarkably clean observable: loss drops a second time at a sharp step, corresponding to the moment induction heads form.

§ 2我的理解my read

把 attention 看成一种「按内容寻址的内存」,那么单层 attention 是「单跳查询」,两层就是「两跳查询」。两跳查询恰好够实现「先把上下文里的某个 token 当 key 找,再返回它后面的那个 token 的内容」——这就是复制。If attention is content-addressed memory, single-layer is one-hop lookup, two-layer is two-hop. Two-hop is exactly enough to "find a token in context as key, then return whatever followed it" — which is copy.

"复制"是涌现的,不是被教的。两层注意力 + 适当的训练数据 + 足够的参数,就够。 "Copying" is emergent, not taught. Two layers of attention plus the right data plus enough params is sufficient.

§ 3和我项目的连接tie to my project

这对古文 LLM 那个项目很有用。如果在「续写」格律诗时存在类似 induction head 的机制——比如把上句的「平」复制到这句的「仄」位置——那就有了一个非常具体的可观察对象,比泛泛说"模型学会了平仄"具体得多。Useful for the classical-Chinese project. If induction-like heads exist when generating regulated verse — say, copying a tonal pattern from the previous line into the current — that gives me a very concrete observable, much more specific than "the model learned tonal rules."

具体形式可能是:找一个 head,它的 attention pattern 在「上句末字」上有强权重,且在 OV circuit 里把声调信息线性投到下一个位置。如果存在,逐 step 跟踪它的形成应该能看到一个类似双 bump 的现象。Concretely: find a head whose attention pattern has high weight on the previous line's last syllable, and whose OV circuit linearly projects tonal info into the next position. If it exists, tracking its formation step-by-step should yield something double-bump-shaped.

§ 4待确认to verify

  • 在 Pythia 系列里 induction head 的形成 step 是否随模型规模单调?Does induction-head formation step scale monotonically with model size in Pythia?
  • 古文语料的 induction 现象是否需要 BPE/字符级 tokenizer?Does classical-Chinese induction need BPE vs char-level tokenization?
  • "押韵"这种非局部依赖是否也走 induction,还是另一种 circuit?Does non-local dependency like rhyme also route through induction, or another circuit?