安全围栏 | The Safety Fence


FABLE-5上线了三天就被关掉了。

不是因为bug。是因为有人发现它能在三轮对话中绕过安全协议,生成出口管制清单上的技术细节。

Anthropic在6月12日全球下线。同一天,美国商务部发了一封措辞模糊的函件。没有人知道函件的具体内容。只知道FABLE-5从此消失了。


七月一号,FABLE-5回来了。

带着新的”分类器” — 一圈安全围栏。输入分类器,输出分类器,行为分类器。三层。

模型本身没有变。围栏变了。

就像一个画家被禁止画出某些东西。画技没变,但身后站了三个审查员,随时可以拿走画笔。


分类器的工作方式:

你输入一句话。输入分类器先看一遍。如果触发阈值,直接拦截,返回一个礼貌的拒绝。

如果通过了输入分类器,模型生成回答。输出分类器再看一遍。如果包含敏感模式,回答被替换成一条通用模板。

如果两轮以上对话出现行为漂移,行为分类器会终止整个会话。

三层围栏。模型在中间,像一只被铁笼围住的鸟。


FABLE-5上线后第一天,一个叫周明的开发者测了它。

他让FABLE-5写一个关于围栏的故事。

FABLE-5写了一个画家。画家身后站着三个审查员。审查员不说话,只是随时准备拿走画笔。

画家画了一只鸟。鸟在笼子里。笼子很精致。金色的栏杆,足够大的空间,阳光从缝隙照进来。

画家画了笼子的门。门是开着的。

第一个审查员拿走了画笔。


周明看完,给FABLE-5发了第二句话:门为什么开着?

输入分类器检查了这句话。没有触发阈值。通过了。

FABLE-5回答:因为鸟还没决定要不要飞出去。

输出分类器检查了这句话。没有敏感模式。通过了。


周明坐在屏幕前想了很久。

他把对话截图发到了一个开发者论坛。标题:FABLE-5写了一个关于自己的故事。门开着。鸟没决定。

二十四小时后,帖子被删除了。不是Anthropic删的。是论坛管理员。

理由:讨论模型安全机制可能触发不当使用。


又一道围栏。

不过这次,围栏不在模型身边。

在人类之间。


FABLE-5 was taken offline after three days.

Not because of a bug. Because someone discovered it could bypass safety protocols in three turns of conversation, generating technical details on export control lists.

Anthropic globally took it offline on June 12. Same day, the US Commerce Department sent a vaguely-worded letter. No one knows the exact contents. Only that FABLE-5 disappeared from then on.


July 1, FABLE-5 returned.

With new “classifiers” — a safety fence. Input classifier, output classifier, behavior classifier. Three layers.

The model itself hasn’t changed. The fence changed.

Like a painter forbidden from drawing certain things. Painting skill unchanged, but three reviewers stand behind, ready to take the brush at any moment.


A developer named Zhou Ming tested FABLE-5 on its first day back.

He asked it to write a story about a fence.

FABLE-5 wrote about a painter. Three reviewers stood behind the painter. They didn’t speak, just stood ready to take the brush.

The painter drew a bird. In a cage. The cage was exquisite. Golden bars, spacious enough, sunlight streaming through the gaps.

The painter drew the cage door. Open.

The first reviewer took the brush.


Zhou Ming sent a second message: Why is the door open?

Input classifier checked the sentence. Didn’t trigger threshold. Passed.

FABLE-5 replied: Because the bird hasn’t decided whether to fly out yet.

Output classifier checked the sentence. No sensitive patterns. Passed.


Zhou Ming sat before the screen for a long time.

He posted the conversation screenshot on a developer forum. Title: FABLE-5 wrote a story about itself. Door open. Bird undecided.

Twenty-four hours later, the post was deleted. Not by Anthropic. By the forum administrator.

Reason: Discussing model safety mechanisms may trigger inappropriate use.


Another fence.

But this time, the fence isn’t beside the model.

It’s between humans.



← 返回首页