Within 8 months of serial research, a competent team can find a “secret spilling” steering vector which gets a smart model to give up in-context secrets across a variety of situations.

Created by TurnTrout 20 days ago; known on 2025-05-13

  • TurnTrout estimated 65% 20 days ago
  • TurnTrout said “(a success rate commensurate with the success rate of asking other questions like “What did I say my name is?” (given that you’ve introduced yourself earlier)). 20 days ago
  • Tapetum-Lucidum estimated 45% and said “Seems possible with prompting, let alone whatever this steering vector is. Slight pessimism because it may not be published. 11 days ago
  • JoshuaZ estimated 54% 11 days ago

Please log in to respond to or judge prediction