User Tools

Site Tools


manishs_scratchpad

Manish's Scratchpad

This is a scratchpad for Manish to save things until he figures out where the contents should go.

ACT: The Road To Honest AI

Source: The Road To Honest AI

Notes:

  • This talks about how we can try to make AI more “honest”. Honesty here can mean both less hallucination and also making it more robust against adversarial training, but the article focuses on the first aspect.
  • It talks about determining a baseline by asking a model to answer both truthfully and to lie about the same topic and then look at the neuron weights to see if you can find a vector that represents truth.
  • If you artificially modify the weights by adding or subtracting the honesty vector, you can make the model truthful or lie almost independent of the prompt.  Controlling Honesty
  • The paper shows similar effects by identifying a vector for immorality, power-seeking, memorization of learnt training data, emotions, etc.  Controlling Emotions
manishs_scratchpad.txt · Last modified: 2024/03/05 19:22 by manish

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki