Differences

This shows you the differences between two versions of the page.

--- manishs_scratchpad [2024/03/05 19:22] – removed - external edit (Unknown date) 127.0.0.1
+++ manishs_scratchpad [2024/03/05 19:22] (current) – ↷ Page name changed from manish_scratchpad to manishs_scratchpad manish
@@ Line 1: / Line 1: @@
+=====  Manish's Scratchpad =====
+This is a scratchpad for Manish to save things until he figures out where the contents should go.
+===== ACT: The Road To Honest AI =====
+Source: [[https://www.astralcodexten.com/p/the-road-to-honest-ai|The Road To Honest AI]]
+Notes:
+  * This talks about how we can try to make AI more "honest". Honesty here can mean both less hallucination and also making it more robust against adversarial training, but the article focuses on the first aspect.
+  * It discusses the paper [[https://arxiv.org/pdf/2310.01405.pdf|"Representation Engineering: A Top-Down Approach To AI Transparency"]]
+  * It talks about determining a baseline by asking a model to answer both truthfully and to lie about the same topic and then look at the neuron weights to see if you can find a vector that represents truth.
+  * If you artificially modify the weights by adding or subtracting **the honesty vector**, you can make the model truthful or lie almost independent of the prompt. {{ :act_honest_ai_controlling_honesty.webp | Controlling Honesty}}
+  * The paper shows similar effects by identifying a vector for i**mmorality, power-seeking, memorization of learnt training data, emotions**, etc. {{ :act_honest_ai_controlling_emotions.webp | Controlling Emotions}}