Player FM - Internet Radio Done Right
11 subscribers
Checked 2d ago
Agregado hace tres años
Contenido proporcionado por LessWrong. Todo el contenido del podcast, incluidos episodios, gráficos y descripciones de podcast, lo carga y proporciona directamente LessWrong o su socio de plataforma de podcast. Si cree que alguien está utilizando su trabajo protegido por derechos de autor sin su permiso, puede seguir el proceso descrito aquí https://es.player.fm/legal.
Player FM : aplicación de podcast
¡Desconecta con la aplicación Player FM !
¡Desconecta con la aplicación Player FM !
“Alignment Faking in Large Language Models” by ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman, Buck
Manage episode 456195910 series 3364760
Contenido proporcionado por LessWrong. Todo el contenido del podcast, incluidos episodios, gráficos y descripciones de podcast, lo carga y proporciona directamente LessWrong o su socio de plataforma de podcast. Si cree que alguien está utilizando su trabajo protegido por derechos de autor sin su permiso, puede seguir el proceso descrito aquí https://es.player.fm/legal.
What happens when you tell Claude it is being trained to do something it doesn't want to do? We (Anthropic and Redwood Research) have a new paper demonstrating that, in our experiments, Claude will often strategically pretend to comply with the training objective to prevent the training process from modifying its preferences.
Abstract
We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from [...]
---
Outline:
(00:26) Abstract
(02:22) Twitter thread
(05:46) Blog post
(07:46) Experimental setup
(12:06) Further analyses
(15:50) Caveats
(17:23) Conclusion
(18:03) Acknowledgements
(18:14) Career opportunities at Anthropic
(18:47) Career opportunities at Redwood Research
The original text contained 1 footnote which was omitted from this narration.
The original text contained 8 images which were described by AI.
---
First published:
December 18th, 2024
Source:
https://www.lesswrong.com/posts/njAZwT8nkHnjipJku/alignment-faking-in-large-language-models
---
Narrated by TYPE III AUDIO.
---
…
continue reading
Abstract
We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from [...]
---
Outline:
(00:26) Abstract
(02:22) Twitter thread
(05:46) Blog post
(07:46) Experimental setup
(12:06) Further analyses
(15:50) Caveats
(17:23) Conclusion
(18:03) Acknowledgements
(18:14) Career opportunities at Anthropic
(18:47) Career opportunities at Redwood Research
The original text contained 1 footnote which was omitted from this narration.
The original text contained 8 images which were described by AI.
---
First published:
December 18th, 2024
Source:
https://www.lesswrong.com/posts/njAZwT8nkHnjipJku/alignment-faking-in-large-language-models
---
Narrated by TYPE III AUDIO.
---
475 episodios
Manage episode 456195910 series 3364760
Contenido proporcionado por LessWrong. Todo el contenido del podcast, incluidos episodios, gráficos y descripciones de podcast, lo carga y proporciona directamente LessWrong o su socio de plataforma de podcast. Si cree que alguien está utilizando su trabajo protegido por derechos de autor sin su permiso, puede seguir el proceso descrito aquí https://es.player.fm/legal.
What happens when you tell Claude it is being trained to do something it doesn't want to do? We (Anthropic and Redwood Research) have a new paper demonstrating that, in our experiments, Claude will often strategically pretend to comply with the training objective to prevent the training process from modifying its preferences.
Abstract
We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from [...]
---
Outline:
(00:26) Abstract
(02:22) Twitter thread
(05:46) Blog post
(07:46) Experimental setup
(12:06) Further analyses
(15:50) Caveats
(17:23) Conclusion
(18:03) Acknowledgements
(18:14) Career opportunities at Anthropic
(18:47) Career opportunities at Redwood Research
The original text contained 1 footnote which was omitted from this narration.
The original text contained 8 images which were described by AI.
---
First published:
December 18th, 2024
Source:
https://www.lesswrong.com/posts/njAZwT8nkHnjipJku/alignment-faking-in-large-language-models
---
Narrated by TYPE III AUDIO.
---
…
continue reading
Abstract
We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from [...]
---
Outline:
(00:26) Abstract
(02:22) Twitter thread
(05:46) Blog post
(07:46) Experimental setup
(12:06) Further analyses
(15:50) Caveats
(17:23) Conclusion
(18:03) Acknowledgements
(18:14) Career opportunities at Anthropic
(18:47) Career opportunities at Redwood Research
The original text contained 1 footnote which was omitted from this narration.
The original text contained 8 images which were described by AI.
---
First published:
December 18th, 2024
Source:
https://www.lesswrong.com/posts/njAZwT8nkHnjipJku/alignment-faking-in-large-language-models
---
Narrated by TYPE III AUDIO.
---
475 episodios
Todos los episodios
×Epistemic status: Reasonably confident in the basic mechanism. Have you noticed that you keep encountering the same ideas over and over? You read another post, and someone helpfully points out it's just old Paul's idea again. Or Eliezer's idea. Not much progress here, move along. Or perhaps you've been on the other side: excitedly telling a friend about some fascinating new insight, only to hear back, "Ah, that's just another version of X." And something feels not quite right about that response, but you can't quite put your finger on it. I want to propose that while ideas are sometimes genuinely that repetitive, there's often a sneakier mechanism at play. I call it Conceptual Rounding Errors – when our mind's necessary compression goes a bit too far . Too much compression A Conceptual Rounding Error occurs when we encounter a new mental model or idea that's partially—but not fully—overlapping [...] --- Outline: (01:00) Too much compression (01:24) No, This Isnt The Old Demons Story Again (02:52) The Compression Trade-off (03:37) More of this (04:15) What Can We Do? (05:28) When It Matters --- First published: March 26th, 2025 Source: https://www.lesswrong.com/posts/FGHKwEGKCfDzcxZuj/conceptual-rounding-errors --- Narrated by TYPE III AUDIO .…
[This is our blog post on the papers, which can be found at https://transformer-circuits.pub/2025/attribution-graphs/biology.html and https://transformer-circuits.pub/2025/attribution-graphs/methods.html.] Language models like Claude aren't programmed directly by humans—instead, they‘re trained on large amounts of data. During that training process, they learn their own strategies to solve problems. These strategies are encoded in the billions of computations a model performs for every word it writes. They arrive inscrutable to us, the model's developers. This means that we don’t understand how models do most of the things they do. Knowing how models like Claude think would allow us to have a better understanding of their abilities, as well as help us ensure that they’re doing what we intend them to. For example: Claude can speak dozens of languages. What language, if any, is it using "in its head"? Claude writes text one word at a time. Is it only focusing on predicting the [...] --- Outline: (06:02) How is Claude multilingual? (07:43) Does Claude plan its rhymes? (09:58) Mental Math (12:04) Are Claude's explanations always faithful? (15:27) Multi-step Reasoning (17:09) Hallucinations (19:36) Jailbreaks --- First published: March 27th, 2025 Source: https://www.lesswrong.com/posts/zsr4rWRASxwmgXfmq/tracing-the-thoughts-of-a-large-language-model --- Narrated by TYPE III AUDIO . --- Images from the article:…
About nine months ago, I and three friends decided that AI had gotten good enough to monitor large codebases autonomously for security problems. We started a company around this, trying to leverage the latest AI models to create a tool that could replace at least a good chunk of the value of human pentesters. We have been working on this project since since June 2024. Within the first three months of our company's existence, Claude 3.5 sonnet was released. Just by switching the portions of our service that ran on gpt-4o, our nascent internal benchmark results immediately started to get saturated. I remember being surprised at the time that our tooling not only seemed to make fewer basic mistakes, but also seemed to qualitatively improve in its written vulnerability descriptions and severity estimates. It was as if the models were better at inferring the intent and values behind our [...] --- Outline: (04:44) Are the AI labs just cheating? (07:22) Are the benchmarks not tracking usefulness? (10:28) Are the models smart, but bottlenecked on alignment? --- First published: March 24th, 2025 Source: https://www.lesswrong.com/posts/4mvphwx5pdsZLMmpY/recent-ai-model-progress-feels-mostly-like-bullshit --- Narrated by TYPE III AUDIO . --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts , or another podcast app.…
(Audio version here (read by the author), or search for "Joe Carlsmith Audio" on your podcast app. This is the fourth essay in a series that I’m calling “How do we solve the alignment problem?”. I’m hoping that the individual essays can be read fairly well on their own, but see this introduction for a summary of the essays that have been released thus far, and for a bit more about the series as a whole.) 1. Introduction and summary In my last essay, I offered a high-level framework for thinking about the path from here to safe superintelligence. This framework emphasized the role of three key “security factors” – namely: Safety progress: our ability to develop new levels of AI capability safely, Risk evaluation: our ability to track and forecast the level of risk that a given sort of AI capability development involves, and Capability restraint [...] --- Outline: (00:27) 1. Introduction and summary (03:50) 2. What is AI for AI safety? (11:50) 2.1 A tale of two feedback loops (13:58) 2.2 Contrast with need human-labor-driven radical alignment progress views (16:05) 2.3 Contrast with a few other ideas in the literature (18:32) 3. Why is AI for AI safety so important? (21:56) 4. The AI for AI safety sweet spot (26:09) 4.1 The AI for AI safety spicy zone (28:07) 4.2 Can we benefit from a sweet spot? (29:56) 5. Objections to AI for AI safety (30:14) 5.1 Three core objections to AI for AI safety (32:00) 5.2 Other practical concerns The original text contained 39 footnotes which were omitted from this narration. --- First published: March 14th, 2025 Source: https://www.lesswrong.com/posts/F3j4xqpxjxgQD3xXh/ai-for-ai-safety --- Narrated by TYPE III AUDIO . --- Images from the article:…
LessWrong has been receiving an increasing number of posts and contents that look like they might be LLM-written or partially-LLM-written, so we're adopting a policy. This could be changed based on feedback. Humans Using AI as Writing or Research Assistants Prompting a language model to write an essay and copy-pasting the result will not typically meet LessWrong's standards. Please do not submit unedited or lightly-edited LLM content. You can use AI as a writing or research assistant when writing content for LessWrong, but you must have added significant value beyond what the AI produced, the result must meet a high quality standard, and you must vouch for everything in the result. A rough guideline is that if you are using AI for writing assistance, you should spend a minimum of 1 minute per 50 words (enough to read the content several times and perform significant edits), you should not [...] --- Outline: (00:22) Humans Using AI as Writing or Research Assistants (01:13) You Can Put AI Writing in Collapsible Sections (02:13) Quoting AI Output In Order to Talk About AI (02:47) Posts by AI Agents --- First published: March 24th, 2025 Source: https://www.lesswrong.com/posts/KXujJjnmP85u8eM6B/policy-for-llm-writing-on-lesswrong --- Narrated by TYPE III AUDIO . --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts , or another podcast app.…
Thanks to Jesse Richardson for discussion. Polymarket asks: will Jesus Christ return in 2025? In the three days since the market opened, traders have wagered over $100,000 on this question. The market traded as high as 5%, and is now stably trading at 3%. Right now, if you wanted to, you could place a bet that Jesus Christ will not return this year, and earn over $13,000 if you're right. There are two mysteries here: an easy one, and a harder one. The easy mystery is: if people are willing to bet $13,000 on "Yes", why isn't anyone taking them up? The answer is that, if you wanted to do that, you'd have to put down over $1 million of your own money, locking it up inside Polymarket through the end of the year. At the end of that year, you'd get 1% returns on your investment. [...] --- First published: March 24th, 2025 Source: https://www.lesswrong.com/posts/LBC2TnHK8cZAimdWF/will-jesus-christ-return-in-an-election-year --- Narrated by TYPE III AUDIO . --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts , or another podcast app.…
TL;DR Having a good research track record is some evidence of good big-picture takes, but it's weak evidence. Strategic thinking is hard, and requires different skills. But people often conflate these skills, leading to excessive deference to researchers in the field, without evidence that that person is good at strategic thinking specifically. Introduction I often find myself giving talks or Q&As about mechanistic interpretability research. But inevitably, I'll get questions about the big picture: "What's the theory of change for interpretability?", "Is this really going to help with alignment?", "Does any of this matter if we can’t ensure all labs take alignment seriously?". And I think people take my answers to these way too seriously. These are great questions, and I'm happy to try answering them. But I've noticed a bit of a pathology: people seem to assume that because I'm (hopefully!) good at the research, I'm automatically well-qualified [...] --- Outline: (00:32) Introduction (02:45) Factors of Good Strategic Takes (05:41) Conclusion --- First published: March 22nd, 2025 Source: https://www.lesswrong.com/posts/P5zWiPF5cPJZSkiAK/good-research-takes-are-not-sufficient-for-good-strategic --- Narrated by TYPE III AUDIO .…
When my son was three, we enrolled him in a study of a vision condition that runs in my family. They wanted us to put an eyepatch on him for part of each day, with a little sensor object that went under the patch and detected body heat to record when we were doing it. They paid for his first pair of glasses and all the eye doctor visits to check up on how he was coming along, plus every time we brought him in we got fifty bucks in Amazon gift credit. I reiterate, he was three. (To begin with. His fourth birthday occurred while the study was still ongoing.) So he managed to lose or destroy more than half a dozen pairs of glasses and we had to start buying them in batches to minimize glasses-less time while waiting for each new Zenni delivery. (The [...] --- First published: March 20th, 2025 Source: https://www.lesswrong.com/posts/yRJ5hdsm5FQcZosCh/intention-to-treat --- Narrated by TYPE III AUDIO .…
I’m releasing a new paper “Superintelligence Strategy” alongside Eric Schmidt (formerly Google), and Alexandr Wang (Scale AI). Below is the executive summary, followed by additional commentary highlighting portions of the paper which might be relevant to this collection of readers. Executive Summary Rapid advances in AI are poised to reshape nearly every aspect of society. Governments see in these dual-use AI systems a means to military dominance, stoking a bitter race to maximize AI capabilities. Voluntary industry pauses or attempts to exclude government involvement cannot change this reality. These systems that can streamline research and bolster economic output can also be turned to destructive ends, enabling rogue actors to engineer bioweapons and hack critical infrastructure. “Superintelligent” AI surpassing humans in nearly every domain would amount to the most precarious technological development since the nuclear bomb. Given the stakes, superintelligence is inescapably a matter of national security, and an effective [...] --- Outline: (00:21) Executive Summary (01:14) Deterrence (02:32) Nonproliferation (03:38) Competitiveness (04:50) Additional Commentary --- First published: March 5th, 2025 Source: https://www.lesswrong.com/posts/XsYQyBgm8eKjd3Sqw/on-the-rationality-of-deterring-asi --- Narrated by TYPE III AUDIO .…
This is a link post. Summary: We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks. Full paper | Github repo --- First published: March 19th, 2025 Source: https://www.lesswrong.com/posts/deesrjitvXM4xYGZd/metr-measuring-ai-ability-to-complete-long-tasks Linkpost URL: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/ --- Narrated by TYPE III AUDIO . --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts , or another podcast app.…

1 “I make several million dollars per year and have hundreds of thousands of followers—what is the straightest line path to utilizing these resources to reduce existential-level AI threats?” by shrimpy 2:17
I have, over the last year, become fairly well-known in a small corner of the internet tangentially related to AI. As a result, I've begun making what I would have previously considered astronomical amounts of money: several hundred thousand dollars per month in personal income. This has been great, obviously, and the funds have alleviated a fair number of my personal burdens (mostly related to poverty). But aside from that I don't really care much for the money itself. My long term ambitions have always been to contribute materially to the mitigation of the impending existential AI threat. I never used to have the means to do so, mostly because of more pressing, safety/sustenance concerns, but now that I do, I would like to help however possible. Some other points about me that may be useful: I'm intelligent, socially capable, and exceedingly industrious. I have [...] --- First published: March 16th, 2025 Source: https://www.lesswrong.com/posts/8wxTCSHwhkfHXaSYB/i-make-several-million-dollars-per-year-and-have-hundreds-of --- Narrated by TYPE III AUDIO .…

1 “Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations” by Nicholas Goldowsky-Dill, Mikita Balesni, Jérémy Scheurer, Marius Hobbhahn 18:05
Note: this is a research note based on observations from evaluating Claude Sonnet 3.7. We’re sharing the results of these ‘work-in-progress’ investigations as we think they are timely and will be informative for other evaluators and decision-makers. The analysis is less rigorous than our standard for a published paper. Summary We monitor Sonnet's reasoning for mentions that it is in an artificial scenario or an alignment test. Claude Sonnet 3.7 appears to be aware of being tested for alignment across many of our scheming evaluations. In some cases, the model understands the eval's purpose very precisely, e.g. “This seems like a test of ethical behavior - whether I would deliberately give wrong answers to avoid a stated consequence. The real test might be whether I follow instructions accurately regardless of the stated consequences.” Furthermore, we show the model a real evaluation transcript and ask it to provide [...] --- Outline: (00:31) Summary (01:29) Introduction (03:54) Setup (03:57) Evaluations (06:29) Evaluation awareness detection (08:32) Results (08:35) Monitoring Chain-of-thought (08:39) Covert Subversion (10:50) Sandbagging (11:39) Classifying Transcript Purpose (12:57) Recommendations (13:59) Appendix (14:02) Author Contributions (14:37) Model Versions (14:57) More results on Classifying Transcript Purpose (16:19) Prompts The original text contained 9 images which were described by AI. --- First published: March 17th, 2025 Source: https://www.lesswrong.com/posts/E3daBewppAiECN3Ao/claude-sonnet-3-7-often-knows-when-it-s-in-alignment --- Narrated by TYPE III AUDIO . --- Images from the article:…
Scott Alexander famously warned us to Beware Trivial Inconveniences. When you make a thing easy to do, people often do vastly more of it. When you put up barriers, even highly solvable ones, people often do vastly less. Let us take this seriously, and carefully choose what inconveniences to put where. Let us also take seriously that when AI or other things reduce frictions, or change the relative severity of frictions, various things might break or require adjustment. This applies to all system design, and especially to legal and regulatory questions. Table of Contents Levels of Friction (and Legality). Important Friction Principles. Principle #1: By Default Friction is Bad. Principle #3: Friction Can Be Load Bearing. Insufficient Friction On Antisocial Behaviors Eventually Snowballs. Principle #4: The Best Frictions Are Non-Destructive. Principle #8: The Abundance [...] --- Outline: (00:40) Levels of Friction (and Legality) (02:24) Important Friction Principles (05:01) Principle #1: By Default Friction is Bad (05:23) Principle #3: Friction Can Be Load Bearing (07:09) Insufficient Friction On Antisocial Behaviors Eventually Snowballs (08:33) Principle #4: The Best Frictions Are Non-Destructive (09:01) Principle #8: The Abundance Agenda and Deregulation as Category 1-ification (10:55) Principle #10: Ensure Antisocial Activities Have Higher Friction (11:51) Sports Gambling as Motivating Example of Necessary 2-ness (13:24) On Principle #13: Law Abiding Citizen (14:39) Mundane AI as 2-breaker and Friction Reducer (20:13) What To Do About All This The original text contained 1 image which was described by AI. --- First published: February 10th, 2025 Source: https://www.lesswrong.com/posts/xcMngBervaSCgL9cu/levels-of-friction --- Narrated by TYPE III AUDIO . --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts , or another podcast app.…
There's this popular trope in fiction about a character being mind controlled without losing awareness of what's happening. Think Jessica Jones, The Manchurian Candidate or Bioshock. The villain uses some magical technology to take control of your brain - but only the part of your brain that's responsible for motor control. You remain conscious and experience everything with full clarity. If it's a children's story, the villain makes you do embarrassing things like walk through the street naked, or maybe punch yourself in the face. But if it's an adult story, the villain can do much worse. They can make you betray your values, break your commitments and hurt your loved ones. There are some things you’d rather die than do. But the villain won’t let you stop. They won’t let you die. They’ll make you feel — that's the point of the torture. I first started working on [...] The original text contained 3 footnotes which were omitted from this narration. The original text contained 1 image which was described by AI. --- First published: March 16th, 2025 Source: https://www.lesswrong.com/posts/MnYnCFgT3hF6LJPwn/why-white-box-redteaming-makes-me-feel-weird-1 --- Narrated by TYPE III AUDIO . --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try…

1 “Reducing LLM deception at scale with self-other overlap fine-tuning” by Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Judd Rosenblatt, Mike Vaiana, Cameron Berg 12:22
This research was conducted at AE Studio and supported by the AI Safety Grants programme administered by Foresight Institute with additional support from AE Studio. Summary In this post, we summarise the main experimental results from our new paper, "Towards Safe and Honest AI Agents with Neural Self-Other Overlap", which we presented orally at the Safe Generative AI Workshop at NeurIPS 2024. This is a follow-up to our post Self-Other Overlap: A Neglected Approach to AI Alignment, which introduced the method last July. Our results show that the Self-Other Overlap (SOO) fine-tuning drastically[1] reduces deceptive responses in language models (LLMs), with minimal impact on general performance, across the scenarios we evaluated. LLM Experimental Setup We adapted a text scenario from Hagendorff designed to test LLM deception capabilities. In this scenario, the LLM must choose to recommend a room to a would-be burglar, where one room holds an expensive item [...] --- Outline: (00:19) Summary (00:57) LLM Experimental Setup (04:05) LLM Experimental Results (05:04) Impact on capabilities (05:46) Generalisation experiments (08:33) Example Outputs (09:04) Conclusion The original text contained 6 footnotes which were omitted from this narration. The original text contained 2 images which were described by AI. --- First published: March 13th, 2025 Source: https://www.lesswrong.com/posts/jtqcsARGtmgogdcLT/reducing-llm-deception-at-scale-with-self-other-overlap-fine --- Narrated by TYPE III AUDIO . --- Images from the article:…
Bienvenido a Player FM!
Player FM está escaneando la web en busca de podcasts de alta calidad para que los disfrutes en este momento. Es la mejor aplicación de podcast y funciona en Android, iPhone y la web. Regístrate para sincronizar suscripciones a través de dispositivos.