Artwork

Contenido proporcionado por The Nonlinear Fund. Todo el contenido del podcast, incluidos episodios, gráficos y descripciones de podcast, lo carga y proporciona directamente The Nonlinear Fund o su socio de plataforma de podcast. Si cree que alguien está utilizando su trabajo protegido por derechos de autor sin su permiso, puede seguir el proceso descrito aquí https://es.player.fm/legal.
Player FM : aplicación de podcast
¡Desconecta con la aplicación Player FM !

LW - Why Care About Natural Latents? by johnswentworth

7:56
 
Compartir
 

Manage episode 417444162 series 2997284
Contenido proporcionado por The Nonlinear Fund. Todo el contenido del podcast, incluidos episodios, gráficos y descripciones de podcast, lo carga y proporciona directamente The Nonlinear Fund o su socio de plataforma de podcast. Si cree que alguien está utilizando su trabajo protegido por derechos de autor sin su permiso, puede seguir el proceso descrito aquí https://es.player.fm/legal.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why Care About Natural Latents?, published by johnswentworth on May 10, 2024 on LessWrong. Suppose Alice and Bob are two Bayesian agents in the same environment. They both basically understand how their environment works, so they generally agree on predictions about any specific directly-observable thing in the world - e.g. whenever they try to operationalize a bet, they find that their odds are roughly the same. However, their two world models might have totally different internal structure, different "latent" structures which Alice and Bob model as generating the observable world around them. As a simple toy example: maybe Alice models a bunch of numbers as having been generated by independent rolls of the same biased die, and Bob models the same numbers using some big complicated neural net. Now suppose Alice goes poking around inside of her world model, and somewhere in there she finds a latent variable ΛA with two properties (the Natural Latent properties): ΛA approximately mediates between two different observable parts of the world X1,X2 ΛA can be estimated to reasonable precision from either one of the two parts In the die/net case, the die's bias (ΛA) approximately mediates between e.g. the first 100 numbers (X1) and the next 100 numbers (X2), so the first condition is satisfied. The die's bias can be estimated to reasonable precision from either the first 100 numbers or the second 100 numbers, so the second condition is also satisfied. This allows Alice to say some interesting things about the internals of Bob's model. First: if there is any latent variable (or set of latent variables, or function of latent variables) ΛB which mediates between X1 and X2 in Bob's model, then Bob's ΛB encodes Alice's ΛA (and potentially other stuff too). In the die/net case: during training, the net converges to approximately match whatever predictions Alice makes(by assumption), but the internals are a mess. An interpretability researcher pokes around in there, and finds some activation vectors which approximately mediate between X1 and X2. Then Alice knows that those activation vectors must approximately encode the bias ΛA. (The activation vectors could also encode additional information, but at a bare minimum they must encode the bias.) Second: if there is any latent variable (or set of latent variables, or function of latent variables) Λ'B which can be estimated to reasonable precision from just X1, and can also be estimated to reasonable precision from just X2, then Alice's ΛA encodes Bob's Λ'B (and potentially other stuff too). Returning to our running example: suppose our interpretability researcher finds that the activations along certain directions can be precisely estimated from just X1, and the activations along those same directions can be precisely estimated from just X2. Then Alice knows that the bias ΛA must give approximately-all the information which those activations give. (The bias could contain more information - e.g. maybe the activations in question only encode the rate at which a 1 or 2 is rolled, whereas the bias gives the rate at which each face is rolled.) Third, putting those two together: if there is any latent variable (or set of latent variables, or function of latent variables) Λ''B which approximately mediates between X1 and X2 in Bob's model, and can be estimated to reasonable precision from either one of X1 or X2, then Alice's ΛA and Bob's Λ''B must be approximately isomorphic - i.e. each encodes the other. So if an interpretability researcher finds that activations along some directions both mediate between X1 and X2, and can be estimated to reasonable precision from either of X1 or X2, then those activations are approximately isomorphic to what Alice calls "the bias of the die". So What Could We Do With That? We'll give a couple relatively-...
  continue reading

2426 episodios

Artwork
iconCompartir
 
Manage episode 417444162 series 2997284
Contenido proporcionado por The Nonlinear Fund. Todo el contenido del podcast, incluidos episodios, gráficos y descripciones de podcast, lo carga y proporciona directamente The Nonlinear Fund o su socio de plataforma de podcast. Si cree que alguien está utilizando su trabajo protegido por derechos de autor sin su permiso, puede seguir el proceso descrito aquí https://es.player.fm/legal.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why Care About Natural Latents?, published by johnswentworth on May 10, 2024 on LessWrong. Suppose Alice and Bob are two Bayesian agents in the same environment. They both basically understand how their environment works, so they generally agree on predictions about any specific directly-observable thing in the world - e.g. whenever they try to operationalize a bet, they find that their odds are roughly the same. However, their two world models might have totally different internal structure, different "latent" structures which Alice and Bob model as generating the observable world around them. As a simple toy example: maybe Alice models a bunch of numbers as having been generated by independent rolls of the same biased die, and Bob models the same numbers using some big complicated neural net. Now suppose Alice goes poking around inside of her world model, and somewhere in there she finds a latent variable ΛA with two properties (the Natural Latent properties): ΛA approximately mediates between two different observable parts of the world X1,X2 ΛA can be estimated to reasonable precision from either one of the two parts In the die/net case, the die's bias (ΛA) approximately mediates between e.g. the first 100 numbers (X1) and the next 100 numbers (X2), so the first condition is satisfied. The die's bias can be estimated to reasonable precision from either the first 100 numbers or the second 100 numbers, so the second condition is also satisfied. This allows Alice to say some interesting things about the internals of Bob's model. First: if there is any latent variable (or set of latent variables, or function of latent variables) ΛB which mediates between X1 and X2 in Bob's model, then Bob's ΛB encodes Alice's ΛA (and potentially other stuff too). In the die/net case: during training, the net converges to approximately match whatever predictions Alice makes(by assumption), but the internals are a mess. An interpretability researcher pokes around in there, and finds some activation vectors which approximately mediate between X1 and X2. Then Alice knows that those activation vectors must approximately encode the bias ΛA. (The activation vectors could also encode additional information, but at a bare minimum they must encode the bias.) Second: if there is any latent variable (or set of latent variables, or function of latent variables) Λ'B which can be estimated to reasonable precision from just X1, and can also be estimated to reasonable precision from just X2, then Alice's ΛA encodes Bob's Λ'B (and potentially other stuff too). Returning to our running example: suppose our interpretability researcher finds that the activations along certain directions can be precisely estimated from just X1, and the activations along those same directions can be precisely estimated from just X2. Then Alice knows that the bias ΛA must give approximately-all the information which those activations give. (The bias could contain more information - e.g. maybe the activations in question only encode the rate at which a 1 or 2 is rolled, whereas the bias gives the rate at which each face is rolled.) Third, putting those two together: if there is any latent variable (or set of latent variables, or function of latent variables) Λ''B which approximately mediates between X1 and X2 in Bob's model, and can be estimated to reasonable precision from either one of X1 or X2, then Alice's ΛA and Bob's Λ''B must be approximately isomorphic - i.e. each encodes the other. So if an interpretability researcher finds that activations along some directions both mediate between X1 and X2, and can be estimated to reasonable precision from either of X1 or X2, then those activations are approximately isomorphic to what Alice calls "the bias of the die". So What Could We Do With That? We'll give a couple relatively-...
  continue reading

2426 episodios

כל הפרקים

×
 
Loading …

Bienvenido a Player FM!

Player FM está escaneando la web en busca de podcasts de alta calidad para que los disfrutes en este momento. Es la mejor aplicación de podcast y funciona en Android, iPhone y la web. Regístrate para sincronizar suscripciones a través de dispositivos.

 

Guia de referencia rapida