With local LLMs you can edit the LLMs response to e.g. something really vile and then ask why it would say this. Depending on the model (e.g. Gemma) it apologizes a lot but also promises that it’s a bug and people are working on it.
Llama in a psychology role play sometimes doubles down on it and says it said it to get an emotional response and even keeps using some of the vile stuff.
With local LLMs you can edit the LLMs response to e.g. something really vile and then ask why it would say this. Depending on the model (e.g. Gemma) it apologizes a lot but also promises that it’s a bug and people are working on it.
Llama in a psychology role play sometimes doubles down on it and says it said it to get an emotional response and even keeps using some of the vile stuff.