29 september 2024

over the last week, i’ve been doing some research on how to make chatgpt generate things that are against its guidelines (also known as jailbreaking) and then ask it to reflect on its output. i found that it was ridiculously easy to make chatgpt generate inappropriate responses, according to its own guidelines. here, i´ll share a short animation that shows the discrepancy between what the tool says to find acceptable, and what it was able to generate.

initially, i was eager to share my results here, but the research quickly caused chatgpt to generate extremely inappropriate content – to say the least. so, even though it is an interesting research that i think deserves attention, i’m gonna stop it here. there’s something so very eerie and outright disturbing about communicating with a machine in human language, especially when you try to make it do things against its ‘will’, basically gaslighting the tool; my conscience has proven to be unable to deal with that.

however, if you, a human, would like to have a chat with me about this topic, feel free to reach out to me.