Apparent Complexity Can Be Deceptive
As I look back to a turbulent year 2022, one of the things that sticks out scientifically for me is the advent of high-quality generative models. Whether it is DALL·E 2, Stable Diffusion, ChatGPT, or many others: I really felt like I was seeing a large leap forward in terms of capabilities of machine learning systems! At the same time, the chasm between our understanding of these systems and our ability to use them seemed to widen even further. It turns out that even our most basic assumptions of how these systems can be trained are incomplete at best. For instance, under the supervision of Tom Goldstein, researchers showed that novel diffusion models are not—contrary to the previous consensus—overly reliant on Gaussian noise, and in fact, other sources of noise are equally valid.1
In fact, our lack of understanding leads to a resurgence of ‘magical thinking:’ if you want to have the best results for such models, you better use the right incantation! Prompt engineering is now apparently a thing, and I figure that we should maybe call it ‘generomancy’ or something equally mystical—because until we better grok why things work, we should not fool ourselves into understanding how to make these systems do our bidding. Over time, a Sorcerer’s Apprentice type situation may develop here…
One thing is clear, though: the use of the large-scale models and the sources of the data on which they are trained will lead to sweeping legal ramifications for years to come. If I generate a logo with the help of such a system, does it belong to me? Does it belong to the artists that originally supplied the training data? Were those training data obtained legally? Legal scholars are going to have a field day with this.
However, one thing that stuck out to me when showing these systems to my family is a general misunderstanding of the complexity behind the queries. The interfaces of all these generative models are deceptively simple—and of course they are, since all that is required is a text prompt! While everyone had a lot of fun experimenting with these models, in particular with ChatGPT, no one was able to fathom the utter complexity that is hidden behind that simple user interface, though. No one picked up on the fact that to interpret2 a user input and generate something new from it is a truly formidable task! I was reminded of the dictum ‘To the user, the user interface is the software.’
If anything, I blame the ongoing AI Hype Machine for this. The general public may not be able to judge the complexity of a task for machine learning models. And why should they? We are not teaching them properly! Over time, in particular when it comes to policymakers, this oversight may well backfire. Let us do something against that. And in the meantime, please enjoy generative models creatively and responsibly.
May 2023 prove to be a good year for all of us!
See the preprint Cold Diffusion: Inverting Arbitrary Image Transforms Without Noise for more information. This manuscript is immensely readable even if you have no strong exposure to machine learning algorithms. ↩︎
I realise that this is potentially too anthropomorphising. I do not believe that the models actually ‘understand’ the input in a conscious way. ↩︎