Two months ago OpenAI released DALL-E - an advanced neural network that generates images from text prompts and a natural extension of the powerful language model GPT-3.
Last summer, I was privileged to be among the first in the world to get access to GPT-3. Like others in the AI community, I was blown away by its performance. Recently, OpenAI has been signaling it will soon open access to DALL-E’s API. In the light of this upcoming release, it’s necessary to understand how DALL-E was built, what’s unique about it compared to the current trends in ML research, and its inherent risks.
The ability to generalize
To build DALL-E, GPT-3 creators refined their model so it will be able to manipulate visual concepts through language. To achieve that they’ve trained their model with 400 million image-text pairs of unlabeled data collected from the internet. To demonstrate DALL-E's capabilities, OpenAI shared a few examples in a blog post, using a series of interactive visuals. For example, the following prompt: “A cube made of porcupine. A cube with the texture of a porcupine” yielded the following results:
These images showcase the model’s impressive capabilities: it was able to detect and create the shape of a certain animal, understand and map textures, and combine it into a three-dimensional cube.
The interesting question is, out of the 512 possible results generated for every given textual prompt - how does the model rank the best match? How does it prioritize the most-realistic generations?
The algorithm responsible for this ranking is called CLIP. In short, CLIP is another model from OpenAI that can automatically describe images based on their content. It’s based on 1) a Transformers architecture, 2) a combination of more than one input type, in this case text and image, and 3) Zero-shot learning, i.e. the ability to generalize and perform tasks it wasn’t specifically trained to perform.
OpenAI, which anticipated the upcoming skepticism about DALL-E’s ability to generalize with people stating the model is copying existing images from the internet, has published image-text combinations that have a very low probability of being found publicly, such as “a snail made of cabbage”.
The Architecture behind it all
GPT-3 has shown that vast neural networks trained on immense datasets can master a complex domain such as language understanding and perform intricate tasks like text generation. Respectively, DALL-E has proven that the same neural network is capable of generating high-quality images. Both were built on the basis of the Transformers architecture, an approach published in 2017 by several researchers from Google Brain, exhibiting promising results across multiple domains in NLP. Since then, Google has incorporated it throughout multiple products, including its search engine, translation service, and more.
The future of creativity
Naturally, DALL-E’s early adopters will be the developer community, research institutes, and startups. Specifically, designers could enhance websites and build landing pages in a manner of seconds, saving hours of tinkering with the creation of icons and illustrations. Business intelligence specialists could create complex graphs and dashboards based on a table description, and marketing professionals could come up with assets for their next campaign within minutes.
OpenAI has declared the model would also be able to generate results in 3D, which then can be used by architects to visualize buildings or archaeologists to bring ancient structures to life. Additionally, it provided a sneak peek at the potential impact on the fashion world by demonstrating the output of the following prompt: “a female mannequin dressed in a navy leather jacket and red mini skirt”.
It's still too early to proclaim professions will be replaced by this technology, but one thing is certain—companies adopting it can unlock a whole new set of creativity, eventually leading to faster results and a faster time to market.
Human Intelligence and general-purpose models
These days most AI applications in production are vertical, focusing on solving a specific task such as extracting information from an invoice. By now it’s a common understanding that narrow AI is not merely equivalent to human intelligence, even when conducting one specific task. For example, even the state-of-the-art deep learning model for early-stage cancer detection (vision) is limited in its performance when it’s missing patient's charts (text) from her electronic health records.
On the promise of combining language and vision, OpenAI’s Chief Scientist Ilya Sutskever said that in 2021 OpenAI will strive to build and expose models to new stimuli: “Text alone can express a great deal of information about the world, but it’s incomplete because we live in a visual world as well”. He then adds “this ability to process text and images together should make models smarter. Humans are exposed to not only what they read but also what they see and hear. If you can expose models to data similar to those absorbed by humans, they should learn concepts in a way that’s more similar to humans”.
Models which combine more than one source are considered to be Multimodal. Although this field has been extensively researched in the past, it has been picking up interest once again in the last several years with promising results such as Facebook’s recent research in the field of automatic speech recognition (ASR), showcasing major progress by combining audio and text.
Regulation and risks
Unlike other AI applications that are narrow, DALLE-E’s capabilities of supporting a wide variety of prompts with a human-like quality is making it hard to distinguish whether its outputs were made by a human. Moreover, newer and more advanced versions might support the generation of full videos, or incorporate image prompts in a way that will make deep fake generation accessible to the mass.
In an era where polarization is at its peak and deep fakes are rising in popularity—DALL-E might be the tipping point for some kind of regulation to take place. OpenAI is already implicitly acknowledging the harmful potential of its technology by self-regulating and by having a thorough screening process. In the efforts of mitigating these shortcomings from happening in production, OpenAI has developed its own technique for making its model more reliable through a combination of reinforcement learning and human feedback. In this workflow, a human operator curates the model by both labeling harmful content and by flagging false-alarms in case the model mistakenly-thinks its generations are not safe, i.e. a completion that contains biased or toxic content.
With no consensus in sight, the main challenge will be understanding in which ways, and if at all, regulation is an effective tool in ensuring safe AI adoption. Nevertheless, and for the time being - can we trust a commercial company to self-regulate? What happens once such a company faces a trade-off between ethics and revenues?
OpenAI was founded in 2015 by Elon Musk and Sam Altman with the promise to build artificial general intelligence (AGI) that benefits all of humanity and that is safe. In its charter, OpenAI stated that if anyone other than itself will come close to achieving AGI before they do, they commit to stop competing with and start assisting this project.
Although both DALL-E and CLIP are not yet fully stable and reliable, and many will write about their shortcomings and risks—there’s no doubt both are another stepping stone on the way to achieving AGI. They were both developed within less than a year since GPT-3’s release, with the main breakthrough of achieving significantly more robust performance thanks to their exposure to both text and images, i.e. multimodality. This old-new approach can get us to the next level of generalization, surpassing the previous levels of AI performance which has plateaued during the last years.
Since the DALL-E blog post last January, the AI community anxiously awaits to get access to the API. People already have several ideas on what can be done and based on the months following GPT-3’s release, we can all expect our Twitter feed to be filled with mind-blown people in front of mind-blowing gifs.
Written by Sahar Mor, founder of AirPaper.ai