Recently I was accidentally amazed by the current level of AI drawing, I have felt that the rapid progress of today's AI painting may have far surpassed everyone's anticipation. In this article, we will go through the history of AI drawing, recent breakthroughs, and all the glories and controversies it is undergoing.
If you are just curious and want to try right now, you can go to our free AI Text to Image tool to play around.
Table of Content
- 2023 is a year of AI Paiting
- The History of AI Painting
- Why AI painting advances dramatically
- Comparison of top AI painting models: Stable Diffusion V.S. MidJourney
- What a Breakthrough in AI Painting Means for Humanity
Since the beginning of this year, AI paintings that allow you to input text prompts and automatically generate pictures have suddenly sprung up.
The first is Disco Diffusion.
Disco is an AI modeling technique that can be used to convert Text-to-Image using prompts that describe the scene.
In April 2022, the famous AI team OpenAI also released a new model called DALL.E 2 The name comes from the combination of the famous painter Dali and Robot Story (Wall-E).
And many people began to pay special attention to AI drawing, perhaps from the following news about this AI work:
This is a digital oil painting generated by the AI painting service MidJourney. Artist Jason Allen who generated this painting participated in the art competition at the Colorado State Fair in the United States and won first place. This has triggered a huge debate on the Internet.
At present, the technology of AI painting is developing rapidly. It iterates at an astounding pace. Even if you compare the AI painting at the beginning of this year with the current one, you can see a huge difference. At the beginning of the year, Disco Diffusion could generate some atmospheric sketches, but it was basically impossible to generate human faces. However, only two months later, DALL-E 2 can already generate accurate facial features; Right now, the most powerful Stable Diffusion has made tremendous progress in both the sophistication and speed of the painting.
The technology of AI painting is not new in recent years, but since the beginning of this year, the quality of AI-produced works has been increasing at a lighting speed. The drawing time has also been shortened from hours at the beginning to just a few seconds now.
What happened behind this scene? Let us first review the history of AI painting and then understand how the breakthrough of AI painting technology has caused turbulence in human history.
AI painting may appear earlier than many people think.
Computers were invented in the 1960s. In the 1970s, an artist, Harold Cohen (painter, and professor at the University of California, San Diego) began to create a computer program "AARON" for painting creation. However, different from current AI drawing, AARON was really acting as an artist and controlled a robotic arm to paint.
Harold's improvement on AARON continued for decades until his death. In the 1980s, ARRON "mastered" the drawing of three-dimensional objects; In the 1990s, AARON was able to paint with multiple colors. It is said that until today, ARRON is still painting.
However, AARON is not open sourced, so the details of its painting remain unknown. But we can infer that ARRON only describes Harold's understanding of painting in a complex programming way. That is why even after decades of iteration, ARRON is still limited to generating colorful abstract paintings. This is exactly Harold Cohen's own painting style. Harold has spent decades presenting his own understanding and expression of art through the program to guide the robotic art on the canvas.
Although it is difficult to say how intelligent AARON is, as the first program that automatically paints on canvas, it well deserves the title of the originator of AI painting.
In 2006, a similar painting product called “the Painting Fool” appeared. It can observe photos, extract block color information in photos, and use real painting materials such as paint, pastels, or pencils to create.
The above two examples are relatively "classical" computer automatic painting.They are a bit like a toddler learning how to walk.Their intelligence is still quite rudimentary.
As for now, the concept of "AI painting/AI drawing" refers more to the computer program for automatic drawing based on the deep learning model. Such a drawing method is actually relatively new.
In 2012, Google's two famous AI masters, Andrew Y. Ng and Jeff Dean, conducted an unprecedented experiment, jointly using 16,000 CPUs to train the world's largest deep learning network at that time, which was used to guide the computer to draw cat pictures. They used 10 million cat pictures from YouTube, trained for 3 days with 16,000 CPUs, and finally got a model that was exciting enough to generate a very blurry cat face.
In today's view, the training efficiency and output results of this model are not worth mentioning. But for the AI research field at that time, this was a historical breakthrough: It officially opened the era of "new" AI painting supported by deep learning models.
Here let us dive into a little bit of technical details: How complex is the AI painting based on the deep learning model? Why did the multi-day training on a large-scale modern computer cluster in 2012 only produce such a limited result?
People may have a basic understanding of deep learning. To simplify, the training of a deep learning model is a process of using a large number of externally labeled training data inputs, and repeatedly adjusting the internal parameters of the model to match expected output. Then the process of teaching AI to paint is to feed the training data of existing paintings to an AI model to iteratively adjust the parameters.
But how much information does a painting carry? Of course, there are width x height RGB pixels. The simplest starting point for a computer to learn painting is to train an AI model to output regular pixel combinations. However, a painting is much more than just a simple combination of RGB pixels. Every stroke, including its position, shape, color and many other parameters combined together formed a painting. The information contained in a painting is enormous. The computational complexity of deep model training increases exponentially with the increase of input combinations. That is why the training process is much harder than it may appear.
After Wu Enda and Jeff Dean's groundbreaking cat face generation model, AI scientists began to invest in this new challenge. In 2014, the AI community proposed a very important deep learning model, which is the famous GAN (Generative Adversarial Network, GAN).
Just like its name "Adversarial Generation", the core idea of this deep learning model is to train the two internal models "generator" and "discriminator” in a zero-sum game, adversarial, until the discriminator model is fooled about half the time, meaning the generator model is generating plausible examples.
The GAN model gained its popularity rapidly and has been widely used in many fields. It has also become the basic framework of many AI painting models, in which the generator is used to generate pictures, and the discriminator is used to judge the quality of pictures. GAN’s appearance has greatly advanced the development of AI painting.
However, using the basic GAN model for AI painting also has obvious defects. On the one hand, the control over the output is weak. It is easy to generate random images, while the output of AI artists should be stable. Another problem is that the images generated have low resolution. While the resolution problem is relatively easy to solve, GAN still has a fundamental defect, which is precisely its own core feature: According to the structure of GAN, the discriminator needs to judge whether the generated image is consistent with the ones that has been provided to the discriminator. That means that even in the best case, the output image can only be an imitation of existing artworks, rather than an innovation...
In addition to GAN, researchers have also begun to try other types of deep learning models to teach AI to paint.
A well-known example is “Deep Dream”, an image tool released by Google in 2015. Deep Dream released a series of paintings, which attracted a lot of attention at that time. Google even curated an art exhibition of Deep Dream's work
But to be honest, Deep Dream is not so much an AI painting, but more like an advanced AI filter. Just take a look at the above images, you can see that they are more like adding filters to existing artworks. In 2017 Google released another model trained on thousands of hand-drawn pictures, which can draw some simple stroke-based drawings. (Google, "A Neural Representation of Sketch Drawings "). There is one reason why this model has received widespread attention. Google open-sourced this model, so third-party developers can develop interesting AI stroke-based drawings based on this model. There was an online application called "Draw Together with a Neural Network", which allows the AI to automatically complete the entire drawing for you with just some simple strokes.
It is worth noting that the Internet giants are the main pioneers in this field. In addition to the above-mentioned research done by Google, in July 2017, Facebook, together with Rutgers University and Art History Department of Charleston Art College, released a new model called CAN, Creative Adversarial Networks)
(Facebook, "CAN: Creative Adversarial Networks, Generating "Art" by Learning About Styles and Deviating from Style Norms")
As can be seen from the images below, the artwork CAN is trying to generate artworks that look more like the original works of artists, rather than imitations of existing art.
The creativity in these artworks shocked the researchers at that time, as they looked very similar to the popular abstract paintings. So they organized a Turing test and asked the audience to ascertain which works are created by AI, and which are by humans. As a result, 53% of the audience believed that the AI art works of the CAN model came from human hands, which was the first time in history for similar Turing tests.
However, the AI painting of CAN is still limited to some abstract expressions. It is far from reaching the level of human masters, let alone the creation of some realistic or figurative paintings. In fact, even until the beginning of 2021, the famous DALL-E system released by OpenAI has many limitations. . The following is a fox drawing by DALL-E, which can barely be recognized.
However, when it comes to DALL-E, the AI has begun to wear an important ability, that is, it can follow text prompts to create!
Next, we will continue to explore the questions raised at the beginning. Why since this year, the level of AI painting has suddenly surged. Compared with previous works, there has been a fundamental leap in the quality of AI drawings. So what happened? Let's take a look now.
In many sci-fi movies, there is often a scene where the protagonist speaks to a sci-fi feeling computer AI, and then the AI generates a 3D image, standing in front of the protagonist in the form of VR/AR/holographic projection.
Aside from those fancy visual effects, the core capability here is that humans use language input, and then the computer AI understands the language, generates a graphic image and displays it to humans. Think about it carefully, the most basic form of this ability is the concept of AI painting. (Of course, there is still a distance from a two-dimensional painting to 3D generation, but compared to AI creating a concrete and meaningful painting on the fly, the difficulty is not comparable). Therefore, whether it is controlled by speech or some mysterious brain wave, the cool scenes in science fiction actually describe an AI ability, which is to automatically convert "language" into images through AI understanding. As of now, voice-to-text technology is already mature, so this is essentially an AI painting process from text to image.
It is quite amazing, right? Just relying on some text, without any reference pictures, AI can understand and automatically draw the corresponding content, and the drawing is getting better and better! This seemed a little far away yesterday, but now it’s appeared in front of everyone.
How did all this happen?
First of all, we must mention the birth of a new model. Again, OpenAI team, in January 2021, open sourced a new deep learning model CLIP (Contrastive Language-Image Pre-Training). A state-of-the-art AI for image classification. CLIP trains AI to do two things at the same time, one is natural language understanding, and the other is computer vision analysis. It is designed as a powerful tool with a specific purpose: to classify images. CLIP can determine the matching degree of images and texts. For example, it can match an image of a cat to the word "cat" exactly.
The training process of the CLIP model is to use the already annotated "text-image" training data, on the one hand to train the text model, on the other hand to train another model for the images. It constantly adjusts the internal parameters of the two models, so that the text eigenvalues and image eigenvalues generated by the two models can verify the matching of the corresponding "text-image" through simple verification.
Here comes the key point. In fact, people have tried to train the "text-image" matching model before, but the biggest difference of CLIP is that it has collected 4 billion "text-image" training data! The data, and then poured into the expensive training time, the CLIP model finally achieved an unforeseen result.
You may ask, who can annotate so many "text-image"? It is almost impossible to manually annotate 4 billion images. This is exactly the smartest thing about CLIP. It uses images that are widely scattered on the Internet!
Pictures on the Internet generally have various texts like titles, descriptions, comments, and even tags, which makes them perfect training samples. In this way, CLIP is able to avoid the most expensive and time-consuming manual labeling process. In other words, Internet users all over the world have already done the labeling work in advance.
CLIP is powerful. However, at first glance, it seems to have little to do with art creation.
But just a few days after the open-source release of CLIP, some machine learning engineers realized that this model can be used to do more things. For example, Ryan Murdock figured out how to connect other AI to CLIP to create an AI Image generator. Ryan Murdock said in an interview: "After I played with it for a few days, I realized that I could generate images." In the end he chose BigGAN, a variant of the GAN model, and published the code on Colab naming “The Big Sleep.”
(Colab Notebook is a very convenient interactive programming online notebook provided by Google, powered by Google Cloud Computing. Users with a little technical knowledge can edit and run Python scripts on a notebook-like web interface and get output. Importantly, this programming note can be shared )
The drawings created by Big Sleep are actually a little spooky and abstract, but it's a good start.
Soon the Spanish player @RiversHaveWings published CLIP+VQGAN and its tutorial on Twitter, which attracted great attention from the AI community and enthusiasts. In fact, behind this ID, it is data scientist Katherine Crowson.
In the past, generation tools like VQ-GAN can synthesize similar new images after training a large number of images. However, as mentioned earlier, the GANs-type model itself cannot generate new images through text prompts, and is not good at creating new images. And the idea of grafting CLIP to GAN to generate images is actually straightforward.
Since CLIP can be used to calculate which image feature match any string of text, it is only necessary to link this matching verification process to the AI model responsible for generating images (for example, VQ-GAN here), and the model responsible for generating images in turn derives an image that contain all the image feature value and can pass the matching verification. As a result, you get a picture that matches the text description.
Some argue that CLIP+VQGAN is the biggest innovation in the AI art field since Deep Dream in 2015. The beauty is that CLIP+VQGAN is available to anyone who wants to use them. Following Katherine Crowson's online tutorial and Colab Notebook, a slightly technical user can run the system in minutes.
Interestingly, as mentioned in the previous chapter, at the same time (early 2021), the OpenAI team that released CLIP as an open source also released its own image generation engine DALL-E. DALL-E also uses CLIP internally, but DALL-E is not open-sourced!
Therefore, in terms of influence and contribution, DALL-E cannot be compared with the open-source implementation of CLIP+VQGAN. Of course, open sourcing CLIP is already a huge contribution to the community by OpenAI.
When it comes to open-source contributions, we have to mention LAION here. LAION is a global non-profit machine learning research institution. In March this year, LAION-5B, the largest open-source cross-modal database, was opened to the public. It contains nearly 6 billion (5.85 Billion) image-text pairs, which can be used for training text-to-image generative models and CLIP, a model used to score the matching degree of text and images, both of which are now the core of the current AI image generation model. In addition to providing the above massive training data, LAION also trains AI to score the pictures in LAION-5B according to the sense of art and visual beauty, and classifies the high-scoring pictures into a subset called LAION-Aesthetics. In fact, the latest AI painting models, including the most famous Stable Diffusion(which will be introduced later), are all trained using the high-quality data set of LAION-Aesthetics.
CLIP+VQGAN has led the trend of a new generation of AI image generation technology. Now all open-source TTI (Text to Image, text to image) models will thank Katherine Crowson in the introduction. She is the well-deserved foundation of a new generation of AI painting models.
Technology players began to form a community around CLIP+VQGAN, and the code was continuously optimized and improved. There were Twitter accounts dedicated to collecting and publishing AI paintings. The earliest practitioner, Ryan Murdoch, was also recruited by Adobe as a machine learning algorithm engineer.
However, the players in this wave of current AI painting are mainly AI technology enthusiasts.
Although compared with the local AI development environment, the threshold for running CLIP+VQGAN on Golab Notebooks is relatively low, you still need to apply for a GPU to run code and call an AI program to generate pictures. You also need to deal with code errors from time to time. It is not something that the general public, especially art creators without a technical background, can do. This is why the zero-threshold fool-style AI paid creation services such as MidJourney are now shining brightly.
But the excitement is far from over here. You probably have noticed that the powerful combination of CLIP+VQGAN was released early last year and spread in small circles. However, the wild spread of AI painting happened at the beginning of this year, triggered by Disco Diffusion. It is still half a year away. What has caused the delay?
One reason is that the image generation part used by the CLIP+VQGAN model as well as the generated images are not always satisfactory. AI researchers noticed another way to generate images. If you review the principle of the GAN model, its image output is the result of the combination between the internal generator and the judger. But there is another way of thinking, and that is the Diffusion model.
The word Diffusion sounds sophisticated, but the basic idea is simple. It is actually "denoise". If we repeat the denoising process in an exhaustive way, is it possible to restore a completely noisy picture to a clear one?
Of course, it is not possible to rely on people to denoise. A simple denoising program is also impossible. However, it is feasible to "guess" while denoising based on AI capabilities. This is the basic idea of the Diffusion diffusion model. The Diffusion diffusion model is currently becoming more and more influential in the field of computer vision. It can efficiently synthesize visual data, and the image generated completely beats the GAN model. It also shows good potential in other fields such as video generation and audio synthesis.
Disco Diffusion, the AI painting product known to the public at the beginning of this year, is the first practical AI painting product based on the CLIP + Diffusion model. However, the shortcomings of Disco Diffusion are still obvious. For example, Stijn Windig, a professional artist, has repeatedly tried Disco Diffusion and believes that Disco Diffusion cannot replace the ability of manual creation. There are two main reasons:
- Disco Diffusion cannot describe specific details, and the rendered images are stunning at first glance. But if you look closely, you will find that most of them are vague, which do not reach the commercial level.
- The initial rendering time of Disco Diffusion takes hours. To reinforce the details on the rendered image is equivalent to redrawing the entire image. Such a process takes more time and energy than direct hand painting.
However, Stijn Windig is still optimistic about the development of AI painting. He thinks that although it is not feasible to directly use Disco Diffusion for commercial creation, it is still very good as a reference for inspiration: "...I find it more suitable as an idea generator. Give it the prompt: "fantasy city on a sunny day, game of thrones, massive castle" and it'll return something that at least sparks the imagination and can be used to paint on top of, as a sketch.
In fact, Stijn also raised two major technical pain points: 1) AI painting details are not deep enough, 2) the rendering time is too long. It is an inherent shortcoming of the Diffusion diffusion model: The image generation process by reverse denoising is slow, and the model is computed in pixel space, which leads to a huge demand for computing time and memory. As a result, it becomes prohibitively expensive when generating high-resolution images. (Pixel space means that the model does calculations directly on the level of raw pixel information)
Therefore, this model cannot generate pictures with enough details within the normal user’s acceptable time. Even a sketch will take hours. But in any case, the painting quality by Disco Diffusion is superior to all previous AI painting models. It is already at a painting level that most ordinary people cannot reach. However, Stijn probably never expected that the two major pain points of AI painting he pointed out were almost perfectly solved by AI researchers within a few months!
Speaking of which, the most powerful AI painting model in the world today, Stable Diffusion, finally made its debut!
Stable Diffusion started testing in July this year, and it solved the above pain points very well. Compared with the previous Diffusion diffusion model, Stable Diffusion focuses on one thing: reducing the calculation space of the model from the pixel space through mathematical transformation to the low-dimensional space known as the Latent Space while retaining as many details as possible. It then performs heavy model training and image generation in Latent Space
How much impact has this "simple" transformation had?
Compared with the pixel space Diffusion model, the Diffusion model based on the latent space greatly reduces the memory and calculation requirements. For example, the latent space coding reduction factor used by Stable Diffusion is 8. In simple words, the length and width of the image are reduced by 8 times. A 512x512 image is directly changed to 64x64 in the latent space, saving 8x8=64 times the memory!
That's why Stable Diffusion is so fast and so good, it can quickly (in seconds) generate a detailed 512x512 image, all it needs is a consumer-grade 8GB 2060 graphics card!
Let’s do some simple math here. Without this space compression transformation, to achieve the same image generation experience like Stable Diffusion, you need a super graphics card with 8Gx64=512G video memory. According to the current development of graphics cards, you need to wait for at least 8-10 years before such consumer-grade graphics cards come to the market.
An important algorithm iteration by AI researchers has brought the AI painting results that we may enjoy 10 years later directly to the computers of all ordinary users!
Therefore, it is completely normal for everyone to be surprised by the progress of AI painting at present as AI painting technology has indeed seen continuous breakthroughs since last year. From the emergence of the CLIP model to the AI drawing wave triggered by the open sourcing of CLIP, from using the diffusion model as a better image generation module to the improved method of latent space, All these just happened within a year.
During this process, the most beneficiaries are AI technology enthusiasts and art creators. (Well, some hate them too). Everyone has witnessed AI painting technology that has been stagnant for many years rushed to the top at rocket speed. There is no doubt that this is a high moment of AI development in history.
For all of us, the happiest thing is of course enjoying the great fun of using today's top painting AI such as Stable Diffusion or MidJourney to generate professional-level paintings.
Interestingly, the birth of Stable Diffusion is also related to the two pioneers mentioned above, Katherine Crowson and Ryan Murdoch. They became the core members of EleutherAI, an AI open-source R&D team in a decentralized organization. Although it claims to be a grassroots team, EleutherAI is currently the leader in the field of ultra-large-scale prediction models and AI image generation.
It was EleutherAI who supported Stability.AI, an AI solution provider founded in London, England. These pioneers gathered together, and based on the above-mentioned latest breakthroughs in AI painting technology, launched the most powerful AI painting model Stable Diffusion. The important thing is that Stable Diffusion has been fully open-sourced in August as promised! Once Stable Diffusion is open-sourced, it ranked first place in the GitHub hot list.
Stability.AI has fully honored its Slogan "AI by the people, for the people" on its homepage.
We will have another article introducing MidJourney, an online AI painting artifact. Its biggest advantage is zero-threshold interaction and superior output results. Creators can use the Discord-based MidJourney bot for interactive painting creation without any technical background (Well, of course, in English)
For the output style, MidJourne obviously has made some optimizations for portraits. After running it multiple times, MidJourney's style tends to be more obvious, to put it nicely, it is more delicate and flattering, or in other words, a little bit greasy. But Stable Diffusion's works are obviously more elegant and more artistic.
The following is a comparison of AI works created on the two platforms using the same text description. Readers may wish to experience it directly.
Stable Diffusion(left) V.S. MidJourney(right) :
Which style is better? In fact, each of them have their own fans.
Because of the targeted optimization, it is more convenient to use MidJourney if you want to produce portraits. But after comparing many works, we think that Stable Diffusion is obviously superior, no matter in terms of artistic expression or style diversity.
However, MidJourney has gone through rapid iteration in the past few months (after all, it is a paid service, and is very motivated). With the completely open source of Stable Diffusion, it is expected that the relevant technical advantages will be absorbed into MidJourney soon. On the other hand, the training of the Stable Diffusion model is still ongoing, and we can expect that the future version of the Stable Diffusion model will also make great progress.
This is a great thing for all of us.
In 2022, the AI painting model based on text-to-image generation is the protagonist. Starting from Disco Diffusion in February, DALL-E 2 and MidJourney invited internal testing in April, Google released two major models Imagen and Parti in May and June. ( papers only, no open testing), and at the end of July, Stable Diffusion was born.
It’s really dazzling how the level of AI painting has improved just recently. In fact, it is only within a year that AI painting has undergone revolutionary progress, which can be deemed a breakthrough in history. In the coming days, what will happen in AI painting, or more broadly, in the field of AI-generated content (image, sound, video, 3D content, etc..), is full of uncertainty and excitement.
But without experiencing the future, from what advanced AI painting models such as Stable Diffusion can do now, we can basically conclude that "imagination" and "creativity", the two mysterious words which are also the last pride of mankind, can actually be decoded by technology.
For the advocates of the divine supremacy of the human soul, the creativity displayed by today's AI painting model is merciless. The so-called inspiration, creativity, and imagination, will soon be( or have already been ) decoded by the powerful combination of supercomputing + big data + mathematical model.
In fact, one of the core ideas of AI generative models like Stable Diffusion and many other deep learning AI models, is to represent the content created by humans as a vector in a high-dimensional or low-dimensional mathematical space (To simplify, a string of numbers). If the transformation design of "content->vector" is reasonable enough, all human creations can then be expressed as partial vectors in a certain mathematical space. Any other vectors in the mathematical space are those that are theoretically possible for humans to create, but have not yet been created. Through the reverse transformation of "vector -> content", these uncreated content will be excavated by AI.
This is exactly what the latest AI painting models such as MidJourney and Stable Diffusion are currently doing. We can say that AI is creating new content, but we can also say that AI is just to surface new paintings. The new paintings generated by AI have always been there mathematically. They have only been restored from the mathematical space by AI in a very clever way.
In Math we trust:-)
Right now, the "creativity" of the latest AI paintings has begun to catch up with and even almost equal to that of human beings. Starting from Alpha Go, the dignity of human beings on "wisdom" has become smaller. The breakthrough of AI painting just further shattered the dignity of human "imagination" and "creativity"—maybe not completely broken, but already full of cracks and crumbling.
We always maintained a neutral view of the development of science and technology. Although we hope that technology will make human life better, in fact, just like the invention of the nuclear bomb, the emergence of some science and technology can be beneficial, or can be deadly. From a practical point of view, super AI that completely replaces humans seems to be not entirely impossible. What humans need to think about is how in the not-too-distant future, when we face AI surpassing humans in all fields, how we maintain our dominance over the world.
If AI finally learns to write codes -- there seem to be no inevitable barriers to prevent this from happening -- then the story of the movie "Terminator" may happen. If this is too pessimistic, then humans must at least consider how to get along with an AI world that surpasses all their intelligence and creativity.
Of course, from an optimistic point of view, the future world will only be better: human beings can live in their personal metaverse through AR/VR. We just need to move our lips, and the omnipotent AI assistant can automatically generate content, and even directly generate stories/games/virtual life that can be experienced by humans.
Is this a better Inception, or a better Matrix?
In any case, the breakthrough and transcendence of AI painting ability that we witnessed today is just the first step.
Finally, we would like to share a series of "Mushroom city" that we generated using stable diffusion with completely different details, a completely consistent style, and high quality. Looking at these exquisite AI works, we have a feeling that AI creation has a "soul", do you feel the same?