AI and Photography: Part 3 - Midjourney vs Stable Diffusion

Written By Yan Zhang
Published by Yvette Depaepe, the 24th of May 2024

"Machines will be capable, within twenty years, of doing any work a man can do.” ~Herbert A. Simon (1965)

In July 2023, I attended in an AI research forum. An Amazon researcher introduced to us several AI projects currently undertaken at Amazon. During the event, we had lunch together. When she learned that I was also a photographer, she bluntly said to me: "Midjourney ended photography!"his statement, her words present the view of many professionals engaged in the cutting-edge research on generative AI. In this article, from the perspectives of both as an AI scientist and as a professional photographer, I try to thoroughly explore the profound impact that generative AI is having on traditional photography; and how we, as photographers, should face it to this challenge.

Next week: Part 4 - The Photographer's Confusion

Midjourney vs Stable Diffusion

2023 will be definitely a year wrinen in the history of AI.

In early 2023, ChatGPT, a large language model (LLM) launched by OpenAI, reached 100 million users in just two months. By mid 2023, the applications of ChatGPT and its successor GPT-4 have significantly expanded from initial Question Answering, document editing and creation, to a wider range of finance, health care, education, soiware development, etc.

At the same time, research on the diffusion model based image generation represented by Midjourney, Stable Diffusion, and DALL.E 2 have also achieved major breakthroughs. The main function of these models is to generate imagines of various styles from prompts. The most amazing of them is that the Midjourney and Stable Diffusion models can generate realistic images similar to photography.

Images Generated by Midjourney

Generally speaking, Midjourney can use relatively simple and direct prompts to generate high quality and photorealistic images. Here we demonstrate several various images generated by v5.0 and v6.0 versions.

*“Everest Base camp”. Generated on Midjourney, by Yan Zhang.*

*“A young woman portrait”. Generated on Midjourney, by Yan Zhang.*

*“Mysterious forest”. Generated on Midjourney, by Yan Zhang.*

*“Dream seascape”. Generated on Midjourney, by Yan Zhang.*

From the pictures above, we can see that Midjourney can produce nearly perfect "photographs". Midjourney is also good at generating non-photographic artworks, and can generate such artworks with even specific artist styles, as shown in the following.

*“Picasso’s women”. Generated on Midjourney, by Yan Zhang.*

The power of Midjourney with image generation has been widely recognised. However, since it is a fully closed system, Midjourney's model structure and training methods are unknown to the public, and users have to pay fees for using it through the Discord platform.

Stable Diffusion Model Structure

Stable Diffusion is an image generation diffusion model launched by Stability AI in July 2022. Unlike Midjourney, Stable Diffusion is a completely open system, so we can understand all examine all technical details of this model from the structure to the training process.

*Figure 6. The main model structure of Stable Diffusion.*

After we know the basic idea of the diffusion model (see Figure 4 and Figure 5), it is not difficult to understand the structure of the Stable Diffusion main model in Figure 6. The training image x is compressed into a latent vector z by the encoder, and the process of forward diffusion begins. During this process, noises are gradually added to the latent vector, and finally transformed into a noise latent vector zT; then the reverse diffusion begins. process. At this time, the additional "text/image" condition is converted into the representation of a latent vector through a transformer and implanted into the reverse diffusion process. In this reverse diffusion process, the neural network U-Net uses a specific algorithm to gradually remove noises, restore it to a latent vector z, and finally generates a new image x^ through the decoder.

It should be noted that aier the model completes training, we only need to use the reverse diffusion process as an inference engine to generate images. At this time, the input text/image is converted into a latent vector through the transformer, and reverse diffusion through U-Net begins to generate a new image.

The Stable Diffusion model in Figure 6 can also be roughly divided into three major components: the leftmost red module VAE, the middle green module U-Net, and the rightmost Conditioning transformer. Such a structural diagram will facilitate the description of the Stable Diffusion extension we will discuss later.

Figure 7. The three modules of Stable Diffusion correspond to the main structure in Figure 6. VAE (Variational AutoEncoder) compresses and restores images; U-Net neural network is used for the reverse diffusion process, which we also call inference; Conditioning transformer is an encoder used to convert text and image conditions, attached to the reverse diffusion process.

Stability AI uses 5 billion (image, text) pairs collected by LAION as the training dataset, where each image size is 512X512. The compuDng resources used for model training are 256 Nvidia A100 GPU processors on Amazon Web Services (AWS) (each A100 GPU has a capacity of 80 GB); the iniDal model training took 150,000 GPU hours and cost USD $600,000.

Images Generated by Stable Diffusion

Generally speaking, under the same prompt words, the quality of the pictures generated by Stable Diffusion is not as good as Midjourney. For example, using the same prompts of the "Mysterious forest " picture generated by Midjourney above, the picture generated by SD v1.5 is as follows:

*"Mysterious forests". Generated on Stable Diffusion (use the same prompts as the same titled image shown above), by Yan Zhang.*

Obviously, the quality of the picture above is not as good as the one generated by Midjourney, both in terms of photographic aesthetics and image quality. However, it would be a mistake to think that Stable Diffusion is far inferior to Midjourney.

Because it is open source, Stable Diffusion provides people with unlimited possibilities for subsequent research and development in various ways. We will briefly outline the work in this area below.

Using a rich prompt structure and various extensions, Stable Diffusion can also generate realistic “photography works" comparable to Midjourney.

“Future city”. Generated on Stable Diffusion, by Yan Zhang.

*“A young woman portrait”. Generated on Stable Diffusion, by Yan Zhang.*

*“Alaska Snow Mountain Night”. Generated on Stable Diffusion, by Yan Zhang.*

Stable Diffusion Extensions

The open source of Stable Diffusion allows AI researchers to carefully study its structure and source code, so as to make various extensions to the model and enhance its functions and applications.

The expanded research and development of Stable Diffusion is basically focused on the UNet part (see Figure 7). There are two main aspects of the work: (1) Based on the original Stable Diffusion U-Net, with a small amount of specific dataset to train a personalized U-Net sub-model. In this way, when the sub-model is embedded in Stable Diffusion, it can generate images with personalized styles that users want. Dreambooth, LoRA, Hypernetworks, etc., all belong to this type of work.

(2) Enhance control over the image generation process of Stable Diffusion. Research in this area is to design and train a specific neural network control module so that in the process of image generation by Stable Diffusion, users can directly intervene according to their own requirements, such as changing the posture of the character, replacing the face or background, etc. ControlNet, ROOP, etc., are all control module extensions that belong to this category.

In addition, we can also revise the original U-Net structure of Stable Diffusion and use a specific training dataset to train part or all of the modified diffusion model. The underlying diffusion model trained in this way can be targeted at specific application domains, such as medicine, environmental science, etc.

Stable Diffusion sub-model example. The author of this article downloaded 7 photos of Tom Hanks from the Internet as shown in (a). Then use the extension Dreambooth to train these only 7 photos to generate an "AI-TomHanks" sub-model. Embedding this sub-model in Stable Diffusion can generate an AI version of the Tom Hanks picture, as shown in (b).

In addition to U-Net, we can also make more modifications and extensions to Stable Diffusion in the two parts of VAE and Conditioning transformer, which we will not go into details here.

Comparisons between Midjourney and Stable Diffusion

Here based on my own experience, I made the following comparison of the six main features of the two.

User friendliness: From a user’s perspective, I think Midjourney is easier to use than Stable Diffusion. It is easier for people to generate more satisfactory pictures on Midjourney. If you are a Stable Diffusion user, you will find that in order to generate a high-quality image, in addition to working on prompts, you also need to have a suitable sub-model (also called checkpoint), no matter whether you are using SD v1.5 or SD XL v1. 0, therefore, it is relatively difficult.

Flexibility: In the process of image generation, Midjourney and Stable Diffusion provide different ideas and methods to control and modify the final output image. However, I think Midjourney's method is more intuitive and practical, giving users more flexibility. Although Stable Diffusion also provides more complex and richer image editing capabilities, such as inpainting, outpainting, upscaling, etc., it is not very easy to use in practice for ordinary users.

Functionality diversity: Because of open source and scalability, the functions of Stable Diffusion have been conDnuously enhanced, which has also made Stable Diffusion increasingly popular in various application domains in business, education, medical and scientific research. However, just from the aspect of artistic picture generation, both Midjourney and Stable Diffusion can generate stunning artistic pictures (photography, painting, cartoon, 3D, sculpture, etc.).

Image quality: Both systems can generate high-quality artistic images of all types. However, as mentioned before, Midjourney is slightly bener than Stable Diffusion in terms of the aesthetics and quality of the generated images.

Extendibility/Free use: First of all, Midjourney is not free to use, and it is not open source. For users who want to use generative AI soiware for free and have some IT knowledge background, I strongly recommend installing Stable Diffusion on their own computers, so that you can enjoy to freely create anything you are interested.

Photographers ask me, which one should we choose, Midjourney or Stable Diffusion?

My suggestions are as follows: (a) If you are limited by technology and/or resources (for example: you don’t know how to install and use Stable Diffusion, your computer does not have a certain GPU capacity), then you can just choose Midjourney. Although it requires a subscription fee, after learning, you will definitely be able to create great AI art works, and you can also use it to help you enhance your photography post-process workflow.

(b) If you are only interested in generating AI artwork and processing photos, I also only recommend using Midjourney and do not consider Stable Diffusion at all.

(c) If you have a certain IT knowledge background and are interested in the technical details of generating a wide range of artistic images, especially if you want to generate some personalized images, then I strongly recommend Stable Diffusion, because it is currently the most comprehensive generative AI soiware for image generation.

*“Mountain sunrise”. Generated on Midjourney, by Yan Zhang.*

*“Silent valley”. Generated on Stable Diffusion, by Yan Zhang.*

Mini AI knowledge: AI Winter - refers to the period from 1974 to 2000, when AI research and development, mainly in the United States, was at a low ebb, and research funding and investment were significantly reduced. The main reason for the AI winter is that since the mid-1960s, a series of large-scale AI research projects have failed or failed to make substantial progress. This includes: the failure of machine translation and single-layer neural network research projects in the late 1960s; the failure of speech understanding research at Carnegie Mellon University in the mid-1970s; and the stagnation of the fifth-generation computer research and large-scale expert system development during 1980s -1990s.

Write

Ulrike Eisenmann PRO

Stable diffusion, see https://arxiv.org/pdf/2112.10752, researchers from the university of Munich published the schematic picture in April 2022, which is shown above without reference, or did I miss it?

Miro Susta CREW

Dear Ulrike In this respect they did not use it legally, this picture is licenced by Hiroshima International University.

Ulrike Eisenmann PRO

ah , interesting! does not make it better ,-)

Cristiano Giani PRO

...well, if a user starts producing ''WOW'' images from tomorrow, compared to the ''MMM...NOT BAD'' ones he produced until yesterday, he has certainly started using Midjourney....:):):)....

Miro Susta CREW

😊👍😊

Ulrike Eisenmann PRO

Completely agree with Miro, what has the result by Midjourney to do with photography, ok, it grabs parts of other peoples photographies to combine it. Yes, dear Yan, you can use prompts very well but pictures created do not look like photographies but really artificial. Was also very astonished about the title, since AI is strictly forbidden here, sorry to be really negative on this article

Miro Susta CREW

😊👍😊

Steven T CREW

Yan Zhang, Thank you for the articles and images. I don't understand the technical details of how AI creates images, but it's clear that it will be a big change for visual artistry - much like the big change around 1840 when the invention of photography threatened to replace painting, and the more recent change when digital imaging and Photoshop began to replace film and darkroom. Painting survived, and Photography will too. Perhaps the evolution of technology will push us towards making meaningful images that can't be described with words.

Miro Susta CREW

Midjourney and Stable Diffusion are two leading generative AI applications that offer highly advanced functionality in their creation of images. While these two generative AI image creators share a similar focus, they are significantly different in their approach to AI image generation. Their difference boils down to a preference of artistic nuance versus extensive customization: Midjourney: Best for creating artistic, visually compelling images. Stable Diffusion: Best for extensive customization and technical control over image generation. This means both, Midjourney and Stable Diffusion are AI applications with ability to quickly generate images from text prompts. Use of AI for image generation is strictly forbiden in 1x. After reading this article (actually all 3 parts) I am non able to revise my previous comments to it. IMHO, this article is a promotion of AI, is it appropriate to promote something which was banned on this 1x platform? Furher, part of this article orignates from (licenced) work of Shigekazu Ishihara, Rueikai Ruo and Keiko Ishihara (Hiroshima International University), without mention their names. Dear Yan, please do not be upset about my comment(s) to this article, this is just my personal opinion on this subject. I am an engineer and also supporter of AI but definitely not in the photography. I wish you and all 1x readers lovely weekend.

Yan Zhang CREW

Dear Miro, thanks for reading this article and provided your comments. As an AI researcher, I am definitely an AI supporter. However, if you read my later parts of this article, you should know my position about the relationship between AI and photography. Most importantly, no matter you like or not, AI is here, and its impact to photography is increasing, with the most photography industries are embracing AI. When Adobe first time embedded generative AI into Photoshop in 2023 officially, it has become clear that the traditional meaning of photography is getting complicated, and we may need to re-define it.

Miro Susta CREW

Dear Yan, I appreciate your answer to my comment very much, I understand it very well. I know that we can't stop any development, as you said also not in photography. I'm very sad about it, for me is photography an art, created by the photographer and his camera. Now with AI we don't need the camera anymore, we can create beautiful pictures with words only, I tried it, it is working very well. I must repeat, IMHO it is pushing real photo work, or photo artwork to offside. I'm not a very good photographer, but photography is a part of my life. I was always very proud when I saw that one of my humble photo was published or even awarded, but now I have to reconsider if I shall continue or not, because my chances in competition with AI touched photos are rather slim. Once more thank you for this most educative article. Have a very nice weekend.

Written By Yan ZhangPublished by Yvette Depaepe, the 24th of May 2024

Next week: Part 4 - The Photographer's Confusion

Midjourney vs Stable Diffusion

Images Generated by Midjourney

Images Generated by Stable Diffusion

Stable Diffusion Extensions

Comparisons between Midjourney and Stable Diffusion

Written By Yan Zhang
Published by Yvette Depaepe, the 24th of May 2024