In July I shared my thoughts on Dalle-2, Open AI’s text-to-image generator. I steered clear of the ongoing narrative proclaiming the theft or death of art and creative practice, instead exploring the myriad benefits these AI-Human collaborations are going to present to creatives.
The last few months have validated that view, but even my optimistic stance underestimated the magnitude of the moment we’re in. The space has exploded, going from a fringe creative playground for techies and designers, to a rapidly evolving and super-democratised juggernaut of creativity.
Given the rapid pace of change, I thought I’d draft a new article sharing my evolving thoughts on the genre. I’ll be diving a little deeper, exploring how ecosystem diversification, combined with technological democratisation, has created the perfect breeding ground for blistering progression.
And don’t worry, I’ve also included loads more incredible examples!
A changing ecosystem
My previous article on text-to-image tools focused explicitly on Dalle-2 from Open AI because it was, at that stage, the most high-profile example of the genre. Compared with Open AI’s first generation tool, Dalle-2 represented a step-change in the quality of the images being generated, offering 4x better image resolution, better caption matching, and better realism.
That being said, in many ways it was fundamentally the same model and approach as Dalle’s first incarnation - it was invite only, somewhat limited without development expertise, contained only basic editing functionality, and relied upon heavily filtered/curated reference material (rightfully so, to avoid generating harmful content or infringing on copyright).
As with any new technology, the first commercialised product to enter the market won’t be alone for long, and a whole host of alternative platforms have emerged. Whether by luck or design, the most successful new entrants have addressed many of the key pitfalls of Open AI’s approach - if not always technologically, then certainly from a business model perspective. In this article l’ll be referencing two of the most successful new tools: Midjourney and Stable Diffusion.
Unlike Dalle-2, which is invite only, Midjourney is primarily accessed through Discord, though they do have a web app. Initially they only offered access via their own standalone server, but they’ve since created the Midjourney bot, which allows any server admin to add Midjourney to their Discord community.
Naturally there are some usage limits before you have to pay for further prompts, but the Discord-distribution model has been incredible for broader adoption of text-to-image as a category. I’d encourage readers to join their discord server, even if simply to bear witness to the incredibly community participation. It’s text-to-image, but as a social and collaborative experience.
Other differences from Dalle-2 include Midjourney being pointedly less focused on realism, and deliberately defaults to a stylised, artistic aesthetic. As Midjourney founder David Holz told The Verge “We have a default style and look, and it’s artistic and beautiful, and it’s hard to push [the model] away from that”.
Lastly, unlike Dalle-2, Midjourney is slightly more flexible regarding content restrictions. You’re still protected from generating overtly x-rated content, but you can for example add prompts that include celebrity likeness. In the below example I asked Midjourney to create an image of ‘The queen of England smiling in a forest, dark and moody’. If you compare that picture of the Queen to Dalle-2’s attempt, you’ll quickly see why that difference matters:
Other communities members have had fun creating entire stories, like John Oliver falling in love with a cabbage:
The other fastest-growing text-to-image model is Stable Diffusion, which has many similarities with Dalle-2 and Midjourney, but its own distinct business approach.
You see, Stable Diffusion is open source, and this has made for an incredible distribution model. Anybody can fork the library and build their own tools from it, and this is exactly what is happening - I’ll share some impressive examples of this later in the article.
At face value, the downside of their open source approach is that it requires technical expertise to get started. However, Stable Diffusion’s open source approach has lead to several standalone tools that are very consumer-friendly and accessible, and they’ve now created their own easy-to-use web app too.
Like Midjourney, Stable Diffusion is also a little more generous than Dalle-2 with regards to the sort of images it allows you to create. However, I don’t want to go too deep on the differences between the various distribution methodologies and libraries, and instead I think it’s worth exploring the greatest ecosystem benefit: democratisation.
If you’re wondering why I’ve not focused on the technological aspects of text-to-image it’s because it’s arguably a moot point. There is naturally a lot of runway to improve on the underlying technology, just as there was with the early days of the internet, or the smartphone, but the most creative forms of text-to-image innovation have been taking place thanks to democratisation.
Whether it’s thanks to Midjourney giving everyone access to a simple text interface in their discord channel, Stable Diffusion offering their model as open source, or self-started AI-art communities, the result is that more people are accessing these tools than ever before.
Not only are more people gaining access to interfaces that allow them to create AI-generated imagery, but every software developer on the planet can now access libraries and models that allow them to experiment with the technology. As with all things, increased access leads to faster progress.
Ecosystem diversification x democratisation = speed!
The potent combination of ecosystem diversification and technological democratisation has led to incredibly fast and dynamic evolution within the space. We saw the same thing with the internet in its first incarnation, and, for better or worse, it’s been the same in the blockchain space over the past decade. Each new application of the technology feeds into the system, providing new avenues of exploration, whilst simultaneously signposting opportunities for the established players to further resolve their own propositions, then the cycle starts anew.
Examples of exciting new innovations
Now that I’ve covered what I believe is the driving force behind this accelerating pace of innovation, it’s time to look at some of the amazing recent manifestations of the technology and its application by people all over the world.
Some of these are live examples, and others are prototypes that signpost where the technology is going next - I’ve done my best to draw attention to the difference when necessary. You’ll also see why I avoided sharing Stable Diffusion examples earlier in the article: it has an outsized presence among the samples below, and this is because their open source model is driving the most innovation!
Firstly, there are quite a few people exploring image-to-image, I’m not going to pretend to understand how they’re using Stable Diffusion to do this, but it highlights the immense potential of quickly turning sketches and storyboards into fully realised visuals.
Using AI to create illustrated stories
Another use-case I’m extremely excited about is the application of text-to-image to translate written stories into more visual formats.
For example, Sharon Zhou has been undertaking some amazing experiments that she is referring to as Long-form text-to-images generation. You can find out more by visiting her Github repository, or subscribing to the Stories by AI newsletter.
Unfortunately, it’s not super easy to share or embed Sharon’s work in our article, but that’s not that case for this amazing project by Glenn Marshall:
It’s genuinely inspiring to see text-to-image being deployed experimentally to push the boundaries of literature and storytelling. And, if you want to explore how tools like Dalle-2, Midjourney, or Stable Diffusion could help you to create your own animations or illustrated stories, check out this handy little tool by Andreas Jansson.
Tile-able textures for 3D rendering
If you’ve ever worked with 3D design software you’ll know the struggle that comes from trying to find the right texture for your renders, let alone ensuring it’s tile-able! Well, this problem has been well and truly solved by text-to-image engines, and it’s hard to understate quite how many hours this could potentially end up saving designers around the world:
Some of the new applications for text-to-image seem obvious, others less so. I was completely blown away by this eCommerce idea from Russ Maschmeyer at Shopify:
It’s only a mock-up, but the potential is undeniable, and it’s surely only a matter of time before live camera feeds, voice recognition, AI, and Machine Learning (ML) are combined to this effect.
Stable Diffusion for your design tools
As a primarily digital designer, I’m bound to get most excited when I see these tools translated into software I already know well. These next two examples show how powerful it could be to integrate Stable Diffusion with the standard digital design tools of today.
And Antonio Cao has made this plugin for Figma that shows huge potential too:
Naively, I’d assumed that limits on processing power, coupled with the issues with repeatability inherent within current text-to-image models, might delay the emergence of text-to-video until at least early 2023, but it’s already here.
I was first introduced to this concept by an amazing video from Malick Lombion and Manuel Sainsily. Admittedly it’s been created using a very manual process, developed frame by frame using text-to-image, but they still turned some amazing creations around extremely fast once the tools were readily available. Depending on your preferred platform you can view a great example on Linkedin or Instagram.
What’s more, there are several tech companies vying to create tools that will put the power of text-to-video into the hands of every creative or consumer. A few of the most compelling examples have come from a relatively unknown company called Runway:
Of course, the tech giants aren’t sitting on the sidelines for this one either.
Google’s text-to-video debut
Runway’s demos above can somewhat raise our expectations, and it’s worth saying that what they’re sharing is more akin to a heavily edited showcase that aims to build momentum for the eventual release of their products. For a more realistic insight into the current state of text-to-video research, Google’s efforts are probably as good a guide as you can find:
Meta’s text-to video offering
Not to be outdone, Meta (a.k.a. Facebook) has also got in on the action, publishing the results of their latest research here. You can get a glimpse of what’s to come from this post shared on Twitter:
Insert yourself in the narrative
Last, but by no means least, I thought I’d share a slightly more comical use-case. Perhaps in an attempt to reflect the modern obsession with personal narrative, an imgur user has trained Stable Diffusion on his own face so that he could add his own likeness to famous clips from film and television:
It goes without saying, but the pace of progress truly is staggering, and it’s hard to really imagine where these technologies will take us over the next couple of years. They’re unleashing and defining entirely new approaches to creative practice, and making access and adoption more widespread than ever before.
It’s also worth noting that these are only examples of AI and Machine Learning models that are focused on image and video generation. This category naturally takes the limelight as the easiest media formats to consume and appreciate, but the innovation in the space is much broader and equally rapid. In music and writing the same challenges are being tackled, in eCommerce, in shipping, in governance and sport, you name it and AI and Machine learning are being applied to revolutionise each category. We can all easily see and appreciate the impact of text-to-image, or video-to-image, but it’s really just the flagship application of a technological wave that is sweeping our world that most of us are entirely blind to.
It’s scary and exciting, and frankly unstoppable, and in my view we should all be adapting, learning to embrace these new tools and technologies if we want to succeed going forward.