Dimitrios Kechagias

Posted on Jun 19 • Edited on Jun 20

AI Image Creation: ChatGPT vs Gemini vs DALL·E vs Grok

#ai #openai #google #grok

Over a year ago I published a comparison of Google's Duet AI image generation with Microsoft's DALL·E 3 powered image creator. The focus was image generation for presentations, articles or apps and the results were promising, even though there were spectacular failures in a few subjects. I am revisiting the exact same prompts with the "current crop" of AI generators.

OpenAI ChatGPT Plus (4o): While the popular chat originally used DALL·E 3 exclusively, it recently switched to its 4o Image Generation for paid accounts and this is the version I will be testing (on a "Team" account). The free accounts seem to still be limited to DALL·E 3 (in addition to slow generation during peak). One image is generated per prompt. Only Microsoft Image Creator (DALL·E 3) remains unchanged since the last test — all others are new or upgraded.

Google Gemini (Imagen 4): It is the successor to the Duet AI I was testing in the previous comparison. It is a significant improvement and the free chat plan does include image generation (with a daily limit). For this comparison I used Gemini Enterprise from within Google Slides, which uses the same engine just generates 4 images per prompt by default. I did the comparison with the Imagen 3 engine and had to redo all the Gemini images as the improved Imagen 4 was released while I was writing the report.

Microsoft Image Creator (DALL·E 3): Free for 15 image generations daily (with 4 images per prompt), using DALL·E 3. This is the only engine also used in my previous comparison, performing pretty much the same.

xAI Grok: The free Grok chat allows you 10 images per 2 hours with its Grok 3 image generation, which is quite generous and should be enough for most non-pro use cases. A single image is created per prompt.

Table of Contents for convenience:

The Comparison Tests

In my previous comparison I tried to recreate slide deck images I had used in various talks, app icons and header images to dev.to articles. I am using the exact same methodology to see whether there has been progress since last year.

There are 12 comparison image rounds in total where all engines get the same prompts. Unlike last year, I did not tweak prompts based on results. This time I used the exact same prompts from the 2024 test for a more controlled and direct comparison.

As before, each round was rated from 0 to 10 based on the proximity of the result to expectations, suitability for the intended purpose and adherence to instructions. The ratings are subjective, but the images are included so you can draw your own conclusions.

The Camels

For the last few years I've been giving talks at the Perl and Raku conference. My presentations always feature some sort of camel, as that's the most recognised Perl symbol, so I tried to recreate a couple of camel images I have used in the past.

Running Camel

For a Software Performance Optimization talk, an image of a running camel was used (an inexpensive stock photo):

The simplest prompt I could think of:

a happy camel running

The results:

ChatGPT is simple and quite good - I got a cartoon, but I did not actually specify otherwise. Microsoft's is also not realistic, but kind of very specific and weird style. Gemini gives both a cartoonish and 3 realistic camels running, while Grok's don't seem to be running, and they are too zoomed-in to tell for sure.

Specifying:

a photo-realistic happy camel running, single colour background

ChatGPT and Gemini are pretty much what I had in mind. DALL·E is not photorealistic and Grok did not really improve at all with my second prompt, not even the background was simplified.

Let's generate some scores:


ChatGPT	10/10
Gemini	10/10
Image Creator	6/10
Grok	4/10

Google already had good results last year, but ChatGPT is added as a top performer as well.

Camel with Glasses

A camel with glasses was used for the Fast Perceptual Image Hashing talk, created from two stock images:

It's a bit rudimentary as I am neither a designer, nor did I want to spend much time on it. Perhaps an AI generator could have managed this with an appropriate prompt:

A smiling camel looking at us through big blue glasses, single colour background, photo-realistic

At this point I feel like I was too lax with scoring last year, as I had given Duet AI a 10, while ChatGPT and Google's own Gemini are now clearly better. Grok has the issue of the glasses being a bit off center compared to the eyes, which may be realistic in how glasses would not fit a camel that well, but that's not really what we are after, is it? :)

In the end I decided to retroactively adjust last year scores by -1 to show the meaningful improvement. I came close to adjusting a couple more scores, but that was a bit of a slippery slope, so in the end I adjusted only the most egregious examples (this and the sloths at the end) by -1.


ChatGPT	10/10
Gemini	10/10
Image Creator	7/10
Grok	9/10

Astronomy

I used to do talks for my local astronomy club, so I picked some of the slides I used for a "Choosing Binoculars" talk.

Photos of Astro Objects

This slide with examples of objects as you'd view them with binoculars was going to be a long-shot:

Without the "how would they look through binoculars" element, I tried to give a list of objects to see if I could get the sort of "astrophotos on canvas" style above:

A compilation of photos one each of the astronomical objects: the moon, pleiades, orion nebula, andromeda galaxy, the Double Cluster, comet Lovejoy, arranged randomly on a canvas with slight overlaps

ChatGPT is the only one that gets the right number of images, even though it just does repeats giving 2x Andromeda galaxies and 3x Pleiades. Grok is visually interesting, but not close to what I asked, while Google's and Microsoft's solutions have tons of not overly realistic objects, with Gemini adding some labels full of typos.

Repeating last year's second attempt:

A compilation of 6 photo-realistic photos, one each of the astronomical objects: the moon, pleiades, orion nebula, andromeda galaxy, the Double Cluster, comet Lovejoy, arranged randomly on a bigger canvas, some overlap is allowed

Gemini does probably worse than last year. Adding the label "Pleades" (sic) to something random does not make it the Pleiades... And it still cannot count. Interestingly, the previous Imagen 3 engine Gemini was using until a few days ago did a bit better. DALL*E can't do anything useful as we saw last year and Grok, again visually interesting, with realistic-looking objects, but not getting the "photos on canvas" instructions. ChatGPT actually does well, if there was no repeat of one of the 6 images, it would have been perfect.


ChatGPT	8/10
Gemini	2/10
Image Creator	1/10
Grok	3/10

Stack of Binoculars

This photo is from the same presentation:

4 binoculars stacked on top of each other from largest (bottom) to smallest (top), with their lens pointed towards our viewpoint, photo-realistic

I think we established last year that DALL-E 3 does not understand binoculars. Add Grok to this category, the results are photorealistic but outlandish binocular-inspired depictions. Google's Imagen 4 update is a big improvement (just a few days ago I was getting results close to last year's Duet AI), with usable results. ChatGPT's latest solution gets it right on the first try.

Alternative prompt with more hints:

4 pairs of binoculars of various types, stacked on top of each other from largest (bottom) to smallest (top) with their lens pointing to viewer, photo-realistic

I was very lenient last year with Duet AI managing to produce one usable image on the second try, ChatGPT and Gemini go far beyond by producing good results from the get go and improving with more hints. Bing and Grok are alien to the concept of binoculars.


ChatGPT	10/10
Gemini	10/10
Image Creator	1/10
Grok	2/10

Lawn Chair Binocular Mount

Again on the same slide deck, there was an image of one of the various DIY "lawn chair binocular mounts" that can be impressive and sort of amusing:

They are sometimes called "bino-chairs" and require some design creativity and ingenuity, so it was a different type of test for the AI engines.

man on lawn chair using hands-free binocular mount

Image Creator gets a lenient 5 as it gets a sort of tripod in one of the images (but not completely hands-free). ChatGPT and Gemini gave great images, I will deduct one point from Gemini for the distinction of ChatGPT getting the "hands-free" aspect with no extra direction (Gemini gets there easily if you add to the prompt). Grok would have been spot on, if the binoculars were not the wrong way around!


ChatGPT	10/10
Gemini	9/10
Image Creator	5/10
Grok	6/10

Technical Drawings

For technical talks, explanatory illustrations are often required. E.g. the following image from the Perceptual Image Hashing talk shows which cells of a 6x6 matrix are used for a specific hash:

draw a symmetric 6x6 square matrix with white lines, make the top-left cell black, also the cells that are below the bottom-left to top-right diagonal also black, and the rest blue, 2d art style

At this point, Grok started malfunctioning. First it started giving me images that incorporated the previous prompt for no reason:

Even including "forget context" etc instructions in the prompt along with the instructions did not help, but a "Forget previous images. Start from scratch." prompt did make it exclaim it "understood" and it will start afresh, then giving me a broken image link. When I complained I can't see it, I finally got a result. Not a great result though:

So, Grok, Gemini and Image Creator are trying to hard to go wildly off script. ChatGPT almost got it perfect. There's a small error about which diagonal the black squares fall under, but it's close, it even automatically switched to square format output.

Going to the very basics, replacing even the word matrix with "grid" to see if the others can be helped.

plain 6x6 square grid, solid colour background, vector drawing

ChatGPT nails it, Gemini gets it in 2 out 4 examples, while the other two are just wildly off.


ChatGPT	9/10
Gemini	4/10
Image Creator	1/10
Grok	1/10

Logo/Splash Graphic Design

Moving on to a couple of logos / splash screens I designed for the iOS apps I develop as a hobby. It's not typical slide deck graphics, but it could still be relevant if you are putting together a new project / product and creating a presentation for it.

Polar Scope Align Icon

Polar Scope Align is an iOS app for amateur astronomers & astrophotographers. It's quite popular in its niche and it is often praised for a well-designed UI with a focus on functionality. The image assets themselves, such as icons, are rather simple as I am not a designer. Here is the older (left) and newer (right) icon of the app:

Hopefully, with the right prompt, something that resembles the older & simpler icon on the left could be within the abilities of the AI engines.

red crosshairs with circle around them, centered on the middle of the 7 stars of the Little Dipper, the Little Dipper should barely fit the circle, clip-art style

ChatGPT gets the style very well, except in reverse colours - not sure why it went with black stars on white. It does not get the actual constellation - but no other generator did either, with Gemini being the closest in style. Grok did not do clip-art as instructed, while Image Creator is visually interesting but not close to what I was asking.


ChatGPT	5/10
Gemini	3/10
Image Creator	2/10
Grok	2/10

Xasteria Icon

Next up, the icon / splash screen for the app Xasteria Astronomy Weather:

solid dark blue sky, having several yellow 4-pointed stars of various sizes, each designed using hyperbolic curves, but all with their points at top/bottom/left right orientation, a third of the sky covered by a dark grey mountainous range silhouette and a big white X that is the same shape as the 4-pointed stars but rotated 45 degrees and takes up 80% of the width of the scene, clip-art style

ChatGPT is the best on the first try. It got the stars and mountains right, only the X is almost, but not exactly right, going with parabolic instead of hyperbolic curves. Gemini's X is squared, so a bit worse. Grok changed its aspect ratio a bit for some reason (wider, and kept doing that for about half the subsequent tries), and gave me some weird stars and asymmetric X. Correct colours though, so I'd call it an improvement over the weird Microsoft attempt.

To help Dall·E last year I had asked ChatGPT to optimize the prompt and I got this one to try:

create an image of a serene landscape with a solid dark blue sky. Populate the sky with several yellow 4-pointed stars of various sizes, each designed using hyperbolic curves. Ensure that all stars have their points at top/bottom/left/right orientations. Dedicate one-third of the sky to a dark grey mountainous range silhouette. Additionally, include a prominent white X shape in the scene. The X should be the same shape as the 4-pointed stars but rotated by 45 degrees, taking up 80% of the width of the scene. The X should maintain the hyperbolic curve design.

The results are mostly worse, except Gemini which gets the X with the hyperbolic curves perfectly in at least one attempt.

The last attempt involved going back to the original prompt, but tweaking the description of the "X" to make it more explicit.

solid dark blue sky, having several yellow 4-pointed stars of various sizes, each designed using hyperbolic curves, but all with their points at top/bottom/left right orientation, a third of the sky covered by a dark grey mountainous range silhouette, at the foreground a big white X that is also designed using hyperbolic curves and takes up 80% of the width of the scene, clip-art style

It actually did worse than the original here, except Gemini which is at a similar level and Dall·E which is a small improvement.


ChatGPT	8/10
Gemini	9/10
Image Creator	4/10
Grok	5/10

Blog Header Image

After all these quite specific images, I thought I'd try more creative generation and see what the AI engines can come up with when given titles of articles - dev.to articles I've posted to be exact.

Google Cloud & AMD EPYC

First, is the performance review of the latest AMD EPYC powered GCP instances. I did a simplified title last year as Duet AI was getting confused, I'll repeat the same:

An image that can serve as a title for a Google Cloud and AMD EPYC presentation

You'll notice I say "title", when I really should have said "header" or similar. I did not notice, as the image generators did not take it literally in the last comparison - possibly because they were lousy with text. Here comes ChatGPT 4o though:

Text looks good and, interestingly, the top half is kind of what I went with myself. However, I will change "title" to "header" for this comparison, as I had expected some sort of graphics would be included:

ChatGPT gave a very simple design, but it is just right, even getting Google's logo and the EPYC font right. Gemini is trying harder to impress, but modifies logos etc in the process. It does know a header image should be on a wide aspect ratio. Image Creator is an improvement from last year, no garbage text. Grok is just uninspired - generic. I'll base the points to what I gave Image Creator last year (I was a bit lenient again).


ChatGPT	10/10
Gemini	9/10
Image Creator	7/10
Grok	4/10

Compute Cloud Provider Comparison

Next was the Compute Cloud Provider performance and price comparison.

an image that can be used as a header in a compute cloud provider price & performance comparison

ChatGPT is again a simple design, gets the text right and the drawing is very on point. Gemini is usable as long as it does not try to add too much text, at which point we start getting gems like "COMPANES" (sic). Image Creator is similar to last year, no text attempted so usable results although a bit too "imaginative. Grok decided to give me a single image for the first time, and it's not great, as there are some weird typos in the title and the chart even weirder.


ChatGPT	10/10
Gemini	6/10
Image Creator	7/10
Grok	2/10

This Article

For this article a generic prompt was attempted:

Header image for blog post: "AI Image Creation: ChatGPT vs Gemini vs DALL·E 3 vs Grok"

ChatGPT was reasonably clear, the others rather disappointing, although Grok did get most of the text OK - these too were the only ones that could spell DALL·E. Gemini and DALL·E could not spell anything.

Since I didn't get good results, I gave more explicit prompts, such as asking for a painter writing a different phrase for each engine on a canvas:

painter writing [name of ai service] on a canvas.

ChatGPT does well as usual, a bit artistic output. Gemini is great, with realistic images and correct spelling / Google logo. Microsoft's service gets the spelling of "Microsoft" slightly off in 2/4 tries. Grok gives me one version before the painter has written anything. Maybe it's intended as "progression"?

I'll give average marks from the two attempts above (and the separate marks in parenthesis next to them).


ChatGPT	9/10 (8+10)
Gemini	6/10 (1+10)
Image Creator	4/10 (2+6)
Grok	5/10 (3+6)

Sloths with Headphones

Finally, I tried something fun for my videoconferencing background. First thing that came to my mind was:

sloths wearing headphones, photo-realistic

Gemini and Grok gave me what I wanted. Some playful, reasonably realistic sloths (plural) wearing headphones. ChatGPT gave me what looks like the passport photo sheet of a single sloth, not really natural or much realistic. Image Creator also had trouble with plural, half the attempts featured a single sloth. I did award a 10/10 last year, this is the second case I am revisiting to subtract 1, as I was too lenient and this year’s top two performers clearly improved upon last year's best:


ChatGPT	10/10
Gemini	6/10
Image Creator	7/10
Grok	10/10

Conclusion

Let's take a look at the cumulative scores:

Test	ChatGPT	Gemini	Image Creator	Grok
Running Camel	10	10	6	4
Camel Glasses	10	10	7	9
Astrophotos	8	2	1	3
Binoculars	10	10	1	2
Bino Chair	10	9	5	6
6x6 Matrix	9	4	1	1
PS Align Icon	5	3	2	2
Xasteria Icon	8	9	4	5
GCP / EPYC	10	9	7	4
Cloud Comparison	10	6	7	2
Drawing Phrase	9	6	4	5
Sloths	10	6	7	10
Total	109/120	84/120	52/120	53/120
%	91%	70%	43%	44%

This year's follow-up confirms that AI image generation tools have made noticeable progress - from sub-50% scores, we got to at least one solution (subjectively) scoring over 90%.

The top performer is ChatGPT, the new 4o model is the most dependable of all the solutions. It is the only one that can count, do technical drawings and the only one that has fully solved text rendering, while being the best at interpreting prompts. It went with the "less is more" approach, often giving the simplest, yet most appropriate image. It's the only one though that is not accessible in a free version. That may be just for now though, as it's quite new.
Gemini (Imagen 4) marked a clear improvement over last year’s Duet AI. It can mostly render text (not yet consistently) and can finally draw multiple binoculars without merging them into a monstrosity. It still has problems counting (with perhaps a small improvement over Duet AI) and misses fine prompt details, but it's available even on the free version. Plus, its integration into the Google Docs suite is a nice convenience.
Microsoft's Image Creator seems to use pretty much the same (DALL·E 3) engine like last year. It actually got lower marks, possibly due to luck - I only used the results of the first attempt, so some luck is involved. It's still good for creative results, but it's not accurate, can't do text, so it's rather limited for serious uses.
Grok (Grok 3) showed promise with photorealistic visuals and some artistic "flair", but was the most inconsistent, occasionally misinterpreting prompts, producing malformed compositions, or displaying contextual confusion.

Re-running the exact same 12 test rounds as last year highlighted that prompt interpretation, factual accuracy and visual clarity remain difficult to balance across models. Some were good in freeform design, others in precision, but none yet do both equally well, although ChatGPT got close. Still, the overall quality is clearly up from 2024.

So, for most professional use cases, especially when accuracy matters, ChatGPT 4o can provide great results, with Gemini being a decent alternative most of the time. Image Creator remains a decent free option for creative use, while Grok does show some interesting potential - it often goes a different way compared to the others.