Shrijal Acharya for Composio

Posted on Jun 16 • Originally published at composio.dev

🚀OpenAI o3 Pro vs. Claude Opus 4 vs. Gemini 2.5 Pro Coding Comparison 🧙🪄

#ai #webdev #javascript #programming

AI is evolving fast, with new "best" AI models appearing every few weeks. 🥴

Among them, we have three top models leading the leaderboards: Claude Opus 4, Gemini 2.5 Pro, and OpenAI's o3 Pro.

Which one do you really pick for coding?

If you're a dev who cares about clean, efficient coding models, it can get overwhelming really quickly.

So I put them head-to-head. Let’s see which one actually works for coding. 👀

TL;DR

If you want to skip to the results, Claude Opus 4 is much better than the other three in terms of code quality, implementation, following prompts, and most importantly, understanding your exact needs. It really gets what you want, even the specifics.

But you won't go wrong choosing the Gemini 2.5 Pro model for coding as well. It's a perfect model and gives great results. If price is a small factor, I'd suggest you stick to the Gemini 2.5 Pro over any other models. It's just worth every penny.

Here, we didn't compare the model on any factors other than coding. o3 Pro could be much better at other things like reasoning and math than both Claude Opus 4 and Gemini 2.5 Pro, but based on our tests, it's a disaster for coding. Just don't choose this model if you are looking for a "good coding model". You can't go more wrong.

Coding Comparison

1. 3D Town Simulation

Prompt: Make a 3D game where I can place buildings of various design and sizes and drive through the town I create. Add traffic to the road as well.

Response from Claude Opus 4

You can find the code it generated here: Link

Here’s the output of the program:

Look how beautiful this is for one shot. Everything works, and with nicer UI and shadows. I can switch from build mode to drive mode, add buildings and continue to drive. No logic issues in any part it implemented.

One thing that could be improved is the collision detection between the vehicle and the buildings but I didn't ask for it, so it's fine.

Response from Gemini 2.5 Pro

You can find the code it generated here: Link

Here’s the output of the program:

This is great too. You can place buildings in the right spot and not on the road. The overall user interface and feel are somewhat average.

One thing I love that it did, which the Claude Opus 4 didn't, is the collision detection. It was pretty quick with the response, within seconds, and for a model that's this affordable and performs so much better than most, what more could you ask for, right?

Response from OpenAI o3 Pro

You can find the code it generated here: Link

Here’s the output of the program:

As you can see, this is a much worse result than both models. There's no validation for where you can place the buildings, and even when you try placing them in the correct spot, they end up on the road.

The overall UI and feel are terrible, flickering everywhere. The controls are completely inverted, and the colors are really weird.

Overall, this is a very disappointing result from this model for such a simple question. 👎

Follow up prompt: Now I need you to make me a firetruck and set some buildings on fire randomly. I must be able to extinguish the fire and add an alert in the UI when a building is on fire. Also, add a rival helicopter that detects the fire and tries to come to extinguish the fire before me. Additionally, make the buildings look a bit more realistic and add fire effects to the affected buildings.

Now, let's see how well these models can understand their own code and add some features on top.

Response from Claude Opus 4

You can find the code it generated here: Link

Here’s the output of the program:

This is amazing. I can easily switch from a normal car to a fire truck, see alerts, and everything works. The logic is all fine, except the helicopter doesn't put out the fire.

This isn't your typical question that it might be trained on. It's a completely random idea, and for something this random, it's too good. If a few follow-up prompts are made, I'm pretty sure it can even fix that.

Overall, this is a rock-solid response and probably the best you can expect from any model in one shot.

Response from Gemini 2.5 Pro

You can find the code it generated here: Link

Here’s the output of the program:

Now, this is interesting. For some reason, the controls are now completely inverted. It has improved the UI a bit but introduced a whole lot of new bugs in the code.

Not the best you can expect, but it's good enough for this feature request.

Response from OpenAI o3 Pro

You can find the code it generated here: Link

Here’s the output of the program:

This is another disaster. The controls are inverted, the fire truck does not work, the buildings request an image that does not exist, and many more issues. Bugs are everywhere, and literally nothing works.

Even if you are using this as a starting point for a project like this one, you're better off building it all on your own.

2. Bike Racing

Prompt: You can find the prompt I've used here: Link

Response from Claude Opus 4

You can find the code it generated here: Link

Here’s the output of the program:

Unlike others, this one has a few problems. First, the nice part is the UI looks extremely good, and it has added all the features that I asked for, but the game has nowhere to end, and the map is not complete.

Not much I can comment on the implementation, but there's room for improvement.

Response from Gemini 2.5 Pro

You can find the code it generated here: Link

Here’s the output of the program:

Similar result, but with bad UI. The game is endless, which is what I want, but it stops rendering the roads midway and has no boundary for driving, allowing you to move anywhere.

Not the best you'd expect from Gemini 2.5 Pro for this question.

Response from OpenAI o3 Pro

You can find the code it generated here: Link

Here’s the output of the program:

This is tiny bit better than the earlier implementation, but still not good. The position calculation is incorrect and the game kinda force start at one single point kinda mimicing endless run.

It's still not right and nowhere near the Claude Opus 4 or the Gemini 2.5 Pro implementation.

3. Black Hole Simulation

Let's end our test with a quick animation challenge.

Prompt: Build an interactive 3D black hole visualization using Three.js/WebGL. It should feature different colors to add to the ring and a button that triggers the 'Disk Echo' effect.

This is an idea I got from Puneet from this tweet:

Here, he used the earlier Gemini 2.5 Pro model (05-06) with several follow-up prompts to achieve this result. Now, let's see what kind of results we can get from our models, which are supposedly better than that in one shot.

Response from Claude Opus 4

You can find the code it generated here: Link

Here’s the output of the program:

This is okay, and honestly, it's the best you could expect from a model in one go. It has implemented everything correctly; perhaps the blackhole itself could be a bit better, but this is completely valid and a great response.

It has implemented everything correctly, from different color support to Disk echo, but when you see a result like the one above in the tweet from a slightly less capable model, you expect a bit more, right? 🥴 But this is fine.

Response from Gemini 2.5 Pro

You can find the code it generated here: Link

Here’s the output of the program:

Honestly, I seem to prefer this result over Claude Opus 4, mainly because the black hole and the animation here look a bit more realistic.

This still does not look as good as the one in the tweet, but it's a one-shot, don't forget. Once we iterate on the code and ask it to change accordingly, we should get closer to that or even better.

Response from OpenAI o3 Pro

You can find the code it generated here: Link

Here’s the output of the program:

Nah, this is complete dogshit. In what sense does it look like a black hole animation? Yeah, this model has a great benchmark for research, reasoning and all, but when it comes to coding, it always seems to disappoint.

And yeah, I got this result after it thought for over 5 minutes, which is the longest any model has taken in the entire test. This is super disappointing.

Conclusion

Claude Opus 4 is the clear winner in all of our tests. It's a great model, and it justifies why it's the best model for coding. 🥱

To me, Gemini 2.5 Pro is the real find. In most cases, it does a pretty solid job. The benchmarks are too good for a model this low in pricing.

Don't forget, it's about to get a 2M token context window update. Imagine how good this model is going to be after a 2M token context window? 😵

o3 Pro was kind of disappointing, but it's not completely a coding model either. You might be better off using this model for research and stuff, but not really for coding.

I've compared the earlier version, o3, as well, and got pretty much similar results. If you'd like to take a look, here you go:

🔥Claude Opus 4 vs. Gemini 2.5 Pro vs. OpenAI o3 Coding Comparison 🚀

Shrijal Acharya for Composio ・ May 26

#ai #javascript #webdev #programming

What do you think, and which one do you prefer based on your needs and budget? Let me know in the comments! 🙌

Shrijal Acharya

Full Stack SDE • Open-Source Contributor • Collaborator @Oppia • Mail for collaboration

Top comments (10)

Nabin Bhardwaj • Jun 16

I might get hate for saying this but for most of the coding tasks, I've gotten better response with free ChatGPT model than DeepSeek or Claude 3.7 Sonnet or Grok.