Shrijal Acharya for Composio

Posted on Jun 11 • Originally published at composio.dev

🔥Claude Sonnet 4 vs. Gemini 2.5 Pro Coding Comparison ✅

#ai #javascript #webdev #programming

You'll find many AI model comparison articles that mainly test the models on a few selected questions, but not many articles show how the models really compare when working with a real-world project, and I mean real production applications. 💪

I think it's high time that we also start testing the models on this aspect, as the improvements in AI models are exponential nowadays, with one of the models we are going to test (Claude Sonnet 4) reaching about 72.7% of accuracy in the SWE bench.

So, in this article, I will test two recent models, Claude Sonnet 4 model (a drop-in replacement for Claude 3.7 Sonnet) with the recent Gemini 2.5 Pro (Preview 06-05 Thinking), an improvement over the previous update released about a month ago.

Gemini 2.5 Pro has a 24-point ELO jump in LMArena at 1470 and a 35-point ELO jump to lead on WebDevArena at 1443, but comes with slightly lower ~67.2% accuracy in the SWE benchmark compared to Claude Sonnet 4.

Let's jump into it without any further ado!

TL;DR

If you want to jump straight to the result, yes Claude Sonnet 4 is definitely an improvement over the Claude 3.7 Sonnet and performs pretty much good than any other coding models including the Gemini 2.5 Pro, but in some cases, there could be exception (as you can see for yourself in the comparison below).

And, there can be cases like this as well: 😮‍💨

Overall, you won't go wrong choosing Claude Sonnet 4 over Gemini 2.5 Pro (for Coding!).

However, all of this comes with a cost. Claude Sonnet 4 is slightly more expensive at $3/M input token and $15/M output token, while Gemini 2.5 Pro is $1.25/M input token and $10/M output token.

But if I were to judge based on the overall model quality and pricing, and not just on coding, Gemini 2.5 Pro is one of the cheapest yet powerful models you can get. So, unless you are using the model for some hardcore coding tasks, I'd suggest sticking with Gemini 2.5 Pro.

Brief on Claude Code

💁 First, let me tell you that we are going to be testing Claude Sonnet 4 with the help of Claude Code. If you don't know what this is, let me give you a quick overview.

Claude Code is a command-line tool developed by Anthropic that runs directly in your terminal. It uses agentic search to understand your project, including its dependencies, without you having to manually select files to give it context.

The best part is that it integrates well with GitHub, GitLab, and other dev tools, allowing you to read issues, write code, make PRs, and even run tests directly from the terminal.

🗣️ "Claude Code’s understanding of your codebase and dependencies enables it to make powerful, multi-file edits that actually work." - Anthropic

Here's a quick demo to get an idea of how powerful this tool is and what it can do:

Why use Claude Code when you can use the Claude Web UI?

This is a completely valid question, and there's a reason that I've chosen to use Claude Code over your typical Web UI chat. The main reason is that even if you have a million lines of code, the AI model can access all of it and provide a better response, which simply isn't possible with the Web UI.

And the other reason is, what could be better for a developer than to test the models right from the terminal? 😉

Brief on Jules Coding Agent

💁 We'll use this agent to test Gemini 2.5 Pro. If you don't know what this is, let me give you a quick overview.

This is a coding AI agent from Google. Think of it like Claude Code, but it runs on the web and not locally.

If you've been following Google I/O, you probably already know about this. For those of you who don't know and if you're interested, check out this quick demo: 👇

One downside to this agent is that you must connect it with GitHub and your repository to use it. That's how the agent gets context on your codebase, and we're using it for the same reason we use Claude Code.

Testing in Real world project

As I said, here we'll be testing these models on a real repository, asking it to implement some changes, fix bugs, and test its understanding of the codebase.

Here's the project that I used: collaborative-editor. Think of it like Google Docs, which allows sharing and editing documents collaboratively in real time. All kudos to JavaScript Mastery. ✌️

It is a fairly complex project and should be enough to test the models on.

1. Codebase Understanding

Let's start off with something simple; we'll ask the models to understand the codebase.

With Claude Code, it's pretty simple. You don't need to prompt it differently; you just need to run \init, which generates a CLAUDE.md file with everything that the model can understand from the codebase.

Response from Claude Sonnet 4

You can find the response it generated here: Link

It goes without saying that the model can understand the project well, as that's the main purpose of Claude Code: to understand the codebase without providing manual context. Claude Sonnet 4 does a pretty good job at it.

There's not much more I can add; it simply went through all the files and wrote a clean CLAUDE.md file with great understanding.

Response from Gemini 2.5 Pro

You can find the response it generated here: Link

Again, the same is the case here for Gemini 2.5 Pro; it did pretty much the exact thing and wrote a great README.md file with everything correct.

The only difference I can visibly see between the two models' responses is that Claude Sonnet 4's output is a bit thorough and explains the key architectural patterns, while Gemini 2.5 Pro wrote the README similar to what you'd find for any other repository on GitHub (clean and concise).

But I can't really, and I don't think you can either, see any major difference in the output between the two models. It's just some language fluff.

2. Find the Buggy Commit and Fix It

It's a real problem and something that most of us face, especially those who don't use tests much. We make some changes, it works, but then in some commit, we do something that breaks the functionality. 😴

Let's see if these models can find and fix all the issues that I'll introduce in the codebase, make commits on top, and ask it to fix them.

Prompt: Hey, the project had everything working, but recently a few bugs were introduced. A document name never updates even though you rename it, and a user seems to be able to remove themselves as a collaborator from their own document. Also, a user who should not have permission to view a document is able to view it. Help me fix all of this.

Response from Claude Sonnet 4

You can find the raw response it generated here: Link

Great, we got a nice response, and it really fixed all the bugs that I intentionally added in some commits on top.

Even better, it added a point to its TODO for running the lint and test commands, which is a good addition as it improves overall trust in the changes it made.

Response from Gemini 2.5 Pro

You can find the raw response it generated here: Link

Pretty much the same result we got with Claude Sonnet 4. Gemini 2.5 Pro fixed all of the issues, and the way Jules works is that once you approve the changes, it commits those and pushes them to a new branch on your GitHub. This is pretty handy but also pretty slow as everything runs inside a Virtual Machine (VM).

Claude Code got the same thing done within 2-3 minutes with Claude Sonnet 4, but Jules with Gemini 2.5 Pro took over 10 minutes. 😮‍💨

But we're interested in the model itself. Gemini 2.5 Pro is doing pretty well and going hand-in-hand with Claude Sonnet 4.

3. Add a New Feature (Focus Mode)

The models are doing pretty well with finding bugs and fixing them, but now let's see how good they can be at adding new features on top of it.

Prompt: I need you to implement "Focus Mode" in the doc. Here's the thing you need to do: Hide navigation bar, full-screen toggle option hide formatting toolbar until selection and hide the comments that are shown in the side (doc needs to take the whole width of the page).

Response from Claude Sonnet 4

Here's the output of the program:

Once again, Claude Sonnet 4 did it in a matter of a few seconds. It almost got everything correct as I asked, but you can never be sure with an AI model, right?

Everything seemed to go well, and it did all work properly, except for the fact that the text I wrote in the doc in Focus Mode does not persist.

If I talk about the code quality, it's great and follows best practices, and most importantly, it's not junior-level code. It has kept everything minimal and just to the point.

Response from Gemini 2.5 Pro

For some reason, Jules was having problem midway trying to add the feature. I tried it over 3 times, and got the same result, and I'm not sure what's the exact reason for that.

So, as a final resort, I went back to Google AI Studio and provided manual context, but overall, it was able to implement the functionality I asked for. However, the UI was not very nice, as you can see in the demo below. 👇

Overall, it added a well-functioning feature, and that's what counts, with no logic bugs like Claude Sonnet 4.

It's fair to say Gemini 2.5 Pro got this win even without Jules and with manual context.

Build an AI Agent with Composio

Up until now, we've mostly done one-shot questions with no follow-up prompts.

In this section, I'm going to try to build a fully functioning AI agent with multiple prompts if required, using both models to see how it goes and which gets us our result quickly.

Prompt: I need you to create an AI agent for Google Sheet that basically helps with Google Sheet tasks. Use Composio for the integration and do it in Python. Use packages like openai, composio_llamaindex, llamaindex that gets the job done. Also better if you can use uv as the package manager. Refer to the docs if required: https://docs.composio.dev/getting-started/welcome

Response from Claude Sonnet 4

You can find the raw response it generated here: Link

Here's the output of the program:

If I'm honest, I didn't quite expect the model to be this good with the code. It wrote the code in one shot, and I didn't have to iterate on follow-up prompts. It has followed all the right approaches, similar to how Composio docs suggest working with the Google Sheet Agent. 🙀

Overall, it's a great result.

Response from Gemini 2.5 Pro

Here's the output of the program:

It was quick this time, and I got a working AI agent for Google Sheets in 6 minutes. This was really surprising, actually, considering writing over 2500 lines of code with no errors in one shot is completely awesome. Believe me, I've tried this with Claude 3.7 Sonnet, and it failed very badly.

You can find the response Claude 3.7 generated here: Link

It's impressive that the model could refer to the docs and use packages as I said, with all the best practices and got us a working AI agent.

Conclusion

Claude Sonnet 4 is definitely superior to Gemini 2.5 Pro in coding. It may not be by a large margin, but it's definitely a rock-solid model for coding.

It might not be enough to judge, but we have some fair results. The pricing for the model is the same as Claude 3.7 Sonnet and has some improvements, so we can definitely see that this is a drop-in replacement for the 3.7 Sonnet as well.

By the way, if you want to see how the Claude 4 lineup (Claude Opus 4 and Claude Sonnet 4) competes against each other and Gemini 2.5 Pro in coding, check out this blog post:

🔥Claude Opus 4 vs. Gemini 2.5 Pro vs. OpenAI o3 Coding Comparison 🚀

Shrijal Acharya for Composio ・ May 26

#ai #javascript #webdev #programming

Which test do you prefer to judge an LLM model (real-world problems or a mix of smaller coding questions)? Let me know in the comments below!

Shrijal Acharya

Full Stack SDE • Open-Source Contributor • Collaborator @Oppia • Mail for collaboration

Top comments (14)

Shekhar Rajput • Jun 11

I recently read it somewhere, that Antropic now claims that they don't really know how AI works nowadays. Imagine a company that big doing this claim.

But, even with that, when it comes to AI in general, I prefer Anthropic models than this cheap Google and Sam's one.

Shekhar Rajput • Jun 11

Overall, based on your observation, which one do you pick for coding? Keep pricing apart.

Shrijal Acharya • Jun 11

Claude Sonnet 4 would be it.

Shrijal Acharya • Jun 11

Gemini 2.5 is not something you can simply ignore. But, I agree Anthropic is just too goated for AI models. 💥

Shrijal Acharya • Jun 11

Folks, let me know your thoughts on this comparison. Do you prefer real-world coding tests or smaller ones?

Lara Stewart - DevOps Cloud Engineer • Jun 11

I prefer this method. This is how it should be done to test how they perform in the real world. It's a good read. 🤍

Have you had a chance to check out my recent post on starting with DevOps?

Start with DevOps in 3 simple steps 🐳

Lara Stewart - DevOps Cloud Engineer ・ Jun 7

#productivity #devops #certification #programming

Shrijal Acharya • Jun 13

Great read, @larastewart_engdev ✌️

Firuza Chess • Jun 11 • Edited

Anthropic’s Claude Sonnet 4 is a refined continuation of the Sonnet lineage—tuned for strong general capabilities, consistent coding, and nuanced reasoning—while remaining cost-effective and available on free tiers (theverge.com). It features the same “thinking summaries” and hybrid “extended thinking” mode introduced in Claude Opus 4, and Anthropic claims that it is about 65 % less prone to shortcutting and better at retaining long-term context (theverge.com). Meanwhile, Gemini 2.5 Pro represents Google DeepMind’s latest major leap, offering a mammoth 1 million‑token context window, a new “Deep Think” reasoning mode, and standout benchmark performance—especially in multi-step reasoning, math, science, and coding (tomsguide.com). Side‑by‑side user reports echo this: many note Gemini outperforms Claude on big coding tasks, thanks to its deep context and precision—but some still prefer Claude for cleaner reasoning trails or narrative flexibility (reddit.com).

In summary, Claude 4 Sonnet is a smart, reliable, and relatively lightweight companion—great for generalist use and precise reasoning—while Gemini 2.5 Pro pushes the envelope on context capacity, reasoning depth, and technical tasks, though occasionally at the cost of verbosity or over‑extension. Choosing between them depends on whether you prioritize nimble, instruction‑following consistency (Claude) or heavyweight reasoning and tool‑capable prowess (Gemini).

Shrijal Acharya • Jun 12

Thanks!

Aayush Pokharel • Jun 11

Nice one, sathi! ❤️

Shrijal Acharya • Jun 11

Thanks!

Aavash Parajuli • Jun 11

Good comparison. Always a nice read for model comparisons from you. Shrijal 💯

Shrijal Acharya • Jun 11

Thanks! 🙌

Sebastian • Jun 16

Let's be clear about what this article actually is. This isn't a "Claude Sonnet 4 vs. Gemini 2.5 Pro" comparison. It's a poorly structured and biased comparison of two completely different development tools: Anthropic's local command-line tool, "Claude Code", and Google's web-based agent, "Jules".
Because the author tests the wrapper tools instead of the models themselves in a controlled environment, the entire premise is flawed and the conclusions about which model is better are meaningless.
The comparison's credibility collapses further from there:

Broken Protocol: When the Jules agent (running Gemini) failed during the "Focus Mode" feature test, the author didn't log it as a failure. Instead, he broke his own methodology, switched to a completely different tool (Google AI Studio), and fed it context manually. This is a critical failure in testing; you cannot change the rules for one contestant mid-test.
Blatant Conflict of Interest: The final test involves building an AI agent using Composio. The author is explicitly writing "for Composio", meaning he's using this article to test and showcase his own company's product. This isn't a neutral benchmark; it's a conflict of interest masquerading as analysis.
Arbitrary Evaluation: The criteria for winning are subjective and inconsistent. A working feature from Gemini (after its tool failed) is declared a "win", while Claude's "great" code with a logic bug is treated as a lesser success. The goalposts are moved to fit whatever narrative is being pushed in that section. Criticism without a solution is just bashing, so here is a blueprint for a comparison that would actually be useful:
Establish Neutral Ground: Test the models, not their proprietary tools. Use a single, neutral IDE like VS Code with a third-party extension that can connect to both the Claude and Gemini APIs. This makes the environment and workflow identical.
Standardize Identical Tasks: The prompts and tasks must be 100% identical for both models. No asking one to perform a tool-specific function like init while the other performs a general task like writing a README.md.
Define Success Before Testing: Create a rigid, public scorecard before running any tests. This should include weighted metrics for functional correctness, adherence to all prompt constraints, and code quality. This prevents subjective, after-the-fact judging.
Ensure Transparency: If you are testing an integration with your own product, label it clearly as a "Case Study," not a neutral head-to-head comparison. An actual comparison requires rigor. This article offers none. It's a flawed, biased anecdote, not a credible analysis.