idavidov13

Posted on May 24 • Edited on Jun 25

Compare generated tests with Playwright MCP Server and LLMs

#playwright #mcp #qa #ai

Introduction

As we setup the stage in Part 7 we are ready to continue with comparison. The comparison for newly introduced Claude 4 Opus and Claude 4 Sonnet can be found in Part 7.

Comparison Criteria

Code quality (structure, modularity, error handling)
Readability (clarity, naming, comments, formatting)
Adherence to Playwright and automation best practices (locator usage, assertions, reusability, maintainability)

Comparison Results

GPT-4.1

Code Quality

Implements a Page Object Model (ArticlePage), with grouped locators and actions.
Uses nested objects for navigation, forms, and article actions.
Uses role-based and parameterized locators.
Good encapsulation and reusability.

Readability

Very clear structure, logical grouping, and descriptive method/variable names.
Minimal comments, but code is self-explanatory.
Consistent formatting.

Best Practices

Follows Playwright best practices: role-based selectors, web-first assertions, config via env vars.
No hardcoded timeouts.
Test covers all CRUD actions and checks for post-deletion state.

Claude 3.7 Sonnet (Thinking)

Code Quality

Uses a well-structured Page Object Model (ConduitPage), grouping locators and actions as class methods/getters.
Good encapsulation and reusability; all page interactions are abstracted.
Uses role-based locators (getByRole) and avoids hardcoded selectors.
Handles login state check before logging in.

Readability

Clear and descriptive method names.
Consistent formatting and variable naming.
Minimal comments, but code is self-explanatory.

Best Practices

Follows Playwright best practices: web-first assertions, role-based locators, and modularity.
Uses environment variables for config.
No hardcoded timeouts.
Test scenario is end-to-end and readable.

SWE-1

Code Quality

Implements a Page Object Model (ConduitApp) with private getters for locators and methods for actions.
Good encapsulation and modularity.
Uses role-based selectors and web-first assertions.
Has beforeAll/afterAll hooks for setup/teardown.

Readability

Clear and descriptive method names.
Consistent formatting and logical structure.

Best Practices

Follows Playwright recommendations: modularity, role-based selectors, web-first assertions.
Uses environment variables.
Test is comprehensive and checks all CRUD operations.

xAI-Grok-3

Code Quality

Implements a Page Object Model (ConduitApp) but with all data hardcoded inside methods.
Each action is a single method; no parameterization.
Uses role-based selectors and web-first assertions.
Less flexible for reusability.

Readability

Simple and readable, but less scalable.
Method names are clear, but lack parameterization for reuse.
Minimal comments.

Best Practices

Uses Playwright best practices for selectors and assertions.
No modularization of test data.
Test steps are clear and sequential.

Deepseek R1

Code Quality

No Page Object Model; all actions are inline within the test.
Uses constants for credentials and article data.
Directly uses Playwright locators in test steps.
Lacks abstraction and reusability.

Readability

Simple, readable, but less maintainable for larger tests.
Variable names are clear.
Test is split into logical sections (Create/Edit/Delete).

Best Practices

Uses role-based selectors and web-first assertions.
No modularization; not scalable for larger suites.
No comments, but the structure is easy to follow.

Summary table for POM

Pattern	Encapsulation	Flexibility	Readability	Staleness Risk	API Surface
Getters	Medium	High	High	Low	Medium
Private Getters	High	Medium	Medium	Low	Low
Objects for Locators	Low	High	Medium	Medium	High
Methods Directly	High	Low	High	Low	Low

Comparative Table

File	POM Used	Abstraction	Readability	Best Practices	Scalability	Comments	Manual Updates
Claude 3.7 Sonnet (Thinking)	Yes	High	High	Yes	High	Well-structured	Low
Deepseek-R1	No	Low	Medium	Partial	Low	Inline logic	Low
GPT-4.1	Yes	High	High	Yes	High	Well-structured	Low
SWE-1	Yes	High	High	Yes	High	Hooks used, Old Setup/Teardown	High
xAI-Grok-3	Yes	Medium	Medium	Yes	Medium	No parameterization	Medium

Conclusion

Since the result of Claude 4 Opus and Claude 4 Sonnet are not considerable better than GPT-4.1 and Claude 3.7 Sonnet, I will not recommend them due to higher costs.

Best Overall (Code Quality & Best Practices):

GPT-4.1 and Claude 3.7 Sonnet (Thinking) stand out for their structured Page Object Models, modularity, and adherence to Playwright best practices. Both are highly maintainable and readable, with good abstraction and scalability.
SWE-1 is also strong, but could improve old browser setup/teardown. There was a need to manually update the test/locators after generation.

Most Readable for Small Tests:

Deepseek R1 and xAI Grok-3 are readable and easy to follow for small, simple scenarios but lack abstraction and scalability for larger suites.

Best for Large Automation Suites:

GPT-4.1 and Claude 3.7 Sonnet (Thinking) are preferred due to their maintainability, modularity, and extensibility.

What's next?

The world of software development and testing is rapidly changing, so keep learning and experimenting.
Please, do not hesitate to start conversation regarding the test or it's result.

🙏🏻 Thank you for reading! Building robust, scalable automation frameworks is a journey best taken together. If you found this article helpful, consider joining a growing community of QA professionals 🚀 who are passionate about mastering modern testing.

Join the community and get the latest articles and tips by signing up for the newsletter.

DEV Community

Compare generated tests with Playwright MCP Server and LLMs

Introduction

Comparison Criteria

Comparison Results

GPT-4.1

Code Quality

Readability

Best Practices

Claude 3.7 Sonnet (Thinking)

Code Quality

Readability

Best Practices

SWE-1

Code Quality

Readability

Best Practices

xAI-Grok-3

Code Quality

Readability

Best Practices

Deepseek R1

Code Quality

Readability

Best Practices

Summary table for POM

Comparative Table

Conclusion

Best Overall (Code Quality & Best Practices):

Most Readable for Small Tests:

Best for Large Automation Suites:

What's next?

Top comments (0)