On April 17, 2025, OpenAI introduced two new reasoning models—GPT-04 and GPT-04 Mini. The tech world reacted quickly, with many claiming these models reach “genius-level” capabilities. But how much of this is genuine innovation, and how much is just hype?
Over the past year, the AI space has seen a rapid succession of launches, including GPT-4.1, GPT-4.5, and OpenAI’s image generation tools. With new versions arriving faster than most developers can test, keeping up has become a challenge. The latest models continue this trend, and the naming conventions have started to cause confusion.
What’s New in GPT-04?
GPT-04 and GPT-04 Mini aim to be stronger reasoning engines, especially useful for writing and analyzing code. Alongside these models, OpenAI also released an open-source command-line tool called Codeex, designed to assist developers by writing, running, and analyzing code directly within the terminal or IDE—similar to Anthropic’s Claude Code.
Testing Codeex
Installing Codeex is simple via a quick npm
command, followed by setting your OpenAI API key. The testing goal: build a basic YouTube clone.
While the tool responded to the prompt and attempted to structure the project, it ended up creating empty directories without successfully generating usable code. Even more specific requests—such as generating Svelte 5 code with Runes—failed.
This suggests platform compatibility may still be an issue, particularly on Windows. Performance could be different on macOS or Linux systems.
Claude Code vs. Firebase Studio
For comparison, the same task was run using Claude Code and Firebase Studio (formerly Project IDX by Google).
Claude Code showed some initial promise and successfully executed basic commands, though it also failed to properly generate Svelte 5 with Runes.
Firebase Studio stood out for its speed but ignored the Svelte prompt entirely, defaulting to Next.js instead. This behavior may reflect a tendency of some AI tools to favor popular frameworks over developer-specific requests.
Who Comes Out on Top?
None of the tools tested performed flawlessly. Each has its strengths, but also significant limitations. However, this isn’t necessarily a negative outcome.
What we’re seeing is the natural result of an AI tooling boom—companies like OpenAI, Google, Microsoft, and Anthropic are all aggressively pushing innovations in developer productivity. Microsoft, in particular, has made a strong move with its upgraded Copilot Agent Mode, which combines file creation, command execution, and deep context-awareness for a more integrated developer experience.
Despite these advancements, many developers still consider Gemini 2.5 to be the most effective programming model, especially when paired with Firebase Studio for end-to-end development and deployment.
Conclusion
The current AI tooling landscape is rapidly evolving. While none of the current models or assistants are yet capable of delivering flawless results on complex tasks, the potential they show is undeniable.
Now is a great time for developers to explore, test, and integrate these tools into their workflows. Even with their imperfections, they offer meaningful productivity boosts and lay the groundwork for the next generation of coding experiences.