I ran Claude Sonnet 4 and GPT-4o on the same three client tasks: Supabase RLS policies for a multi-tenant B2B app, a Stripe webhook handler with idempotency, and a Next.js App Router route that composes auth + billing. Same prompts, same repo context, same acceptance criteria. This is not a benchmark blog post | it is what I actually use in Cursor this week.
The Tasks (Controlled)
Task A: RLS for organizations, memberships, and child tables scoped by org_id. Task B: Stripecheckout.session.completed + subscription lifecycle with a webhook ledger table. Task C: Protected dashboard layout with server-sidegetUser() and subscription gate. I scored each output on: compiles first try, security correctness, edge cases mentioned, and hallucinated APIs.
Results Summary
- RLS: Claude produced correct policies with
auth.uid()membership checks on first pass. GPT-4o missed INSERT policies on junction tables twice and suggested disabling RLS “temporarily for debugging.” - Stripe webhooks: GPT-4o was faster on boilerplate (signature verify, raw body). Claude added idempotency keys and race notes without being asked.
- Next.js routes: Tie on happy path. Claude caught async cookies in Next 15; GPT-4o still generated sync cookies once.
Hallucination Rate (My Notebook)
Over 30 generations each: Claude ~8% invented Supabase helpers or wrong RPC names. GPT-4o ~14% | often confident wrong Stripe API version flags or deprecated Next patterns. For architecture reasoning, Claude wins. For “give me the route handler skeleton in 90 seconds,” GPT-4o wins.
Comparison Table
| Dimension | Claude Sonnet 4 | GPT-4o |
|---|---|---|
| Architecture / security | Strong | Adequate |
| Speed on boilerplate | Good | Faster |
| RLS correctness | Best | Needs review |
| Hallucinations | Lower | Higher |
How I Test in Cursor
I never accept multi-file diffs without running npm run type-check and opening the Stripe webhook test in the CLI. For RLS, I use two test users in Supabase SQL editor and verify cross-tenant reads return zero rows. Models that skip verification steps get rejected even if the code “looks fine.”
Temperature stays low for security tasks. For marketing copy in-app, I raise creativity. The model choice is only half the equation | the acceptance tests are what keep client MVPs out of incident reports.
Need this shipped in production, not just in a blog post? Start your MVP.
