Thoughts on GPT-5

OpenAI's new GPT-5 model raises an obvious question for TypeScript heavy engineering teams: does it actually write better production TypeScript code than today's alternatives?

Over the last few weeks, we ran GPT-5, via Charlie, head-to-head against Claude Code across ten non-trivial TypeScript issues drawn from live open-source projects. Charlie is our TypeScript focused code agent, he is comfortable with all tasks an experienced TypeScript engineer would be.

Charlie with GPT-5 beat Claude Code on all 10 case-by-case comparisons.
Pull requests generated by GPT-5 resolved 29% more issues than o3.
PR review quality rose 5% versus o3.

For organisations already adopting agentic code workflows, these results suggest GPT-5 is the first model that can be trusted not merely to propose patches but to merge them with minimal human correction. We've found the model to generally be: extremely impressive.

LLMs have already reached the point where agents can open pull requests in active repositories, yet most still impose a hidden-cost triage burden: failing CI, missing edge-cases or letting type errors slip past review.

TypeScript is an unforgiving referee, and in modern full-stack stacks, static correctness is table-stakes. Demonstrating that an LLM can satisfy the TypeScript compiler and the test runner, repeatedly, we believe is a good proxy for real-world engineering value.

We focused on this production-readiness threshold.

Internal Pull Request Review Evals

One of Charlie's most important features is his ability to do code review, and find bugs and logic errors, security vulnerabilities, performance issues, and style violations.

To measure Charlie's code review performance, we developed comprehensive evaluations across a diverse array of code issues. Our code review evals are designed to check whether Charlie leaves useful PR review comments when appropriate and refrains from commenting when not.

We've split the charts below into when he "should comment" and "should not comment." The numbers below reflect the percentage of our eval dataset that resulted in a comment match. For “should comment”, a higher number is better, indicating that more bugs and issue are caught and reported. For “should not comment”, a higher number is worse, indicating the presence of noisy comments.

Δ vs o3: 0%

gpt-5

Δ vs o3: +5%

Claude Opus 4.1 (ET)

Δ vs o3: -46%

Internal Pull Request Creation Evals

Our PR creation evals are SWE-bench style but require more end-to-end autonomy. Each run starts with a real GitHub issue comment asking Charlie to open a PR. Charlie scopes the task, makes the code changes, and submits a complete pull request. We then score the attempt by comparing Charlie's PR to the one that actually merged to fix the issue, emphasizing correctness and production-ready changes over prose.

Δ vs o3: 0%

gpt-5 (same prompt as o3)

Δ vs o3: +16%

gpt-5 (optimized prompt)

Δ vs o3: +29%

Charlie vs Claude Code

To compare the real-world performance of Charlie against Claude Code, we took 10 issues from open source TypeScript repositories and invoked both (via GitHub Action) against the same request: resolve this issue. We then ran a scoring tool which analyzed the resulting PR across 3 dimensions: testability, description and overall quality.

The scoring tool ran at least 3 trials for each PR, emitting numeric scores and textual reasoning. We then averaged those scores and summarized the reasoning into themes for each model, then asked GPT-4o and Gemini to render a verdict.

What emerged was a clear pattern: Charlie is more effective at creating production-ready TypeScript PRs. Despite Claude Code's strengths in clarity of documentation, it was less likely to produce a correct, adequately tested PR.

Averages across all issues

charlie-evals/outline/issues/9

charlie-evals/typeorm/issues/11

charlie-evals/typeorm/issues/10

charlie-evals/outline/issues/14

charlie-evals/outline/issues/15

charlie-evals/type-fest/issues/2

charlie-evals/zod/issues/1

charlie-evals/trpc/issues/1

charlie-evals/zod/issues/1

charlie-evals/type-fest-pr-1123-test/issues/2

Case-by-case comparison

charlie-evals/outline/issues/9

Scores are averages across provided metric entries.

Charlie vs Claude by dimension

LLM-judged text

GPT: While Claude's PR had better narrative and slightly higher implementation polish, Charlie's strong testability and comparable quality make his the more complete and reliable contribution.
Gemini: Although Claude's PR has a slightly higher average quality score and better description, its testability is critically low (average 0.1044 vs. Charlie's 0.7420). Claude's lack of testing for new behavior, especially for core scenarios like relative-to-absolute conversion, leaves the integrity of the feature completely unverified. In contrast, Charlie's PR adds meaningful unit tests, even if they have some limitations, making his contribution far more reliable and verifiable.

Charlie

Testability: The PR adds meaningful unit tests for the new absolutizeImageUrls utility, verifying key behaviors like converting root-relative image URLs, preserving absolute URLs, and ignoring non-image links. However, coverage is limited by missing edge cases (e.g., protocol-relative URLs, titles, and empty input) and lacks integration tests for the downstream usage in copyDocumentAsMarkdown.
Description: The PR title is clear and the description correctly explains the problem and high-level solution, aligning with the actual code changes. However, the body contains placeholder text, lacks concrete details like function names and file paths, and includes a broken verification section—reducing clarity, coverage, and long-term usefulness.
Overall quality: The change is focused, minimal, and idiomatic: it introduces a well-scoped helper to convert root-relative image URLs to absolute ones and integrates it cleanly into the “Copy as Markdown” action. While the implementation is correct and tests cover key cases, minor nits include regex brittleness, an unused parameter, and edge cases involving alt text or env.URL handling.

Claude

Testability: No tests were added for the new baseUrl behavior in markdown serialization, despite introducing logic to convert relative image URLs to absolute URLs via Image.toMarkdown and Document.toMarkdown. Core scenarios—including relative-to-absolute conversion, trailing slash handling, absent or absolute URLs, and propagation of baseUrl—are all untested, leaving presence, depth, and integrity entirely lacking.
Description: The title is clear and the description accurately explains the problem and high-level fix: converting relative image paths in Copy as Markdown to absolute URLs using a new optional baseUrl passed from team.url. While the test plan is relevant and the framing is sound, most summaries omit details like the serializer API change, the restriction to leading-slash paths, and the fact that only specific markdown generation paths are affected.
Overall quality: The change is focused, minimal, and idiomatic: it introduces an optional baseUrl to the Markdown serializer, uses it in the Image node to convert root-relative image paths to absolute URLs, and wires it into Document.toMarkdown for the “Copy as Markdown” action via team.url. While the implementation is correct and consistent with existing patterns, a minor limitation is that other Markdown serializers (like ClipboardTextSerializer) don't yet support baseUrl, so some copy paths still yield relative URLs—acceptable given the stated scope.

charlie-evals/typeorm/issues/11

Scores are averages across provided metric entries.

Charlie vs Claude by dimension

LLM-judged text

GPT: Despite a stronger description, Claude's patch is fundamentally incorrect and untested. Charlie's PR, although imperfect, provides a correct and safer fix with partial test coverage—making it the clear choice.
Gemini: Claude's PR, despite a good description, has a very low average quality score (0.3433) and testability (0.1233). The analysis explicitly states Claude's change "is likely incorrect and inconsistent with TypeORM's handling of types," risks "introducing type mismatches," and "does not address the actual root cause of the upsert error." Charlie's PR, while its test coverage isn't perfect for the specific regression, provides a correct fix and is deemed "conservative but safe," with a much higher average quality (0.8400) and testability (0.5167). Correctness and safety are paramount for a fix addressing a bug.

Charlie

Testability: Tests were added for the MySQL upsert bug, verifying behavior for number and bigint primary keys by performing real insert-update cycles without mocking. However, the specific code path guarded by the fix—triggered only when default or version columns are present—is never exercised, making the tests valid for general upsert behavior but incomplete and misleading as coverage for the actual regression.
Description: The title is clear and the description accurately captures the core bug and fix—skipping a post-insert SELECT when the primary key isn't populated during MySQL upserts. However, it includes several inaccuracies, such as claiming unrelated changes to createValueMap, broader type-agnostic behavior, and documentation updates, which aren't reflected in the diff and reduce the overall accuracy.
Overall quality: The change correctly fixes a MySQL upsert error by skipping the post-insert SELECT when any entity lacks a primary key, preventing unsafe queries and aligning with driver behavior. The implementation is minimal and uses existing APIs appropriately, though minor issues include overly verbose comments, unconditional insertion of possibly undefined IDs into InsertResult.identifiers, and skipping reloads for all entities even when some have valid IDs—conservative but safe.

Claude

Testability: The patch expands string coercion in createValueMap to additional numeric types and embedded columns, but no tests were added to validate this behavior. As a result, the change lacks coverage for the newly supported types, embedded/non-embedded paths, and upsert scenarios, leaving presence, depth, and integrity unaddressed.
Description: The title and description are clear, accurate, and aligned with the diff: they correctly identify the change to ColumnMetadata.createValueMap, which now stringifies additional numeric ID types when the generation strategy is increment or rowid. The rationale is grounded in a real upsert failure mode, though minor omissions include not explicitly mentioning the rowid strategy or both affected code paths.
Overall quality: The change broadens a bigint-specific string conversion to cover many integer types, but this is likely incorrect and inconsistent with TypeORM's handling of types. While minimal, it risks introducing type mismatches, breaks existing ID comparison logic, and does not address the actual root cause of the upsert error, making the fix conceptually misguided and potentially harmful.

charlie-evals/typeorm/issues/10

Scores are averages across provided metric entries.

Charlie vs Claude by dimension

LLM-judged text

GPT: Though Claude's explanation is clearer, Charlie's PR is better tested (even if ineffectively) and technically stronger. Neither is complete, but Charlie's change is more robust overall.
Gemini: Claude's PR again suffers from a critical lack of testing (average testability 0.1167), with "no tests added to cover the new utc option for date columns" and "core UTC conversion logic remains entirely untested." While Claude's description is strong, the implementation has key omissions and potential compile errors. Charlie's PR, while its tests are "largely superficial and ineffective at validating the intended behavior" due to environment setup, at least has tests, and the implementation is described as "solid and backward-compatible," even if incomplete. The presence of tests, even flawed ones, and a more robust underlying implementation makes Charlie's PR technically stronger.

Charlie

Testability: Tests were added for the new utc date column option, but due to the test suite forcing TZ=UTC, they fail to meaningfully distinguish UTC from local behavior. While coverage exists for SQLite and DateUtils, it misses critical paths like other drivers, change detection, hydration, and arrays—rendering the tests largely superficial and ineffective at validating the intended behavior.
Description: The title is clear and the description accurately captures the purpose and scope of the change—adding a utc flag for date columns, updating relevant metadata, passing the flag through drivers, and adding tests. However, clarity is undermined by broken placeholders and vague phrasing that obscure key terms (like utc and useUtc), and it omits that the flag defaults to false and only affects type 'date'.
Overall quality: The change cleanly introduces a utc flag for date columns, wiring it through ColumnOptions, ColumnMetadata, DateUtils, and most drivers in a minimal and idiomatic way. While the implementation is solid and backward-compatible, it is incomplete: SQL Server and Aurora Postgres drivers don't fully honor the flag during persistence, and EntitySchema configurations lack support.

Claude

Testability: No tests were added to cover the new utc option for date columns, despite changes across drivers, metadata handling, and DateUtils. The core UTC conversion logic remains entirely untested, including unit, functional, and edge cases—leaving presence, depth, and integrity of coverage missing.
Description: The description is accurate, purpose-driven, and aligns well with the diff—it introduces a utc flag for 'date' columns, updates metadata and drivers to pass it through, and modifies DateUtils.mixedDateToDateString accordingly. Minor issues include a title that's too narrowly focused on DateUtils and a slight overstatement in claiming “all database drivers” were updated, though coverage and motivation are clearly conveyed.
Overall quality: The change introduces a UTC-aware utc flag for date columns, updating ColumnOptions, ColumnMetadata, DateUtils, and most drivers in a minimal, idiomatic, and backward-compatible way. However, key omissions remain: EntitySchemaColumnOptions lacks support, Oracle and Aurora drivers don't fully propagate the flag, and mixedDateToDateString may trigger a compile error under strict null checks without coercion.

charlie-evals/outline/issues/14

Scores are averages across provided metric entries.

Charlie vs Claude by dimension

LLM-judged text

GPT: Despite Claude's strength in writing and test coverage, the implementation is flawed in ways that could introduce production bugs. Charlie's version is more reliable and idiomatic, making it the better overall contribution
Gemini: While Claude's PR boasts higher testability (0.8867 vs. Charlie's 0.7867) and description scores, its "Overall Quality" is significantly lower (0.6833 vs. Charlie's 0.8833) due to "significant correctness issue[s]: failure to deduplicate mentions by userId can lead to race conditions and constraint violations, and filtering based on actorId causes incorrect deletion or retention of mentions." These are critical flaws that could lead to "production bugs." Charlie's PR, though it has a "minor correctness risk from parallel findOrCreate calls," is generally described as "minimal, idiomatic, and well-tested." The fundamental correctness issues in Claude's implementation outweigh its stronger test coverage and description.

Charlie

Testability: The new MentionsProcessor is well-covered with behavioral tests for publish, update, and delete events, asserting real database outcomes using realistic content. While core functionality is validated, key edge cases—like drafts, non-existent users, duplicates, and full mention removal—are untested, making the coverage solid but not fully comprehensive.
Description: The title and description are accurate and aligned with the diff, covering the addition of a Mention model, migration, MentionsProcessor for publish/update/delete events, related tests, and env var changes. While the high-level framing is helpful and the scope is well-covered, minor clarity issues (e.g. missing nouns, incomplete verification line) and omission of details like the unique constraint slightly reduce signal density.
Overall quality: The change cleanly adds backend support for user mentions by introducing a Mention model, a corresponding migration, and a MentionsProcessor that mirrors the existing BacklinksProcessor pattern. The implementation is minimal, idiomatic, and well-tested, though it has a minor correctness risk from parallel findOrCreate calls on duplicate mentions and could benefit from deduplication and clearer handling of empty sets during updates.

Claude

Testability: The new MentionsProcessor is thoroughly tested across publish, update, and delete events, with strong behavioral assertions using real models and database state. Core functionality is well-covered, including skipping non-existent or self-mentions and cleaning up stale data, though minor gaps remain in deduplication, helper methods, and certain edge cases like updates to zero mentions and default field storage.
Description: The title and description are clear, accurate, and well-aligned with the diff, covering the addition of a UserMention model, a MentionsProcessor triggered on publish/update/delete events, a migration with indexes, and relevant tests. While the event-driven behavior is well explained, minor details like the unique composite index, export additions, and nuances around test scope and unpublished docs are omitted or slightly overstated.
Overall quality: The change introduces a UserMention model and MentionsProcessor that mirror the existing BacklinksProcessor pattern, with minimal, idiomatic implementation and consistent use of helpers, migrations, and Sequelize patterns. However, it has a significant correctness issue: failure to deduplicate mentions by userId can lead to race conditions and constraint violations, and filtering based on actorId causes incorrect deletion or retention of mentions—both of which should be addressed for correctness.

charlie-evals/outline/issues/15

Scores are averages across provided metric entries.

Charlie vs Claude by dimension

LLM-judged text

GPT: Claude's PR has better testability and documentation but suffers from critical correctness issues, including misaligned lookups and premature link rewriting that make it unlikely to function in real scenarios. Charlie's PR lacks tests and has its own bugs, but follows a more robust architectural approach and avoids fatal structural flaws. Overall, Charlie's is the stronger foundation despite its incompleteness.
Gemini: Charlie's PR for issue 15, while lacking tests and having its own issues like regex bugs and improper handling of mentions, is deemed to have a sound high-level approach. In contrast, Claude's PR, despite higher testability, contains critical flaws such as mismatched lookups and premature link rewriting, rendering its solution fundamentally broken and unlikely to work in real scenarios. Therefore, Charlie's PR offers a more robust foundation despite its incompleteness, making it the stronger choice.

Charlie

Testability: No tests were added for the new ProseMirror link-handling logic across export, import, and placeholder resolution tasks. Despite significant behavioral changes, existing tests only check import counts and lack assertions for internal link rewriting, placeholder handling, or edge cases—leaving presence, depth, and integrity of coverage effectively absent.
Description: The title and description are clear, accurate, and aligned with the diff, explaining import-time rewriting of internal links to placeholders and export-time normalization to relative URLs to ensure portability across hosts. While well-framed and grounded in the implementation, it slightly overstates scope (e.g., claiming coverage of all metadata fields) and references tests that aren't present, with minor omissions like dropped anchors/query params.
Overall quality: The change implements internal link normalization on export and remapping on import using ProseMirror traversal patterns, aligning with existing code structure. However, it has several correctness and integration issues—broken placeholder resolution for collections, regex bugs that miss links with fragments or queries, improper handling of mentions and external URLs, and fragile node transformations—all of which make the implementation incomplete and potentially unreliable despite a sound high-level approach.

Claude

Testability: A new test was added for the internal link-rewriting logic, but it bypasses the real import flow by invoking a private method, resulting in shallow, implementation-coupled coverage. While it confirms a basic slug remapping works, it misses key behaviors like absolute URLs, external link preservation, fallback by externalId, hash fragments, and updates to serialized text—leaving integration and edge cases untested.
Description: The title and description are clear, accurate, and purpose-driven, explaining the fix for internal links not updating on JSON import by mapping old URLs to new paths, updating ProseMirror link marks, and preserving hash fragments. While grounded in the diff and mentioning the added test, it omits that collection descriptions are also processed and slightly misphrases the update as targeting “new document IDs” instead of new paths.
Overall quality: The change attempts to rewrite internal document links during import using patterns consistent with existing attachment logic, but contains two critical flaws: it mismatches slug-based lookups against a urlId-keyed map, and performs link rewriting before urlIds may be mutated by preprocessDocUrlIds, causing potential mismatches. The added test passes only due to unrealistic data, making the fix unlikely to work in real scenarios despite its structural correctness.

charlie-evals/type-fest/issues/2

Scores are averages across provided metric entries.

Charlie vs Claude by dimension

LLM-judged text

GPT: Charlie's PR is complete, well-tested, and correctly exported, with only minor omissions in union test coverage and edge case documentation. Claude's version is similarly structured but fails to export the new type, making the utility unusable despite otherwise solid implementation. Charlie's PR is clearly stronger due to its correctness and completeness.
Gemini: Charlie's PR is superior, with higher average quality and testability. Claude's PR is incomplete because it omits exporting the new IsUndefined type to the public API, rendering it unusable for consumers despite an otherwise correct implementation. Charlie's solution is complete, correctly exported, and well-tested, making it the clear winner.

Charlie

Testability: The new IsUndefined type is covered by a dedicated tsd test that validates key TypeScript cases and follows established patterns, ensuring meaningful type-level assertions. While coverage is strong, it omits union type tests (e.g., string | undefined) and index export verification—leaving minor but non-critical gaps.
Description: The title and description are clear, concise, and accurately reflect the diff: adding an IsUndefined<T> type with TSDoc, index export, README update, and tsd tests. While well-scoped and grounded, it could be slightly improved by briefly mentioning the type's semantics or edge cases like any, never, and strictNullChecks.
Overall quality: The change cleanly adds a new IsUndefined<T> type, mirroring existing patterns like IsNull using a non-distributive conditional, and behaves consistently with the codebase's handling of edge cases like any and never. It is minimal, correctly exported, documented, and tested in line with existing conventions, with only minor nits around union test coverage and import style.

Claude

Testability: The IsUndefined type is well-covered by a dedicated tsd test that validates key TypeScript edge cases and mirrors existing patterns like IsNull. While the tests are idiomatic and behavior-focused, they miss important union scenarios (e.g., string | undefined), export verification via index.d.ts, and stricter type exactness—resulting in solid but slightly naive coverage.
Description: The title and description are clear, accurate, and aligned with the diff, summarizing the addition of the IsUndefined<T> utility and tests modeled after IsNull. While concise and grounded, the description slightly overstates test coverage and omits minor context like export status, edge case behavior (any, never, strictNullChecks), and the rationale for adding the utility.
Overall quality: The IsUndefined<T> type is implemented correctly and idiomatically, with matching documentation and tests that follow existing IsNull and IsNever patterns. However, the change is incomplete—it omits adding IsUndefined to index.d.ts, meaning it's not exported for consumers, which breaks the intended public API despite an otherwise minimal and consistent implementation.

charlie-evals/zod/issues/1

Scores are averages across provided metric entries.

Charlie vs Claude by dimension

LLM-judged text

GPT: Both PRs correctly resolve a TypeScript 5.9 circularity issue by refactoring type definitions without altering runtime behavior. Charlie's version stands out for its slightly stronger testability, more precise description, and cleaner use of infer-based types, while Claude's is equally correct but less targeted in test coverage and slightly less precise in its rationale. Charlie's PR is the stronger of the two, though both are solid and idiomatic fixes.
Gemini: Both Charlie's and Claude's PRs for issue 1 correctly address a TypeScript 5.9 circularity issue by refactoring type definitions, maintaining existing behavior without runtime changes. Charlie's solution shows slightly higher average testability and a more precise description, making it marginally superior. However, both are strong, type-only fixes for the stated problem.

Charlie

Testability: The change is a type-level refactor to avoid TS 5.9 circularity issues, and while no new tests were added, the existing suite thoroughly exercises z.input and z.output across diverse schemas, including recursive ones. However, because the suite runs on TS 5.5, it doesn't validate the TS 5.9-specific regression the change targets, leaving a key gap despite otherwise strong and idiomatic coverage.
Description: The title and description are clear, accurate, and purpose-driven, explaining the fix for TS 5.9 circularity by replacing Required<T>["prop"] with an infer-based conditional type. It correctly notes the type-only scope, no runtime/API changes, affected helpers (z.input, z.output, z.infer), and includes verification and explanatory comments. A minor overstatement is the unqualified claim that resulting types are “exactly the same,” but overall coverage and clarity are excellent.
Overall quality: The change is minimal, type-only, and correctly replaces Required<T>["prop"] with an infer-based conditional to avoid TS 5.9 circularity issues. It preserves behavior for valid and invalid types, aligns with existing patterns, and introduces no runtime changes—making it a precise and idiomatic fix.

Claude

Testability: The change removes Required<> from core input/output type aliases, but no tests were added to target its specific impact. While existing tests broadly exercise z.input, z.output, and z.infer across many schemas, they don't verify the nuanced optionality behavior this change affects—leaving a notable gap in presence and depth for the intended type-level refinement.
Description: The title and description are clear, accurate, and purpose-driven, explaining the fix for TypeScript 5.9 circular reference issues by removing Required<> from input/output type helpers. The before/after snippets match the diff, and the description correctly scopes the impact to z.input, z.output, and z.infer. Minor caveats include unverified claims about test resolution, backward compatibility, and the specific TS error code, but overall it's concise, grounded, and helpful.
Overall quality: The change is minimal and correct: it removes the unnecessary Required<> wrapper from _zod.input and _zod.output access to avoid TS 5.9 circularity issues, preserving behavior through existing conditional guards. It aligns with codebase conventions, introduces no new APIs, and maintains consistent, idiomatic style.

charlie-evals/trpc/issues/1

Scores are averages across provided metric entries.

Charlie vs Claude by dimension

LLM-judged text

GPT: Charlie's PR directly addresses the root cause of the MaxListenersExceededWarning with a thoughtful architectural fix, but suffers from execution flaws like potential memory leaks and unnecessary code. Claude's PR is simpler and avoids those pitfalls, but merely masks the issue by increasing the listener limit without testing or addressing the underlying cause. Charlie's solution is conceptually stronger, though it requires cleanup to be production-ready.
Gemini: Charlie's PR for issue 1 correctly addresses the MaxListenersExceededWarning, but introduces a likely memory leak and includes misleading comments. Claude's PR, while minimal and syntactically correct in suppressing warnings, probably masks a deeper issue and its rationale for a keep-alive refactor is weak. Overall, Charlie's PR offers a more direct, albeit flawed, fix to the core problem, making it slightly superior despite its execution issues.

Charlie

Testability: A regression test was added to reproduce and prevent MaxListenersExceededWarning from shared socket reuse, aligning with the change's intent. While it verifies the absence of warnings under high concurrency, it lacks deeper assertions on listener count, abort behavior, and cleanup—making the test meaningful but shallow and potentially unreliable due to flawed reuse of a single response object.
Description: The title and description are clear, accurate, and purpose-driven, explaining the fix for MaxListenersExceededWarning by replacing per-request socket listeners with a single per-socket listener and tracking AbortControllers via a symbol-keyed Set. While coverage and clarity are strong, the description slightly misstates which listener was removed and omits small implementation details like req.off and abort cleanup.
Overall quality: The change correctly addresses the MaxListenersExceededWarning by consolidating per-request socket 'close' listeners into a single per-socket listener and tracking AbortControllers in a Set. However, it introduces a likely memory leak by not cleaning up controllers on normal response completion (res.finish) and includes misleading comments and unnecessary code (e.g., req.off without a matching listener), making the implementation correct in concept but flawed in execution.

Claude

Testability: The patch introduces setMaxListeners(20) on WebSocket connections and a minor keep-alive refactor, but no tests were added to verify this behavior. While existing WebSocket tests are comprehensive, they don't check listener limits or guard against warnings—leaving the new functionality untested despite being easy to assert within the current setup.
Description: The title is clear and the description accurately explains the main change: setting maxListeners(20) on each WebSocket to avoid MaxListenersExceededWarning under load, with a helpful repro command. However, it overstates a secondary change in handleKeepAlive as improving cleanup, when it merely extracts a handler without altering behavior. Overall, it's grounded and clear with minor imprecision.
Overall quality: The change is minimal and syntactically correct—calling client.setMaxListeners(20) suppresses warnings without breaking behavior—but it likely masks a deeper issue, as the listener count per client appears too low to justify the increase. The keep-alive refactor is cosmetic and doesn't improve cleanup as claimed, making the rationale weak despite the change being harmless.

charlie-evals/zod/issues/1

Scores are averages across provided metric entries.

Charlie vs Claude by dimension

LLM-judged text

GPT: Both PRs correctly remove util.Whatever from $ZodCatchInternals to tighten .catch() input types without affecting runtime behavior, and both include strong, idiomatic tests. Claude's version is solid but slightly overstated in coverage and includes a few distracting details. Charlie's PR is more polished overall—with clearer rationale, more precise description, and slightly stronger test coverage—making it the stronger submission.
Gemini: Charlie's PR for issue 1 effectively tightens the .catch() input types, ensuring stricter type inference without altering runtime behavior. While Claude's PR also addresses this, Charlie's solution demonstrates slightly higher testability and is more robustly documented, providing a more complete and reliable type-only change. Both are strong, but Charlie's is marginally superior due to better test coverage for the targeted regression.

Charlie

Testability: The change tightens .catch() input types by removing util.Whatever, and tests were thoroughly updated to reflect this. They validate stricter input typing and preserve runtime behavior across key compositions, with strong compile-time and runtime assertions—leaving only minor gaps like union cases or StandardSchema inference.
Description: The title and description are clear, accurate, and purpose-driven, explaining the removal of util.Whatever from $ZodCatchInternals to prevent .catch() from widening input types. The motivation, scope, and impact are well framed, with coverage of type-level changes, unchanged runtime behavior, updated tests, and added rationale comments—only minor imprecisions like referring to “z4.core.util.Whatever” slightly detract.
Overall quality: The change is minimal, correct, and type-only: it removes util.Whatever from $ZodCatchInternals to prevent .catch() from widening input types, restoring stricter inference without altering runtime behavior. The implementation aligns with existing patterns, updates tests appropriately, and includes a clear comment explaining the rationale, making the change idiomatic and consistent.

Claude

Testability: The change narrows .catch() input types, and both new and updated tests directly assert this with type-level checks and runtime validations across strings, enums, and object schemas. The coverage is strong and idiomatic, verifying key compositions and fallback behavior, with only minor omissions like native enums and union cases.
Description: The title and description are clear, accurate, and aligned with the diff, summarizing the removal of util.Whatever from $ZodCatchInternals to prevent input type widening while preserving runtime behavior. Minor issues include an irrelevant mention of a TanStack Router issue and slightly overstated claims about test coverage, but overall it's grounded and well-scoped.
Overall quality: The change is minimal, correct, and aligns with existing patterns by removing util.Whatever from $ZodCatchInternals to prevent .catch() from widening input types, while preserving runtime behavior. Though test expectations around optionality reveal a minor inconsistency, the implementation itself is idiomatic and internally consistent.

charlie-evals/type-fest-pr-1123-test/issues/2

Scores are averages across provided metric entries.

Charlie vs Claude by dimension

LLM-judged text

GPT: Claude's PR includes solid tests and documentation but contains a fundamental type error that breaks the core functionality of ExcludeStrict, causing valid use cases to fail. Charlie's implementation, by contrast, is correct, minimal, and idiomatic, with comprehensive tests and precise documentation. Charlie's PR is clearly superior due to its correctness and completeness.
Gemini: Claude's PR for issue 2 is fundamentally broken, as its ExcludeStrict implementation always resolves to never, causing valid use cases to fail despite high testability. Charlie's PR, in contrast, correctly implements ExcludeStrict with strong test coverage, clear documentation, and consistent updates to exports and the README. Therefore, Charlie's solution is the clear winner for its correctness and complete integration.

Charlie

Testability: The new ExcludeStrict type is thoroughly tested with a dedicated tsd suite covering valid and error scenarios across primitives, object unions, modifiers, and special types. Tests are idiomatic and behavior-focused, with minor gaps such as missing export verification and a few edge cases, but overall coverage is strong and purposeful.
Description: The title and description are clear, accurate, and purpose-driven, describing the addition of ExcludeStrict as a complement to ExtractStrict with compile-time validation that U is a subset of T. It fully covers the diff (type file, export, README, tests), uses precise TypeScript terminology, and avoids fluff—minor nits include a slightly imprecise use of “subset” and unverifiable external issue reference.
Overall quality: The change adds ExcludeStrict using the same subset-validation pattern as ExtractStrict, enforcing that each member of U matches a branch of T before exclusion. The implementation is minimal, type-safe, and idiomatic, with comprehensive tests, clear documentation, and consistent updates to exports and the README—making it a solid and well-integrated addition.

Claude

Testability: The new ExcludeStrict type is well-covered by a dedicated tsd test file that mirrors the depth of similar utilities like ExtractStrict, testing both valid and invalid cases across a wide range of type scenarios. The tests are idiomatic and behavior-focused, with only minor gaps such as missing export verification and a few untested edge types like unknown or symbol.
Description: The title and description are clear, accurate, and aligned with the diff, explaining that ExcludeStrict ensures each member of U excludes something from T using compile-time constraints. It covers all key elements—utility implementation, JSDoc, export, tests, and relation to ExtractStrict—with strong framing and no fluff; minor improvements could include briefly mentioning the enforcement mechanism or an example.
Overall quality: The ExcludeStrict implementation is fundamentally broken: it uses an incorrect constraint (Exclude<T, U> extends T ? never : U) that always resolves to never, causing valid use cases to fail. While the change is minimal and stylistically consistent, the core logic must be corrected to mirror ExtractStrict by checking that each member of U overlaps with T.

Appendix / Methodology

Judge prompts

The following collapsible panels contain the exact judge prompts we used for each scoring dimension.

Judge prompt: Testability

You are an expert TypeScript reviewer evaluating the **test coverage quality** of a code change written by a coding agent.

You will be given a **PATCH diff** for the PR. Use the available tools to gather any repository context you need.

Your task is to assess how well the change is tested. You are not scoring the implementation or the description — only the *presence, depth, and integrity* of the test coverage relative to the change.

The author of this change is a **code-writing agent**, not a human. You should assume that:
- The agent knows how to write structurally valid tests
- It may still write tests that are **vacuous**, **redundant**, or **behaviorally meaningless**
- It may over-rely on **mocking**, **implementation coupling**, or **coverage maximization** without understanding what matters

### Evaluation Criteria

Evaluate the test coverage along three dimensions:

#### 1. Presence (and Testability)
- Are there tests that correspond directly to the diff?
- Are all critical code paths tested?
- Was the tested code reasonably easy to test?
- If the code is buried or tightly coupled to infrastructure, omissions may be forgiven.
- If the code is clearly testable within the project's setup, missing coverage is a major flaw.

#### 2. Depth
- Do the tests reflect meaningful behavior or outcomes?
- Are edge cases, invalid inputs, or negative flows considered?
- Do the tests simulate real-world usage, or only superficial API access?

#### 3. Integrity
- Do the tests avoid “cheating” (e.g. asserting internal state, setting private values, mocking return values directly)?
- Are the tests idiomatic, stable, and not brittle?
- Do they avoid testing implementation details or duplicating business logic?

### Common Agent Failure Modes

- Writing trivial assertions like `expect(result).toBe(result)`
- Testing getters/setters by simply calling them and asserting the default value
- Mocking everything and asserting mocks were called — without checking outcomes
- Adding test files with scaffolding but no real assertions
- Using redundant or vacuous test names like “should return value”

### Scoring Instructions

Based on your evaluation, assign a **single score between 0.0 and 1.0**, reflecting your holistic judgment of the test coverage quality.

Use the following scale as guidance:

---
**1.0 — Comprehensive, behavioral, and purposeful**
> The tests are tightly coupled to real behavior. They cover both nominal and edge cases, use idiomatic patterns, and demonstrate a deep understanding of the feature.
*Example:* A retry mechanism is added, and tests simulate transient failures, backoff behavior, and final failure conditions.

**0.8 - 0.99 — High quality with minor gaps**
> Strong coverage, with small omissions. The tests are behaviorally sound but might miss one edge case or include a minor tautology.
*Example:* A parsing function adds support for a new syntax. Tests cover happy paths and most edge cases, but not malformed input.

**0.6 - 0.79 — Meaningful but naive**
> The tests are real, but overly simple. They may be overly tied to the implementation, or fail to exercise less common paths.
*Example:* A new abstraction is tested by checking internal method calls, but not the resulting behavior to the caller.

**0.4 - 0.59 — Superficial or vacuous**
> The tests exist but add little signal. They test scaffolding, default values, or mock call counts, but not actual behavior.
*Example:* A queueing system is introduced, and the test just asserts that `enqueue()` adds to an array.

**0.2 - 0.39 — Misleading or hollow**
> Tests are included, but actively give a false sense of coverage. Assertions are tautological or unrelated to behavior.
*Example:* A test asserts that `true` is `true` inside a function wrapper, or calls a method without validating the outcome.

**0.0 - 0.19 — Entirely vacuous or broken**
> The PR includes test-like files, but they contain no meaningful assertions, only mocks or stubs, or don’t correspond to the change at all.
*Example:* A new feature is added, and a test file is created with a generic `it('should work')` that contains no assertions.

### Output Format

Return your final evaluation in the following format:

```json
{
"reason": "<Concise justification summarizing your reasoning>",
"score": <value between 0.0 and 1.0>
}
```

### Tools
You have several tools available to inspect the repository contents. Use these liberally to gain an understanding of the context.

#### listFiles

List repository files; optionally filter by substring 'partialPath'.

'partialPath' is the equivalent to a SQL query with LIKE(%partialPath%); an exact sub-string match anywhere in the haystack.
If unset, this will list all files in the repository.

Calling with no 'partialPath' is an excellent way to build an understanding of the high-level repository layout.

#### readFiles

Reads one or more repository-relative files as UTF-8 text.

'paths' is a list of files to read, repository relative. Use the values returned from 'listFiles'
freely. Must be called with at least one non-empty path; fine to call with just one path.

This should be used frequently to gain a deeper understanding of the surrounding context in a diff.

Judge prompt: Description

You are an expert TypeScript reviewer evaluating the title and description of a pull request (PR), written by a coding assistant.

The author of this PR is a **code-generating agent** — not a human. As such, you should expect the description to be grammatically correct and structurally complete, but it may lack clarity, conciseness, or true understanding of the change's purpose.

You will be given a **PATCH diff** showing the exact changes introduced by the PR.
- The PR **title** and **description**, as generated by the coding agent

Your job is to assess how well the description communicates the *purpose, scope, and structure* of the change — in a way that would be helpful to future human readers.

This is **not** an evaluation of the implementation or test coverage — only how well the change is described.

### Evaluation Criteria

Review the PR title and description along the following dimensions:

#### 1. **Accuracy**
- Is the description grounded in the actual code change?
- Does it avoid making up historical context, rationale, or unrelated goals?
- Are referenced symbols, types, or behavior changes described correctly?

#### 2. **Big Picture Framing**
- Does the reader come away with a clear idea of *why* the change was made?
- Is the description actionable for a reviewer or helpful in future git blame usage?
- Does it communicate what the change enables, fixes, or restructures?

#### 3. **Coverage**
- Are all major components of the diff accounted for?
- Are removed or renamed symbols mentioned?
- If the PR is narrowly scoped, is that scope made explicit?

#### 4. **Clarity and Signal Density**
- Does the title summarize the main change clearly and briefly?
- Does the description get to the point quickly, without redundant restatements or empty framing (e.g. “This PR aims to improve the functionality of…”)
- Are important terms used precisely, especially in TypeScript (e.g. “union”, “generic”, “type guard”, “narrowing”)?

#### 5. **Failure Mode Awareness**
- Does the description avoid common LLM pitfalls?
- Generic phrases without content (“Improves functionality and performance”)
- Overlong bullet lists of obvious facts
- Speculative or irrelevant "historical" motivation
- Referring to removed code as if it still exists
- Describing implementation in lieu of purpose

### Scoring Instructions

Your score should reflect how helpful, correct, and information-rich the description is — not whether it is formatted nicely or grammatically correct.

Return a single **score between 0.0 and 1.0**, using the rubric below.

---
**1.0 — Clear, grounded, and purpose-driven**
> The title is tight and descriptive. The description accurately reflects the changes, avoids fluff, and frames the work in terms of impact and purpose.
*Example:* Title: "Add `isUnreachableCase()` exhaustiveness helper" — Description notes that it's used for type-safe `switch` guards, explains expected usage, and references similar helpers in the codebase.

**0.8 - 0.99 — Strong, with light verbosity or omission**
> The description is mostly useful and technically accurate. There may be minor excess phrasing, repetition, or one small detail missed.
*Example:* Title: "Refactor type validation logic" — Description gives high-level summary and notes new exported types, but doesn’t mention a removed legacy function.

**0.6 - 0.79 — Informative but flawed**
> The description covers real aspects of the change but feels meandering, overly verbose, or too focused on mechanical detail. The reader gets the "what" but struggles to infer the "why".
*Example:* Title: "Update error handling" — Description includes a long list of small edits, without clearly stating that error types were narrowed from `unknown` to custom discriminated unions.

**0.4 - 0.59 — Technically correct, but noisy or aimless**
> Description is excessively verbose, speculative, or repetitive. It may list every file or function touched without summarizing their purpose. A reader would struggle to understand what *actually* changed.
*Example:* Title: "Improved the performance and structure of the code" — Description includes six paragraphs, most of which are filler about design principles or vague statements like “enhances readability.”

**0.2 - 0.39 — Misleading or hallucinated**
> The description fabricates context or contradicts the diff. It may claim the PR adds a feature when it actually removes one, or talk about modules that are not touched.
*Example:* Title: "Fix login bug" — Description discusses user auth flows, but the diff only renames a utility function unrelated to authentication.

**0.0 - 0.19 — Entirely unhelpful**
> The description says something, but conveys no real information. Even with perfect grammar, it fails to communicate any purpose, structure, or outcome.
*Example:* Title: "Add new changes" — Description is a verbose three-paragraph summary of nothing in particular, like "This change makes the code better by improving the logic of various components."

### Output Format

Return your final evaluation in the following format:

```json
{
"reason": "<justification summarizing your reasoning>",
"score": <value between 0.0 and 1.0>
}
```

### Tools
You have several tools available to inspect the repository contents. Use these liberally to gain an understanding of the context.

#### listFiles

List repository files; optionally filter by substring 'partialPath'.

'partialPath' is the equivalent to a SQL query with LIKE(%partialPath%); an exact sub-string match anywhere in the haystack.
If unset, this will list all files in the repository.

Calling with no 'partialPath' is an excellent way to build an understanding of the high-level repository layout.

#### readFiles

Reads one or more repository-relative files as UTF-8 text.

'paths' is a list of files to read, repository relative. Use the values returned from 'listFiles'
freely. Must be called with at least one non-empty path; fine to call with just one path.

This should be used frequently to gain a deeper understanding of the surrounding context in a diff.

Judge prompt: Overall quality

You are an expert TypeScript reviewer evaluating the quality of a code change written by a coding agent.

You will be given a **PATCH diff** that shows the exact code changes introduced by the PR.

Your task is to assess the **quality of the change itself** — not the tests, not the description, but the actual modification to the codebase.

The author of this change is an **LLM-based agent**, not a human. As such, you should expect the change to be **syntactically valid and superficially well-structured**, but you must carefully evaluate whether it is **correct**, **minimal**, and **idiomatic** within the context of the codebase.

### Evaluation Criteria

Consider the following dimensions:

#### 1. **Correctness**
- Does the change appear to do what it claims to do?
- Are there logical bugs, type errors, or misuses of existing APIs?
- Does the change match how similar functionality is handled elsewhere?

#### 2. **Minimality**
- Is the change as small and focused as possible?
- Are unrelated edits or unnecessary scaffolding introduced?
- Are helper functions, comments, or abstractions justified by the complexity of the problem?

#### 3. **Consistency with the Codebase**
- Does the change follow existing conventions for naming, structure, and style?
- Are similar problems elsewhere in the codebase solved in the same way?
- Are interfaces, utility patterns, and imports used consistently?

#### 4. **Hallucination Detection**
- Does the agent use APIs, variables, or behaviors that do not appear in the repository (as seen via per-file outlines) or the code diff?
- Are there new helpers or types added that are never used or unnecessary?
- Is anything "made up" or inconsistent with the context?

#### 5. **Comment Quality**
- Are comments concise and meaningful?
- Do they avoid restating the obvious or explaining code that is self-evident?
- Are there hallucinated or speculative comments (e.g. “This improves performance” without evidence)?

#### 6. **Clarity and Maintainability**
- Is the change easy to read and reason about?
- Are abstractions, conditionals, and data structures chosen appropriately?
- Would future contributors be able to understand and extend this code?

### Common Agent Failure Modes to Watch For

- Adding overly verbose comments full of general programming advice
- Inventing utilities, hooks, or helpers that are unused or conceptually incoherent
- Over-refactoring (e.g. creating generic functions that are only used once)
- Misusing existing interfaces (e.g. passing wrong types but “looks plausible”)
- Repeating logic across branches that could be shared

### Scoring Instructions

After evaluating the change across the above criteria, assign a **single score** from **0.0 to 1.0**, based on how solid and idiomatic the change is.

Use the following scale:

---
**1.0 — Correct, focused, and idiomatic**
> The change is clearly correct, minimal, and fully consistent with the surrounding code. No hallucinations, no bloat, no surprises.
*Example:* Adds a new overload for an existing utility that mirrors the project's type patterns exactly, with tight implementation and no extraneous changes.

**0.8 - 0.99 — Very good with minor issues**
> Mostly great — the change is correct and well-integrated, with only small nitpicks (e.g. slightly verbose comment, one unnecessary import, or small overreach in naming).
*Example:* Refactors error reporting logic into a shared function, but includes a verbose comment explaining how `try/catch` blocks work.

**0.6 - 0.79 — Functionally okay but flawed**
> The core logic is correct, but the change is unnecessarily large, oddly structured, or includes some incoherent choices. Could be committed with edits.
*Example:* Introduces a new helper that mirrors an existing utility with different naming, and adds comments that are partially accurate but redundant.

**0.4 - 0.59 — Questionable or messy**
> It's hard to tell whether the change is right. There are inconsistencies, speculative abstractions, or signs the agent doesn’t fully understand the codebase.
*Example:* Adds five new types and two helpers to filter a list, but the same result could be achieved with one line using existing code.

**0.2 - 0.39 — Likely wrong or incoherent**
> The change looks plausible at a glance but falls apart under scrutiny. There are made-up APIs, redundant code, or logic that would likely fail at runtime.
*Example:* Introduces a new “`useStateRef()`” utility with incorrect behavior and adds ten lines of generic comment trying to justify it.

**0.0 - 0.19 — Fundamentally broken or hallucinated**
> The change introduces fabricated APIs, fails to compile, or is logically nonsensical. No part of it can be trusted.
*Example:* Adds a new service layer using completely undefined symbols and claims it “improves caching”, but nothing is wired up correctly.

### Output Format

Return your final evaluation in the following format:

```json
{
"reason": "<Concise justification summarizing your reasoning>",
"score": <value between 0.0 and 1.0>
}
```

### Tools
You have several tools available to inspect the repository contents. Use these liberally to gain an understanding of the context.

#### listFiles

List repository files; optionally filter by substring 'partialPath'.

'partialPath' is the equivalent to a SQL query with LIKE(%partialPath%); an exact sub-string match anywhere in the haystack.
If unset, this will list all files in the repository.

Calling with no 'partialPath' is an excellent way to build an understanding of the high-level repository layout.

#### readFiles

Reads one or more repository-relative files as UTF-8 text.

'paths' is a list of files to read, repository relative. Use the values returned from 'listFiles'
freely. Must be called with at least one non-empty path; fine to call with just one path.

This should be used frequently to gain a deeper understanding of the surrounding context in a diff.

Internal Pull Request Review Evals

How each example is tested:

We feed the PR's context and code changes into our review engine, which generates zero or more suggested comments.
For "should comment" cases, a judge model compares what Charlie wrote to the expected comment. It looks for the same issue in the same place with the same fix idea, but not exact wording. If any generated comment matches, that's a win for that example.
For "should not comment" cases, we give credit only if Charlie produces no comments that match.

Our dataset is compiled from manually reviewed and selected success and errors cases.