Comparing 11 LLMs on a LangGraph Code Reorganization Task: American vs. Chinese Models

Korridzy

2d ago· 47 min readenInsight

technology programming llm benchmarking ai model comparison

Summary

A detailed experimental comparison of 11 large language models (5 American: GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Flash, Gemini 2.5 Pro, Grok 3; and 6 Chinese: DeepSeek V3, DeepSeek R1, Qwen 2.5, Qwen 2.5 Max, Yi Lightning, GLM-4) on a code reorganization task. The author takes a complex "god node" from a real LangGraph agent and asks each model to propose how to untangle it, then has them evaluate each other's proposals. Three different methods are used to determine which model's output to trust. The experiment explores model performance, self-evaluation reliability, and cross-model evaluation dynamics.

Source

Hacker NewsComparing 11 LLMs on a LangGraph Code Reorganization Task: American vs. Chinese Modelswtf.korridzy.com

Key quotes

· 3 pulled

You know how it goes: you're building a practice AI agent with the fellas on a course by Data Sanity, and amid the colorful whirl of rapidly accreting features you suddenly notice that one of the project's internal agents has grown a god node.

I took a god node from a real LangGraph agent and asked 5 American and 6 Chinese models first to propose how to untangle it, then to evaluate each other's proposals.

After that, I tried three different ways to figure out which of them to trust on the matter.

Snippet from the RSS feed

Weekly Tracker Feed by Korridzy.

You might also wanna read

DeepSeek-V3.1: Open-Source Language Model with Hybrid Inference for Advanced Reasoning and Coding

DeepSeek-V3.1 is an open-source large language model that introduces hybrid inference with both 'Think' and 'Non-Think' modes, optimized for

Product Hunt·10mo ago

Benchmarking GLM-5.2 vs Opus 4.8: Long-context retrieval performance for coding agents

This article benchmarks GLM-5.2 (an open-source model from Z.ai) against Anthropic's Opus 4.8 for long-context retrieval in coding agent use

braintrust.dev·4d ago

Study reveals why in-context learning fails on complex specification-heavy tasks and how fine-tuning can help

This research paper investigates the limitations of in-context learning (ICL) for large language models (LLMs) when applied to specification

arxiv.org·14d ago

LK Losses: A New Training Objective to Optimize Acceptance Rate in Speculative Decoding for LLMs

This paper introduces LK losses, a novel training objective for speculative decoding in large language models (LLMs). Speculative decoding a

arxiv.org·1mo ago

CoT-PoT Ensembling: Efficient LLM Reasoning with Self-Consistency from Just Two Samples

This paper introduces a hybrid ensembling approach called CoT-PoT that combines Chain-of-Thought (CoT) and Program-of-Thought (PoT) reasonin

arxiv.org·25d ago

LLMs Can Describe Their Own Internal Decision-Making Processes, New Research Shows

This research paper demonstrates that large language models (LLMs) can accurately describe their own internal decision-making processes. The

arxiv.org·28d ago

Comments

No comments yet. Be the first.