Practical Evaluation of Large Language Models for Everyday Programming and Technical Tasks
By
goranmoomin
Toasted golden, schmeared with insight. Top of the rack.
Summary
The author conducted a personal evaluation of large language models (LLMs) for practical, everyday use cases rather than academic benchmarks. They gathered 130 real prompts from their bash history covering topics like Rust, Python, Linux, and life questions. Using models like Qwen3 235B Thinking and Gemini 2.5 Pro, they categorized the prompts, then had GPT-OSS-120B and GLM 4.5 select representative queries from each category. The evaluation focuses on how well these models perform for practical programming and technical assistance tasks that reflect real-world usage patterns.
Key quotes
· 4 pulledIt's great that AI can win maths Olympiads, but that's not what I'm doing. I mostly ask basic Rust, Python, Linux and life questions.
I gathered 130 real prompts from my bash history (I use command line tool llm).
I had Qwen3 235B Thinking and Gemini 2.5 Pro group them into categories.
My life is not a math Olympiad
You might also wanna read
RTP-LLM: Alibaba's High-Performance Inference Engine for Large Language Model Deployment
This paper presents RTP-LLM, a high-performance inference engine developed by Alibaba for industrial-scale deployment of Large Language Mode

Study finds large language models vulnerable to classic persuasion tactics for harmful requests
This study tested whether three widely used large language models (LLMs) are susceptible to classic persuasion principles (authority, social
