All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Practical Evaluation of Large Language Models for Everyday Programming and Technical Tasks

By

goranmoomin

9mo ago· 11 min readenInsight

Summary

The author conducted a personal evaluation of large language models (LLMs) for practical, everyday use cases rather than academic benchmarks. They gathered 130 real prompts from their bash history covering topics like Rust, Python, Linux, and life questions. Using models like Qwen3 235B Thinking and Gemini 2.5 Pro, they categorized the prompts, then had GPT-OSS-120B and GLM 4.5 select representative queries from each category. The evaluation focuses on how well these models perform for practical programming and technical assistance tasks that reflect real-world usage patterns.

Key quotes

· 4 pulled
It's great that AI can win maths Olympiads, but that's not what I'm doing. I mostly ask basic Rust, Python, Linux and life questions.
I gathered 130 real prompts from my bash history (I use command line tool llm).
I had Qwen3 235B Thinking and Gemini 2.5 Pro group them into categories.
My life is not a math Olympiad
Snippet from the RSS feed
My life is not a math Olympiad

You might also wanna read