All Topics
All Topics
Technology
Technology
AI
AI
Business
Business
Entertainment
Entertainment
News
News
Programming
Programming
Security
Security
Science
Science
Design
Design
Environment
Environment
Finance
Finance
Crypto
Crypto
Politics
Politics
Sports
Sports
Education
Education
Gaming
Gaming
Art
Art
Music
Music
Health
Health
Books
Books
Food
Food
Travel
Travel
Personal
Personal
Bluesky
Twitter

Graders

11mo ago

Source

OpenAIGradersopenai.com
Snippet from the RSS feed
Explains grader types and how to score model outputs. — evals

You might also wanna read

A Teacher Built an AI Grading Assistant—Then Removed Its Most Automated Feature to Keep Humans in Charge

A teacher returns from two days of chaperoning field trips to find 450 ungraded assignments. Rather than simply marking them credit/no credi

edsurge.com·18d ago

A Teacher Built an AI Grading Assistant—Then Removed Its Most Automated Feature to Keep Humans in Charge

A teacher returns from two days of chaperoning field trips to find 450 ungraded assignments. Rather than simply marking them credit/no credi

edsurge.com·18d ago

Three AI tools all made the same mistake grading student exit tickets — and what it reveals about teaching

A teacher gave three AI tools the task of analyzing 16 student exit tickets from an 8th grade math class on solving systems of linear equati

pattypapers.wordpress.com·27d ago

Grade (YC W2026) Builds API for Performance-Based Payroll Targeting AI Agents and Remote Contractors

Grade (YC W2026) is building an API infrastructure for performance-based payroll, enabling companies to pay AI agents, remote contractors, a

startuphub.ai·11d ago

BINEVAL: A Binary Question Framework for Interpretable LLM Evaluation and Self-Improvement

This paper introduces BINEVAL, a framework for evaluating LLM outputs that decomposes evaluation criteria into atomic binary questions. Inst

arxiv.org·7d ago

BINEVAL: A Binary Question Framework for Interpretable LLM Evaluation and Self-Improvement

This paper introduces BINEVAL, a framework for evaluating LLM outputs that decomposes evaluation criteria into atomic binary questions. Inst

arxiv.org·7d ago

Systematic evaluation of 21 LLM-as-a-Judge models reveals reliability flaws and position bias across 541,000 judgments

This paper presents the largest systematic evaluation of LLM-as-a-Judge models to date, analyzing 21 judges from nine providers across three

arxiv.org·12d ago

Systematic evaluation of 21 LLM-as-a-Judge models reveals reliability flaws and position bias across 541,000 judgments

This paper presents the largest systematic evaluation of LLM-as-a-Judge models to date, analyzing 21 judges from nine providers across three

arxiv.org·12d ago

Dr Jake Clark on STEM education and all the things!

adsei.org·1mo ago

Comments

Sign in to join the conversation.

No comments yet. Be the first.