MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts
By
[Submitted on 15 Sep 2025 (v1), last revised 29 May 2026 (this version, v3)]
You might also wanna read
Backprompting: Synthetic Data Generation Method for Health Advice Guardrails in LLMs
Researchers propose 'backprompting' - a method to generate synthetic production-like labeled data for developing health advice guardrails in
New Benchmark Evaluates LLM Understanding of Persian Taarof Cultural Norms
Researchers introduce TaarofBench, the first benchmark for evaluating large language models' understanding of Persian taarof - a sophisticat
New Benchmark Uses Esoteric Programming Languages to Evaluate LLM Reasoning Abilities
Researchers introduce EsoLang-Bench, a new benchmark for evaluating large language models (LLMs) using esoteric programming languages like B
Transforming Medical Data into Reasoning Traces for Improved LLM Clinical Performance
The article discusses how the value of data has shifted in the age of LLMs, arguing that simply having proprietary data is no longer suffici
Research on LLM Output Drift in Financial Workflows: Quantifying Consistency Across Model Sizes
This research paper examines the critical issue of output drift in Large Language Models (LLMs) deployed for financial workflows. The study
