Analyzed 780+ graded mathematical proofs to compare LLM vs. human grader consistency and accuracy. Identified feedback errors and applied statistical testing to evaluate grading reliability.
Sep 10, 2025