Research

Analyzed 780+ graded mathematical proofs to compare LLM vs. human grader consistency and accuracy. Identified feedback errors and applied statistical testing to evaluate grading reliability.

Sep 10, 2025