Definition
Confidence Score is a numerical value indicating how likely a piece of text is to have been AI-generated. Typically expressed as a percentage, it represents the detector's degree of certainty based on the statistical signals it has measured.
When an AI detector analyzes a piece of text, it does not return a simple yes or no. It returns a confidence score – a probability estimate reflecting how closely the text matches statistical patterns associated with AI-generated content. A score of 95% does not mean the text is definitely AI-generated. It means the detector assigns a 95% probability to that outcome based on the signals it has measured.
Understanding confidence scores as probability estimates rather than definitive verdicts is essential for anyone using AI detection in an academic context. A high score should prompt further investigation – not immediate action.
How It Works
The confidence score is calculated by aggregating multiple detection signals – most commonly perplexity, burstiness, and token probability – and mapping the combined result onto a probability scale. Some detectors use a single underlying model; others combine predictions from multiple models and average the results.
Scores are sensitive to text length: samples under 100 words produce less reliable results because there is insufficient text to establish a stable statistical profile. Longer documents with consistent patterns produce more reliable and actionable scores.
Why It Matters for AI Detection
Confidence scores matter because they are what educators actually see and act on. Treating 80% as “definitely AI” rather than “probably AI with meaningful uncertainty” leads to over-confident decisions that can unfairly harm students.
Proofademic displays confidence scores alongside sentence-level breakdowns, giving educators context to understand not just the overall score but which specific passages contributed most to it – enabling more informed and defensible assessment decisions.
FAQs
What confidence score threshold should educators use before taking action?
There is no universal standard. Most practitioners treat scores above 80-85% as warranting further investigation, while scores above 95% on longer documents are considered stronger indicators. Institutional policies should define specific thresholds and require corroborating evidence before taking any disciplinary action based on detection results alone.
Can a document score differently on different detectors?
Yes, often significantly. Different detectors use different reference models, signal combinations, and threshold calibrations. A document scoring 90% on one tool might score 60% on another. This is why using multiple tools and treating scores as probabilistic indicators is recommended practice.