Skip to content

Evaluation Criteria

Prompt Output Scoring Rubric (1–5 Scale)

Each model output in this lab journal is scored on a 1 to 5 scale across five key metrics, as defined below.

Metric Score 5 (Excellent) Score 3 (Acceptable) Score 1 (Poor)
Clarity Crystal clear, unambiguous, and immediately understandable. Understandable but contains minor ambiguity or complex phrasing. Confusing, incomplete, or requires significant re-reading.
Accuracy Fully correct, aligns perfectly with all constraints, and hits the goal. Partially correct; contains small factual errors or misses minor constraints. Incorrect, factually wrong, or completely off-target from the prompt goal.
Tone Perfectly matched to the requested context and intended audience. Acceptable but the tone is uneven, inconsistent, or slightly generic. Inappropriate, mismatched, or potentially offensive/unprofessional.
Creativity Highly original, unique, and compelling execution of the prompt. Contains some novelty or unique phrasing, but relies mostly on standard patterns. Generic, stale, or a predictable regurgitation of common phrases.
Structure Impeccably organized with logical flow and helpful formatting (headings, lists). Has a general structure but suffers from minor formatting or flow issues. Disorganized, difficult to scan, or appears as a single wall of text.

Compliance Checks

In addition to the 1–5 scoring, each experiment includes a Compliance column to track adherence to specific rules, such as word count limits or mandatory negative constraints (e.g., "Do not add any extra details").