Evaluation Criteria¶
Prompt Output Scoring Rubric (1–5 Scale)¶
Each model output in this lab journal is scored on a 1 to 5 scale across five key metrics, as defined below.
Metric | Score 5 (Excellent) | Score 3 (Acceptable) | Score 1 (Poor) |
---|---|---|---|
Clarity | Crystal clear, unambiguous, and immediately understandable. | Understandable but contains minor ambiguity or complex phrasing. | Confusing, incomplete, or requires significant re-reading. |
Accuracy | Fully correct, aligns perfectly with all constraints, and hits the goal. | Partially correct; contains small factual errors or misses minor constraints. | Incorrect, factually wrong, or completely off-target from the prompt goal. |
Tone | Perfectly matched to the requested context and intended audience. | Acceptable but the tone is uneven, inconsistent, or slightly generic. | Inappropriate, mismatched, or potentially offensive/unprofessional. |
Creativity | Highly original, unique, and compelling execution of the prompt. | Contains some novelty or unique phrasing, but relies mostly on standard patterns. | Generic, stale, or a predictable regurgitation of common phrases. |
Structure | Impeccably organized with logical flow and helpful formatting (headings, lists). | Has a general structure but suffers from minor formatting or flow issues. | Disorganized, difficult to scan, or appears as a single wall of text. |
Compliance Checks¶
In addition to the 1–5 scoring, each experiment includes a Compliance column to track adherence to specific rules, such as word count limits or mandatory negative constraints (e.g., "Do not add any extra details").