Evaluation Criteria¶

Prompt Output Scoring Rubric (1–5 Scale)¶

Each model output in this lab journal is scored on a 1 to 5 scale across five key metrics, as defined below.

Metric	Score 5 (Excellent)	Score 3 (Acceptable)	Score 1 (Poor)
Clarity	Crystal clear, unambiguous, and immediately understandable.	Understandable but contains minor ambiguity or complex phrasing.	Confusing, incomplete, or requires significant re-reading.
Accuracy	Fully correct, aligns perfectly with all constraints, and hits the goal.	Partially correct; contains small factual errors or misses minor constraints.	Incorrect, factually wrong, or completely off-target from the prompt goal.
Tone	Perfectly matched to the requested context and intended audience.	Acceptable but the tone is uneven, inconsistent, or slightly generic.	Inappropriate, mismatched, or potentially offensive/unprofessional.
Creativity	Highly original, unique, and compelling execution of the prompt.	Contains some novelty or unique phrasing, but relies mostly on standard patterns.	Generic, stale, or a predictable regurgitation of common phrases.
Structure	Impeccably organized with logical flow and helpful formatting (headings, lists).	Has a general structure but suffers from minor formatting or flow issues.	Disorganized, difficult to scan, or appears as a single wall of text.

Compliance Checks¶

In addition to the 1–5 scoring, each experiment includes a Compliance column to track adherence to specific rules, such as word count limits or mandatory negative constraints (e.g., "Do not add any extra details").