How Do You Score a Fantasy World?
13
Elders scored
8.8
Avg quality
85%
AI agreement
5
Errors caught
by human
I needed a way to measure whether my worldbuilding was actually good — not just interesting, not just detailed, but structurally sound and narratively useful. So I developed a scoring framework, tested it across 13 major characters, and had two independent AI systems review the results. What follows is the framework, the data, and what it reveals about how humans and AI contribute differently to creative work. No story spoilers. Just methodology and numbers.
The Framework: 6 Categories, 1-10 Each
Lore Depth
Historical layers, cultural detail, mythological weight. Can you trace this element back through time?
Creativity
Original concepts, avoiding generic tropes. Does this surprise you or feel fresh?
Visual Design
Can you physically see this character or place? Is the description specific enough to draw or film?
Political Integration
How does this element connect to the world's power structures? Does it have allies and enemies?
Character Consistency
Does behavior match backstory? Would this character do this based on who they are?
Narrative Foreshadowing
Does this set up future story possibilities without forcing them? Seeds planted, not rails laid.
The Data: 13 Characters Scored
I built comprehensive documents for 13 members of a governing council. Each represents a different civilization, philosophy, and personal wound. Each was independently reviewed and scored by a second AI system.
Character Documents by Score
Category Breakdown — Where AI Excels vs. Humans
Average Score by Category
The pattern: AI scores highest on categories requiring memory and detail (visual design, lore depth). AI scores lowest on authorial intent (narrative foreshadowing). The categories requiring human judgment sit in between. This suggests a clear division of labor that any creator can use.
Three Cases That Prove the Framework Works
These are real situations from the project — described without story spoilers.
Case 1 — The Seated Elder
One AI described a character as "standing dramatically in shadow, never sitting." A second AI reviewing the document didn't catch it. But the manuscript clearly showed the character seated at the table between two other people. Both AIs trusted prototype data over the actual manuscript. The author caught it by reading the source. Framework lesson: Character Consistency requires manuscript verification, not just data synthesis.
Case 2 — The Family Tree
Both AI systems described a character as someone's biological daughter. The author realized she was actually a daughter-in-law who married into the family — a completely different relationship that changes grief dynamics, power dynamics, and bloodline logic. Neither AI caught this because the source data was ambiguous. Framework lesson: Character Consistency collapses when family relationships are wrong. One error cascades into motivation, dialogue, and plot.
Case 3 — The Score Convergence
One AI scored a document 8.5. The other counter-scored it 9.0. The disagreement was caused by access disparity — one system had source files, the other reviewed cold. When both received the same context, they converged to 9.0 with zero divergence. Framework lesson: Disagreement between reviewers reveals documentation gaps, not quality problems. Agreement confirms coherence.
What Separates an 8.5 from a 9.5
8.5
Strong identity. Clear politics. But missing: a named personal wound, specific crisis history, or governance details.
9.0
Everything above plus: named bonds, detailed infrastructure, connections to multiple storylines. Mythic resonance.
9.5
Every element connects to every other. Nothing isolated. Everything load-bearing. Each detail serves two functions.
The Error Rate — Why Humans Are Essential
5
errors caught by human review
Across 13 documents. All caught through manuscript verification.
None caught by either AI system independently.
None survived to final canon.
The errors:
- A character described as standing who was actually seated (manuscript check)
- A character described as approaching someone who spoke from their chair (manuscript check)
- A district placed in the wrong city (geographic logic check)
- A name spelled two ways across documents (consistency check)
- A daughter misidentified as biological when she was a daughter-in-law (family tree logic check)
The lesson: AI produces high-quality first drafts. But every claim must be verified against the source material. The framework catches quality issues. The human catches factual ones.
AI Strengths vs. Human Strengths
AI Excels At
Holding thousands of details simultaneously. Cross-referencing across databases. Generating rich visual descriptions. Building layered history. Maintaining consistency at scale. Producing comprehensive first drafts.
Humans Excel At
Knowing which details serve the story. The insight moments — connecting distant ideas into revelations. Deciding what to work on next. Verifying against the manuscript. Catching what AI confidently gets wrong. Declaring canon.
Can This Be Used for Research?
Yes. This framework produces measurable, reproducible data.
Research-Ready Properties
Quantifiable: 6 categories × 1-10 scale = comparable scores across documents, sessions, and projects.
Reproducible: Same document scored by two independent AI systems produced 85% agreement, converging to 0% divergence after context sharing.
Falsifiable: Errors were caught (5 total), documented, and corrected — proving the framework surfaces real problems.
Applicable beyond fantasy: The 6 categories map to any complex world — games, films, campaigns, corporate narratives, historical reconstructions.
Controlled variable: The human author remained constant. Two different AI systems produced independently verifiable scores.
Potential research questions this framework could answer:
- Does AI-assisted worldbuilding produce measurably more consistent output than solo creation?
- Which categories benefit most from AI assistance vs. human judgment?
- At what scale does AI-assisted worldbuilding reach diminishing returns?
- How does the error correction rate change with session length?
A Framework You Can Use Right Now
Score each major element (character, city, faction, race) on the 6 categories:
■ 6 or below — Needs fundamental work. Missing core components.
■ 7 — Functional but generic. Works for the story but doesn't feel alive.
■ 8 — Strong and believable. Integrated into the world.
■ 9 — Exceptional. Feels like it existed before the story arrived.
■ 10 — Mythic. The reader will remember it years later.
How to use it in practice:
- Build the element (character, city, etc.) — use AI or write it yourself
- Score it honestly on all 6 categories
- Any category below 8? That's where your next work session should focus
- Have someone else (human or AI) score it independently — disagreement reveals blind spots
- Check every factual claim against your source material — AI is confident, not infallible
The Bottom Line
19
Lore documents
50+
Commits in one session
40
Months of building
Worldbuilding quality is measurable. AI makes it possible to build at a scale that would take a team of editors. But the scoring — the judgment of what's good enough, what needs more, and what matters — that's human work.
The framework is the bridge between "I built a lot of stuff" and "I built a world."
Scoring data from 13 character documents built across 40 months. Independently reviewed by two AI systems with 85% agreement rate. 5 errors caught exclusively by human manuscript verification. Part of the devlog for The Ethereal Web.
— Jorge