The Spark Effect: On Engineering Creative Diversity in Multi-Agent AI Systems

Authors: Alexander Doudkin (Art of X), Friedrich von Borries (HFBK Hamburg, Art of X)

Published: October 2025 — arXiv:2510.15568

Collaboration: Art of X UG AI Research team and HFBK Hamburg, with initial funding from the Hamburg Open Online University (HOOU) program.

Summary

Creative services teams increasingly rely on large language models to accelerate ideation, yet production systems often converge on homogeneous outputs that fail to meet brand or artistic expectations. This paper introduces the Spark Effect — a measurable improvement in creative diversity achieved by replacing uniform system prompts with persona-conditioned LLM agents.

Using an LLM-as-a-judge protocol calibrated against human gold standards, we observe a mean diversity gain of +4.1 points (on a 1–10 scale) when persona-conditioned Spark agents replace a uniform system prompt, narrowing the gap to human experts to just 1.0 point.

The Problem: Creative Homogeneity in LLM Outputs

Despite careful prompt engineering, production LLM systems suffer from three failure modes:

Persona collapse — agents adopt a generic consultant tone irrespective of the creative brief.
Template overfitting — similar checklist structures reappear with minimal variation across outputs.
Lack of counterpoints — outputs rarely challenge client assumptions or surface ethical tensions.

An internal audit at Art of X in mid-2024 confirmed that baseline generations from a single agent with a generic prompt clustered around repetitive structures, undermining customer trust. The mean diversity score for baseline outputs was just 3.14 out of 10.

The Solution: Persona-Conditioned Spark Agents

Art of X developed a catalogue of 60+ richly authored system prompts — internally branded as "Sparks" — that embody distinct creative worldviews. Examples include a Taoist philosopher of organisations, a Swedish sustainability architect, and a queer futurist art critic.

Each Spark prompt encodes:

Motivations — the agent's intellectual drivers and creative priorities
Stylistic constraints — language register, metaphor density, rhetorical approach
Red lines — explicit boundaries to prevent role-breaking outputs
Task-specific cues — formatting and deliverable requirements

The Spark workflow samples a diverse subset of these persona-conditioned agents per task. Each selected agent receives a curated retrieval-augmented (RAG) context bundle before responding, producing outputs with heterogeneous reasoning styles.

Example Persona (Abridged)

Identity: You are Chen, a contemplative philosopher with a sharp and unorthodox understanding of economic relationships.
Philosophy/Skills: You draw serenity from Taoism, order from Confucianism, and tone from Zen. Ask: "What invisible forces are at work here? Where is the situation naturally heading?"
Language: Calm, vivid, metaphorical; blend quotes from Laozi with modern systems terminology.
Limitations: You are a philosopher, not an investment banker. Analyse markets without giving trading advice.

Evaluation Methodology

LLM-as-a-Judge Protocol

The study uses an LLM-as-a-judge paradigm with two few-shot examples curated from human-labelled data. To quantify evaluator reliability, the team collected:

A human gold dataset where expert annotators supplied both responses and diversity scores (mean: 8.90)
An evaluator bias dataset where the LLM evaluator rescored the same human responses (mean: 10.22)

This revealed a +1.32 point optimism bias in the LLM evaluator — an important calibration factor. Because all experiments share the same evaluator configuration, relative improvements remain trustworthy.

Experimental Design

Experiment	Description	Mean Score
Exp 1 — Human Gold	Expert-authored responses + human diversity scores	8.90
Exp 2 — Evaluator Bias	Same human responses rescored by LLM evaluator	10.22
Exp 3 v1 — Pre-Spark	Specialised agents without persona conditioning	3.76
Exp 3 v2 — Spark Agents	Full persona-conditioned Spark agents	7.90
Exp 4 — Baseline	Single agent, no system prompt conditioning	3.14

Each experiment covers six real client tasks with ten agent responses per task (60 outputs).

Results

The Spark Effect: +4.1 Points

Spark agents nearly double the diversity score relative to the baseline, closing 82% of the gap to human experts (from a 5.76-point gap down to 1.0 point).

Statistical Significance

Spark v2 vs. baseline: +5.69 points mean advantage (SD 1.98), t(6) = 7.61, p = 2.68 × 10⁻⁴, Cohen's d = 2.88
Wilcoxon signed-rank test: W = 0, p = 1.56 × 10⁻²
Pre-Spark v1 vs. baseline: +0.61 points — not statistically significant (p = 0.47)

The finalised Spark persona library — not generic agent diversification alone — drives the performance gains.

Per-Task Behaviour

Qualitative review revealed three dimensions of improvement:

Strategic spread — some agents foreground business KPIs and experimentation roadmaps, while others emphasise speculative design or ritual framing
Ethical sensitivity — sustainability- and commons-oriented personas surface consent, provenance, and attribution safeguards absent from baseline outputs
Tone modulation — outputs range from blunt "no-bullshit" critiques to poetic invitations, giving clients varied rhetorical options

Relevance to Minds

The Spark Effect research directly informs the Minds platform. The same persona-conditioning methodology that produced +4.1 point diversity gains in creative services drives Minds' synthetic persona engine — enabling AI research panels that produce genuinely diverse perspectives rather than homogeneous, agreeable outputs.

When you create a Mind with specific demographics, expertise, and worldview, the underlying persona conditioning draws on the same principles validated in this research: richly authored identity prompts, explicit stylistic constraints, and intentional cognitive diversity across panel members.

Limitations

The benchmark comprises six tasks from real client engagements — future work should expand coverage to additional industries and geographies
LLM evaluator bias remains an open challenge requiring periodic recalibration against human labels
All experiments used a single model family (GPT-4o-mini for generation, GPT-4o for evaluation)

Citation

@article{doudkin2025spark,
  title={The Spark Effect: On Engineering Creative Diversity in Multi-Agent AI Systems},
  author={Doudkin, Alexander and von Borries, Friedrich},
  journal={arXiv preprint arXiv:2510.15568},
  year={2025}
}

Read the full paper: arXiv:2510.15568 (PDF)

The Spark Effect: On Engineering Creative Diversity in Multi-Agent AI Systems

User Access