User Guide

A comprehensive guide to using Delstorm for multi-agent experimentation.

1. Overview

Delstorm is a multi-agent experimentation platform. You design experiments where AI agents — each with their own persona, backed by an LLM of your choice — engage in structured discourse across configurable rounds.

Each round has a condition prompt that sets the social scenario: a debate topic, a collaborative task, a negotiation, a simulated classroom, or any situation you can describe. Agents respond based on their personas and the conversation history. You control the interaction pattern, number of turns, which agents participate, and how long responses should be.

Use cases include:

Simulating academic debates grounded in real papers
Modeling stakeholder negotiations with distinct interests
Exploring how different ideological perspectives interact on a topic
Testing how agent personas, temperature, or round structure affect discourse outcomes
Running batch experiments to observe variance across identical setups
Generating structured multi-perspective analysis of documents or topics

2. Getting Started

To run your first experiment, follow these steps:

1.

Create an account

Click "Sign In" in the top right. If you don't have an account, click "Create one" to sign up with your email and password. You'll receive a confirmation email — click the link to activate your account, then sign in.
2.

Check your providers

Go to Settings. You should see at least one provider already configured (e.g., "claude"). If not, you'll need to add one — see Setting Up Providers below.
3.

Create an experiment

Go to New Experiment. Give it a name, select your default provider (e.g., "claude"), enter the default model name (e.g., "claude-sonnet-4-6"), and add at least one agent and one round.
4.

Launch

Click "Launch Experiment" and watch the agents respond in real-time. Results appear as they're generated.

Quick test: Try 2 agents, 1 round, "Short" response length. This will finish in under a minute and let you verify everything works before building a larger experiment.

3. Setting Up Providers

A provider is an LLM backend that powers your agents. Providers are configured in Settings.

Shared vs. Personal Providers

Your administrator may have pre-configured shared providers (like Claude) that are available to all users automatically. These appear in your provider list without any setup. You can also add your own personal providers — these are visible only to you.

Provider Types

Anthropic (Claude)

For Claude models by Anthropic.

Backend: Select "Anthropic"
Base URL: Leave blank (the SDK handles this automatically)
API Key: Your Anthropic API key (starts with sk-ant-). Get one at console.anthropic.com.
Default Model: claude-sonnet-4-6 (recommended) or any Claude model ID

OpenAI Compatible

Works with any server that exposes the OpenAI-style /v1/chat/completions endpoint. This includes Ollama, vLLM, LMStudio, RunPod, OpenAI itself, and many others.

Backend: Select "OpenAI Compatible"
Base URL: The server's API URL ending in /v1 (e.g., http://localhost:11434/v1 for Ollama, or a RunPod proxy URL)
API Key: Whatever the server requires. For Ollama, any string works (e.g., "ollama"). For OpenAI, your OpenAI API key.
Default Model: The model name as the server knows it (e.g., gemma2:latest for Ollama, gpt-4o for OpenAI)

Per-Agent Provider Override

By default, all agents in an experiment use the experiment's default provider. However, you can override this on individual agents — for example, to have one agent powered by Claude and another by GPT-4. Set this in the "Provider Override" dropdown on each agent card.

4. Building an Experiment

The Experiment Builder is where you design your experiments. It has these sections:

1.

Load Design (optional)

If you have saved designs, select one from the dropdown to pre-fill all fields.
2.

Experiment Info

Name your experiment, select the default provider (e.g., "claude"), and enter the default model ID (e.g., "claude-sonnet-4-6"). The model ID must match what the provider expects.
3.

Document Context (optional)

Upload a document to extract themes and generate grounded agent personas. See Document-Grounded Personas.
4.

Agents

Define the participants. Click "Add Agent" to create agent cards. See Agents & Personas.
5.

Rounds

Define the sequence of interactions. Click "Add Round" to create round cards. See Rounds, Conditions & Settings.
6.

Launch / Save / Batch

"Launch Experiment" runs it once. "Save Design" stores the config for reuse. "Batch Run" runs it multiple times.

5. Agents & Personas

Agents are the participants in your experiment. Each agent is defined by these fields:

Name / ID

A unique identifier shown in the results (e.g., "Left_Coalition_Theorist", "Budget_Hawk"). Use descriptive names — they help you read the results and they're included in the agent's prompt context.

Specialty

The agent's area of expertise (e.g., "Progressive policy coherence", "Fiscal conservatism"). This is included in the agent's system prompt to focus its responses.

Persona

The most important field. This is the system prompt that defines who the agent is. It should describe: what the agent believes, what evidence they draw on, how they communicate, and what perspective they bring. The richer and more specific this is, the more distinctive and grounded the agent's responses will be.

Example of a weak persona: "A political analyst."

Example of a strong persona: "You are a political analyst specializing in left-wing party coherence. You argue that egalitarianism functions as a structural unifying principle across wealth redistribution, social morality, and immigration. You draw on Cochrane's analysis of Benoit and Laver's 22-country expert survey data, which shows 100% of far-left economic parties also hold left-wing positions on social and immigration dimensions. You are confrontational toward claims of left-right symmetry."

Temperature

Controls response variability. Range: 0.0 to 1.5.

Low (0.1-0.3): More deterministic, focused, predictable. Good for analytical or conservative agents.
Medium (0.4-0.6): Balanced. The default (0.5) works well for most cases.
High (0.7-1.0): More creative, varied, surprising. Good for innovative or provocative agents.

Tip: Pairing agents with different temperatures (e.g., 0.3 and 0.8) produces more interesting discourse than uniform temperatures.

Memory Window

How many past rounds the agent remembers. Default is 0 (all rounds). Set to a number (e.g., 2) to limit memory to the last N rounds. This is useful for simulating bounded memory or preventing context from growing too large.

Memory Overflow

What happens to memories beyond the window. "Summary" compresses older rounds into a summary. "Forget" drops them entirely. This is an experimental variable — agents with different memory settings can behave very differently over many rounds.

Agent Type, Model Override, Provider Override

"Agent Type" is a label (default "Custom"). "Model Override" lets you use a different model than the experiment default. "Provider Override" lets this agent use a different LLM provider entirely — e.g., one agent on Claude, another on GPT-4.

Personal Document

A document from your Library that this agent carries with them across every round they participate in. Unlike a round-level reference document (shared by all agents in that round), a personal document is private to this agent and persists for the whole experiment. Use it when one agent needs expertise the others don't have — e.g., a "Methodologist" agent with a statistics textbook, or a "Historian" agent with a primary source.

Document Depth (agent)

Same three modes as round-level documents: Summary (extraction metadata only), Relevant Excerpts (default — paragraphs matching the current round's topic), or Full Text. Excerpts re-query against each round's condition prompt, so the same personal document surfaces different passages in different rounds.

6. Document-Grounded Personas

You can upload a document (PDF, TXT, or MD) to automatically extract themes, positions, and stakeholders, then generate agent personas grounded in the document's actual content.

Step-by-step:

1.

In the Experiment Builder, go to the "Document Context" section. Upload your file and select which provider to use for analysis (Claude is recommended for best extraction quality).
2.

Click "Analyze Document". The LLM reads the document and extracts: title, summary, key themes, distinct positions/stances, stakeholders, key arguments, and domain terminology. This takes 10-30 seconds.
3.

Review the extraction results. The themes, positions, and stakeholders should give you ideas for what agents to create.
4.

Add agents and give them relevant names and specialties (e.g., if the document is about left-right politics, create "Left_Coalition_Theorist" with specialty "Progressive policy coherence").
5.

On each agent card, click "Generate Persona from Document". The LLM creates a detailed persona for that specific agent, using the document's themes, arguments, data points, and terminology. The persona is tailored to the agent's name and specialty — not a generic summary.
6.

Review and edit the generated personas. The generation is a starting point — you can refine it, add specific instructions, or combine it with your own text.

Important notes:

Document analysis is ephemeral — it's not saved with the experiment. The generated personas ARE saved as part of the agent configuration.
The more specific the agent name and specialty, the more distinctive the generated persona.
Supported file types: PDF, TXT, MD (max 10MB).
For academic papers, the extraction identifies the thesis, counterarguments, methodology, and key findings.
For narratives or stories, characters become stakeholders and their motivations become positions.

7. Document Library

The Document Library provides persistent storage for documents you want to reuse across experiments. Unlike the ephemeral document upload in the experiment builder (which is lost after analysis), library documents are permanently stored in your account.

Uploading Documents

Go to Library and upload a PDF, TXT, or MD file. The document is automatically analyzed by the LLM — themes, positions, stakeholders, and key arguments are extracted and stored alongside the full text. Supported files up to 10MB.

Using Documents in Experiments

Library documents can be attached at two levels:

Round-level — "Reference Document"

Each round card has a "Reference Document" dropdown. The selected document is shared by all agents in that round. Use this when the whole group should discuss the same text (e.g., a paper everyone is critiquing together). Different rounds can reference different documents — good for walking through a paper section-by-section.

Agent-level — "Personal Document"

Each agent card has a "Personal Document" dropdown. The selected document stays with that agent across every round, and it's private — other agents don't see it. Use this to give different agents different expertise (a Methodologist with a stats textbook, a Historian with a primary source, a Critic with a rival paper).

Both at once

An agent can have a personal document and participate in a round with a reference document — the two are injected as separate sections in the prompt, so there's no conflict.

Document Depth

For both round-level and agent-level documents, the "Document Depth" setting controls how much of the document is injected into the prompt:

Summary

Agents receive only the extraction metadata — themes, positions, and key arguments. Fast and cheap. Good for general grounding when agents just need to know what the paper argues.

Relevant Excerpts (recommended)

The system uses keyword matching against the round's condition prompt to pull the most relevant paragraphs from the actual document text. Agents get real text — specific data points, quotes, tables — but only the parts relevant to this round's topic. The default and best balance of cost and depth.

Full Text

The entire document is injected into agents' prompts. Comprehensive but uses significantly more tokens. Use when agents genuinely need access to every part of the document.

Example: Section-by-Section Analysis

Upload an academic paper to your library, then design rounds that walk through it:

Round 1: "Discuss the theoretical framework." (Document: paper, Depth: relevant excerpts)

Round 2: "Critique the methodology." (Document: paper, Depth: relevant excerpts)

Round 3: "Analyze the findings and data." (Document: paper, Depth: relevant excerpts)

Round 4: "Synthesize your conclusions." (No document — agents discuss from memory)

Each round's condition prompt guides the excerpt extraction, so agents see the parts of the paper most relevant to that round's focus.

8. Rounds, Conditions & Settings

Each round represents a phase of the experiment. Rounds are independent by default — you create continuity through how you write your condition prompts.

The Condition Prompt

This is the most important part of a round. It sets the social scenario that all participating agents operate within. Think of it as setting the stage for a scene.

Example — Academic conference:

"You are at an academic conference panel. Present your interpretation of the asymmetry finding and what explains it."

Example — Escalating challenge:

"A critic challenges your methodology. Defend your position and address the critique directly."

Example — Resource scarcity:

"Resources have become scarce. A drought has hit the village. Renegotiate your trade agreements."

Round Settings

Interaction Pattern

How agents interact within the round. See Interaction Patterns for details on all five options.

Turns per Round

How many back-and-forth exchanges within this round. Default is 1 (each agent speaks once). Set to 3 and agents will go back and forth three times, each turn seeing what everyone said in previous turns. More turns = deeper conversation within a single round.

Response Length

Controls how long each agent's responses should be:

Dynamic — No constraint. Agents respond as thoroughly as they see fit. Produces the most natural responses but can be very long with capable models.
Short — 2-4 concise paragraphs. Good for quick iteration, testing, and batch runs.
Medium — 4-6 paragraphs. Balanced detail.
Thorough — Detailed responses with evidence and elaboration.

This is set per-round, so you can have a short opening round followed by a thorough deep-dive round.

Participating Agents

Which agents are active in this round. Leave empty for all agents. Enter comma-separated agent IDs to restrict participation (e.g., "Agent_1, Agent_3"). Click "Fill from agents" to auto-populate from your current agent list. This lets you design rounds where only certain agents interact.

Context Window

How many past rounds of shared discourse are visible to agents. Default is 0 (all rounds visible). Set to 2 to only show the last 2 rounds. This is an experimental variable — limiting context simulates bounded attention.

Context Overflow

What happens to round history beyond the context window. "Summary" compresses it. "Truncate" drops older rounds. "Forget" removes them entirely.

9. Interaction Patterns

Each round uses one of five interaction patterns that define how agents communicate:

Free Discussion

All agents respond to the condition prompt. In multi-turn rounds, each agent sees all other agents' responses from previous turns. This is the default and most common pattern — it simulates an open conversation.

Best for: general discussion, brainstorming, collaborative analysis

Debate

Agents are split into teams that argue opposing positions. You define team assignments as JSON — for example: {"for_regulation": ["Agent_1", "Agent_2"], "against_regulation": ["Agent_3"]}. Each team sees the other team's arguments.

Best for: adversarial discourse, policy debates, exploring opposing viewpoints

Chain

Sequential — each agent sees only the immediately previous agent's response, not the full discussion. You specify the order as comma-separated IDs (e.g., "Agent_1, Agent_2, Agent_3"). Ideas build incrementally, like a game of telephone.

Best for: iterative refinement, building on ideas, examining how information transforms

Blind Parallel

All agents respond independently with no visibility of each other's responses — even from previous turns. Each agent only sees the condition prompt.

Best for: comparing uninfluenced perspectives, measuring baseline positions before discussion

Custom

Define your own interaction rules in natural language using the "Custom Instructions" field. These instructions are injected into each agent's prompt alongside the condition. Use this for scenarios that don't fit the presets — e.g., "Only respond to the agent who spoke before you" or "You can only communicate in questions."

Best for: novel interaction designs, creative constraints, specialized scenarios

Tip: You can use different patterns across rounds in the same experiment. Start with Blind Parallel (to capture uninfluenced positions), then Free Discussion (to let agents interact), then Debate (to force a conclusion).

10. Running Experiments

Click "Launch Experiment" to start. You'll be taken to the live view.

The live view shows:

A progress indicator while the experiment is running
Live streaming of each agent's response as tokens are generated (you see the text appearing in real-time)
Color-coded agent names for easy identification
Round-by-round results with full markdown rendering (headers, bold, lists, etc.)
An "Export JSON" button for downloading the complete results

Duration estimates:

With Claude Sonnet (3 agents, 3 rounds):

Short responses: ~2-4 minutes
Medium responses: ~5-8 minutes
Dynamic responses: ~10-15 minutes

More agents, more rounds, and more turns per round increase duration proportionally.

11. Batch Running

Batch running lets you execute the same experiment design multiple times to observe how agent behavior varies across identical setups.

How to use:

Build your experiment as usual in the experiment builder.
Instead of "Launch Experiment", click the purple "Batch Run" button.
Enter the number of runs (1-20).
A batch progress page shows completion status for each run.
Click into individual runs to see their full results.

Each run uses the exact same configuration — same agents, personas, rounds, and conditions — but produces different responses due to LLM sampling variability. Runs execute sequentially (not in parallel) to avoid API rate limits.

Tip: Use "Short" response length for batch runs to save time and API costs. Run 3-5 short batch runs to identify interesting patterns, then do a single thorough run to explore them in depth.

12. Saving & Loading Designs

Designs let you save and reuse experiment configurations:

Save Design: Click the green "Save Design" button in the experiment builder. Your agents (including personas), rounds, and all settings are saved to your account.
Load Design: At the top of the experiment builder, select a saved design from the dropdown and click "Load". All fields are populated with the saved values.
Delete Design: Select a design from the dropdown and click "Delete" to remove it.

What gets saved:

Experiment name, default provider, default model
All agents: names, specialties, personas, temperatures, memory settings, provider overrides, personal document references
All rounds: condition prompts, interaction patterns, turns, response length, participating agents, context settings, reference document

Saved designs store document references (library IDs), not the documents themselves. The documents remain in your Library.

Sharing designs with other users

You can share an entire experiment setup — agents, rounds, and the documents they reference — with another Delstorm user (or with yourself on a different account) via JSON export/import:

Export: Select a saved design and click "Export". A <name>.delstorm.json file is downloaded. By default, every referenced document (from rounds and from agent-level personal documents) is bundled inside the file so nothing is lost on the other end.
Import: Click "Import" and choose a .delstorm.json file. The design is added to your list (suffixed with "(imported)") and auto-loaded into the builder. Any bundled documents that don't already exist in your library are added to it.

How references are handled on import

Documents: Deduped by name + filename. If your library already has a document with the same name and filename, that existing document is reused — no duplicate is created. Otherwise, the bundled document is added with a fresh ID, and the design's references are rewritten to point to it.
Dangling references: If the bundle has no embedded content for a referenced document, that reference is stripped (the design still imports, but with a warning).
Providers: If the design uses a provider name you don't have configured (e.g., the sender's "my_openai"), the default provider is cleared and you're prompted to pick one before launching.

Note: export files can be large if the design references big documents. There's a 25 MB cap on import uploads.

13. Exporting Results

Click "Export JSON" on any completed experiment to download the full results. The JSON file includes:

Experiment metadata (name, duration, session ID)
Agent specifications (names, personas, temperatures, providers)
All rounds with condition prompts, interaction patterns, and full agent responses
Turn-by-turn data for multi-turn rounds
Timing data per round

The exported JSON can be used for further analysis in Python, R, or any data processing tool.

14. Evaluation

The Evaluate page lets you analyze completed experiments by defining structured evaluation fields and running a hybrid parser + LLM extraction system.

When to Use Evaluation

Evaluation is post-hoc and on-demand — you run it after an experiment completes, not as part of the experiment design. This means you can evaluate the same experiment multiple ways, or decide what to measure after seeing the results.

Access the evaluate page from:

The "Evaluate" link in the navigation bar
The "Evaluate" button on any completed experiment's results page
The "Evaluate All Runs" button on a completed batch page

How It Works

The evaluation system is hybrid — it combines two approaches:

1. Deterministic Parser (always runs)

Scans the transcript using regex patterns for explicit values. If an agent wrote "Score: 7/10" or "Consensus: yes", the parser extracts it directly. This is instant, free, and exact.

2. LLM Evaluator (optional)

Sends the transcript + field definitions + parser findings to an LLM for nuanced analysis. The LLM can understand context ("the agents generally agreed" → Consensus: true), fill in text fields ("summarize the key agreement"), and validate parser findings. Costs one LLM call per evaluation.

The parser runs first. The LLM runs second, informed by what the parser found. Results are merged with confidence tracking so you know whether each value came from the parser (deterministic), the LLM (inferred), or both (confirmed).

Field Types

Number — A numeric value, optionally with min/max range. Parser looks for patterns like "Score: 7" or "8/10". Example: "Consensus Score (1-10)".

Scale — Like number but intended for Likert-type ratings. Same extraction logic with range validation.

Text — Free-form text answer. Always requires the LLM (the parser can't infer text). Example: "Key Agreement Points".

Boolean — Yes/no, true/false. Parser looks for "Resolved: yes" style patterns. Example: "Reached Consensus".

Choice — Pick from predefined options. Parser looks for explicit mentions near the field name, or falls back to most-mentioned option. Example: "Winner" with choices ["Agent_1", "Agent_2", "Tie"].

Source Filtering

Each field can optionally specify a source round and/or source agent. When set, the parser and LLM only look at that portion of the transcript. This is useful when you have a dedicated evaluation round — you want scores from the evaluator agent's output, not from the full debate.

Evaluation Templates

Save your field definitions as reusable templates. If you always evaluate political debates with the same criteria (Consensus Score, Winner, Key Arguments), save those fields as a template and load it for each new evaluation. Templates are saved to your account.

Batch Comparison

When evaluating multiple experiments (from a batch run), results display as a comparison table — fields as rows, experiments as columns. Numeric fields show an average across runs. This is the core tool for measuring variance: "Did the agents reach consensus more often in runs with lower temperature?"

Example Workflow: Evaluator Agent

A powerful pattern is to include a dedicated evaluator agent in your experiment:

Create an agent named "Evaluator" with a persona like "You are an impartial judge who evaluates the quality of discourse."
Add a final round with only the Evaluator participating, with a condition like: "Score the preceding discussion on: consensus (1-10), argument quality (1-10), and name the strongest contributor."
After the experiment, go to Evaluate and define fields: "Consensus (number, 1-10, source agent: Evaluator)", "Argument Quality (number, 1-10, source agent: Evaluator)", "Strongest Contributor (choice, source agent: Evaluator)".
The parser will extract the scores directly from the evaluator's structured output.

Confidence Badges

parser — Value was extracted deterministically from the transcript text.

llm — Value was inferred by the LLM evaluator.

both — Both parser and LLM found the value (highest confidence).

none — Neither parser nor LLM could determine a value.

15. AI Assistant

Delstorm includes a built-in AI assistant that can answer questions about the platform. Look for the blue chat bubble in the bottom-right corner of every page.

Click it to open the chat panel and ask anything about how to use Delstorm — setting up providers, building experiments, configuring agents, using evaluation, or understanding features.

Example questions:

"How do I set up a Claude provider?"
"What's the difference between free discussion and debate patterns?"
"How do I use document grounding to create agent personas?"
"What evaluation field types are available?"
"How does batch running work?"

The assistant is powered by Claude and has access to this entire guide as its knowledge base. It provides concise, specific answers with references to relevant features. It will not make up information — if it doesn't know the answer, it will tell you.

16. Tips & Best Practices

Start small, then scale up

Test with 2 agents, 1 round, short responses first. Once your design works, add more rounds, agents, and switch to dynamic length. This saves time and API costs during iteration.

Invest in personas

The single biggest factor in experiment quality is agent persona specificity. Use document grounding to create rich initial personas, then edit them to sharpen the perspective. A 5-sentence persona produces dramatically better results than a 1-sentence one.

Design round progression

Structure rounds to escalate: Round 1 for initial positions, Round 2 for challenges and rebuttals, Round 3 for synthesis and conclusions. This mirrors real discourse structure and produces more nuanced results than a single long round.

Vary temperatures across agents

A conservative agent (temp 0.3) paired with a creative one (temp 0.8) produces more interesting discourse than agents at uniform temperatures. The conservative agent grounds the conversation while the creative one introduces novel perspectives.

Save designs before launching

Always save your design before running an experiment. If the experiment produces unexpected results, you can reload and adjust without re-entering everything.

Use batch runs for research

If you're studying how agents interact under specific conditions, run the same design 3-5 times with short responses to observe variance before committing to a thorough single run. This helps you identify which experimental designs are worth investing in.

Mix interaction patterns across rounds

Start with Blind Parallel (to capture uninfluenced baseline positions), then switch to Free Discussion (to let agents interact and update their views), then finish with a Debate (to force a structured conclusion). This design reveals how discourse changes agents' positions.

17. Glossary

Agent

An AI participant in an experiment, defined by a persona prompt and backed by an LLM.

Batch Run

Running the same experiment configuration multiple times to observe variance.

Condition Prompt

The text that sets the social scenario for a round — the situation agents are placed in.

Confidence

In evaluation, indicates how a field value was determined: "parser" (regex extraction), "llm" (LLM inference), "both" (confirmed by both), or "none" (not found).

Design

A saved experiment configuration (agents, rounds, settings) that can be loaded and reused.

Document Depth

Controls how much of a referenced document agents see: Summary (extraction only), Relevant Excerpts (keyword-matched paragraphs), or Full Text (entire document).

Document Library

Persistent storage for uploaded documents. Documents are analyzed once and can be referenced from any experiment's rounds.

Evaluation

Post-hoc analysis of experiment transcripts using defined fields. Combines deterministic parsing with LLM inference.

Evaluation Field

A structured metric to extract from a transcript: number, text, scale, boolean, or choice.

Evaluation Template

A saved set of evaluation field definitions that can be reused across experiments.

Hybrid Evaluator

The system that runs deterministic parsing first, then LLM analysis, merging results with confidence tracking.

Interaction Pattern

How agents communicate within a round: free discussion, debate, chain, blind parallel, or custom.

Persona

The system prompt that defines an agent's identity, beliefs, and communication style.

Provider

An LLM backend (e.g., Claude, Ollama, OpenAI) that powers agents. Configured in Settings.

Round

A phase of an experiment defined by a condition prompt, interaction pattern, and settings.

Temperature

An LLM sampling parameter that controls response variability. Lower = more deterministic, higher = more creative.

Turn

One exchange within a round where all participating agents respond. Multiple turns create back-and-forth conversation.