User Guide
A comprehensive guide to using Delstorm for multi-agent experimentation.
1. Overview
Delstorm is a multi-agent experimentation platform. You design experiments where AI agents — each with their own persona, backed by an LLM of your choice — engage in structured discourse across configurable rounds.
Each round has a condition prompt that sets the social scenario: a debate topic, a collaborative task, a negotiation, a simulated classroom, or any situation you can describe. Agents respond based on their personas and the conversation history. You control the interaction pattern, number of turns, which agents participate, and how long responses should be.
Use cases include:
- Simulating academic debates grounded in real papers
- Modeling stakeholder negotiations with distinct interests
- Exploring how different ideological perspectives interact on a topic
- Testing how agent personas, temperature, or round structure affect discourse outcomes
- Running batch experiments to observe variance across identical setups
- Generating structured multi-perspective analysis of documents or topics
2. Getting Started
To run your first experiment, follow these steps:
-
1.
Create an account
Click "Sign In" in the top right. If you don't have an account, click "Create one" to sign up with your email and password. You'll receive a confirmation email — click the link to activate your account, then sign in.
-
2.
Check your providers
Go to Settings. You should see at least one provider already configured (e.g., "claude"). If not, you'll need to add one — see Setting Up Providers below.
-
3.
Create an experiment
Go to New Experiment. Give it a name, select your default provider (e.g., "claude"), enter the default model name (e.g., "claude-sonnet-4-6"), and add at least one agent and one round.
-
4.
Launch
Click "Launch Experiment" and watch the agents respond in real-time. Results appear as they're generated.
Quick test: Try 2 agents, 1 round, "Short" response length. This will finish in under a minute and let you verify everything works before building a larger experiment.
3. Setting Up Providers
A provider is an LLM backend that powers your agents. Providers are configured in Settings.
Shared vs. Personal Providers
Your administrator may have pre-configured shared providers (like Claude) that are available to all users automatically. These appear in your provider list without any setup. You can also add your own personal providers — these are visible only to you.
Provider Types
Anthropic (Claude)
For Claude models by Anthropic.
- Backend: Select "Anthropic"
- Base URL: Leave blank (the SDK handles this automatically)
- API Key: Your Anthropic API key (starts with
sk-ant-). Get one at console.anthropic.com. - Default Model:
claude-sonnet-4-6(recommended) or any Claude model ID
OpenAI Compatible
Works with any server that exposes the OpenAI-style /v1/chat/completions endpoint. This includes Ollama, vLLM, LMStudio, RunPod, OpenAI itself, and many others.
- Backend: Select "OpenAI Compatible"
- Base URL: The server's API URL ending in
/v1(e.g.,http://localhost:11434/v1for Ollama, or a RunPod proxy URL) - API Key: Whatever the server requires. For Ollama, any string works (e.g., "ollama"). For OpenAI, your OpenAI API key.
- Default Model: The model name as the server knows it (e.g.,
gemma2:latestfor Ollama,gpt-4ofor OpenAI)
Per-Agent Provider Override
By default, all agents in an experiment use the experiment's default provider. However, you can override this on individual agents — for example, to have one agent powered by Claude and another by GPT-4. Set this in the "Provider Override" dropdown on each agent card.
4. Building an Experiment
The Experiment Builder is where you design your experiments. It has these sections:
-
1.
Load Design (optional)
If you have saved designs, select one from the dropdown to pre-fill all fields.
-
2.
Experiment Info
Name your experiment, select the default provider (e.g., "claude"), and enter the default model ID (e.g., "claude-sonnet-4-6"). The model ID must match what the provider expects.
-
3.
Document Context (optional)
Upload a document to extract themes and generate grounded agent personas. See Document-Grounded Personas.
-
4.
Agents
Define the participants. Click "Add Agent" to create agent cards. See Agents & Personas.
-
5.
Rounds
Define the sequence of interactions. Click "Add Round" to create round cards. See Rounds, Conditions & Settings.
-
6.
Launch / Save / Batch
"Launch Experiment" runs it once. "Save Design" stores the config for reuse. "Batch Run" runs it multiple times.
5. Agents & Personas
Agents are the participants in your experiment. Each agent is defined by these fields:
Name / ID
A unique identifier shown in the results (e.g., "Left_Coalition_Theorist", "Budget_Hawk"). Use descriptive names — they help you read the results and they're included in the agent's prompt context.
Specialty
The agent's area of expertise (e.g., "Progressive policy coherence", "Fiscal conservatism"). This is included in the agent's system prompt to focus its responses.
Persona
The most important field. This is the system prompt that defines who the agent is. It should describe: what the agent believes, what evidence they draw on, how they communicate, and what perspective they bring. The richer and more specific this is, the more distinctive and grounded the agent's responses will be.
Example of a weak persona: "A political analyst."
Example of a strong persona: "You are a political analyst specializing in left-wing party coherence. You argue that egalitarianism functions as a structural unifying principle across wealth redistribution, social morality, and immigration. You draw on Cochrane's analysis of Benoit and Laver's 22-country expert survey data, which shows 100% of far-left economic parties also hold left-wing positions on social and immigration dimensions. You are confrontational toward claims of left-right symmetry."
Temperature
Controls response variability. Range: 0.0 to 1.5.
- Low (0.1-0.3): More deterministic, focused, predictable. Good for analytical or conservative agents.
- Medium (0.4-0.6): Balanced. The default (0.5) works well for most cases.
- High (0.7-1.0): More creative, varied, surprising. Good for innovative or provocative agents.
Tip: Pairing agents with different temperatures (e.g., 0.3 and 0.8) produces more interesting discourse than uniform temperatures.
Memory Window
How many past rounds the agent remembers. Default is 0 (all rounds). Set to a number (e.g., 2) to limit memory to the last N rounds. This is useful for simulating bounded memory or preventing context from growing too large.
Memory Overflow
What happens to memories beyond the window. "Summary" compresses older rounds into a summary. "Forget" drops them entirely. This is an experimental variable — agents with different memory settings can behave very differently over many rounds.
Agent Type, Model Override, Provider Override
"Agent Type" is a label (default "Custom"). "Model Override" lets you use a different model than the experiment default. "Provider Override" lets this agent use a different LLM provider entirely — e.g., one agent on Claude, another on GPT-4.
6. Document-Grounded Personas
You can upload a document (PDF, TXT, or MD) to automatically extract themes, positions, and stakeholders, then generate agent personas grounded in the document's actual content.
Step-by-step:
-
1.
In the Experiment Builder, go to the "Document Context" section. Upload your file and select which provider to use for analysis (Claude is recommended for best extraction quality).
-
2.
Click "Analyze Document". The LLM reads the document and extracts: title, summary, key themes, distinct positions/stances, stakeholders, key arguments, and domain terminology. This takes 10-30 seconds.
-
3.
Review the extraction results. The themes, positions, and stakeholders should give you ideas for what agents to create.
-
4.
Add agents and give them relevant names and specialties (e.g., if the document is about left-right politics, create "Left_Coalition_Theorist" with specialty "Progressive policy coherence").
-
5.
On each agent card, click "Generate Persona from Document". The LLM creates a detailed persona for that specific agent, using the document's themes, arguments, data points, and terminology. The persona is tailored to the agent's name and specialty — not a generic summary.
-
6.
Review and edit the generated personas. The generation is a starting point — you can refine it, add specific instructions, or combine it with your own text.
Important notes:
- Document analysis is ephemeral — it's not saved with the experiment. The generated personas ARE saved as part of the agent configuration.
- The more specific the agent name and specialty, the more distinctive the generated persona.
- Supported file types: PDF, TXT, MD (max 10MB).
- For academic papers, the extraction identifies the thesis, counterarguments, methodology, and key findings.
- For narratives or stories, characters become stakeholders and their motivations become positions.
7. Rounds, Conditions & Settings
Each round represents a phase of the experiment. Rounds are independent by default — you create continuity through how you write your condition prompts.
The Condition Prompt
This is the most important part of a round. It sets the social scenario that all participating agents operate within. Think of it as setting the stage for a scene.
Example — Academic conference:
"You are at an academic conference panel. Present your interpretation of the asymmetry finding and what explains it."
Example — Escalating challenge:
"A critic challenges your methodology. Defend your position and address the critique directly."
Example — Resource scarcity:
"Resources have become scarce. A drought has hit the village. Renegotiate your trade agreements."
Round Settings
Interaction Pattern
How agents interact within the round. See Interaction Patterns for details on all five options.
Turns per Round
How many back-and-forth exchanges within this round. Default is 1 (each agent speaks once). Set to 3 and agents will go back and forth three times, each turn seeing what everyone said in previous turns. More turns = deeper conversation within a single round.
Response Length
Controls how long each agent's responses should be:
- Dynamic — No constraint. Agents respond as thoroughly as they see fit. Produces the most natural responses but can be very long with capable models.
- Short — 2-4 concise paragraphs. Good for quick iteration, testing, and batch runs.
- Medium — 4-6 paragraphs. Balanced detail.
- Thorough — Detailed responses with evidence and elaboration.
This is set per-round, so you can have a short opening round followed by a thorough deep-dive round.
Participating Agents
Which agents are active in this round. Leave empty for all agents. Enter comma-separated agent IDs to restrict participation (e.g., "Agent_1, Agent_3"). Click "Fill from agents" to auto-populate from your current agent list. This lets you design rounds where only certain agents interact.
Context Window
How many past rounds of shared discourse are visible to agents. Default is 0 (all rounds visible). Set to 2 to only show the last 2 rounds. This is an experimental variable — limiting context simulates bounded attention.
Context Overflow
What happens to round history beyond the context window. "Summary" compresses it. "Truncate" drops older rounds. "Forget" removes them entirely.
8. Interaction Patterns
Each round uses one of five interaction patterns that define how agents communicate:
Free Discussion
All agents respond to the condition prompt. In multi-turn rounds, each agent sees all other agents' responses from previous turns. This is the default and most common pattern — it simulates an open conversation.
Best for: general discussion, brainstorming, collaborative analysis
Debate
Agents are split into teams that argue opposing positions. You define team assignments as JSON — for example: {"for_regulation": ["Agent_1", "Agent_2"], "against_regulation": ["Agent_3"]}. Each team sees the other team's arguments.
Best for: adversarial discourse, policy debates, exploring opposing viewpoints
Chain
Sequential — each agent sees only the immediately previous agent's response, not the full discussion. You specify the order as comma-separated IDs (e.g., "Agent_1, Agent_2, Agent_3"). Ideas build incrementally, like a game of telephone.
Best for: iterative refinement, building on ideas, examining how information transforms
Blind Parallel
All agents respond independently with no visibility of each other's responses — even from previous turns. Each agent only sees the condition prompt.
Best for: comparing uninfluenced perspectives, measuring baseline positions before discussion
Custom
Define your own interaction rules in natural language using the "Custom Instructions" field. These instructions are injected into each agent's prompt alongside the condition. Use this for scenarios that don't fit the presets — e.g., "Only respond to the agent who spoke before you" or "You can only communicate in questions."
Best for: novel interaction designs, creative constraints, specialized scenarios
Tip: You can use different patterns across rounds in the same experiment. Start with Blind Parallel (to capture uninfluenced positions), then Free Discussion (to let agents interact), then Debate (to force a conclusion).
9. Running Experiments
Click "Launch Experiment" to start. You'll be taken to the live view.
The live view shows:
- A progress indicator while the experiment is running
- Live streaming of each agent's response as tokens are generated (you see the text appearing in real-time)
- Color-coded agent names for easy identification
- Round-by-round results with full markdown rendering (headers, bold, lists, etc.)
- An "Export JSON" button for downloading the complete results
Duration estimates:
With Claude Sonnet (3 agents, 3 rounds):
- Short responses: ~2-4 minutes
- Medium responses: ~5-8 minutes
- Dynamic responses: ~10-15 minutes
More agents, more rounds, and more turns per round increase duration proportionally.
10. Batch Running
Batch running lets you execute the same experiment design multiple times to observe how agent behavior varies across identical setups.
How to use:
- Build your experiment as usual in the experiment builder.
- Instead of "Launch Experiment", click the purple "Batch Run" button.
- Enter the number of runs (1-20).
- A batch progress page shows completion status for each run.
- Click into individual runs to see their full results.
Each run uses the exact same configuration — same agents, personas, rounds, and conditions — but produces different responses due to LLM sampling variability. Runs execute sequentially (not in parallel) to avoid API rate limits.
Tip: Use "Short" response length for batch runs to save time and API costs. Run 3-5 short batch runs to identify interesting patterns, then do a single thorough run to explore them in depth.
11. Saving & Loading Designs
Designs let you save and reuse experiment configurations:
- Save Design: Click the green "Save Design" button in the experiment builder. Your agents (including personas), rounds, and all settings are saved to your account.
- Load Design: At the top of the experiment builder, select a saved design from the dropdown and click "Load". All fields are populated with the saved values.
- Delete Design: Select a design from the dropdown and click "Delete" to remove it.
What gets saved:
- Experiment name, default provider, default model
- All agents: names, specialties, personas, temperatures, memory settings, provider overrides
- All rounds: condition prompts, interaction patterns, turns, response length, participating agents, context settings
What is NOT saved: the uploaded document and its analysis (these are ephemeral). Generated personas ARE saved as part of the agent text.
12. Exporting Results
Click "Export JSON" on any completed experiment to download the full results. The JSON file includes:
- Experiment metadata (name, duration, session ID)
- Agent specifications (names, personas, temperatures, providers)
- All rounds with condition prompts, interaction patterns, and full agent responses
- Turn-by-turn data for multi-turn rounds
- Timing data per round
The exported JSON can be used for further analysis in Python, R, or any data processing tool.
13. Evaluation
The Evaluate page lets you analyze completed experiments by defining structured evaluation fields and running a hybrid parser + LLM extraction system.
When to Use Evaluation
Evaluation is post-hoc and on-demand — you run it after an experiment completes, not as part of the experiment design. This means you can evaluate the same experiment multiple ways, or decide what to measure after seeing the results.
Access the evaluate page from:
- The "Evaluate" link in the navigation bar
- The "Evaluate" button on any completed experiment's results page
- The "Evaluate All Runs" button on a completed batch page
How It Works
The evaluation system is hybrid — it combines two approaches:
1. Deterministic Parser (always runs)
Scans the transcript using regex patterns for explicit values. If an agent wrote "Score: 7/10" or "Consensus: yes", the parser extracts it directly. This is instant, free, and exact.
2. LLM Evaluator (optional)
Sends the transcript + field definitions + parser findings to an LLM for nuanced analysis. The LLM can understand context ("the agents generally agreed" → Consensus: true), fill in text fields ("summarize the key agreement"), and validate parser findings. Costs one LLM call per evaluation.
The parser runs first. The LLM runs second, informed by what the parser found. Results are merged with confidence tracking so you know whether each value came from the parser (deterministic), the LLM (inferred), or both (confirmed).
Field Types
Number — A numeric value, optionally with min/max range. Parser looks for patterns like "Score: 7" or "8/10". Example: "Consensus Score (1-10)".
Scale — Like number but intended for Likert-type ratings. Same extraction logic with range validation.
Text — Free-form text answer. Always requires the LLM (the parser can't infer text). Example: "Key Agreement Points".
Boolean — Yes/no, true/false. Parser looks for "Resolved: yes" style patterns. Example: "Reached Consensus".
Choice — Pick from predefined options. Parser looks for explicit mentions near the field name, or falls back to most-mentioned option. Example: "Winner" with choices ["Agent_1", "Agent_2", "Tie"].
Source Filtering
Each field can optionally specify a source round and/or source agent. When set, the parser and LLM only look at that portion of the transcript. This is useful when you have a dedicated evaluation round — you want scores from the evaluator agent's output, not from the full debate.
Evaluation Templates
Save your field definitions as reusable templates. If you always evaluate political debates with the same criteria (Consensus Score, Winner, Key Arguments), save those fields as a template and load it for each new evaluation. Templates are saved to your account.
Batch Comparison
When evaluating multiple experiments (from a batch run), results display as a comparison table — fields as rows, experiments as columns. Numeric fields show an average across runs. This is the core tool for measuring variance: "Did the agents reach consensus more often in runs with lower temperature?"
Example Workflow: Evaluator Agent
A powerful pattern is to include a dedicated evaluator agent in your experiment:
- Create an agent named "Evaluator" with a persona like "You are an impartial judge who evaluates the quality of discourse."
- Add a final round with only the Evaluator participating, with a condition like: "Score the preceding discussion on: consensus (1-10), argument quality (1-10), and name the strongest contributor."
- After the experiment, go to Evaluate and define fields: "Consensus (number, 1-10, source agent: Evaluator)", "Argument Quality (number, 1-10, source agent: Evaluator)", "Strongest Contributor (choice, source agent: Evaluator)".
- The parser will extract the scores directly from the evaluator's structured output.
Confidence Badges
parser — Value was extracted deterministically from the transcript text.
llm — Value was inferred by the LLM evaluator.
both — Both parser and LLM found the value (highest confidence).
none — Neither parser nor LLM could determine a value.
14. AI Assistant
Delstorm includes a built-in AI assistant that can answer questions about the platform. Look for the blue chat bubble in the bottom-right corner of every page.
Click it to open the chat panel and ask anything about how to use Delstorm — setting up providers, building experiments, configuring agents, using evaluation, or understanding features.
Example questions:
- "How do I set up a Claude provider?"
- "What's the difference between free discussion and debate patterns?"
- "How do I use document grounding to create agent personas?"
- "What evaluation field types are available?"
- "How does batch running work?"
The assistant is powered by Claude and has access to this entire guide as its knowledge base. It provides concise, specific answers with references to relevant features. It will not make up information — if it doesn't know the answer, it will tell you.
15. Tips & Best Practices
Start small, then scale up
Test with 2 agents, 1 round, short responses first. Once your design works, add more rounds, agents, and switch to dynamic length. This saves time and API costs during iteration.
Invest in personas
The single biggest factor in experiment quality is agent persona specificity. Use document grounding to create rich initial personas, then edit them to sharpen the perspective. A 5-sentence persona produces dramatically better results than a 1-sentence one.
Design round progression
Structure rounds to escalate: Round 1 for initial positions, Round 2 for challenges and rebuttals, Round 3 for synthesis and conclusions. This mirrors real discourse structure and produces more nuanced results than a single long round.
Vary temperatures across agents
A conservative agent (temp 0.3) paired with a creative one (temp 0.8) produces more interesting discourse than agents at uniform temperatures. The conservative agent grounds the conversation while the creative one introduces novel perspectives.
Save designs before launching
Always save your design before running an experiment. If the experiment produces unexpected results, you can reload and adjust without re-entering everything.
Use batch runs for research
If you're studying how agents interact under specific conditions, run the same design 3-5 times with short responses to observe variance before committing to a thorough single run. This helps you identify which experimental designs are worth investing in.
Mix interaction patterns across rounds
Start with Blind Parallel (to capture uninfluenced baseline positions), then switch to Free Discussion (to let agents interact and update their views), then finish with a Debate (to force a structured conclusion). This design reveals how discourse changes agents' positions.
16. Glossary
Agent
An AI participant in an experiment, defined by a persona prompt and backed by an LLM.
Batch Run
Running the same experiment configuration multiple times to observe variance.
Condition Prompt
The text that sets the social scenario for a round — the situation agents are placed in.
Confidence
In evaluation, indicates how a field value was determined: "parser" (regex extraction), "llm" (LLM inference), "both" (confirmed by both), or "none" (not found).
Design
A saved experiment configuration (agents, rounds, settings) that can be loaded and reused.
Evaluation
Post-hoc analysis of experiment transcripts using defined fields. Combines deterministic parsing with LLM inference.
Evaluation Field
A structured metric to extract from a transcript: number, text, scale, boolean, or choice.
Evaluation Template
A saved set of evaluation field definitions that can be reused across experiments.
Hybrid Evaluator
The system that runs deterministic parsing first, then LLM analysis, merging results with confidence tracking.
Interaction Pattern
How agents communicate within a round: free discussion, debate, chain, blind parallel, or custom.
Persona
The system prompt that defines an agent's identity, beliefs, and communication style.
Provider
An LLM backend (e.g., Claude, Ollama, OpenAI) that powers agents. Configured in Settings.
Round
A phase of an experiment defined by a condition prompt, interaction pattern, and settings.
Temperature
An LLM sampling parameter that controls response variability. Lower = more deterministic, higher = more creative.
Turn
One exchange within a round where all participating agents respond. Multiple turns create back-and-forth conversation.