- Identify prompts that need improvement, from your traces
- Store prompts in the Prompt Hub for version control
- Edit and test prompts in the Playground
- Pull optimized prompts back into your code
Follow along with code: This guide has a companion notebook with runnable code examples. Find it here.
Step 1: Locate Bad Spans in Traces
By inspecting our traces, we can find where the model made mistakes and pinpoint which prompt and step were responsible. This gives us the starting point for any meaningful improvement.Locate bad spans in traces
ChatCompletion span (the classification step) to see:
- The system prompt with the list of categories
- The user’s support query
- The classification output
Step 2: Replay Span and Edit Prompt in Playground
Once we’ve identified a weak spot, the next step is to test and refine. The Playground lets us replay the same input, edit the prompt, and see how those edits change the model’s output, without code.Replay span and edit prompt in Playground
- Edit the prompt template
- Try different models
- Adjust parameters (temperature, max tokens)
- Re-run and compare outputs
Save Original Prompt to Prompt Hub
Before making changes, it’s important to save a baseline. Storing the original prompt in Prompt Hub ensures every version is tracked and recoverable, so you can compare edits later and avoid losing what worked before. In the Playground, click Save Prompt and give your prompt a name:support-classifier.
Edit Prompt and Re-Run Span
Then we’ll make 2 changes.- Let’s add the following rule to our prompt:
- Let’s upgrade our model to GPT-5 to see if a model upgrade helps in classification.
Save Edited Prompt as a New Prompt Version (Version 2)
Once you’ve verified the change works, save it as a new version. Versioning lets you track progress over time and roll back if future edits don’t perform as expected. Click Save Prompt and keep the same prompt name,support-classifier.
Now, we can see that both versions of our prompt are stored!
Step 3: Load Edited Prompt Back Into Your Code
The final step is to edit our actual code to use the new prompt version we just created.- Python
- TypeScript
Summary
Congratulations! You’ve improved your agent’s performance.You identified where your prompt was falling short, replayed that example, and refined it to produce a more accurate classification. By saving both versions in Prompt Hub, you’ve established a reliable, version-controlled workflow for prompt iteration - one you can reuse as your application evolves.

