Workflow Benchmarks
Benchmarks evaluate the final output of an AI Workflow. Use them to track quality, guide iteration, and estimate cost before scaling to scheduled jobs. Benchmarks provide objective metrics to assess workflow output quality and identify areas for improvement.
What Are Workflow Benchmarks?
Workflow Benchmarks are evaluation tools that analyze the quality of AI Workflow outputs. They provide:
- Objective Metrics: Quantitative scores across multiple quality dimensions
- Visual Dashboards: Interactive displays showing performance metrics
- Actionable Feedback: Specific recommendations for improvement
- Quality Tracking: Ability to track quality improvements over time
Default Benchmark Dashboard
The default benchmark analyzes output across multiple dimensions and renders an interactive dashboard:
Overall Score
Overall Score: Single 0–100 score with a benchmark range.
Display: The overall score appears prominently at the top of the dashboard with:
- Score Value: Your workflow's overall quality score (0-100)
- Benchmark Range: Typical score range for comparison
- Color Coding: Visual indicators (success, info, warning, error) based on performance
Interpretation:
- Above Upper Benchmark: Excellent performance
- Within Benchmark Range: Good performance
- Below Lower Benchmark: Needs improvement
Metric Bars
Metric Bars: Horizontal bars show per‑dimension scores against benchmark ranges (shaded bands).
Display: Each metric appears as:
- Metric Name: Dimension being evaluated (e.g., "Clarity", "Readability")
- Score: Your score (0-100) for that dimension
- Visual Bar: Horizontal bar showing score position
- Benchmark Range: Shaded band showing typical score range
- Rating Chip: Color-coded rating (Excellent, Good, Fair, Poor)
Common Dimensions:
- Readability Ease/Grade: How easy the text is to read
- Clarity: How clear and understandable the content is
- Cohesion: How well ideas flow together
- Tone Appropriateness: Whether the tone matches the intended audience
- Engagement: How engaging and interesting the content is
- Repetition: Level of unnecessary repetition
- Lexical Diversity: Variety of vocabulary used
- Complexity: Appropriate complexity level
Feedback Section
Feedback: Actionable notes highlighting strengths and areas to improve.
Display: Feedback appears as a list of:
- Strengths: What the workflow does well
- Improvements: Specific areas to focus on
- Recommendations: Suggestions for enhancing output quality
Icons: Feedback items include icons indicating:
- Success: Strengths and positive aspects
- Info: Informational notes
- Warning: Areas needing attention
Running Benchmarks
From a Workflow
To configure benchmarks for a workflow:
- Open your AI Workflow
- Navigate to the Benchmark section
- Choose your benchmark option:
Use Default Benchmark: Use the built-in default benchmark that analyzes multiple quality dimensions
Edit Benchmark: Create a custom benchmark with your own evaluation prompt
- Configure benchmark settings:
- Run on Completion: Enable to automatically run benchmarks after workflow execution
- Use Default Prompt: Toggle to use default or custom evaluation prompt
- Save your workflow
Default Benchmark
What It Does: The default benchmark uses a comprehensive evaluation prompt that analyzes:
- Overall quality score
- Multiple quality dimensions
- Document type inference
- Actionable feedback
When to Use: Use the default benchmark for general quality assessment and when you want comprehensive evaluation.
Custom Benchmark
What It Does: Custom benchmarks use your own evaluation prompt to assess specific aspects of workflow output.
When to Use: Use custom benchmarks when you need to evaluate:
- Specific quality criteria
- Domain-specific requirements
- Custom evaluation metrics
- Specialized assessment needs
Creating Custom Benchmarks:
- Select Edit Benchmark
- Write your evaluation prompt
- Use Handlebars variables to reference workflow output
- Save the benchmark configuration
Run on Completion
Run on Completion: Enable this option to automatically evaluate workflow output after each execution.
Benefits:
- Automatic quality tracking
- Immediate feedback on output quality
- Quality monitoring over time
- No manual benchmark execution needed
Configuration: Check the "Run on completion" checkbox in the Benchmark section.
From a Job
To run benchmarks for a specific workflow job:
- Open the Workflow Job you want to evaluate
- Navigate to the Benchmark Output section
- Review benchmark results if already run
- Run benchmark if not yet executed
Use Cases:
- Evaluate specific job outputs
- Compare quality across different jobs
- Test workflow improvements
- Quality assurance for important outputs
Interpreting Results
Score Interpretation
Scores Near or Above Upper Benchmark: Strong Performance: Indicates excellent quality that meets or exceeds typical standards.
Action: Continue with current workflow configuration. Consider this as a quality baseline.
Mid-Range Scores: Acceptable Quality: Indicates good quality with room for improvement.
Action: Review feedback for specific improvement areas. Consider refining prompts or variables.
Scores Below Lower Benchmark: Needs Improvement: Highlights areas that should be prioritized for enhancement.
Action: Focus on low-scoring dimensions. Revise prompts, adjust variables, or modify output templates.
Metric Analysis
Individual Metrics: Review each metric bar to understand:
- Which dimensions perform well
- Which dimensions need improvement
- Relative strengths and weaknesses
Benchmark Comparison: Compare your scores to benchmark ranges to understand:
- How your output compares to typical quality
- Whether improvements are needed
- If quality meets expectations
Feedback Utilization
Strengths: Identify what your workflow does well to:
- Maintain successful aspects
- Understand what works
- Build on strengths
Improvements: Focus on feedback recommendations to:
- Address specific quality issues
- Enhance weak areas
- Refine workflow configuration
Using Benchmarks for Iteration
Iterative Improvement Process
- Run Benchmark: Execute workflow and review benchmark results
- Identify Issues: Review low-scoring metrics and feedback
- Refine Workflow: Update prompts, variables, or output templates
- Re-run: Execute workflow again with improvements
- Compare: Compare new benchmark results to previous results
- Iterate: Repeat until quality meets your standards
Refining Based on Benchmarks
Prompt Refinement: Use feedback to improve workflow prompts:
- Add clarity to instructions
- Specify tone requirements
- Include quality criteria
Variable Adjustment: Modify variables based on benchmark insights:
- Adjust content parameters
- Refine context variables
- Optimize input data
Output Template Updates: Enhance output templates:
- Improve structure
- Add formatting requirements
- Specify style guidelines
Best Practices
Regular Benchmarking
Consistent Evaluation: Run benchmarks regularly to track quality over time
Quality Monitoring: Monitor benchmark scores to detect quality degradation
Improvement Tracking: Track score improvements to measure workflow enhancements
Benchmark Configuration
Appropriate Benchmarks: Choose default or custom benchmarks based on your needs
Run on Completion: Enable automatic benchmarking for continuous quality monitoring
Custom Prompts: Create custom benchmarks for specialized evaluation requirements
Result Analysis
Comprehensive Review: Review all metrics, not just overall score
Feedback Focus: Pay attention to actionable feedback recommendations
Comparative Analysis: Compare benchmarks across different workflow versions
Cost Estimation
Before Scaling: Use benchmarks to estimate quality before scaling to scheduled jobs
Quality Assurance: Ensure quality meets standards before production use
Resource Planning: Use benchmark insights to plan workflow improvements
Limitations
Benchmark Scope
Output Evaluation: Benchmarks evaluate final workflow output, not intermediate steps
Quality Dimensions: Benchmarks focus on content quality, not functional correctness
Subjective Aspects: Some quality aspects may be subjective and vary by use case
Benchmark Accuracy
Model Dependency: Benchmark accuracy depends on the AI model used for evaluation
Prompt Dependency: Custom benchmark quality depends on evaluation prompt quality
Context Limitations: Benchmarks may not capture all contextual requirements
Related Documentation
- AI Workflows - Learn about creating and managing workflows
- Prompt Best Practices - Improve workflow prompts
- AI Credits - Understand credit usage for benchmarks