Workflow Benchmarks

Benchmarks evaluate the final output of an AI Workflow. Use them to track quality, guide iteration, and estimate cost before scaling to scheduled jobs. Benchmarks provide objective metrics to assess workflow output quality and identify areas for improvement.

What Are Workflow Benchmarks?

Workflow Benchmarks are evaluation tools that analyze the quality of AI Workflow outputs. They provide:

Objective Metrics: Quantitative scores across multiple quality dimensions
Visual Dashboards: Interactive displays showing performance metrics
Actionable Feedback: Specific recommendations for improvement
Quality Tracking: Ability to track quality improvements over time

Default Benchmark Dashboard

The default benchmark analyzes output across multiple dimensions and renders an interactive dashboard:

Overall Score

Overall Score: Single 0–100 score with a benchmark range.

Display: The overall score appears prominently at the top of the dashboard with:

Score Value: Your workflow's overall quality score (0-100)
Benchmark Range: Typical score range for comparison
Color Coding: Visual indicators (success, info, warning, error) based on performance

Interpretation:

Above Upper Benchmark: Excellent performance
Within Benchmark Range: Good performance
Below Lower Benchmark: Needs improvement

Metric Bars

Metric Bars: Horizontal bars show per‑dimension scores against benchmark ranges (shaded bands).

Display: Each metric appears as:

Metric Name: Dimension being evaluated (e.g., "Clarity", "Readability")
Score: Your score (0-100) for that dimension
Visual Bar: Horizontal bar showing score position
Benchmark Range: Shaded band showing typical score range
Rating Chip: Color-coded rating (Excellent, Good, Fair, Poor)

Common Dimensions:

Readability Ease/Grade: How easy the text is to read
Clarity: How clear and understandable the content is
Cohesion: How well ideas flow together
Tone Appropriateness: Whether the tone matches the intended audience
Engagement: How engaging and interesting the content is
Repetition: Level of unnecessary repetition
Lexical Diversity: Variety of vocabulary used
Complexity: Appropriate complexity level

Feedback Section

Feedback: Actionable notes highlighting strengths and areas to improve.

Display: Feedback appears as a list of:

Strengths: What the workflow does well
Improvements: Specific areas to focus on
Recommendations: Suggestions for enhancing output quality

Icons: Feedback items include icons indicating:

Success: Strengths and positive aspects
Info: Informational notes
Warning: Areas needing attention

Running Benchmarks

From a Workflow

To configure benchmarks for a workflow:

Open your AI Workflow
Navigate to the Benchmark section
Choose your benchmark option:

Use Default Benchmark: Use the built-in default benchmark that analyzes multiple quality dimensions

Edit Benchmark: Create a custom benchmark with your own evaluation prompt

Configure benchmark settings:
- Run on Completion: Enable to automatically run benchmarks after workflow execution
- Use Default Prompt: Toggle to use default or custom evaluation prompt
Save your workflow

Default Benchmark

What It Does: The default benchmark uses a comprehensive evaluation prompt that analyzes:

Overall quality score
Multiple quality dimensions
Document type inference
Actionable feedback

When to Use: Use the default benchmark for general quality assessment and when you want comprehensive evaluation.

Custom Benchmark

What It Does: Custom benchmarks use your own evaluation prompt to assess specific aspects of workflow output.

When to Use: Use custom benchmarks when you need to evaluate:

Specific quality criteria
Domain-specific requirements
Custom evaluation metrics
Specialized assessment needs

Creating Custom Benchmarks:

Select Edit Benchmark
Write your evaluation prompt
Use Handlebars variables to reference workflow output
Save the benchmark configuration

Run on Completion

Run on Completion: Enable this option to automatically evaluate workflow output after each execution.

Benefits:

Automatic quality tracking
Immediate feedback on output quality
Quality monitoring over time
No manual benchmark execution needed

Configuration: Check the "Run on completion" checkbox in the Benchmark section.

From a Job

To run benchmarks for a specific workflow job:

Open the Workflow Job you want to evaluate
Navigate to the Benchmark Output section
Review benchmark results if already run
Run benchmark if not yet executed

Use Cases:

Evaluate specific job outputs
Compare quality across different jobs
Test workflow improvements
Quality assurance for important outputs

Interpreting Results

Score Interpretation

Scores Near or Above Upper Benchmark: Strong Performance: Indicates excellent quality that meets or exceeds typical standards.

Action: Continue with current workflow configuration. Consider this as a quality baseline.

Mid-Range Scores: Acceptable Quality: Indicates good quality with room for improvement.

Action: Review feedback for specific improvement areas. Consider refining prompts or variables.

Scores Below Lower Benchmark: Needs Improvement: Highlights areas that should be prioritized for enhancement.

Action: Focus on low-scoring dimensions. Revise prompts, adjust variables, or modify output templates.

Metric Analysis

Individual Metrics: Review each metric bar to understand:

Which dimensions perform well
Which dimensions need improvement
Relative strengths and weaknesses

Benchmark Comparison: Compare your scores to benchmark ranges to understand:

How your output compares to typical quality
Whether improvements are needed
If quality meets expectations

Feedback Utilization

Strengths: Identify what your workflow does well to:

Maintain successful aspects
Understand what works
Build on strengths

Improvements: Focus on feedback recommendations to:

Address specific quality issues
Enhance weak areas
Refine workflow configuration

Using Benchmarks for Iteration

Iterative Improvement Process

Run Benchmark: Execute workflow and review benchmark results
Identify Issues: Review low-scoring metrics and feedback
Refine Workflow: Update prompts, variables, or output templates
Re-run: Execute workflow again with improvements
Compare: Compare new benchmark results to previous results
Iterate: Repeat until quality meets your standards

Refining Based on Benchmarks

Prompt Refinement: Use feedback to improve workflow prompts:

Add clarity to instructions
Specify tone requirements
Include quality criteria

Variable Adjustment: Modify variables based on benchmark insights:

Adjust content parameters
Refine context variables
Optimize input data

Output Template Updates: Enhance output templates:

Improve structure
Add formatting requirements
Specify style guidelines

Best Practices

Regular Benchmarking

Consistent Evaluation: Run benchmarks regularly to track quality over time

Quality Monitoring: Monitor benchmark scores to detect quality degradation

Improvement Tracking: Track score improvements to measure workflow enhancements

Benchmark Configuration

Appropriate Benchmarks: Choose default or custom benchmarks based on your needs

Run on Completion: Enable automatic benchmarking for continuous quality monitoring

Custom Prompts: Create custom benchmarks for specialized evaluation requirements

Result Analysis

Comprehensive Review: Review all metrics, not just overall score

Feedback Focus: Pay attention to actionable feedback recommendations

Comparative Analysis: Compare benchmarks across different workflow versions

Cost Estimation

Before Scaling: Use benchmarks to estimate quality before scaling to scheduled jobs

Quality Assurance: Ensure quality meets standards before production use

Resource Planning: Use benchmark insights to plan workflow improvements

Limitations

Benchmark Scope

Output Evaluation: Benchmarks evaluate final workflow output, not intermediate steps

Quality Dimensions: Benchmarks focus on content quality, not functional correctness

Subjective Aspects: Some quality aspects may be subjective and vary by use case

Benchmark Accuracy

Model Dependency: Benchmark accuracy depends on the AI model used for evaluation

Prompt Dependency: Custom benchmark quality depends on evaluation prompt quality

Context Limitations: Benchmarks may not capture all contextual requirements

AI Workflows - Learn about creating and managing workflows
Prompt Best Practices - Improve workflow prompts
AI Credits - Understand credit usage for benchmarks