Workflow Benchmarks

Benchmarks evaluate the final output of an AI Workflow. Use them to track quality, guide iteration, and estimate cost before scaling to scheduled jobs. Benchmarks provide objective metrics to assess workflow output quality and identify areas for improvement.

What Are Workflow Benchmarks?

Workflow Benchmarks are evaluation tools that analyze the quality of AI Workflow outputs. They provide:

  • Objective Metrics: Quantitative scores across multiple quality dimensions
  • Visual Dashboards: Interactive displays showing performance metrics
  • Actionable Feedback: Specific recommendations for improvement
  • Quality Tracking: Ability to track quality improvements over time

Default Benchmark Dashboard

The default benchmark analyzes output across multiple dimensions and renders an interactive dashboard:

Overall Score

Overall Score: Single 0–100 score with a benchmark range.

Display: The overall score appears prominently at the top of the dashboard with:

  • Score Value: Your workflow's overall quality score (0-100)
  • Benchmark Range: Typical score range for comparison
  • Color Coding: Visual indicators (success, info, warning, error) based on performance

Interpretation:

  • Above Upper Benchmark: Excellent performance
  • Within Benchmark Range: Good performance
  • Below Lower Benchmark: Needs improvement

Metric Bars

Metric Bars: Horizontal bars show per‑dimension scores against benchmark ranges (shaded bands).

Display: Each metric appears as:

  • Metric Name: Dimension being evaluated (e.g., "Clarity", "Readability")
  • Score: Your score (0-100) for that dimension
  • Visual Bar: Horizontal bar showing score position
  • Benchmark Range: Shaded band showing typical score range
  • Rating Chip: Color-coded rating (Excellent, Good, Fair, Poor)

Common Dimensions:

  • Readability Ease/Grade: How easy the text is to read
  • Clarity: How clear and understandable the content is
  • Cohesion: How well ideas flow together
  • Tone Appropriateness: Whether the tone matches the intended audience
  • Engagement: How engaging and interesting the content is
  • Repetition: Level of unnecessary repetition
  • Lexical Diversity: Variety of vocabulary used
  • Complexity: Appropriate complexity level

Feedback Section

Feedback: Actionable notes highlighting strengths and areas to improve.

Display: Feedback appears as a list of:

  • Strengths: What the workflow does well
  • Improvements: Specific areas to focus on
  • Recommendations: Suggestions for enhancing output quality

Icons: Feedback items include icons indicating:

  • Success: Strengths and positive aspects
  • Info: Informational notes
  • Warning: Areas needing attention

Running Benchmarks

From a Workflow

To configure benchmarks for a workflow:

  1. Open your AI Workflow
  2. Navigate to the Benchmark section
  3. Choose your benchmark option:

Use Default Benchmark: Use the built-in default benchmark that analyzes multiple quality dimensions

Edit Benchmark: Create a custom benchmark with your own evaluation prompt

  1. Configure benchmark settings:
    • Run on Completion: Enable to automatically run benchmarks after workflow execution
    • Use Default Prompt: Toggle to use default or custom evaluation prompt
  2. Save your workflow

Default Benchmark

What It Does: The default benchmark uses a comprehensive evaluation prompt that analyzes:

  • Overall quality score
  • Multiple quality dimensions
  • Document type inference
  • Actionable feedback

When to Use: Use the default benchmark for general quality assessment and when you want comprehensive evaluation.

Custom Benchmark

What It Does: Custom benchmarks use your own evaluation prompt to assess specific aspects of workflow output.

When to Use: Use custom benchmarks when you need to evaluate:

  • Specific quality criteria
  • Domain-specific requirements
  • Custom evaluation metrics
  • Specialized assessment needs

Creating Custom Benchmarks:

  1. Select Edit Benchmark
  2. Write your evaluation prompt
  3. Use Handlebars variables to reference workflow output
  4. Save the benchmark configuration

Run on Completion

Run on Completion: Enable this option to automatically evaluate workflow output after each execution.

Benefits:

  • Automatic quality tracking
  • Immediate feedback on output quality
  • Quality monitoring over time
  • No manual benchmark execution needed

Configuration: Check the "Run on completion" checkbox in the Benchmark section.

From a Job

To run benchmarks for a specific workflow job:

  1. Open the Workflow Job you want to evaluate
  2. Navigate to the Benchmark Output section
  3. Review benchmark results if already run
  4. Run benchmark if not yet executed

Use Cases:

  • Evaluate specific job outputs
  • Compare quality across different jobs
  • Test workflow improvements
  • Quality assurance for important outputs

Interpreting Results

Score Interpretation

Scores Near or Above Upper Benchmark: Strong Performance: Indicates excellent quality that meets or exceeds typical standards.

Action: Continue with current workflow configuration. Consider this as a quality baseline.

Mid-Range Scores: Acceptable Quality: Indicates good quality with room for improvement.

Action: Review feedback for specific improvement areas. Consider refining prompts or variables.

Scores Below Lower Benchmark: Needs Improvement: Highlights areas that should be prioritized for enhancement.

Action: Focus on low-scoring dimensions. Revise prompts, adjust variables, or modify output templates.

Metric Analysis

Individual Metrics: Review each metric bar to understand:

  • Which dimensions perform well
  • Which dimensions need improvement
  • Relative strengths and weaknesses

Benchmark Comparison: Compare your scores to benchmark ranges to understand:

  • How your output compares to typical quality
  • Whether improvements are needed
  • If quality meets expectations

Feedback Utilization

Strengths: Identify what your workflow does well to:

  • Maintain successful aspects
  • Understand what works
  • Build on strengths

Improvements: Focus on feedback recommendations to:

  • Address specific quality issues
  • Enhance weak areas
  • Refine workflow configuration

Using Benchmarks for Iteration

Iterative Improvement Process

  1. Run Benchmark: Execute workflow and review benchmark results
  2. Identify Issues: Review low-scoring metrics and feedback
  3. Refine Workflow: Update prompts, variables, or output templates
  4. Re-run: Execute workflow again with improvements
  5. Compare: Compare new benchmark results to previous results
  6. Iterate: Repeat until quality meets your standards

Refining Based on Benchmarks

Prompt Refinement: Use feedback to improve workflow prompts:

  • Add clarity to instructions
  • Specify tone requirements
  • Include quality criteria

Variable Adjustment: Modify variables based on benchmark insights:

  • Adjust content parameters
  • Refine context variables
  • Optimize input data

Output Template Updates: Enhance output templates:

  • Improve structure
  • Add formatting requirements
  • Specify style guidelines

Best Practices

Regular Benchmarking

Consistent Evaluation: Run benchmarks regularly to track quality over time

Quality Monitoring: Monitor benchmark scores to detect quality degradation

Improvement Tracking: Track score improvements to measure workflow enhancements

Benchmark Configuration

Appropriate Benchmarks: Choose default or custom benchmarks based on your needs

Run on Completion: Enable automatic benchmarking for continuous quality monitoring

Custom Prompts: Create custom benchmarks for specialized evaluation requirements

Result Analysis

Comprehensive Review: Review all metrics, not just overall score

Feedback Focus: Pay attention to actionable feedback recommendations

Comparative Analysis: Compare benchmarks across different workflow versions

Cost Estimation

Before Scaling: Use benchmarks to estimate quality before scaling to scheduled jobs

Quality Assurance: Ensure quality meets standards before production use

Resource Planning: Use benchmark insights to plan workflow improvements

Limitations

Benchmark Scope

Output Evaluation: Benchmarks evaluate final workflow output, not intermediate steps

Quality Dimensions: Benchmarks focus on content quality, not functional correctness

Subjective Aspects: Some quality aspects may be subjective and vary by use case

Benchmark Accuracy

Model Dependency: Benchmark accuracy depends on the AI model used for evaluation

Prompt Dependency: Custom benchmark quality depends on evaluation prompt quality

Context Limitations: Benchmarks may not capture all contextual requirements