Failure Rate Monitor

The failure rate monitor detects tests based on failure rate over a rolling time window. Unlike pass-on-retry, which looks for a specific pattern on a single commit, the failure rate monitor identifies tests that fail too often over a period of time, even if no individual failure looks like a retry. You can create multiple failure rate monitors with different configurations. This is how you tailor detection to different branches, test volumes, sensitivity levels, and detection types.

Detection Type

Each failure rate monitor has a detection type — either flaky or broken — which controls what status a test receives when the monitor flags it:

Flaky monitors catch tests that fail intermittently (e.g., 20–50% failure rate). These are typically caused by timing issues, shared state, or non-deterministic behavior.
Broken monitors catch tests that fail consistently at a high rate (e.g., 80%+ failure rate). These usually indicate a real regression — something in the code or environment is genuinely broken and needs a fix.

The detection type is set at creation and cannot be changed afterward. If you need to switch a monitor’s type, create a new monitor with the desired type and disable the old one. This distinction matters because the two problems call for different responses. Flaky tests might be quarantined while you investigate the root cause. Broken tests represent real failures that should be fixed, not hidden.

How It Works

The monitor periodically calculates the failure rate for each test within a time window you define. If the rate meets or exceeds your activation threshold and the test has enough runs to be statistically meaningful, the test is flagged as flaky or broken depending on the monitor’s detection type.

Example

You configure a failure rate monitor with:

Setting	Value
Detection type	Flaky
Activation threshold	30%
Window	6 hours
Minimum sample size	50 runs
Branches	`main`

Over the last 6 hours, here’s what the monitor observes:

Test	Runs	Failures	Failure rate	Meets min sample?	Result
`test_checkout`	120	42	35%	Yes (120 ≥ 50)	Flagged as flaky — rate exceeds 30% threshold
`test_signup`	8	3	37.5%	No (8 < 50)	Not flagged — insufficient data

test_checkout is flagged because its 35% failure rate exceeds the 30% threshold and it has enough runs to be statistically meaningful. test_signup has a higher failure rate but is skipped entirely — the monitor needs at least 50 runs before making a call.

Configuration

Detection Type

Choose Flaky or Broken. This determines the status a test receives when the monitor flags it. See Detection Type above for guidance on which to use.

Activation Threshold

The failure rate that triggers detection, expressed as a percentage. A test is flagged when its failure rate meets or exceeds this value within the time window. For flaky monitors, setting this lower (e.g., 10%) catches more intermittent failures but may produce false positives. Setting it higher (e.g., 50%) is more conservative and only flags tests that fail frequently. For broken monitors, a high threshold (e.g., 80–100%) is appropriate — you want to catch tests that are consistently failing, not ones with occasional failures.

Resolution Threshold

The failure rate a test must drop below to be resolved. If not set, it defaults to the activation threshold, meaning a test resolves as soon as its failure rate drops below the activation level. Setting this lower than the activation threshold creates a buffer that prevents tests from flapping between flagged and resolved. For example, if you activate at 30% and resolve at 15%, a test flagged at 30% must improve to below 15% before it’s marked healthy again. A test hovering at 20% failure rate stays flagged rather than flipping back and forth. The gap between activation (30%) and resolution (15%) is the buffer zone. A test with a failure rate in this range keeps its current status: a healthy test won’t be flagged, but a test already flagged won’t be resolved either.

Window Duration

The rolling time window (in hours) over which failure rate is calculated. Only test runs within this window are considered. A shorter window (e.g., 1 hour) reacts quickly to recent failures but may miss patterns that play out over longer periods. A longer window (e.g., 24 hours) smooths out short-term spikes and gives a more stable picture, but takes longer to detect new issues and longer to resolve.

Minimum Sample Size

The minimum number of test runs required within the time window before the monitor will evaluate a test. Tests with fewer runs are skipped entirely. They won’t be flagged or resolved until enough data accumulates. This prevents the monitor from making decisions on insufficient data. A test that ran 3 times with 2 failures is a 66% failure rate, but that’s not enough data to be confident. The right minimum depends on how often a test actually runs on the branches you’re monitoring. To get a sense of run frequency, open the test’s Test History and filter to the branch you care about — this shows how many runs accumulate over any given period. If your tests run hundreds of times per day, a minimum of 50 to 100 is reasonable. If tests only run a few times per day, a lower minimum may be necessary, but lower minimums mean less statistical confidence.

Stale Timeout

How long (in hours) a flagged test can go without any runs before it’s automatically resolved as stale. This clears out tests that have been deleted, renamed, or are no longer part of your test suite. When not set, flagged tests remain in their detected state indefinitely until they run enough times to recover through the normal threshold check. Setting a stale timeout (e.g., 24 hours) keeps abandoned tests from cluttering your test list. A test resolved as stale is no longer being tracked by this monitor. If the test starts running again and exceeds the activation threshold, it will be re-flagged.

Skipped tests count as not being run. If you have a stale timeout configured and a test starts being skipped rather than executed, the monitor will treat it as having no runs and resolve it as stale once the timeout elapses.

Branch Scope

Which branches the monitor evaluates. You can specify up to 10 branch patterns. Only test runs on matching branches are included in the failure rate calculation. Runs across all matching patterns are pooled together — the failure rate is calculated from the combined set of runs, not evaluated per-pattern individually. This means a monitor scoped to main and release/* will look at all runs on any of those branches together when determining the failure rate.

Branch Pattern Syntax

Branch patterns use glob-style matching with two special characters:

Character	Meaning	Regex equivalent
`*`	Zero or more of any character, including `/`	`.*`
`?`	Exactly one of any character	`.`

All other characters are matched literally. Special regex characters (like ., +, (, ), [, ]) are treated as literal characters in patterns, not as regex operators. You don’t need to escape them.

Unlike some glob implementations, * matches across / separators. The pattern feature/* matches both feature/login and feature/api/auth.

Pattern Examples

Pattern	Matches	Does not match
`main`	`main`	`main-v2`, `maint`
`feature/*`	`feature/login`, `feature/api/auth`	`feature` (no trailing path), `features/x`
`release-?.?.?`	`release-1.2.3`	`release-10.2.3` (10 is two characters), `release-1.2`
`*-hotfix`	`prod-hotfix`, `release/v1-hotfix`	`hotfix`, `hotfix-1`
`*`	All branches

A pattern with no special characters matches that exact branch name only. For example, main matches the branch named main and nothing else.

Stable Branch Patterns

For your main or stable branch, use the exact branch name:

Your stable branch	Pattern
`main`	`main`
`master`	`master`
`develop`	`develop`

Merge Queue Branch Patterns

If you use a merge queue, your queue creates temporary branches to test changes before merging. Each merge queue product uses a different branch naming convention:

Merge queue	Branch pattern	Example branches matched
Trunk Merge Queue	`trunk-merge/*`	`trunk-merge/main/1`, `trunk-merge/main/2`
GitHub Merge Queue	`gh-readonly-queue/*`	`gh-readonly-queue/main/pr-123-abc`
Graphite Merge Queue	`graphite-merge/*`	`graphite-merge/main/1`

GitLab Merge Trains run on the target branch directly rather than creating separate branches. To monitor merge train runs, scope your monitor to the target branch (e.g., main).

Tips for Branch Scoping

You can add up to 10 patterns per monitor. A test run is included if its branch matches any of the patterns.
Since patterns can’t express “everything except a branch,” a practical approach is to create separate monitors: one scoped to main with strict settings, and another scoped to your PR branch naming patterns (e.g., feature/*, fix/*) with more lenient settings.
** is treated as two consecutive * wildcards, which is functionally identical to a single *. There is no special multi-segment matching behavior.

Resolution Behavior

A flagged test resolves in one of two ways: Healthy recovery: The test’s failure rate drops below the resolution threshold (or activation threshold, if no resolution threshold is set) and it still has enough runs to meet the minimum sample size. This means the test is actively running and has improved. Stale recovery: If a stale timeout is configured and the test has no runs on matching branches within that period, it resolves as stale. This is an automatic cleanup mechanism, not an indication that the test has improved. Tests that are still running but haven’t accumulated enough runs to meet the minimum sample size remain in their current state. They won’t be resolved until there’s enough data to make a determination.

Muting

You can temporarily mute a failure rate monitor for a specific test case. See Muting monitors for details.

Recommended Configurations

A common setup is to pair two failure rate monitors — one to catch broken tests quickly and one to catch flaky tests over a longer window:

Monitor	Detection type	Activation threshold	Window	Purpose
Broken on main	Broken	80–100%	1–6 hours	Catch tests that are reliably failing — real regressions that need immediate attention
Flaky on main	Flaky	20–50%	12–72 hours	Catch intermittently failing tests — candidates for investigation or quarantine

You can create as many monitors as you need. For example, you might want separate monitors for your main branch and pull request branches, or different thresholds for different levels of severity. The following sections provide starting points for common scenarios. Choosing a window: The window duration should match how often tests run on the branches you’re monitoring. A window needs enough runs to reach the minimum sample size before it can flag anything. If tests run infrequently, a longer window is necessary to accumulate enough data. A narrower window reacts more quickly — spikes of failures roll off faster, and tests recover to healthy more quickly once the underlying problem is resolved.

Main Branch: Catch Flakiness Early

Failures on your stable branch are a strong signal. Tests should be passing before code is merged, so failures here are unexpected and likely indicate flakiness.

Setting	Suggested value	Why
Activation threshold	10 to 20%	Low threshold catches subtle flakiness early
Resolution threshold	5 to 10%	Requires clear improvement before resolving
Window	6 to 24 hours	Long enough to accumulate data, short enough to catch new issues
Min sample size	20 to 50	Depends on how often your tests run on main
Branches	`main` (or `master`, `develop`, etc.)	Use the exact name of your stable branch

Pull Requests: Catch Broken Tests

On PR branches, tests are expected to fail — that’s part of active development. Analyzing failure rate for flakiness on PRs is generally not productive because a new failing test is likely caused by the code change under review, not non-deterministic behavior. Pass-on-retry already handles real flakiness on PRs: if a test fails and then passes on retry within the same commit, it will be detected regardless of branch. If you do want a failure rate monitor on PRs, scope it to catch broken tests rather than flaky ones — tests that are consistently failing at a high rate across many PRs, which may indicate a persistent regression or a broken test environment.

Setting	Suggested value	Why
Detection type	Broken	Focus on consistently failing tests, not intermittent ones
Activation threshold	70 to 90%	High threshold distinguishes real breakage from expected development failures
Resolution threshold	40 to 50%	Wide buffer prevents flapping
Window	12 to 24 hours	Longer window smooths out short-lived development failures
Min sample size	30 to 100	Higher minimum avoids flagging tests that only ran a few times on PRs
Branches	`feature/`, `fix/`, `dependabot/*`	Match your team’s PR branch naming conventions

Since branch patterns can’t express “everything except main,” create one monitor scoped to main with strict settings and a second monitor scoped to your PR branch naming patterns with more lenient settings.

Merge Queue: Strict Monitoring

Merge queue branches test code that has already passed PR checks. Failures here are suspicious. If you use a merge queue, consider a dedicated monitor with settings similar to or stricter than your main branch monitor. When sizing your window and minimum sample size, consider how many PRs your repo merges per day. For example, if your team merges 10 PRs per day, a 12-hour window will accumulate roughly 5 merge queue runs — setting a minimum sample size of 10 would mean the rule never has enough data to evaluate. Match your minimum sample size to a realistic run count within your chosen window.

Setting	Suggested value	Why
Activation threshold	10 to 15%	Low threshold, failures here are unexpected
Resolution threshold	5%	Strict recovery
Window	6 to 12 hours	Shorter window for faster detection
Min sample size	5 to 15	Size to how many merge queue runs accumulate in your window
Branches	`trunk-merge/` or `gh-readonly-queue/`	Use the pattern for your merge queue provider (see table above)

Common branch patterns for merge queues:

Merge queue	Branch pattern
Trunk Merge Queue	`trunk-merge/*`
GitHub Merge Queue	`gh-readonly-queue/*`

Other Patterns

Release branches: A monitor scoped to release/* with strict thresholds catches flakiness before it ships.
Nightly or scheduled builds: If you run comprehensive test suites on a schedule, a monitor with a longer window and higher minimum sample size can catch slow-burn flakiness that doesn’t show up in faster CI runs.

Overview

Test Frameworks

CI Providers

Detection & Operations

Ticketing Integrations

Webhooks

APIs & CLI

MCP Server

Detection Type

How It Works

Example

Configuration

Detection Type

Activation Threshold

Resolution Threshold

Window Duration

Minimum Sample Size

Stale Timeout

Branch Scope

Branch Pattern Syntax

Pattern Examples

Stable Branch Patterns

Merge Queue Branch Patterns

Tips for Branch Scoping

Resolution Behavior

Muting

Recommended Configurations

Main Branch: Catch Flakiness Early

Pull Requests: Catch Broken Tests

Merge Queue: Strict Monitoring

Other Patterns

Overview

Test Frameworks

CI Providers

Detection & Operations

Ticketing Integrations

Webhooks

APIs & CLI

MCP Server

Documentation Index

​Detection Type

​How It Works

​Example

​Configuration

​Detection Type

​Activation Threshold

​Resolution Threshold

​Window Duration

​Minimum Sample Size

​Stale Timeout

​Branch Scope

​Branch Pattern Syntax

​Pattern Examples

​Stable Branch Patterns

​Merge Queue Branch Patterns

​Tips for Branch Scoping

​Resolution Behavior

​Muting

​Recommended Configurations

​Main Branch: Catch Flakiness Early

​Pull Requests: Catch Broken Tests

​Merge Queue: Strict Monitoring

​Other Patterns

Detection Type

How It Works

Example

Configuration

Detection Type

Activation Threshold

Resolution Threshold

Window Duration

Minimum Sample Size

Stale Timeout

Branch Scope

Branch Pattern Syntax

Pattern Examples

Stable Branch Patterns

Merge Queue Branch Patterns

Tips for Branch Scoping

Resolution Behavior

Muting

Recommended Configurations

Main Branch: Catch Flakiness Early

Pull Requests: Catch Broken Tests

Merge Queue: Strict Monitoring

Other Patterns