March 3, 2026 · 8 min read · loadtest.qa

Automated Load Testing in CI/CD: k6 and GitHub Actions Setup Guide

Complete guide to automated load testing with k6 and GitHub Actions - workflow setup, CI-friendly tests, thresholds as quality gates, and result storage.

Automated Load Testing in CI/CD: k6 and GitHub Actions Setup Guide

Load testing that only runs before major releases is better than no load testing. Load testing that runs automatically on every deployment is dramatically better - it catches performance regressions the moment they are introduced, before they reach production and affect users.

This guide covers the complete setup for automated load testing with k6 and GitHub Actions: workflow configuration, writing CI-friendly test scripts, using thresholds as quality gates, storing and comparing results over time, and advanced patterns for matrix testing and Slack notifications.

Why Load Testing Belongs in CI/CD

The performance regression pattern is predictable: a developer adds a feature that includes a new database query. The query is efficient in isolation. When tested under load with 100 concurrent users, it generates 100 simultaneous database queries that overwhelm the connection pool. p99 latency increases from 200ms to 8 seconds. This regression ships to production undetected because nobody ran a load test on the PR.

With automated load testing in CI/CD:

  • The performance regression is caught in the PR before merge
  • The developer who introduced it can fix it while the context is fresh
  • The fix is verified by re-running the load test before merge

The key insight: performance regression costs increase dramatically over time. Catching it in a PR: 30 minutes of developer time. Catching it in staging: 2 hours. Catching it in production: 2-8 hours of incident response plus customer impact.

Basic GitHub Actions Setup

Start with the simplest possible setup and add complexity as needed.

Workflow File

# .github/workflows/load-test.yml
name: Load Test

on:
  push:
    branches: [main, staging]
  pull_request:
    branches: [main]
  workflow_dispatch:  # Allow manual trigger

jobs:
  load-test:
    name: Run API Load Test
    runs-on: ubuntu-latest
    timeout-minutes: 30

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Setup k6
        uses: grafana/[email protected]
        # This installs k6 on the runner

      - name: Run load test
        run: |
          k6 run \
            --vus 50 \
            --duration 5m \
            --out json=results/load-test-results.json \
            tests/load/api-smoke-test.js
        env:
          BASE_URL: ${{ vars.STAGING_API_URL }}
          K6_THRESHOLDS_HTTP_REQ_DURATION: "p(95)<500"
          K6_THRESHOLDS_HTTP_REQ_FAILED: "rate<0.01"

      - name: Upload results
        uses: actions/upload-artifact@v4
        if: always()  # Upload even if the test fails
        with:
          name: k6-load-test-results
          path: results/
          retention-days: 30

This workflow:

  • Triggers on pushes to main/staging and on PRs to main
  • Runs with 50 virtual users for 5 minutes
  • Fails the CI job if p95 latency exceeds 500ms or error rate exceeds 1%
  • Saves results as a workflow artifact for later review

Writing CI-Friendly Test Scripts

Tests written for manual exploration and tests written for CI/CD have different requirements. CI tests need to:

  • Run quickly (5-10 minutes maximum)
  • Produce clear pass/fail signals
  • Generate minimal noise (no flaky results from irrelevant factors)
  • Not require manual cleanup
// tests/load/api-smoke-test.js
// CI-optimized load test - runs in < 10 minutes, clear pass/fail
import http from 'k6/http';
import { check, sleep, group } from 'k6';
import { Rate, Trend } from 'k6/metrics';

// Custom metrics for business-specific thresholds
const authErrors = new Rate('auth_errors');
const apiErrors = new Rate('api_errors');
const checkoutDuration = new Trend('checkout_duration_ms', true);

export const options = {
  // Use scenarios for more precise control in CI
  scenarios: {
    // Smoke test: verify basic functionality
    smoke: {
      executor: 'constant-vus',
      vus: 10,
      duration: '2m',
      tags: { scenario: 'smoke' },
    },
    // Load test: verify performance at target load
    load: {
      executor: 'ramping-vus',
      startVUs: 0,
      stages: [
        { duration: '1m', target: 50 },  // Ramp up
        { duration: '5m', target: 50 },  // Hold
        { duration: '1m', target: 0 },   // Ramp down
      ],
      tags: { scenario: 'load' },
      startTime: '2m',  // Start after smoke test
    },
  },
  thresholds: {
    // These thresholds must pass for CI to succeed
    'http_req_duration': ['p(95)<500', 'p(99)<2000'],
    'http_req_failed': ['rate<0.01'],
    'auth_errors': ['rate<0.001'],
    'api_errors': ['rate<0.01'],
    // Per-scenario thresholds
    'http_req_duration{scenario:smoke}': ['p(99)<1000'],
    'http_req_duration{scenario:load}': ['p(95)<500'],
  },
};

const BASE_URL = __ENV.BASE_URL || 'https://api.staging.example.com';

// Test data - use deterministic selection based on VU number
// to avoid cache effects from always testing the same data
const TEST_EMAILS = Array.from(
  { length: 200 },
  (_, i) => `ci-test-user-${i + 1}@example.com`
);

export default function () {
  const userEmail = TEST_EMAILS[(__VU - 1) % TEST_EMAILS.length];

  group('auth', () => {
    const loginRes = http.post(
      `${BASE_URL}/auth/login`,
      JSON.stringify({ email: userEmail, password: 'CI_Test_Password_123!' }),
      { headers: { 'Content-Type': 'application/json' } }
    );

    const loginOk = check(loginRes, {
      'login status 200': (r) => r.status === 200,
      'login returns token': (r) => r.json('token') !== undefined,
    });

    authErrors.add(!loginOk);

    if (!loginOk) return;

    const token = loginRes.json('token');
    const headers = {
      'Authorization': `Bearer ${token}`,
      'Content-Type': 'application/json',
    };

    sleep(0.5);

    group('core_api', () => {
      // Test key API endpoints
      const endpoints = [
        { url: `${BASE_URL}/dashboard`, name: 'dashboard' },
        { url: `${BASE_URL}/projects`, name: 'projects' },
        { url: `${BASE_URL}/users/me`, name: 'user_profile' },
      ];

      for (const endpoint of endpoints) {
        const res = http.get(endpoint.url, {
          headers,
          tags: { name: endpoint.name },
        });

        const ok = check(res, {
          [`${endpoint.name} status 200`]: (r) => r.status === 200,
          [`${endpoint.name} latency < 500ms`]: (r) => r.timings.duration < 500,
        });

        apiErrors.add(!ok);
        sleep(0.2);
      }
    });
  });

  sleep(1 + Math.random());  // 1-2 second think time
}

Using Thresholds as Quality Gates

Thresholds are k6’s built-in quality gate mechanism. When a threshold fails, k6 exits with a non-zero code, which fails the GitHub Actions job.

Design your thresholds carefully. Too strict and you get false CI failures that block valid PRs. Too loose and regressions slip through.

export const options = {
  thresholds: {
    // Absolute thresholds - must hold at any point in the test
    'http_req_duration': [
      // Primary gate: fail if p95 > 500ms
      { threshold: 'p(95)<500', abortOnFail: true, delayAbortEval: '1m' },
      // Warning: flag if p99 > 2000ms (doesn't fail, but appears in summary)
      'p(99)<2000',
    ],

    // Rate thresholds
    'http_req_failed': [
      { threshold: 'rate<0.01', abortOnFail: true, delayAbortEval: '30s' },
    ],

    // The 'abortOnFail: true' option stops the test immediately if violated
    // 'delayAbortEval' gives the system time to stabilize before aborting
  },
};

Threshold calibration: Pull your production p95 latency from the past 30 days. Set your CI threshold at 120-150% of your measured production p95. This gives headroom for staging performance variation while still catching significant regressions.

Example: Production p95 is 200ms. CI threshold: p(95)<300. This catches any change that makes the API 50% slower but does not fail on normal staging variability.

Storing and Comparing Results

Storing results enables trend analysis: is performance improving or degrading over time?

Option 1: Artifact Storage (Simple)

The basic setup (uploading JSON results as artifacts) works for small teams. Add a comparison step:

      - name: Download previous results
        uses: actions/download-artifact@v4
        continue-on-error: true  # OK if no previous results exist
        with:
          name: k6-baseline-results
          path: baseline/

      - name: Compare with baseline
        if: hashFiles('baseline/load-test-results.json') != ''
        run: |
          python3 scripts/compare-load-results.py \
            --baseline baseline/load-test-results.json \
            --current results/load-test-results.json \
            --threshold 20  # Fail if any metric degrades more than 20%

Create scripts/compare-load-results.py:

#!/usr/bin/env python3
import json
import sys
import argparse

def load_results(filepath):
    """Parse k6 JSON output and extract key metrics."""
    metrics = {}
    with open(filepath) as f:
        for line in f:
            data = json.loads(line)
            if data.get('type') == 'Point':
                metric_name = data['metric']
                if metric_name not in metrics:
                    metrics[metric_name] = []
                metrics[metric_name].append(data['data']['value'])
    return metrics

def percentile(values, pct):
    sorted_vals = sorted(values)
    index = int(len(sorted_vals) * pct / 100)
    return sorted_vals[min(index, len(sorted_vals) - 1)]

def compare(baseline_path, current_path, threshold_pct):
    baseline = load_results(baseline_path)
    current = load_results(current_path)

    key_metric = 'http_req_duration'
    if key_metric not in baseline or key_metric not in current:
        print("Could not find http_req_duration in results")
        return True  # Don't fail if metric not found

    baseline_p95 = percentile(baseline[key_metric], 95)
    current_p95 = percentile(current[key_metric], 95)

    change_pct = ((current_p95 - baseline_p95) / baseline_p95) * 100

    print(f"Baseline p95: {baseline_p95:.1f}ms")
    print(f"Current p95:  {current_p95:.1f}ms")
    print(f"Change:       {change_pct:+.1f}%")

    if change_pct > threshold_pct:
        print(f"FAIL: p95 latency increased by {change_pct:.1f}%, threshold is {threshold_pct}%")
        return False

    print(f"PASS: Performance within threshold ({threshold_pct}% allowed)")
    return True

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--baseline', required=True)
    parser.add_argument('--current', required=True)
    parser.add_argument('--threshold', type=float, default=20)
    args = parser.parse_args()

    if not compare(args.baseline, args.current, args.threshold):
        sys.exit(1)

Option 2: k6 Cloud (Managed)

k6 Cloud stores results automatically and provides built-in comparison between test runs:

      - name: Run load test (k6 Cloud)
        run: k6 cloud tests/load/api-smoke-test.js
        env:
          K6_CLOUD_TOKEN: ${{ secrets.K6_CLOUD_TOKEN }}
          K6_CLOUD_PROJECT_ID: ${{ vars.K6_CLOUD_PROJECT_ID }}

Advanced Patterns

Matrix Testing (Multiple Environments)

jobs:
  load-test:
    strategy:
      matrix:
        environment: [staging, production-canary]
        test-script: [api-smoke-test, checkout-flow-test]

    steps:
      - name: Run load test
        run: |
          k6 run tests/load/${{ matrix.test-script }}.js
        env:
          BASE_URL: ${{ vars[format('{0}_API_URL', matrix.environment)] }}

Slack Notification on Failure

      - name: Notify Slack on failure
        if: failure()
        uses: slackapi/[email protected]
        with:
          payload: |
            {
              "text": "Load test failed on ${{ github.ref_name }}",
              "attachments": [{
                "color": "danger",
                "fields": [
                  {
                    "title": "Repository",
                    "value": "${{ github.repository }}",
                    "short": true
                  },
                  {
                    "title": "Branch",
                    "value": "${{ github.ref_name }}",
                    "short": true
                  },
                  {
                    "title": "Triggered by",
                    "value": "${{ github.actor }}",
                    "short": true
                  },
                  {
                    "title": "Run",
                    "value": "<${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Run>",
                    "short": true
                  }
                ]
              }]
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_PERFORMANCE }}
          SLACK_WEBHOOK_TYPE: INCOMING_WEBHOOK

Scheduled Long-Running Tests

CI load tests run on every push and must be short. Longer soak tests and breakpoint tests should run on a schedule:

# .github/workflows/weekly-load-test.yml
name: Weekly Full Load Test

on:
  schedule:
    - cron: '0 2 * * 1'  # Every Monday at 2am UTC
  workflow_dispatch:

jobs:
  soak-test:
    runs-on: ubuntu-latest
    timeout-minutes: 120

    steps:
      - uses: actions/checkout@v4

      - name: Run 1-hour soak test
        run: |
          k6 run \
            --vus 50 \
            --duration 1h \
            --out json=results/soak-results.json \
            tests/load/soak-test.js
        env:
          BASE_URL: ${{ vars.STAGING_API_URL }}

      - name: Run breakpoint test
        if: success()
        run: k6 run tests/load/breakpoint-test.js
        env:
          BASE_URL: ${{ vars.STAGING_API_URL }}

GitLab CI Equivalent

For teams using GitLab:

# .gitlab-ci.yml
load-test:
  stage: performance
  image:
    name: grafana/k6:latest
    entrypoint: ['']
  script:
    - k6 run
        --vus 50
        --duration 5m
        --out json=results/load-test-results.json
        tests/load/api-smoke-test.js
  variables:
    BASE_URL: $STAGING_API_URL
  artifacts:
    paths:
      - results/
    expire_in: 30 days
    when: always
  only:
    - main
    - merge_requests

Automated load testing in CI/CD is the most reliable way to maintain performance standards as your codebase evolves. The setup investment is 2-4 hours. The ongoing return is catching performance regressions before they reach production. Our load testing automation service implements this setup for your specific CI/CD platform and test requirements within one week.

Know Your Scaling Ceiling

Book a free 30-minute capacity scope call with our load testing engineers. We review your architecture, traffic expectations, and upcoming scaling events — and scope the load test that will give you the data you need.

Talk to an Expert