Load Testing: The Complete Guide for Engineering Teams Shipping to Production
Everything you need to know about load testing - methodology, tool selection, metrics, common mistakes, and CI/CD integration for modern engineering teams.
Load testing is the practice of applying simulated traffic to a system to understand how it behaves under expected and peak conditions. It is the difference between discovering your checkout flow breaks at 500 concurrent users during a marketing campaign versus discovering it during a routine Tuesday afternoon.
Most engineering teams understand that load testing is important. Far fewer actually do it systematically. The reasons are predictable: it requires dedicated time, the tooling has a learning curve, and the results are only valuable if someone acts on them. This guide covers everything needed to build a load testing practice that produces reliable results and drives engineering improvements.
Understanding the Types of Performance Tests
Teams often use “load testing” as a catch-all term, but there are distinct test types for distinct questions.
| Test Type | Question Answered | Duration | Load Pattern |
|---|---|---|---|
| Load test | How does the system behave under expected traffic? | 30-60 min | Steady at expected peak |
| Stress test | At what point does the system fail? | 1-2 hours | Ramp up until failure |
| Soak test | Does the system degrade over extended periods? | 4-24 hours | Steady at moderate load |
| Spike test | How does the system handle sudden traffic surges? | 15-30 min | Instant jump to high load |
| Breakpoint test | What is the exact capacity limit? | 2-4 hours | Step increases with holds |
| Volume test | How does the system perform with large data? | Variable | Normal load with large datasets |
Most teams should run load tests before every significant release, stress tests quarterly or before major events, and soak tests monthly to catch memory leaks and resource exhaustion.
Why Engineering Teams Skip Load Testing
Understanding why teams skip load testing helps address the real barriers.
“We don’t have time.” Load testing often gets cut when release pressure increases. The irony: a 3-hour production incident consuming five engineers’ time costs more than running a 30-minute load test that would have prevented it.
“Our staging environment is too small.” Staging environments are typically 20-50% of production capacity. Rather than using this as an excuse not to test, treat it as a calibration factor. Results from staging at 50% capacity predict production behavior at 50% capacity.
“We don’t have realistic traffic patterns.” You don’t need exact replay of production traffic for useful load testing. A representative mix of your primary user journeys at realistic proportions is sufficient.
“We ran one test and it passed.” A single load test at a point in time is a snapshot. Systems change with every deployment. Load testing produces value only when done regularly.
The Seven-Step Load Testing Process
Step 1: Define Objectives
Before writing a single line of test code, write down the questions you are trying to answer. Examples:
- “Will the checkout flow handle 200 concurrent users during the Black Friday campaign?”
- “What is our current maximum throughput for the API before p99 latency exceeds 1 second?”
- “Does the system recover within 2 minutes after a 5x traffic spike?”
Objectives determine what you test, what load profile you apply, and what metrics indicate success or failure.
Step 2: Identify User Journeys
Do not test individual endpoints in isolation. Real users follow journeys: browse products, search, view product detail, add to cart, checkout. Identify the 3-5 most common user journeys and test them as sequences.
For a SaaS application, typical journeys:
- Authentication: login, get user profile, update preferences
- Core workflow: create resource, read resource, update resource, delete resource
- Search and browse: search query, filter results, view detail
- Reporting: load dashboard, query data, export results
Step 3: Understand Your Production Traffic
Pull traffic data from your analytics or APM tool:
- Peak concurrent users or requests per second
- Ratio of read to write operations
- Most frequently accessed endpoints
- Geographic distribution of users
- Session duration and page depth
This data shapes your test scenarios. If 70% of your production traffic is reading data and 30% is writing, your test should match that ratio.
Step 4: Set Acceptance Criteria
Write down what “pass” looks like before running the test. Without pre-defined criteria, teams argue about results after the fact.
Example criteria:
- p95 response time < 500ms at 200 concurrent users
- p99 response time < 2000ms at 200 concurrent users
- Error rate < 0.5% throughout the test
- Throughput > 1,000 requests/minute
- System recovers to baseline within 5 minutes of load removal
Step 5: Build the Test Script
Choose your tool (see Tool Decision Matrix below) and build a realistic test script. The most common mistake is testing only the happy path with valid, cached data. Test with diverse data: multiple user accounts, different product IDs, varying search terms.
Step 6: Run and Observe
Run the test and observe actively. Do not walk away and return to read results. Watch your metrics during the test:
- Is latency stable or climbing throughout the test?
- At what user count did latency first degrade?
- Are errors occurring, and what type?
- Are any resources (CPU, memory, connections) approaching limits?
Step 7: Analyze and Act
A load test that produces no action items is wasted effort. Analyze results against your acceptance criteria and create engineering tasks for every failure:
- Failed p95 latency: identify the slow queries or operations and optimize
- Failed error rate: identify error types and fix the root cause
- Failed throughput: identify the bottleneck (CPU, connections, database)
Tool Decision Matrix
| Tool | Language | Learning Curve | Distributed | Free | Best For |
|---|---|---|---|---|---|
| k6 | JavaScript | Low | Yes (k6 Cloud) | Open source | Modern teams, CI/CD integration |
| Locust | Python | Low | Yes (workers) | Open source | Python teams, complex scenarios |
| Gatling | Scala/Java | Medium | Yes | Open source | High-throughput, JVM teams |
| Artillery | JavaScript/YAML | Low | Limited | Open source | Simple APIs, quick tests |
| JMeter | XML/GUI | High | Yes | Open source | Legacy, enterprise, existing expertise |
| k6 Cloud | JavaScript | Low | Native | Commercial | Managed, advanced reporting |
Recommended default: k6. It uses JavaScript (most teams are familiar), has excellent documentation, integrates cleanly with GitHub Actions, supports distributed load generation, and has no commercial lock-in. The scripting model is clean and the output metrics are clear.
Choose Locust if: Your team is primarily Python and you need complex scenario logic that is easier to express in Python than JavaScript.
Avoid JMeter for new projects. Its XML-based test format is difficult to version-control, the GUI-first workflow is awkward in CI/CD, and the JavaScript/Python alternatives are strictly superior for new implementations.
Metrics That Matter
The most common load testing mistake is focusing on average response time. Averages obscure the user experience of tail-end users.
The right latency metrics:
- p50 (median): Half of users experience latency below this value. A good baseline.
- p95: 95% of users experience latency below this value. Use as your primary SLO metric.
- p99: 99% of users experience below this. Represents your worst 1% of experience.
- p99.9: 0.1% of users. Important for high-volume services where 0.1% represents many users.
Example: Average = 50ms, p95 = 500ms, p99 = 3000ms. The average looks great. But 1% of users are waiting 3 seconds - at 10 million requests/day, that’s 100,000 terrible experiences per day.
Other essential metrics:
- Throughput: Requests per second. The volume your system can handle.
- Error rate: Percentage of requests returning errors. Should be under 1% at any load level.
- Concurrent virtual users (VUs): How many simulated users are active simultaneously.
- Connection errors: Timeouts and connection failures, which indicate resource exhaustion.
Seven Load Testing Mistakes
Mistake 1: Testing from a single machine. A single test machine hits its own network, CPU, or connection limits before the system under test does. Use distributed load generation for tests above 500 VUs.
Mistake 2: No think time between requests. Real users pause between page loads. Scripts that fire requests as fast as possible generate artificial load patterns that do not represent reality.
Mistake 3: Testing only happy paths. Production traffic includes invalid requests, authentication failures, and edge cases. Test the full mix.
Mistake 4: Using unrealistic data. Testing with the same user ID, product ID, or search term repeatedly does not represent real traffic (caching, database behavior, and contention patterns all differ with repeated data).
Mistake 5: Ignoring the ramp-up period. Starting a test at full load does not give the system time to warm up (JVM JIT, database plan cache, application caches). Use a realistic ramp-up that mirrors how traffic builds in reality.
Mistake 6: Testing in the wrong environment. Staging environments that are misconfigured relative to production produce results that do not translate. Verify that staging matches production configuration (even if smaller).
Mistake 7: Running tests nobody acts on. The entire value of load testing is the engineering improvements it drives. A team that runs load tests and files “interesting” findings without creating and closing action items will not see performance improvements.
CI/CD Integration
Running load tests in CI/CD catches performance regressions the moment they are introduced, not weeks later when a customer complains.
A basic GitHub Actions setup with k6:
name: Load Test
on:
push:
branches: [main]
workflow_dispatch:
jobs:
load-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run k6 load test
uses: grafana/[email protected]
with:
filename: tests/load/api-load-test.js
env:
BASE_URL: ${{ vars.STAGING_URL }}
K6_THRESHOLDS_HTTP_REQ_DURATION: "p(95)<500"
K6_THRESHOLDS_HTTP_REQ_FAILED: "rate<0.01"
For CI/CD load tests, keep the test duration short (5-10 minutes) and focused. The goal is catching regressions, not full capacity testing. Full capacity tests run on a schedule or before major releases.
Load testing is most valuable as a continuous practice, not a pre-launch ritual. Our load testing setup service gets your team from zero to automated load testing in your CI/CD pipeline within one week.
Know Your Scaling Ceiling
Book a free 30-minute capacity scope call with our load testing engineers. We review your architecture, traffic expectations, and upcoming scaling events — and scope the load test that will give you the data you need.
Talk to an Expert