Migrating to Visual Regression Testing with Playwright and S3

1. Introduction

We run 15+ front-end apps of varying complexity, and keeping them visually consistent is a big deal for us—beyond the other quality checks we perform. After letting go of our previous SaaS visual regression tool, we needed something simple, cost-effective, and easy to integrate.

My brilliant colleague Arash came up with the idea to switch to Playwright for visual regression, backed by AWS S3 for storing screenshots.

While there’s plenty of material about using Playwright for visual testing, much less is written on combining it with cloud storage. This post documents our PoC and the reasoning behind it.

2. Why Store Screenshots in the Cloud?

Storing screenshots in Git might seem straightforward, but it’s messy:

Bloated Repo: Constantly updated binary files quickly bloat the repository.
Frequent Conflicts: Multiple devs updating screenshots leads to merge conflicts.
Hard-to-Track Changes: Figuring out which screenshot belongs to which branch or run becomes a chore.

By using cloud storage (like S3), we avoid these headaches:

Scalability: No repo bloat, and large screenshot sets are no problem.
Centralized Access: A single location for all snapshots, easily accessed by any branch.
Clean Repo: Your code stays lean, and you don’t mix binary junk with it.
Streamlined CI: CI jobs can push, pull, and compare screenshots directly from the cloud.

3. Requirements and Limitations

Requirements:

Detect Issues Early: Catch visual regressions in MRs and after merging to main.
Central Baseline: Store authoritative baseline screenshots in S3.
Consistent Test Environments: Run tests in a standardised setup to avoid inconsistencies. Playwright allows us do just that.
Automated Diffs and Reports, locally and in the CI pipeline: Compare screenshots automatically and generate clear, actionable reports.
Long-Lived Branch Support: Update baselines on feature branches without losing the main reference.
Secure Storage: Enforcing access controls is already enabled for us in our S3 set up.
Scalability: Eventually parallelise tests for bigger projects.

Accepted Limitations:

No Per-Story Approvals: You approve changes as a batch, not story-by-story.
Re-Acceptance on Main: After merging, you must re-accept the changes on the main branch. This could be overcome by tracking Git parents for accepted stories, but would require an additional database.
Flaky Tests Happen: There might be occasional rendering quirks.
Versioning Later: While possible, we initially skipped versioned baselines. We opted to set up a relatively short retention period in the S3 bucket configuration to avoid bloating.

4. Setting Up a "Universal" Test

What we need to do:

Install Playwright (if not already present). Use a separate Playwright config for visual regression, as it needs different browsers and resolutions.
Add a Playwright test for visual regression: (a) Use component tests to load stories individually (more overhead). (b) Use one test to compare all stories (our choice).
Make Storybook accessible to tests (locally or in CI), e.g., by running it alongside the test or in a sidecar container.
Run Playwright tests against the deployed Storybook instance.

NB!. The approach below is “hacky” as a PoC, but it worked well enough to prove the concept. See this post for how to use Storybook metadata to get a list of stories.

import { test, expect} from '@playwright/test'

const URL = process.env.STORYBOOK_URL || 'http://localhost:6006'

test('has desktop screenshots', async ({ page }) => {
  await page.setViewportSize({ width: 1200, height: 700 })
  await page.goto(URL)
  await page
    .locator('css=a[data-nodetype=story]')
    .nth(0)
    .waitFor({ state: 'visible', timeout: 10000 })
  await page.keyboard.press('ControlOrMeta+Shift+ArrowDown') // expand all stories
  
  const storyLinks = await page.locator('css=a[data-nodetype=story]')
  const ids = await storyLinks.evaluateAll((links) =>
    links.map((link) => link.getAttribute('data-item-id')),
  )

  const errors = [];
  for (const id of ids) {
    await page.goto(`${URL}/iframe.html?args=&id=${id}&viewMode=story`)
    try {
      await expect(page).toHaveScreenshot({ fullPage: true })
    } catch (err) {
      errors.push(err.message);
    }
  }

  if (errors.length > 0) {
    throw new Error(errors.join('\n'));
  }
})

You can notice that we use try..catch to avoid failing the whole test on the very first mismatch and get a full report in the end.

We also add the json exporter to the Playwright config:

...
reporter: [['html'], ['json', { outputFile 'test-results.json' }]]
...

5. CI Pipeline and S3 Integration

Our CI flow involves three steps:

Pull the screenshots that will be used as a baseline. We use the Git branch name slug as a folder name for the branch screenshots, and if it doesn't exist, fall back to the main branch screenshots.
Perform the diff. This job uploads both artifacts (the Playwright-generated test report) and cache (pass the test results to the "Accept" job).
Accept the changes by overwriting the screenshots with the new ones (pulled from the cache entry created by the "Diff" job), and uploading them to the branch folder on S3.

We use GitLab’s pipeline cache to share snapshots between jobs, and AWS CLI for the S3 operations. The reports are uploaded as pipeline artifacts, so they can be accessed and viewed right in the browser.

6. General Workflow

Initialize on Main: Accept all changes once to set the initial baseline in S3.
Merge Request Runs: Fix all the issues first, then accept the rest of the changes.
Back to Main: After merging, verify no unexpected changes and accept again to keep main authoritative.

7. Conclusion

Switching to Playwright and S3 streamlined our visual regression testing—scaling better, and helped save costs. We had to accept some trade-offs, like no granular story-level approvals and the need to re-accept on main.

Wins:

Scalability: No worries about screenshot volume.
Cleaner Repo: Code and tests remain tidy.
Customisable: Playwright adapts to our environment easily.

Next Steps:

Speed up tests with parallelisation.
Improve reporting for easier debugging.
Consider versioned baselines for better traceability down the road.
Find a way to review and accept stories one by one.