Using realistic development & test data for B2B SaaS

And a gem that helps solve it.

Feb 09, 2023

Something I realised far too late when building Tanda was the difficulty of having a good test dataset. In both the development and test environments, the data we worked against didn’t reflect real customer usage well. Having unrealistic test and development data led to unrealistic expectations about how features should work. This led to bugs and poor design decisions.

An example of a design mistakes we made a lot is building a feature that would work really well for a business with 15 employees. Not many of our customers have 15 employees, but lots of our test accounts were small, because they loaded quickly and were easy to navigate around. This resulted in us building features that looked & performed great with a small number of staff, but totally fell apart with 150 or 1500 employees.

We tried to solve this a bunch of ways, including building bigger and more realistic demo accounts and testing with anonymised customer data. But there was no silver bullet: every approach had two big problems.

Problem 1: quality data

As new features are added and existing features are expanded, test data needs to be kept up to date.

We built a really powerful demo account generator (the sales team almost liked it!), which could fully populate an account with a large amount of realistic data for every feature, catered to specific industries. It worked by importing data from a library of Google Sheets each of which basically represented a database table. The idea was that anyone could keep the sheets updated, add new test data, etc.

6 months later, the generator was abandoned because we’d built some new features, refactored some others, and nobody had updated the Sheets. You could ship a PR without updating the test data, so nobody updated the test data when shipping a PR.

Problem 2: fresh data

Most of what we build at Tanda relates to time, and to events happening in the near past or near future. So when using data for development, we needed data that was generated recently. Otherwise we’d be testing against timesheets from a year ago, our “what’s happening now” dashboard would be empty, etc.

(Incidentally, the sales team had a similar challenge with demo accounts. They haven’t found a solution they’re happy with either.)

Since we weren’t able to architect a solution to keeping data up to date, the next best solution we found was to test against an anonymised production dataset. This updated daily - automatically, as an RDS backup - so we’d just point to a regularly refreshed test database.

This created new headaches. For example, test data would be cleared daily when the database refreshed, which was annoying if you had things you wanted to test or demo to people for more than 24 hours. Often I saw people make test accounts on production just so they could have data in them persist for over 24 hours.

Ensuring data was properly anonymised was also important. The scripts for anonymising data were actually version controlled, so unlike with the demo account generator we did keep that process fresh as new fields were added. We also used Active Record Encryption on key fields, and had different encryption keys on production and development. This meant that if the anonymiser missed a field, it would still be unreadable in development because it could not be decrypted.

A new approach: seed_fixtures

As I write this, we haven’t fully solved the problem at Tanda. And with a 10+ year old codebase and a big team, it just gets a bit harder each day.

While working on a new greenfields codebase, I wanted to try and embed a solution in place early, to avoid these problems becoming so painful years later. This led to the seed_fixures gem.

The insight was that while development data was a mess, test environment data at Tanda was not-so-bad. I wouldn’t say it was great, but it was comprehensive (problem 1) and it was fresh by virtue of being loaded whenever tests ran (problem 2).

What if we used our test data, for development? 🤔

I ended up doing the inverse: putting all my test data in seeds.rb, and tweaking some Rails internals to execute that file in a test environment in place of Rails’ fixtures feature. Now, the data I use in development from my seeds file is the same data that’s available in tests.

When adding new features, the workflow I’ve developed is to figure out the schema, make the models, and then update seeds.rb. Instantly, the data is available for testing in the browser and in the test suite. And if I got a detail wrong, I update things in one place. Because tests need to pass for a PR to merge, it’s impossible to ship a feature without relevant, realistic development and demo data.

I haven’t done this yet, but this solves the sales demo challenge too. You could use the content in seeds.rb to build a complete demo account. With a bit more tweaking and safety checks, you could have the file get executed in production at runtime when someone requests a new demo account.

Using seed_fixtures

The gem is easy to use on new projects: just install it in your test_helper, and turn off Rails fixtures.

For existing projects, it’s basically the same. But you’ll also need to rewrite a lot of tests that depend on your current fixture set. This isn’t difficult work, but it is dull work. The easy way to do it would be to translate each existing fixture into a line in seeds.rb. The best way to do it, would be to write seeds.rb from scratch to represent a realistic set of customers, and then update tests to reference it. This is hard, it takes a long time, and it will require a lot of domain context.

Is this the right way to do fixtures?

The jury is out on this. The initial idea came from this discussion, where a bunch of Rails core people said that you should not do this. I’ve seen lots of posts discuss “good” ways to deal with fixtures, but to me they felt like cries for help because the whole feature didn’t feel quite right. Here’s an example.

So far, it’s worked really well for me. Try seed_fixtures out, and let me know what you think!

Alex Ghiculescu's Newsletter

Discussion about this post