From time to time, we hear from potential customers that they’re planning to (or have already tried to) “just roll their own sync solution”.

Excellent. It’s challenging, exciting, and — for a certain type of nerd — fun. Lets think through some of the challenges you’ll face along the way. It’s difficult to backfill gaps in your sync strategy once you have live apps out in the world.

All sorts of things can go wrong during sync, of course: network dropouts, client crashes, server crashes… but let’s save those for another day. Today, we’re working in an ideal universe, with completely reliable communications.

The Problem

Back at the office (or warehouse, or data center, or virtual cloud-based thing that you know exists somewhere), you have your data. You have applications that read, update, analyze, tend to, and generally use that data.

You want some of that data to escape the office, and live on the phones (or scanners, or tablets, or laptops) of your colleagues (customers, users) out in the field (on the road, in the aisles of Warehouse 57, up in the sky, at the wind farm or oil rig).

And you don’t want to require them to connect to your servers when they’re out in the wild.

You need to sync that data to those devices.

What do we mean by “sync”, anyway?

For our purposes, let’s define it like so: We want to maintain two copies of one or more tables: one copy on a server, one on a mobile device.

Q: So, is this a one-way sync?

Let’s start with the simplest case, and say “yes”. One way. The server owns all of the data, the client gets a copy and changes nothing. No deletes, no inserts, no updates from the client.

A single ideal-world client, making one reliable, successful request at a time.

Excellent. This one’s easy! We just send the whole dataset down from the client to the server whenever anything’s changed. That warehouse-management database fits nicely on your average tablet.1

Q: How do we know something has changed?

Easy! The row count is larger on the server.

Or smaller. Or the same, but some of the fields have changed. Or rows have been replaced.

OK, not exactly “easy”.

☐ TODO: Devise a way to know that changes have happened at all

Should we track server changes as they occur? Or examine the entire dataset on every sync request? One’s more intrusive, one’s slower.2

Examining the entire dataset could work. We could hash everything on both sides. That’s assuming the data are represented identically on both sides.3

Oh, wait: we can include the server’s idea of the hash when we send the data. The client stores it, and we compare that next time.

☐ Definitely need some housekeeping data on the server side. Where should that go?

Of course, we’re still sending the whole dataset. That seems… insane. Maybe don’t do that.

Q: So what do we send?

Changes, of course. Maybe row-by-row deltas. Maybe the full contents of changed rows. And a list of rows to delete. Oh, and “changed” rows also includes “new rows”.

And be sure to apply the updates in order.

  1. Delete the row with primary key ‘foo’
  2. Insert a new row with primary key ‘foo’

…does not want to be replayed out-of-order.

☐ more housekeeping data
☐ we’re almost certainly tracking database activity now

Q: How does that fit in with our “just keep a hash” strategy?

Awkwardly. We know we have different data than the client, but that’s a binary answer. Which updates does the client need?

Thankfully we just have the one client, so we could add some per-row housekeeping, or per-change housekeeping, and mark each change as sent after it’s been synched.

That’s gonna get larger than we want, faster than we’d like.

What if we keep a list of unsent changes, and delete them as they’re sent?

That’s better, as long as we have just the one client. Multiple clients (dont worry, we’re not going there today) would require a client->updates linking table, with some sort of delete-as-we-go reference counting.

And (of course) all of our clients will check in at reasonable intervals, and none of them will ever be decommissioned, and we’ll eventually clean out that whole “unsent” table. It’s a perfect world, remember?

I’m starting to dislike this “server keeps track of clients” strategy.

Q: What if the clients helped with the housekeeping?

That might work. The client knows when it last synched. The server knows when a change occurred. We could send recent updates based on that.

☐ include dates in the housekeeping data

This will work perfectly, since there will never be two syncs going on from different clients at the same time. No datestamp-overlap issues at all. Ideal worlds are so helpful.

Q: How do we track these changes, anyway?

Triggers, obviously. And let’s assume there are no other triggers to contend with, here in our perfect world.

Every time there’s any change, we drop all the rows-in-question into our (ideal) contention-free “unsent” table.

Q: How does the client know the database schema?

Simple: it’s hard-coded. Schemas don’t change in perfect worlds. Just ship it, pre-created.

Our progress so far

We have a very workable concept, given a few minor assumptions:

  • One client…
  • With server-sized storage…
  • That only receives updates…
  • Into a single, predefined schema…
  • From a faultless server…
  • Over a faultless connection

We’re golden, and this barely cost us any person-days at all! Break for lunch!

  1. No? Hmm. We’ll have to revisit that. 

  2. Oh, yes, speed. Remember that perfectly-reliable network? It still has a finite speed. You’re not really going to send the whole dataset down every time, right? More on that in a moment… 

  3. Because floating point numbers are perfectly consistent across all platforms.