woensdag 21 januari 2015

Derived data: to store or not to store...

In database terminology, derived data are results that can be obtained by processing other data that is already in the database. Storing that data in the database is technically redundant and creates a race condition where the source data can have changed but the drived data has not yet been updated.

So the case seems simple: don't store derived data.

But of course it's not as simple as that.
Calculating the derived data takes time. It may not be much but it doesn't take much to be slower than reading a record and that slowdown is often compensated by caching, which is effectively just saving the data, but some in a volatile location instead of the database. That still leaves quite a lot of recalculation to be done.

Recalculation can also be enforced using triggers, so race conditions are in fact a rare occurance. If you design a model for a darts championship, do you store the name of the winner for each leg, or do you only store the score per dart and sum that every time you want to know the name of the winner of the tournament?

Re-calculating could also actually create a variation on a race condition, where a bugfix in the calculation can change the outcome. Why can that be a problem? Think of orders in a shop; the total value can always be calculated from the orderlines, but it must never ever change because that would give you a serious problem with the customer and lateron with the tax people. So if you find a bug in the calculation you should definately fix it for new orders, but you must keep a copy of the old calculaton around for the older orders. That does not sound like a good plan.

So in reality, a lot of the derived data is not actually derived data, it is just data that happens to have been calculated from other data. You also simply cannot always afford to calculate things in realtime because of performance or consistency reasons.

You really have to think about how the data is used and what the consequences would be of saving the data versus reacalculating it.