woensdag 28 januari 2015

Why I like PostgreSQL more than MySQL: conditional NOT NULL constraints

Sometimes you want some fields of a record to be filled out, but only if some condition is met. For example, if a user registers as a company, you'll want his company name.

In MySQL you can do this using a trigger, which works just fine but in PostgreSQL you can do it in a much more compact way, using CHECK constraints.

In this example, all records added to foo must meet the requirement that is_admin must be false if the firstname or lastname are left empty. If is_admin is true then the firstname and lastname may not be null.

CREATE TABLE foo (id INTEGER, firstname VARCHAR(100), lastname VARCHAR(100), is_admin BOOLEAN NOT NULL DEFAULT 'f');
ALTER TABLE foo ADD CONSTRAINT adminfields CHECK (is_admin = 'f' OR (is_admin='t' and firstname IS NOT NULL AND lastname IS NOT NULL));

These inserts meet the requirements:

INSERT INTO foo (id) VALUES (1);
INSERT INTO foo (id, firstname, lastname, is_admin) VALUES (1,'kees','pekinees','t');

And this one will fail:
INSERT INTO foo (id, is_admin) VALUES (2, 't');

You can expand on this by making sure the name is not just NOT NULL, but actually contains sane data, using a regexp.

The major advantage of CHECK over triggers is that the CHECK shows up in the CREATE statement of the table, giving you single overview of the table's behaviour.

Again; there is nothing really wrong with triggers, this is just more convenient.

zaterdag 24 januari 2015

Why I like PostgreSQL (more than MySQL): memcache

Memcache is a fast in-memory key/value store. It is often used as a cache for data that is slow to generate and that needs to be served often. Given that it's usually the database that is the "slowest" part of the system (I use quotes because it the database is usually also the part of the system that does most of the work, people tend to forget that in their hunt for "performance". I use quotes here because the word performance is very relative and incudes much more than getting a reply quickly)

But, saving database results anywhere other than the database creates an instant race condition. The data in memcached is must be considered outdated the moment it is stored, because from that moment on the data will not change when the source changes.

Enter PgMemcached.

PgMemcached is a loadable extension for PostgreSQL that enables access to a memcached server from within a query. It allows you to do something like:

SELECT memcache_server_add('');
SELECT memcache_set('foo','bar');
SELECT memcache_get('foo');

Which will return 'bar'.

But because it is a regular memcached server, you can now do this:

> telnet localhost 11211
Connected to localhost.
Escape character is '^]'.
> get foo
VALUE foo 0 3

So, PostgreSQL can now read and write directly to your memcached server and share the data that your application is storing there. That opens up the next step:

CREATE TEMPORARY TABLE tmp_news (id INTEGER, title VARCHAR(100), body TEXT);
INSERT INTO tmp_news VALUES (1, 'Webber wins'),(2,'Alonso loses'),(3, 'Gingers have souls');
junk RECORD;
WITH json_rows AS (SELECT ROW_TO_JSON(tmp_news) AS r FROM tmp_news ORDER BY id DESC LIMIT 10)
SELECT memcache_set('news_cache', JSON_AGG(r)::TEXT) FROM json_rows INTO junk;

CREATE TRIGGER trg_update_news_cache AFTER INSERT ON tmp_news FOR EACH ROW EXECUTE PROCEDURE update_news_cache();

INSERT INTO tmp_news VALUES (4,'UFO sighted at Malibu');

SELECT memcache_get('news_cache');

So, adding a news item to tmp_news automaticallty updates a memcached value with a JSON array of the news items, for use in your application.
Critically, with this in place your news cache is updated regardles of how the newsitems were changed. It can be done by your application, or by a manual query, whatever changes the news will force an update of the cache. This in turn means that you don't have to check the validity of the cache anymore. if the value exists then it is up-to-date by definition (well, iunless you've changed it by other means...)

woensdag 21 januari 2015

Derived data: to store or not to store...

In database terminology, derived data are results that can be obtained by processing other data that is already in the database. Storing that data in the database is technically redundant and creates a race condition where the source data can have changed but the drived data has not yet been updated.

So the case seems simple: don't store derived data.

But of course it's not as simple as that.
Calculating the derived data takes time. It may not be much but it doesn't take much to be slower than reading a record and that slowdown is often compensated by caching, which is effectively just saving the data, but some in a volatile location instead of the database. That still leaves quite a lot of recalculation to be done.

Recalculation can also be enforced using triggers, so race conditions are in fact a rare occurance. If you design a model for a darts championship, do you store the name of the winner for each leg, or do you only store the score per dart and sum that every time you want to know the name of the winner of the tournament?

Re-calculating could also actually create a variation on a race condition, where a bugfix in the calculation can change the outcome. Why can that be a problem? Think of orders in a shop; the total value can always be calculated from the orderlines, but it must never ever change because that would give you a serious problem with the customer and lateron with the tax people. So if you find a bug in the calculation you should definately fix it for new orders, but you must keep a copy of the old calculaton around for the older orders. That does not sound like a good plan.

So in reality, a lot of the derived data is not actually derived data, it is just data that happens to have been calculated from other data. You also simply cannot always afford to calculate things in realtime because of performance or consistency reasons.

You really have to think about how the data is used and what the consequences would be of saving the data versus reacalculating it.

dinsdag 6 januari 2015

Why I like PostgrteSQL more than MySQL: transactions and magic commit.

MySQL's BEGIN command is actually a COMMIT;BEGIN; which means that you can commit a transaction without ever executing a COMMIT. That by itself is... well... not very good, but the problems don't end there.

The only reason why you use transactions is so that you can revert the data when you notice that you cannot complete the work. So, you'd expect the database to do a rollback whenever it encounters a problem through which it cannot complete the transaction.

MySQL does not support DDL inside transactions. You cannot create or drop tables for example. That by itself is not a problem, it's a complex piece of MVCC that is not easily implemented. But, instead of aborting the transaction and doing a rollback whenever the user does something that breaks the transaction, MySQL will actuallt commit the transaction. Yes, commit. So, you begin work, you do some things that work, you do something that cannot be handled, and suddenly your half-completed work is committed.

Tell me, in which universe is that a good idea?

And don't give me the "users wanted it this way" excuse, nobody wants a commit when there is an error, and if they did, people which database knowledge, which I thought MySQL's developers would have, should tell them no, and implement a rollback instead.

PostgreSQL will actually force an abort on error, you cannot even choose to continue after an error. PostgreSQL also supports DDL in MVCC, which is extremely handy when doing migrations and unittests.

maandag 5 januari 2015

Your database knows your data. Use it!

Case in point: an application that prints invoices. Some of these invoices have not been paid before the expiration date and must be printed in red.

That's easy enough; just write an IF statement in the application to compare the date the invoice was sent to the date it should have been paid on, and voila.

Except; that requires that the application has a function that can compare the dates that the database puts put. The database returns strings, so you must either SELECT the dates in a format that is purely numeric and in UTC, or one that requires no effort to convert into the programming language's objects.

Alternatively, you could use the query to *ASK* the database if the invoice has expired. No, not by running a separate query, but by augmenting the data that you already have in the table.

For example; if your table holds an id, price, paid, date_sent, and date_expired, you can do something like:

SELECT id, price,
  CASE WHEN NOW() > date_expired AND paid = FALSE
    THEN 1
    ELSE 0
  END AS expired
FROM ....

Now the data about the invoice contains a virtual-column called 'expired' which holds '1' for invoices that have expired, or '0' if they have not.

Stick this in a (materialized) view, and your application does not even need to know the businessrules about when and how invoices expire.

PS: Yes, of course, you could also run a cronjob at 00:00:00 that does this check and modifies the status of the invoice. That will work just fine if performance is an issue.