History and provenance tracking¶
CATMAID keeps track of all changes to its database tables. If a database row
is changed, all old values will be stored in a so called history table together
with a time range representing the datas validity. This time period is
represented by the half-open interval [start, end)
for which a row is valid
starting from time point start
and is valid until (but not including!)
end
. Keeping track of changes is managed entirely by the database. Besides
disabling or enabling history tracking, the only thing Django can change, is
providing a label for the current transaction, which is useful to give some
semantics to a set of database changes. Currently, all CATMAID tables except
treenode_edge
and a few others are versioned, which can typically be
regenerated. CATMAID also keeps track of changes in most non-CATMAID tables,
that is the tables used by Django and Django applications we use, except for
asynchronous task related Celery and Kombu.
History tables¶
Each versioned table has a so called history table associated, indicated by the
__history
suffix (e.g. project
and project__history
). A convenient
view that includes live and history data, is available with the
__with_history
suffix. This is simply a union between both tables. A double
underscore is used to minimize collisions with existing names. This history
table is populated automatically through database triggers: whenever data in a
live table is updated or deleted, the history table will be be updated. It
contains a complete copy of every historic version of each row and specifies a
time period for its validity. This time period is called “system time” and is
represented through the additional sys_period
column in each history table.
This time range spans typically the time of the last edition (or creation) to
the time of change. If a live table doesn’t store such a start time stamp, a
separate 1:1 tracking table, which keeps track of editions, is created and
managed. Such tracking tables are named like the original table plus the suffix
__tracking
.
CATMAID’s history system has one requirement for tables it keeps track of: a single column primary key has to be used. Extending it to support multi-column primary keys is possible, not needed at the moment.
By default, all tables of CATMAID itself plus the user table (a Django table)
are set up to track history. To enable this for other tables (e.g. if new tables
are added), the database function create_history_table( live_table )
can be used. This will create the history table and sets up all required
triggers. Likewise, there is a drop_history_table( live_table )
function, which makes sure a history table and triggers are removed cleanly if
this is wanted. The table catmaid_history_table
keeps track of all
currently active history tables.
Transaction log¶
Each endpoint of the CATMAID API that changes data is supposed to leave a log
entry in the transaction log. This way, database changes can be associated with
a particular back-end operation. Like explained in the contributor
documentation, data changing endpoints in urls.py
are wrapped in a record_view
decorator, which is parameterized with a label.
This label is used by the back-end to find affected tables of a change.
Disabling history tracking¶
While history tracking is important and in most situations desirable, there are
a few situations where it would beneficial to disable it (e.g. some database
maintenance tasks, potentially more performance). To do this the setting
HISTORY_TRACKING
can be set to False
, i.e. add the following line to the
settings.py
file:
HISTORY_TRACKING = False
With the next restart of CATMAID, history tracking will be disabled. Likewise,
it can be enabled again by setting HISTORY_TRACKING = True
(or removing the
line). If the history system is enabled after it was disabled (i.e. database
triggers have to be created), all tracking tables are updated to match the live
data again.
Schema migration¶
In case there are schema changes to any of the tracked live tables, the history tables have to be changed as well and triggers have to be regenerated. Every column change of a table has to be reflected in the history triggers and tables. Ideally, this would be implemented with DDL triggers in Postgres, which is currentl only possible using a custom C extension. Because this would make CATMAID harder to install, this history table update involves some manual work when creating database migrations that change table columns. Before the actual migration can happen, the history system has to be disabled:
SELECT disable_history_tracking_for_table('<table-name>'::regclass,
get_history_table_name('<tablel-name>'::regclass));
SELECT drop_history_view_for_table('<table-name>'::regclass);
Depending on how the columns are changed, different scenarios are expected to happen:
If a column is added, a new history table column is added.
If a column is removed, the equivalent history table column is removed as well.
If a column is renamed, no copy is performed and the renaming is applied directly to the history table.
If the data type of a column changes, the history table column data should be updated to the new data type as well, if possible. If the original data needs to be preserved, the original history column is renamed (append first free “_n” suffix) and the new column is added. If no information loss is present (e.g. float to double), the original history column should however just be changed without backup to save storage space.
After both the live table and the history table have been updated, history tracking has to be enabled again:
SELECT create_history_view_for_table('<table-name>'::regclass);
SELECT enable_history_tracking_for_table('<table-name>'::regclass,
get_history_table_name('<table-name>'::regclass), FALSE);
In addition to column changes on existing tables, operations on whole tables are handled like this:
If a table is removed, the history table and history triggers need to be removed as well:
SELECT disable_history_tracking_for_table('<table-name>'::regclass, get_history_table_name('<tablel-name>'::regclass)); SELECT drop_history_view_for_table('<table-name>'::regclass); SELECT drop_history_table('<table-name>'::regclass);
If a table is added and its history should be tracked, history tracking has to be enabled for it. To do this, call
SELECT create_history_table( <tablename>::regclass, <timecolumn>, <txidcolumn> );
, with<timecolumn>
being an edit reference time and<txidcolumn>
being a column tracking a row’s transaction ID. For most CATMAID tables those parameters are'edition_time'
and'txid'
, respectively. If both<timecolumn>
and<txid>
areNULL
, a tracking table will be created automatically. Only providing one of the two is currently not supported. To let CATMAID know if you expect this table to have a history table, add the table to the appropriate list in theHistoryTableTest
class. This way you can also mark a table as not versioned.If a live table is renamed, the history table is not renamed automatically, use the function
history_table_name(<tablename>::regclass)
to create the new name.