Tracing data caching

For setups with a large amount of tracing data it can be helpful to setup intersection caching, especially for larger fields of view (FOV). Such caches store precomputed results to spatial queries for the respective container. There currently two types of node caches (as they are called) available: node-query-cache and node-query-grid-cache. The former stores intersection results for whole sections and the latter one allows to specify a regular grid for which results will be precomputed. They are generally populated using the Django management command catmaid_update_cache_tables. It allows to specify the type of cache using the --cache option, which can be section or grid.

Caches can be automatically updated on data changes as long as the variable SPATIAL_UPDATE_NOTIFICATIONS is set to True (disabled by default). If caches are not used on a CATMAID instance with a large amount of tracing data, it might make sense to disable this setting to improve the speed of operations like skeleton joins.

Both cache types support level of detail (LOD) configurations. This allows to access cached data in terms of a set of an ordered set of buckets. Being able to access only a part of all caches consistently, allows for quicker browsing of larger data sets and maintaining more detail on lower zoom levels.

All caches can store the data in either simple JSON text, as JSON database object or a binary msgpack representation. The msgpack version seems to be the fastest in most situations.

Node Query Cache

For a given orientation this cache stores the intersections with a full section, i.e. all nodes intersecting the plane at a given depth (Z for XY, etc.). To enable this cache for look-up, add a section like this to your NODE_PROVIDERS array in the settings.py file:

('cached_msgpack', {
      'enabled': True,
      'orientation': 'xy',
      'step': 40
}),

This will make the back-end look for a msgpack format section cache. If nothing is found, the next available node provider will be checked. Besides cached_msgpack there is also cached_json and cached_json_text. The regular node provider options like min_x, max_z, etc. are supported as well. Also this entry indicates the cache is only defined for the XY orientation and assumes a section thickness of 40 nm. These values are also used as defaults for the refresh of the cache.

This cache is fast for large field of views that cover most of a section or has to be limited clearly for limit the node count. Otherwise it can cause too many nodes to be loaded.

To populate the cache, the catmaid_update_cache_tables command can be used for instance like this:

./manage.py catmaid_update_cache_tables --type msgpack --orientation xy \
    --step 40 --node-limit 0 --min-z 80000 --max-z 8400 --noclean \
    --n-largest-skeletons-limit 1000

This will ask interactively for the project you want to the create the cache for. With this done a whole section query for each Z in XY orientation is created. The distance between to sections is set to be 40 nm. Also, only a range of Z values is computed, which is sometimes useful for testing different configurations. The --noclean option ensures CATMAID isn’t removing existing cached data. Additionally, only the 1000 largest skeletons are displayed.

There are more options available, which can be read on using:

./manage.py catmaid_update_cache_tables --help

Node Query Grid Cache

The grid cache option for the catmaid_update_cache_tables populates a grid made out of cells, each with the same height, width and depth. For each of these cells a separate spatial query is cached, which also allows independent updates. In its simplest form, the cache can be enabled for lookup by adding an entry like the following to the NODE_PROVIDERS array in settings.py:

('cached_msgpack_grid', {
      'enabled': True,
      'orientation': 'xy',
}),

This will make node queries look for grid caches for the XY orientation. Like with the section cache, there are options like min_x, max_x, etc. can be used to limit for which volume the cache should be defined.

To create this cache, the catmaid_update_cache_tables management command can be used like this:

./manage.py catmaid_update_cache_tables --project=1 --cache grid \
    --type msgpack --cell-width 20000 --cell-height 20000 --cell-depth 40

As a result a uniform msgpack encoded grid cache with cells with the dimensions 20um x 20um x 40 nm (w x h x d).

The optional settings parameter DEFAULT_CACHE_GRID_CELL_WIDTH, DEFAULT_CACHE_GRID_CELL_HEIGHT and DEFAULT_CACHE_GRID_CELL_DEPTH allow to define defaults for the above management command.

To speed up the computation, it is possible to provide the parameters --jobs and --depth-step. With --jobs the number of parallel processes can be specified that can be used, which allows parallel cache filling. By default only a single process is used. With --depth-steps it is possible to reevaluate the number of cells to look at N n times during the run. For instance, using --depth-steps=2 will do a bounding box query when the process is through with half of the depth dimension (Z for XY orientation). By default only a single bounding box query will be made. Updating the bounding box every 100 sections or so can lead to large improvements in cache cell update times. By default, 10 cache cells are executed per process in a parallel run. This can be adjusted using the --chunk-size parameter.

Updating caches

Functionality to update cells automatically is available as well. CATMAID uses Postgres’ NOTIFY/LISTEN feature, which allows for asynchronous event following the pubsub model. To lower the impact on regular tracing operations (especially joining), an insert/update/delete trigger for treenode and connector will execute a conditional trigger function, which is set on CATMAID startup.

These events (“catmaid.spatial-update” and “catmaid.dirty-cache”) are disabled by default, because they add slightly to the query time, even if not used. To enable these database events and allow automatic cache updates, set the settings.py variable SPATIAL_UPDATE_NOTIFICATIONS = True. Once enabled, these events can be consumed by third party clients as well.

Cache updates work by running two additional worker processes in the form of management commands: catmaid_spatial_update_worker and catmaid_cache_update_worker. The former is responsible for listening to the “catmaid.spatial-update” Postgres event and adds entries to the table dirty_node_grid_cache_cell for each intersected cache cell in an enabled grid cache. Upon inserts and updates this table issues the “catmaid-dirty” cache event, which the second management command will listen to. It’s its responsibility to update the respective cache cells and remove entries from the dirty table. If single worker processes aren’t enough, more workers need to be started.

When treenodes are created, moved or deleted the database emits the event “catmaid.spatial-update” along with the start and end node coordinates. The same happens with changed connectors and connector links. Other processes can use this to asynchronously react to those events without writing to another table or blocking trigger processing in other ways.

Alternatively, it is possible to monitor the catmaid_transaction_info table and see which entries caused spatial changes and recompute selectively.

Level of detail

The node query result for either a whole section or a single grid cell is not stores as a single big entry in the cache. Instead it is stored in level of detail (LOD) buckets, each one only allowing a maximum amount of nodes except for the last one, which takes all remaining nodes. This allows requests that make use of this cache declare they are only interested in e.g. 5 nodes per grid cell. With small enough grid size dimensions this allows for a uniform control of reasonable node distributions for each zoom level in the front-end.

To configure LOD relevant parameters during cache constructions the options --lod-levels, --lod-bucket-size and --lod-strategy can be used with the catmaid_update_cache_tables management command. The options are optional and have defined defaults.

The first option defines how many LOD levels there should be. By default only one level is defined, which effectively means there are no levels of detail.

The second option defines how many nodes are allowed in every bucket (except the last one). The default here is a bucket size of 500.

The last option allows to select between the strategies linear, quadratic and exponential. Each one defines a way how the bucket size of every bucket will be computed based on the last one. In linear mode, each bucket has the same size, the one defined with --lod-bucket-size. In quadratic mode, the first bucket has the passed in size, the following are computed by multiplying the initial bucket size with lod-level ** 2, i.e. the second bucket allows for the square of the initial bucket size. This mode is also the default. In exponential mode, the initial bucket size is multiplied with 2 ** lod-level, i.e. buckets grow faster.

To create a usable LOD configuration for a grid cache, the command line could look like this:

./manage.py catmaid_update_cache_tables --project=1 --cache grid \
    --type msgpack --cell-width 20000 --cell-height 20000 --cell-depth 40 \
    --lod-levels 50 --lod-bucket-size 5 --lod-strategy quadratic

This will start with the first level of detail with a bucket of size 5, then 25, 125 and so on up to 12,500 in bucket 50.

The front-end allows to set a “Level of detail” (LOD) value in the tracing layer settings. By default, this is set to “max”, which causes all LOD levels to be included. Setting this to 1, will include only the first level. The font-end also allows to map zoom levels to particular LOD levels. This allows flexible zooming behavior with adaptive display limits using cached data.