Controlling looker cacheing

We use the business intelligence tool Looker to give users access to our data. Generally, it performs a good job of providing a nice front end for the data, but we have experienced difficulties with the cacheing of views.

Looker works by having a set of “views” which are essentially abstractions of sql queries. Each view draws on data from a combination of sql tables and other views. In our set up we have a lot of heirarchial links between views in the spirit of DRY.

To ensure that users get their data quickly, looker creates a cache for each pdt. This is done by a single agent which iterates through all of the views and sees whether their cache has been marked as out of data. Only one cache can be rebuilt at a time.

There are two main problems that this causes:

1 - Frivolous rebuilding. We don’t want to be endlessly rebuilding caches if the underlying data hasn’t been updated. 2 - Rebuilding in the wrong order. If view A depends on view B, then we need to ensure that view B is rebuilt before view A otherwise A will still be out of date

Solution

My solution is to have a central monitoring script which takes as its input two streams: a stream of information about when data in the sql database has been changed and a stream of which caches have been rebuilt. The script returns a stream of instructions about which pdt should be rebuilt next.

Note that within its own scope this creates a nice feedback loop. The program issues commands about what should be rebuilt next and recieves back an input when this has been done. However, it can also work with other cacheing systems. If you have other caches that you want on some fixed schedule, this is immediately accompodated.

Implementation

To make this happen we need to be able to do three things:

1 - identify which tables and views each view depends upon

To do this, I used the python yaml library to parse the lookml files for each view, which greatly simplifies matters. Personally, I created a full domain model of classes for each lookml concept, but then I’m a big fan of type safety. If you just want this tool to work, yaml parsing is enough.

I found the easiest way to identify dependencies was a regex pattern to match against ${xxxx.SQL_TABLE_NAME}. I did something similar for sql table names.

2 - identify when the database has been updated

This step was relatively simple in our setup as we have a set number of imports from the main production database throughout the day. We use a chat-ops system so made a python generator out of polling the chat channel for new updates.

3 - identify which pdts have been rebuilt

Sadly, Looker doesn’t automatically sync this information with our database. It is stored only on looker’s database, but made accessible through a look. Thankfully the looker api makes this information quite easy to programmatically extract. In a similar way to part 2, I turned repeated polls of the api into a stream.

Next steps

The system works great for one specific purpose, controlling the freshness of data in looker versus the freshness in our redshift database. However, I feel it has a lot of potential to go one step further and control the flow into the database in the frist place.

By monitoring how fresh all of the views are and comparing this to some important criteria, the system can ensure that we are optimally pulling from the production database into redshift. This would allow very fine grained control over the flow without having to manually specify complex configuration.

It would also allow the data pipeline to intelligently react to changes in the useage. For example, if the system monitors which data is used and when, then it can incrementally assign weight to data sources that are increasing well used, and reduce weight to those data sources that are declining in use.