Sync PSQL Materialized Views After Debounce Period

Sync PSQL Materialized View After Updates and Debounce Period

Abstract:
You have a PSQL database utilizing a “MV” materialized view and want an automated process to determine if the MV should be refreshed because the underlying tables have changed. It should also wait a period of time between the last known update to a table and then execute the REFRESH MATERIAL VIEW command.

This article will show you how to set up a couple of functions to handle that process on as many MVs as you need. Materialized Views are like actual tables in that their data is stored in a table rather than queried each time it is used, and they are like views in that you can’t modify the data in the MV… and so in order for a materialized view to properly reflect the data in the tables it gets the original data from there has to be an explicit instruction to refresh. Refreshing takes both time and resources, refreshing a materialized view may take long seconds or minutes to complete. In older versions of Postgresql this would even cause data to become temporarily inaccessible (and still will without the use of the CONCURRENTLY parameter circa Postgresql >= 9.4).

Note on Postgresql versions: I am using PostgreSQL 9.5 while writing this article, I’ll assume you’re using Postgresql ≥9.5. The reason for this is that the REFRESH MATERIALIZED VIEW parameter CONCURRENTLY wasn’t added until approx 9.4 I believe and ON CONFLICT wasn’t available until 9.5 as I’ve come to understand. The ON CONFLICT can be worked around super simple-easy enough read my notes at end of article

Note See my notes at the bottom of the page for a JavaScript function that can build out the entire SQL for everything, including setting up triggers on all your tables. All it requires is a JavaScript array of your table names and the material views they effect. (this is optional, you can write it all by hand too)


The Solution (step by step):

We’ll want to take into account:

  • Does the materialized view need to be refreshed?
  • How long since the last event occurred that might cause the materialized view to becoming out of sync with the data it represents?
  • Is that amount of time sufficient to conclude that changes are not likely to be made again? Because more changes mean more refreshes, more resources measurable by units of time/CPU/money. (We’ll label the minimum amount of time to wait after the most recent change was made to initiate a refresh as the “debounce” time)
  • Refresh the materialized view.
  • Whenever changes are made to relevant tables, the timestamp will be recorded as an update to a table we’ll build with the sole purpose of reaching the above conclusions.

The project I used this technique for has two materialized views in our project, but one uses the other and so I don’t need to track them separately. I appreciate that is not the case for everyone and so I wrote this to account for the possibility of multiple independent materialized views.


Step 1. Create a Table To Record Latest Updates

We’ll create a table to keep track of when specific tables have been changed. This table “table_update_records” will be an audit of the latest changes to all tables that contain data our materialized view/s are meant to represent (as well as the update time of the materialized view). We’ll use these times later to determine if the debounce time has passed, which is our buffer to reduce the likelihood that we’ll need to make subsequent refreshes to an MV.

1
2
3
4
5
6
7
8
9
10
11
-- Create the table
CREATE TABLE IF NOT EXISTS table_update_records (
-- COLUMNS
id SERIAL NOT NULL,
table_name VARCHAR(50) NOT NULL,
rel VARCHAR(50) NOT NULL,
updated TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
-- CONSTRAINTS
PRIMARY KEY (id),
UNIQUE (table_name, rel)
);

Columns in the table_update_records table:

id table_name rel udpated
Type: SERIAL VARCHAR(50) VARCHAR(50) TIMESTAMP
Purpose: Primary Key When was the update Table that was updated MV that would be effected by update
Notes: only required to track multiple MV refresh plans  



If you’re not using Postgresql >= 9.5 then you will need to insert a row for each table that is a dependency of some materialized view (and the MV needs a row as well). Otherwise, you don’t need to do this manually as we’ll be using “upserts”. I feel having to do it manually is more prone to error, however, let’s insert some rows by hand to the table_update_records table in order to check out that it’s what we expect it to be and quickly review the difference between an update and an upsert before continuing. Imagine 2 tables and a materialized view respectively as (employees, profiles, and mv_profiles)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- If your Materialized View depended on the `employees` and `profiles` table then:
INSERT INTO table_update_records (table_name, rel) VALUES ('employees', 'mv_profiles');
INSERT INTO table_update_records (table_name, rel) VALUES ('profiles', 'mv_profiles');
-- Updating the time `employees` was updated is simple since we know the row exists
UPDATE table_update_records SET updated = current_time
WHERE table_name = 'employees' AND rel = 'mv_profiles';
-- If we weren't sure whether the row exists or not then an upsert is our best choice:
INSERT INTO table_update_records (table_name, rel) VALUES ('employees', 'mv_profiles')
ON CONFLICT (table_name, mv_profiles) DO UPDATE
SET updated = current_timestamp
WHERE table_update_records.table_name = 'employees'
AND table_update_records.rel = 'mv_profiles';


Step 2. Create a Function To Trigger Record Updates

The table table_update_records won’t update itself, let’s write a trigger function to do the updating for us. This function will trigger on any tables related to specific materialized views after tables perform some INSERT, UPDATE or DELETE.

I wrote two versions of this function. Both versions take a single argument (the materialized views table name), PSQL Trigger Functions can’t have explicit arguments so we’ll pass the argument when we set the trigger and access it on the TG_ARGV[] arguments array at “runtime”. Both also run after any INSERT, UPDATE, DELETE query and update the appropriate table_update_records.updated value to current_time.

Version A just assumes that if an UPDATE/INSERT/DELETE was made to a table then something was actually changed (no need to compare OLD and NEW row values for each row:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- "VERSION A". ASSUME CHANGE WAS MADE IF TRIGGERED, NO QUESTIONS ASKED
CREATE OR REPLACE FUNCTION t_fn_update_table_change ()
RETURNS TRIGGER AS
$fn$
DECLARE
mat_view TEXT;
BEGIN
mat_view := TG_ARGV[0];
INSERT INTO table_update_records (table_name, rel) values (TG_TABLE_NAME, mat_view)
ON CONFLICT (table_name, rel)
DO UPDATE
SET updated = CURRENT_TIMESTAMP
WHERE table_update_records.table_name = TG_TABLE_NAME
AND table_update_records.rel = mat_view;
RETURN NULL;
END;
$fn$
LANGUAGE plpgsql;


ALTERNATIVE

Version B will run after any INSERT, UPDATE, DELETE query, but it will check each row that was touched by the query and test that something is actually different.

If you want to make sure that a change was actually made, and that an update didn’t occur which simply left the table in the same state it was in before, then you can replace the above function with the following alternative version.

version B doesn’t assume that every UPDATE/INSERT/DELETE that is ran actually changes something, it checks each row involved for changes (I didn’t use this version because the likelihood of it being useful in my apps case is almost zero, if you feel that false positives “ie: refreshing materialized views for no reason” could be a problem then this version is for you):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- B. CHECKS 'FOR EACH ROW' AFTER AN UPDATE TO DETERMINE IF DISTINCT.
CREATE OR REPLACE FUNCTION t_fn_update_confirmed_table_change ()
RETURNS TRIGGER SECURITY DEFINER AS
$fn$
DECLARE
mat_view_name TEXT;
BEGIN
mat_view_name := TG_ARGV[0];
-- Check if a change has occurred or not
IF NEW IS DISTINCT FROM OLD THEN
INSERT INTO table_update_records (table_name, rel) values (TG_TABLE_NAME, mat_view_name)
ON CONFLICT (table_name, rel)
DO UPDATE
SET updated = CURRENT_TIMESTAMP
WHERE table_update_records.table_name = TG_TABLE_NAME
AND table_update_records.rel = mat_view_name;
END IF;
RETURN NULL;
END;
$fn$
LANGUAGE plpgsql;


Step 3. Set the trigger function to run on potential changes

1
2
3
4
5
6
7
8
9
10
11
12
-- Make sure to replace 'profiles' and 'mv_profiles' with tables and view
-- A. If you used version A above, then add the trigger like so:
DROP TRIGGER IF EXISTS t_report_table_update ON profiles;
CREATE TRIGGER t_report_table_update
AFTER INSERT OR UPDATE OR DELETE ON profiles
EXECUTE PROCEDURE t_fn_update_table_change('mv_profiles');
-- B. If you used version B from above, (checks each row), add triggers like:
DROP TRIGGER IF EXISTS t_report_confirmed_table_update ON users;
CREATE TRIGGER t_report_confirmed_table_update
AFTER INSERT OR UPDATE OR DELETE ON users
FOR EACH ROW EXECUTE PROCEDURE t_fn_update_confirmed_table_change('mv_profiles');


Step 4. Create a function to compare update times and refresh materialized views if needed

Basically, we are going to write a Postgresql function that will check the table_update_records that audits most-recent updates and for some materialized view mat_view:

  • Query the table_update_records table and see if the last updated table with rel = mat_view was a table, not the actual materialized view mat_view.
  • If so… has it has been longer than the debounce time (ie: 1 minute, thirty-minutes, whatever) from the time of the last update and current time now?
  • If so… Refresh the materialized view mat_view, and then update table_update_records.updated value where table_update_records.table_name is mat_view. So essentially after a refresh, we record an updated time of a change in the Materialized View row on table_update_records,



The way we know if the materialized view is out of sync is if the time of its last update is earlier than the update of a table with rel equal to the materialized view. So this means that we’ll need a row in the table_update_records row for the materialized view. I’ve made sure this will be added if not present with an upsert in the below function, you may enter it manually if you need to replace the upsert in the function with a normal insert.

The following function is not a trigger, it’s just a function, you will need to set up some mechanism to run it on a schedule. You may want to use a cron job, or if running on AWS a Lambda function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
create or replace function fn_cron_smart_mv_sync(debounce_time INTERVAL, mat_view_name TEXT)
RETURNS TEXT as $fn$
DECLARE
updates TIMESTAMP[];
last_table TEXT;
last_update TIMESTAMP;
BEGIN
-- Select most recent updated table, if not mat_view_name it may need a refresh.
SELECT table_name, updated INTO last_table, last_update
FROM table_update_records
WHERE rel = mat_view_name
ORDER BY updated DESC;
IF NOT FOUND THEN
RETURN 'The table_update_records table has nothing related to ' || mat_view_name;
END IF;
-- Test if the materialized view is in sync (most recent recorded updated table)
IF (last_table = mat_view_name) THEN
RETURN 'Materialized View ' || mat_view_name || ' is in sync currently';
END IF;
-- The materialized view is out of sync, so we check if debounce period has passed.
IF (age(current_timestamp, last_update) < debounce_time) THEN
-- "debounce" period has not elapsed since the last time, so return.
RETURN 'Materialized View ' || mat_view_name || ' will be updated in '
|| age(last_update + debounce_time, current_timestamp) || '.';
END IF;
-- We do need to do a refresh, set updated to current_timestamp for materialized view
-- This way if a table updates while the materialized view is refreshing, that update
-- won't be swallowed.
INSERT INTO table_update_records (table_name, rel)
VALUES (mat_view_name, mat_view_name)
ON CONFLICT (table_name, rel) DO UPDATE
SET updated = current_timestamp
WHERE table_update_records.table_name = mat_view_name
AND table_update_records.rel = mat_view_name;
-- Do the work
EXECUTE 'REFRESH MATERIALIZED VIEW CONCURRENTLY ' || mat_view_name::regclass;
-- All Done
RETURN 'Materialized View ' || mat_view_name || ' has been refreshed';
END;
$fn$
language plpgsql;



GOTCHAS:

  • Use SECURITY DEFINER in the declaration so that whichever user you use to
    execute the function will have correct permissions to refresh tables. In
    other words, you want a role/user on the cron job with low low priviledges,
    but you need those priviledges to be able to run the commands inside the
    function (duh’), so SECURITY DEFINER runs the function as the user who
    defined it.
  • Make sure to watch the order of the age() function, think of it like subtraction, the left value must be larger than the right or you’ll end up with a negative. Since the current_timestamp will always be greater than a moment from the past, age(current_timestamp, some_old_timestamp) should always be a positive time value. You can compare it directly to a string representation of some time unit, like “> ‘10 minutes’” or “< 1 hour”.
  • Materialized Views cannot REFRESH with the CONCURRENTLY parameter if the materialized view hasn’t been populated yet or if it doesn’t have a unique index. The latter being the more likely point of failure. So if you didn’t include a unique index on your materialized view, then you can add one now. It can be a combination of columns, but it cannot include a WHERE clause. You might even want to alter the view, add a new column that stands as a unique index if the alternative is making a 5 column unique index.
    So something like:
    1
    2
    3
    4
    5
    6
    7
    -- This Unique Index would work:
    create unique index uniq_indx_mv_profiles_code_state ON mv_profiles (code, state);
    -- This Unique Index would work:
    create unique index uniq_indx_mv_profiles_some_col ON mv_profiles (some_uniq_col);
    -- This Unique Index would not work (because of WHERE constraint):
    create unique index uniq_indx_mv_profiles_guid ON mv_profiles (guid) WHERE id > 0;


Step 5. Creating a cron job.

Now all that is left is to schedule the following line of SQL to run every 10 minutes, or 1 hour, or once a day. The two arguments to the function should be the ‘debounce_time’ and the ‘rel’ value (materialized view table name).

1
2
-- Replace '15 minutes' with your debounce time and 'mv_profiles' with your materialized view
select fn_cron_smart_mv_sync('15 minutes', 'mv_profiles');

Remember, the debounce time isn’t the same as the cron job time. You might have this function run every 1 minutes with a debounce time of 15 minutes, which results in a 15 - 16-minute window between when the last change related to the data in our materialized view was made and the view refreshing.

I will leave the “how” up to you, basically, you need to create some script (bash, python, javascript) and a cron job or pgAdmin scheduled procedure. Since my app is on AWS I don’t have pgAdmin setup and so I chose AWS Lambda and a Python function. Easy peasy lemon squeezy.

Tracking multiple tables

You can set up multiple cron jobs or just make multiple SQL queries that change the arguments passed into the fn_cron_smart_mv_sync so it can track multiple tables.

The output from function return

Expect a call like select fn_cron_smart_mv_sync('10 minutes', 'mv_profiles'); will return results such as:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- If no tables have updated recently then something like:
fn_cron_smart_mv_sync
----------------------------------------------------------
Materialized View mv_profiles is in sync currently
(1 row)
-- If tables have updated recently but the debounce time hasn't passed then:
fn_cron_smart_mv_sync
-------------------------------------------------------------------------
Materialized View mv_profiles will be updated in 00:09:19.314126.
(1 row)
-- Otherwise, the table may refresh (which could take some time), then:
fn_cron_smart_mv_sync
--------------------------------------------------------
Materialized View mv_profiles has been refreshed
(1 row)