Query Automation in Snowflake

Marko Sedlak

/ 2023-07-18

Introduction

The idea behind query automation is the ability to schedule and orchestrate query execution and result storing. Query automation is an important part of fraud detection systems, especially in moving them towards real time operation. Many processes require data to be fully processed and complete before using it. I will describe and explain one such system built as a part of a project. The client was a bank team whose daily work included running a lot of queries manually to check if data is ready, storing results once they are available and forwarding them downstream to detection/alerting systems. All these processes require little human interaction and are great candidates for automation, and the client was quick to recognize this.

Features

The core feature set of a query automation engine (QAE) is:

Queries are evaluated first, validating the query itself, target table structure and schedule, then on successful evaluation they are copied to execution portion and get scheduled.
The engine stores the database role that submitted the query and uses it when executing the query, so the role restrictions are maintained. The engine itself runs under a service role with elevated permissions that can perform all the required operations.
Along with the main query, a prerequisite query and period are stored. Prerequisite query checks data readiness for the main query and postpones execution until the data is ready or the prerequisite period expires.
Notification queries are defined by outputting to a special notifications table and used as a source for sending notifications over AWS.
All processes are logged to corresponding log tables (poller, evaluation and execution logs).

Architecture

The whole engine is built in Snowflake, apart from notifications. Snowflake has many advantages and a few challenges as well. The main advantage it has is tasks, database objects that execute an SQL statement on schedule or immediately. There are three levels of task use:

Poller tasks - One task could be enough here, but we decided to use two separate ones for evaluation and execution to avoid blocking. These tasks run each minute, look for candidate queries to evaluate/execute and run them.
Scheduling tasks - These are used to enter queries to execution_stack table (an intermediate table that allows for prerequisite waiting). Every query gets one corresponding scheduling task that uses query schedule to insert the query for execution.
Execution tasks - These are used to achieve parallel executions. Every query gets one corresponding execution task that remains in suspended state and is executed using "EXECUTE TASK" command (These commands are issued in series, but the resulting executions are parallel). Before this, we need to alter the task to change the stored procedure parameters (since they are hardcoded in the tasks SQL).

center-big

High-level engine overview

Building the QAE this way is very high-level since the tasks will solve most if not all low-level problems. The tasks themselves are running stored procedures. The whole QAE is automated, apart from two stored procedures that users interact with, one for activating a query and one for deactivating it. The activation performs the validations and enters the query into QAE. The deactivation suspends the task, marks query as inactive in the execution table and deletes it from the execution stack.

Challenges

The biggest challenge is something we discovered as the load in production increased. Snowflake is OLAP, so we assumed the transactions are not its strong point. What Snowflake does is transaction queueing, meaning it will queue up to 20 transactions to a single table and try to execute them when conditions allow. The problem is, all subsequent transactions after the 20th will fail with an error. The first 20 errors will get queued to the log table, then the log table itself will get locked by the queued transactions. Subsequent transactions will cause the error and get logged to information_schema system tables.

This can be partially solved by increasing warehouse size, maximum concurrency level and/or maximum cluster count. However, all three methods only increase the limit, there is no way to remove it. Some design considerations can reduce the probability of this error further. An example would be limiting the total number of parallel tasks to a number lower than or equal to 20 (in our example we chose 15). Every task execution will log its work to log table, update execution stack and more. In this way, the probability of the number of transactions queued to either of these tables is reasonably low and the error stops occurring.

Conclusion

There are already plenty of products like the QAE I described. Some have graphic UI and different query lifecycles. While most of them are general purpose, QAE has specific features that make it tailored to the client’s needs. Total costs are also significantly lower as there is no license fee (most of the QAE costs depend on warehouse size and number of task runs). In retrospect, using an OLAP system introduced the only significant limitation, the 20 simultaneous transactions limit. Currently, there is another team in the client bank interested in using QAE and the most likely solution will be a separate installation, possibly with some changes to feature set. On the other hand, OLAP minimizes queries execution times, which is possibly the most important metric, so using OLTP would have its own downsides as well.

Have a need to automate your processes? Feel free to contact us, our team of experts have years of experience and can help you reduce the risks and optimize your time.

Banks are dealing with big amounts of sensitive data daily. If you would like to know more about our solutions with banking institutions, find out more about how we reduced risk with bank loan default prediction.