It’s common to store hierarchical data in SQL and recursive queries are a convenient way to extract information from such graphs. Organizational structure, application menu structure, a set of tasks with sub-tasks in the project, links between web pages, breakdown of an equipment module into parts and sub-parts are examples of the hierarchical data. The post will not go into great details of those many use cases rather look at two toy examples to understand the concept - the simplest possible case of recursion on numbers and querying data from the family tree.
Let's think about queries as a function. In a sense that a function takes an input and produces an output. Queries operate on relations or one could say tables. We will denote those as Rn. Here is a picture of a query. It takes three relations R1, R2, R3 and produces an output R. Simple enough.
SQL example: SELECT <something> FROM R1, R2, R3 WHERE <condition>
Query can take something and produce nothing:
SQL example: SELECT <something> FROM R1 WHERE 1 = 2
Take nothing and produce something:
SQL example: SELECT 1
Or take nothing and produce nothing
SELECT 1 WHERE 1 = 2
Recursion is achieved by WITH statement, in SQL jargon called Common Table Expression (CTE). It allows to name the result and reference it within other queries sometime later.
WITH R A (<query>)
<query involving R>
Here is a sample.
WITH R AS (SELECT 1 AS n)
SELECT n + 1 FROM R
Query (SELECT 1 AS n) now have a name — R. We refer to that name in SELECT n + 1 FROM R. Here R is a single row, single column table containing number 1. The result of the whole expression is number 2.
The recursive version of WITH statement references to itself while computing output.
WITH R AS (<query involving R>)
<query involving R>
For the recursion to work we need to start with something and decide when the recursion should stop. To achieve this, usually recursive with statement has following form.
Important to note that base query doesn’t involve R, but recursive query references R. From the first look it seems like infinite loop, to compute R we need compute R. But here is a catch. R actually don’t reference itself, it just references previous result and when previous result is empty table, recursion stops. Actually it could help to thing of it as an iteration rather then recursion!
Here’s what is happening: base query executed first, taking whatever it needs to compute the result R0. Second recursive query is executed taking R0 as input, that is R references R0 in the recursive query when first executed. Recursive query produces the result R1 and that is what R will reference to at the next invocation. And so on until recursive query returns empty result. At that point all intermediate results are combined together.
Recursive query execution algorithm flow chart
Recursive query execution sequence
Quite abstract now. Lets take a concrete example, count until 3.
Base query returns number 1 , recursive query takes it under the countUp name and produces number 2, which is the input for the next recursive call. When recursive query returns empty table (n >= 3), the results from the calls are stacked together.
Watch out, counting up like that can only go that far. There is a limit for recursion. It defaults to 100, but could be extended with MAXRECURSION option (MS SQL Server specific). Practically, it could be a bad idea to crank recursion limit up. Graphs might have cycles and limited recursion depth can be a good defense mechanism to stop poorly behaving query.
OPTION (MAXRECURSION 200)
Here’s another example, find ancestors of a person:
Base query finds Frank’s parent — Mary, recursive query takes this result under the Ancestor name and finds parents of Mary, which are Dave and Eve and this continues until we can’t find any parents anymore.
Now this tree traversal query could be the basis to augment the query with some other information of interest. For example, having a birth year in the table we can calculate how old the parent was when the child was born. Next query do exactly that, together with showing lineages. To do that it traverses the tree from top to bottom. parentAge is zero in the first row because we don’t know when Alice was born from the data we have.
Take away — recursive query references the result of base query or previous invocation of recursive query. Chain stops when recursive query returns empty table.
Source: Medium - by Denis Lukichev
The Tech Platform
Comments