Next
Topic:Cast Functions |

OLAP (OnLine Analytical Processing) functions are tools for analyzing and presenting multidimensional data. The OLAP functions supported by SAND CDBMS can be used for various forms of data ranking and row numbering. In all cases, the input data can be partitioned by field(s), so that the function applies separately to each partitioned set. If there is no explicit partitioning, the whole set of input rows is considered one single partition.

The following OLAP functions are supported:

CUME_DIST()

Gives the cumulative distribution of rows in a partition, based on the ordering specified in the OVER clause, with the percentile ranking of each row expressed as a floating point value ranging from0to1. In ascending order, this value is calculated as the number of rows ranked the same or lower than the current row (including the current row), divided by the total number of rows in the partition. In descending order, the value is the number of rows ranked higher than or equal to the current row, divided by the total number of rows in the partition.This function takes no parameters, so empty parentheses must be specified immediately after the function name.

DENSE_RANK()

Gives the numerical rank for each row in a partition, based on the ordering specified in the OVER clause. The ranking starts with1for the highest rank, and increments by 1 for each lower ranking. If two or more rows are tied in the ordering, they will all share the same rank value.The only difference between the DENSE_RANK() and RANK() functions is in how ranking ties are handled. With DENSE_RANK() there is no gap in the numerical ranking sequence following tied ranks (for example, 1, 2, 2, 2, 3), whereas there is a sequence gap with RANK() (for example, 1, 2, 2, 2, 5).

This function takes no parameters, so empty parentheses must be specified immediately after the function name.

NTILE(integer)

Gives the tiled rank for each row in a partition, based on the ordering specified in the OVER clause. A positive integer, representing the total number of tiled ranks, must be provided for this function. For instance, NTILE(4) means that the input data will be ranked 1-4, with the top 25% ordered rows in the first rank, the next 25% in the second rank, the next 25% in the third rank, and the bottom 25% in the fourth rank.In cases where the specified NTILE number does not divide evenly into the number of input rows, the number of rows in each tiled rank is calculated as

CEILING( <number of rows in the partition>/<NTILE number> )

except for the lowest rank, which will have one or more less rows than the higher ranks.

If the specified NTILE value exceeds the number of elements in the partition, each row will be ranked with a unique number, resembling the behavior of the ROW_NUMBER() function.

PERCENT_RANK()

Gives a floating point ranking value between0and1for each row in a partition, based on the ordering specified in the OVER clause. The ranking value for a row is calculated as <the row's rank in the partition> – 1, divided by <the total number of rows in the partition> – 1. The first row is always ranked0, and the last row is always ranked1. If there is a single row in the partition, it has a rank of0.This function takes no parameters, so empty parentheses must be specified immediately after the function name.

RANK()

Gives the numerical rank for each row in a partition, based on the ordering specified in the OVER clause. The ranking starts with1for the highest rank, and increments by 1 for each lower ranking, except when there are ties in rank. If two or more rows are tied in the ordering, they will all share the same rank value, and the next rank number will be the previous rank number + the number of rows tied for the previous rank.The only difference between the DENSE_RANK() and RANK() functions is in how ranking ties are handled. With DENSE_RANK() there is no gap in the numerical ranking sequence following tied ranks (for example, 1, 2, 2, 2, 3), whereas there is a sequence gap with RANK() (for example, 1, 2, 2, 2, 5).

ROW_NUMBER()

Gives a number for each row in the partition, based on the ordering specified in the OVER clause. The rows start with1and increment by 1 for each subsequent row, regardless of ties in the ordering.

NTILE (

integer)

The NTILE function requires a positive integer as a parameter. This value sets the total number of tiled ranks that will be included in the results. So, for example, NTILE (100) will rank the results 1-100 for each partition, with approximatelyn/100 rows in each rank (wherenis the total number of rows in the partition).

PARTITION BYcolumn name

The optional PARTITION BY column list specifies one or more fields on which to split the input data. Each row that shares a common value for the partition field will be considered part of the same partition, and all OLAP functions included in the statement will be applied separately for each partition. Multiple partition fields can be specified, separated by commas. When there are multiple partition fields, rows must have the same value in each field to be part of the same partition.If the PARTITION BY clause is omitted, the entire set of input rows is considered a single partition.

ORDER BYexpressionThe ORDER BY clause defines the way the input data is ordered within each partition for the OLAP function. The value expression that sets the order can consist of any combination of column names, functions, constants, and/or scalar subqueries. Multiple ordering expressions can be specified, separated by commas.

Note that this ORDER BY clause is independent of any other ORDER BY clause appearing in the SQL statement, and does not necessarily reflect the way the result set will be ordered.

ASC | DESC

The results of the preceding ORDER BY can be sorted in ascending order (the default) or descending order by specifying the ASC or DESC keywords, respectively.

NULLS FIRST | NULLS LAST

The results of the preceding ORDER BY can have null values sortedbeforenon-null values orafter(the default) by specifying the NULLS FIRST or NULLS LAST option, respectively.Note that the sorting of nulls is independent of the ASC/DESC option, so NULLS LAST will put the null values at the bottom, and NULLS FIRST will put the null values at the beginning, regardless of whether the non-null values are in ascending or descending order.

Consider this table called

Sales:

TDATEPARTQTYTSALESCREDIT2008-01-04 nail 610 68.95 Y2008-01-04 rivet 143 40.56 N2008-01-07 nail 756 85.34 Y2008-01-07 screw 94 12.79 N2008-01-07 nut 478 27.02 N2008-01-09 nut 350 19.80 N2008-01-09 rivet 85 **** Y2008-01-10 nail 917 103.55 Y2008-01-10 bolt 412 65.12 Y2008-01-10 screw 169 22.93 N2008-01-14 rivet 184 52.00 Ywhere

****represents a null value.

The following SELECT statement on theSalestable returns the cumulative distribution of unique values in thePartfield:SELECT Part AS part,

SUM(TSales) AS total_sales,

CUME_DIST() OVER ( ORDER BY SUM(TSales) ) AS cume_dist

FROM Sales

GROUP BY Part

ORDER BY SUM(TSales) ;Output:

part total_sales cume_dist

------- ----------- ----------------------

screw 35.72 2.00000000000000e-01

nut 46.82 4.00000000000000e-01

bolt 65.12 6.00000000000000e-01

rivet 92.56 8.00000000000000e-01

nail 257.84 1.00000000000000e+00Ordering by the sum of

TSales(ascending, which is the default), the CUME_DIST() function shows where each part lies in the distribution of sales by part. In this case, screws have the lowest total sales, such that this part is in the bottom 20% of the cumulative distribution, which also means that 80% of the other parts have higher total sales. Next is the nut, which is in the bottom 40% of total sales by part, followed by the bolt at 60% and the rivet at 80% of the cumulative distribution. Finally, the nail is at 100% of the cumulative distribution, meaning 0% (that is, none) of the other parts have more total sales than the nail.

The next example includes both the NTILE() and PERCENT_RANK() functions:SELECT TDate AS date,

SUM(TSales) AS total_sales,

NTILE(3) OVER ( ORDER BY SUM(TSales) DESC ) AS ntile,

PERCENT_RANK() OVER ( ORDER BY SUM(TSales) ) AS percent_rank

FROM Sales

GROUP BY TDate

ORDER BY SUM(TSales) ;Output:

date total_sales ntile percent_rank

---------- ------------ ----- ----------------------

2008-01-09 19.80 3 0.00000000000000e+00

2008-01-14 52.00 2 2.50000000000000e-01

2008-01-04 109.51 2 5.00000000000000e-01

2008-01-07 125.15 1 7.50000000000000e-01

2008-01-10 191.60 1 1.00000000000000e+00Here, we are grouping total sales by date, and presenting the information in ascending order of sales per day. The NTILE() function serves to rank the rows in descending order of sales per day, according to a "tertile" (1 to 3) ranking because

3was specified as the parameter for NTILE(). So in thentilefield, the lowest total sales is given the lowest rank (3), the next two total sales are given the middle rank (2), and the two highest sales totals are given the top rank (1).Also in the above example, the PERCENT_RANK() function is based on the ascending order of total sales, so it displays for each row in floating point form what percentage of the other rows (excluding the current one) have lower total sales. The first row has the lowest total sales (19.80), so 0% of the other rows are ranked below it. For the next row (with 52.00 in total sales), one out of the four other rows is below it in total sales, so the percent rank is 25%. The percent rank for the next rows are calculated in the same manner. The row with the highest total sales (191.60) has all four of the other rows below it, so the percent rank for that last row is 100%.

The following example illustrates the difference between the RANK() and DENSE_RANK() functions:SELECT Part AS part,

SUM(Qty) AS qty,

RANK() OVER ( ORDER BY SUM(qty) DESC ) AS rank,

DENSE_RANK() OVER ( ORDER BY SUM(qty) DESC ) AS dense_rank

FROM Sales

GROUP BY part

ORDER BY rank ;Output:

part qty rank dense_rank

------ ----- ---- ----------

nail 2283 1 1

nut 828 2 2

rivet 412 3 3

bolt 412 3 3

screw 263 5 4Both of the OLAP functions in the above example rank the summed

qtyvalues per part in descending order. So the highestqtyvalue is given a rank of 1, the next highest a rank of 2, and so on. The difference arises after we have a tie in theqtyvalues. Both of the functions assign a rank of 3 to the tied rows, but the subsequent row is given a rank of 5 by RANK() and a rank of 4 by DENSE_RANK(). No matter how many ties in the ranking there are, DENSE_RANK() will never skip a number in the numerical sequence; whereas RANK(), after ties in the ranking, will assign to the next row a number equal to the tied rankplusthe number of rows tied for that rank (in this case, a tied rank of3+2tied rows =5).

The final example shows the use of the ROW_NUMBER() function with partitioning and ordering that places null values at the beginning:SELECT ROW_NUMBER() OVER (PARTITION BY Credit ORDER BY TSales DESC NULLS FIRST) AS row,

TDate AS date,

Part AS part,

TSales AS sales,

Credit AS credit

FROM Sales

ORDER BY credit, row ;Output:

row date part sales credit

--- ---------- ------- ------- ------

1 2008-01-04 rivet 40.56 N

2 2008-01-07 nut 27.02 N

3 2008-01-10 screw 22.93 N

4 2008-01-09 nut 19.80 N

5 2008-01-07 screw 12.79 N

1 2008-01-09 rivet **** Y

2 2008-01-10 nail 103.55 Y

3 2008-01-07 nail 85.34 Y

4 2008-01-04 nail 68.95 Y

5 2008-01-10 bolt 65.12 Y

6 2008-01-14 rivet 52.00 YIn the above example, the ROW_NUMBER() function operates separately on the data sets created by partitioning the

Creditfield. Since there are two unique values forCredit('N' and 'Y'), there are two partitions. For each partition, the row numbers (starting at 1 and incrementing by 1) are assigned to the rows in descending order of sales.In addition, the ROW_NUMBER() ordering specifies the NULLS FIRST option, which means that any row with a null value in the

TSalescolumn will be placed at the start of the partition, rather than at the end, which is the default. Hence, the first row in the second partition is assigned a row number of 1 because itssales(TSales) value is null (represented by****).

Next
Topic:Cast Functions |