PowerPoint Presentation

Stored procedures in Amazon

Redshift

A N T 3 3 7 - R

Naresh Chainani

Senior Development Manager

Amazon Web Services

Joe Harris

Senior Database Engineer

Amazon Web Services

Why stored procedures?

Feature adoption

Frequently asked questions

Best practices

Demos

Agenda

Why stored procedures?

Wrap complex

ETL logic in

reusable stored

procedures

Eliminate network

round trips

Security

Reduce the

number of admin

users

Simpler “lift &

shift” legacy data

warehouse (DW)

migrations

Simplified migrations

• Stored procedures enable simple “lift and shift” migrations

• Customers have thousands of existing stored procedures that they do

not want to rewrite in Java, Python, or an ETL tool

• PL/pgSQL provides loops, conditionals, case statements, and

IN/OUT/INOUT argument passing

• Use AWS Schema Conversion Tool to easily migrate from SQL Server,

Oracle, Netezza, Teradata, and Greenplum

Security

GRANT and REVOKE

• Only the owner of a procedure can execute it

• Permission must be explicitly granted for others to execute a procedure

SECURITY INVOKER

• Caller executes procedure as themselves and must have appropriate privileges to

access/modify database objects

• This is the default setting

SECURITY DEFINER

• Caller executes procedure with the owner’s privileges

• Allows security admins to avoid giving broad access that user would otherwise need

Security definer use cases

Destructive actions

ETL users can run DELETE or TRUNCATE on important tables without being granted table owner

or superuser privileges

Access control

• Provide fine-grained access to sensitive data

• Revoke access to underlying table and allow authorized users to access data via a stored

procedure

Use caution

Grant EXECUTE on such stored procedures to specific users, not to the public. This avoids potential

misuse of a powerful feature.

Modular code, clean abstractions

Maintaining table relationships and ensuring data integrity can lead to

complex ETL logic

Stored procedures provide code encapsulation for SQL, making it easier

to debug and understand when things do not work as expected

CREATE OR PROCEDURE daily_ingest()

AS $$

BEGIN

CALL load_main_tables();

CALL fix_violations();

END;

$$ LANGUAGE plpgsql;

BEGIN; CALL daily_ingest(); COMMIT;

-- `psql` output

BEGIN

INFO: - Load Main -

INFO: `sales` table - 200000 rows loaded

WARN: `sales` table - 7 duplicates found

INFO: Details in ‘violations’ table

INFO: - Fix PK Violations -

INFO: `sales` table - 7 duplicates found

INFO: `sales` table - 7 duplicates deleted

INFO: ‘violations’ table updated

CALL

COMMIT

Performance

• You can now apply complex procedural logic inside your Amazon Redshift cluster

• Stored procedures loops and conditional logic run on the leader node of Amazon Redshift

cluster, and any SQL within it is distributed to compute nodes

• Previously, you needed to extract the data, apply the procedural logic using Java/Python/or

other on a separate server, and then reload the results

• You may be able to speed up your processing significantly by using procedures to eliminate the

extract > process > reload round trip

• For maximum performance, avoid large loops over millions of rows that are updated one at a

time. Instead use set-based SQL logic.

Feature adoption

Customers have enthusiastically adopted stored procedures

May June July August September October

Stored procedures created

(tens of thousands)

May June July August September October

Daily call count

(hundreds of thousands)

Frequently asked questions

• How do I view a list of all stored procedures?

SELECT proname FROM pg_proc_info WHERE prolang = 100125;

• How do I retrieve the DDL of a stored procedure?

SHOW PROCEDURE my_proc( arg1 INT, arg2 VARCHAR);

• How do I return results from a stored procedure?

• How do I use transactions within a stored procedure?

• How do I use dynamic SQL in a stored procedure?

• How do I debug a stored procedure?

Returning results

Results are returned with REFCURSOR or temp tables

REFCURSOR

• Limited to 1 REFCURSOR

• Results are kept in memory on leader node

• Maximal cursor size is limited on the basis of node type

TEMP TABLE

• Many temp tables can be used

• Results are distributed across compute node disks

• Maximal temp table size is limited by available storage

CREATE PROCEDURE sp_result

( result_set INOUT REFCURSOR )

AS $$

BEGIN

OPEN result_set FOR

SELECT *

FROM my_table

LIMIT 100;

END;

$$ LANGUAGE plpgsql;

BEGIN;

CALL sp_result( 'result’ );

FETCH ALL FROM result;

END;

Returning results

CREATE PROCEDURE sp_result

AS $$

BEGIN

DROP TABLE IF EXISTS my_temp;

CREATE TEMP TABLE my_temp AS

SELECT *

FROM my_table

LIMIT 100;

END;

$$ LANGUAGE plpgsql;

CALL sp_result();

-- Not bound to transaction scope

SELECT * FROM my_temp;

REFCURSOR

Temp Table

Using transactions

• COMMIT, ROLLBACK, or TRUNCATE can be issued within a procedure

• A call to a stored procedure is atomic; if any error is encountered the

current transaction context is rolled back

• Procedures use the transaction context of the caller by default, so a

procedure called from another runs in the same transaction.

Transactions are not nested.

• When a stored procedure is called within an explicit transaction block,

your application’s logic determines when a procedure’s modifications

are committed or rolled back

Using transactions

COMMIT ends the current transaction, and a new transaction starts

CREATE PROCEDURE sp_commit()

AS $$

DECLARE

tlist RECORD;

BEGIN

FOR tlist IN EXECUTE 'SELECT tname FROM tbls;'

LOOP

EXECUTE 'INSERT INTO '|| tlist.tname

||' SELECT * FROM stg_'|| tlist.tname

||' ;';

COMMIT; --<<<<<

END LOOP;

END;

$$ LANGUAGE plpgsql;

BEGIN; CALL sp_commit(); END;

-- Effect of sp_commit()

BEGIN TRANSACTION

INSERT INTO tbl_1

COMMIT;

BEGIN TRANSACTION

INSERT INTO tbl_2

COMMIT;

BEGIN TRANSACTION

INSERT INTO tbl_3

COMMIT;

BEGIN TRANSACTION

INSERT INTO tbl_4

COMMIT;

…etc…

Dynamic SQL

Use EXECUTE and quote_ident/quote_literal to run composed queries

CREATE PROCEDURE sp_dynamic(loops INT, tbl_name VARCHAR)

AS $$

DECLARE

i_loop INT:=0;

BEGIN

EXECUTE 'CREATE TEMP TABLE '||quote_ident(tbl_name)

||' (loop INT, loop_ts TIMESTAMP);';

WHILE i_loop < loops

LOOP

i_loop = i_loop + 1;

RAISE INFO 'Loop pass: %', i_loop;

EXECUTE 'INSERT INTO '||quote_ident(tbl_name)

||' SELECT '||i_loop||', '''||SYSDATE||''';';

END LOOP;

END;

$$ LANGUAGE plpgsql;

BEGIN;

CALL sp_dynamic(15,'tmp_dynamic');

SELECT * FROM tmp_dynamic;

END;

-- `psql` output

loop | loop_ts

------+----------------------------

11 | 2019-10-17 16:27:07.365599

5 | 2019-10-17 16:27:07.365599

13 | 2019-10-17 16:27:07.365599

15 | 2019-10-17 16:27:07.365599

3 | 2019-10-17 16:27:07.365599

1 | 2019-10-17 16:27:07.365599

4 | 2019-10-17 16:27:07.365599

8 | 2019-10-17 16:27:07.365599

14 | 2019-10-17 16:27:07.365599

6 | 2019-10-17 16:27:07.365599

10 | 2019-10-17 16:27:07.365599

7 | 2019-10-17 16:27:07.365599

2 | 2019-10-17 16:27:07.365599

9 | 2019-10-17 16:27:07.365599

12 | 2019-10-17 16:27:07.365599

Debugging

Use RAISE INFO statements to print variables for simple debugging

CREATE PROCEDURE sp_debug(loops INT)

AS $$

DECLARE

i_loop INT:=0;

BEGIN

RAISE INFO 'Starting - % loops',loops;

WHILE i_loop < loops

LOOP

i_loop = i_loop + 1;

RAISE INFO 'Loop pass - %', i_loop;

END LOOP;

RAISE INFO 'Finished - % loops', i_loop;

END;

$$ LANGUAGE plpgsql;

BEGIN; CALL sp_debug(10); END;

-- `psql` output

BEGIN

INFO: Starting - 10 loops

INFO: Loop pass - 1

INFO: Loop pass - 2

INFO: Loop pass - 3

INFO: Loop pass - 4

INFO: Loop pass - 5

INFO: Loop pass - 6

INFO: Loop pass - 7

INFO: Loop pass - 8

INFO: Loop pass - 9

INFO: Loop pass - 10

INFO: Finished - 10 loops

CALL

COMMIT

Debugging

Use a tool that displays info messages when debugging

-- `psql` output

BEGIN

INFO: Starting - 10 loops

INFO: Loop pass - 1

INFO: Loop pass - 2

INFO: Loop pass - 3

INFO: Loop pass - 4

INFO: Loop pass - 5

INFO: Loop pass - 6

INFO: Loop pass - 7

INFO: Loop pass - 8

INFO: Loop pass - 9

INFO: Loop pass - 10

INFO: Finished - 10 loops

CALL

COMMIT

TablePlus

[2019-12-02 10:11:12]

manual transaction mode ON

tpcds.public>

BEGIN

[2019-12-02 10:11:12]

completed in 242 ms

tpcds.public>

END

[2019-12-02 10:11:12]

completed in 443 ms

DataGrip

Best practices summary

1. Use source code control to track changes and provide versioning

2. SECURITY DEFINER is powerful; use it with caution

3. Use loops and REFCUSOR for iterating over small result sets

4. Use temp tables and set logic when working with millions of rows

5. Add INFO messages; talkative procedures are easier to understand

6. KISS: Keep procedures small and nest them to do complex things

Demo stored procedures on GitHub

https://github.com/awslabs/

amazon-redshift-utils/tree/master/src/

StoredProcedures/

sp_controlled_access.sql

sp_check_primary_key.sql

sp_connect_by_prior.sql

sp_pivot_for.sql

Control queries on sensitive data

-- [3] Create User With No Privilegses

CREATE USER user_no_priv WITH PASSWORD 'Password01';

GRANT ALL ON SCHEMA public TO user_no_priv;

GRANT EXECUTE ON PROCEDURE sp_controlled_access(

VARCHAR,VARCHAR,VARCHAR,INTEGER,REFCURSOR

) TO user_no_priv;

-- [4] Change session to the new user

SET SESSION AUTHORIZATION user_no_priv;

SELECT current_user;

-- user_no_priv

SELECT * FROM hr.employee;

-- ERROR: permission denied

-- [5] Call the stored procedure

BEGIN;

CALL sp_controlled_access (

'title,emp,mgr,dept'::VARCHAR

-- select_cols

, 'hr.employee'::VARCHAR

-- query_from

, 'dept = ''Board'''::VARCHAR

-- where_clause

, 10

-- max_rows

, 'employee_data' );

-- result_set

-- INFO: AUTHORIZED: Query on `hr.employee` completed

-- [6] Fetch the results

FETCH ALL FROM employee_data;

-- title | emp | mgr | dept

-- Chairman | 100 | | Board

-- CEO | 101 | 100 | Board

END;

-- [1] Create Example Data

CREATE SCHEMA hr;

CREATE TABLE hr.employee (

title VARCHAR, emp INT, mgr INT, dept VARCHAR);

INSERT INTO hr.employee VALUES

('Chairman', 100, NULL, 'Board' )

, ('CEO' , 101, 100 , 'Board' )

, ('CTO' , 102, 101 , 'IT' );

-- [2] Create Example Auth Table

CREATE TABLE hr.access_authority (

query_from VARCHAR, user_name VARCHAR

, valid_until TIMESTAMP, max_rows INT );

INSERT INTO hr.access_authority

VALUES ('hr.employee’, 'user_no_priv’, '2019-12-31’ ,99);

amazon-redshift-utils/src/StoredProcedures/sp_controlled_access.sql

Allow a user to perform limited queries on

tables where they have no access privileges

sp_controlled_access() looks for query

permission and maximum rows in auth_table

Uses SECURITY DEFINER to run the user query

Check PRIMARY KEY constraint

-- [4] Attempt to DELETE will error

SET SESSION AUTHORIZATION user_low_priv;

DELETE FROM tmp_pk_log WHERE pk_col = 100;

-- ERROR: permission denied for relation tmp_pk_log

-- [5] Call the SP with LOG action

CALL sp_check_primary_key(

SYSDATE,'tmp_pk_test','LOG','tmp_pk_log',0);

SELECT * FROM tmp_pk_log;

-- check_table | tmp_pk_test

-- check_time | 2019-10-31 00:01:33.849821

-- check_status | ERROR (LOG) - Duplicates found

-- error_count | 4

-- [6] Call the SP with FIX action and 3 rows max fix

CALL sp_check_primary_key(

SYSDATE,'tmp_pk_test','FIX','tmp_pk_log',3);

SELECT * FROM tmp_pk_log;

-- check_table | tmp_pk_test

-- check_time | 2019-10-31 00:01:34.479395

-- check_status | ERROR (FIX) - Duplicate count exceeds…

-- error_count | 4

-- [7] Call the SP with FIX action and 99 rows max fix

CALL sp_check_primary_key(

SYSDATE,'tmp_pk_test','FIX','tmp_pk_log',99);

SELECT * FROM tmp_pk_log;

-- check_table | tmp_pk_test

-- check_time | 2019-10-31 00:01:34.892679

-- check_status | SUCCESS (FIX) - Duplicates corrected.

-- error_count | 4

-- [2] Create Log Table

CREATE TABLE tmp_pk_log(

check_table VARCHAR

, check_time TIMESTAMP

, check_status VARCHAR

, error_count INT);

-- [3] Create A User

CREATE USER user_low_priv …;

-- [4] Grant Execute

GRANT EXECUTE ON PROCEDURE

sp_check_primary_key()

TO user_low_priv;

amazon-redshift-utils/src/StoredProcedures/sp_check_primary_key.sql

Checks that the declared PRIMARY KEY does

not contain any duplicates

sp_check_primary_key() attempts to remove

duplicate rows up the max_fix_rows value

Uses SECURITY DEFINER to allow an

unprivileged user to safely DELETE from a

table where they only have read privileges

-- [1] Create Example Table

CREATE TEMP TABLE tmp_pk_test (

pk_col INTEGER PRIMARY KEY

, entries INTEGER);

INSERT INTO tmp_pk_test VALUES

(100, 7001)

, (100, 7001) -- Duplicate

, (101, 20225)

, (102, 22772)

, (103, 4577);

Emulate CONNECT BY PRIOR

-- [2] Example of CONNECT BY PRIOR syntax

SELECT dept

, emp_title

, LEVEL

, emp_id

, mgr_id

FROM tmp_employee

CONNECT BY PRIOR emp_id = mgr_id

START WITH mgr_id IS NULL

WHERE LEVEL <= 3;

-- [3] Syntax for Stored Procedure call

CALL sp_connect_by_prior (

'dept,emp_title'

-- select_cols

,'tmp_employee'

-- table_name

,'emp_id'

-- child_col

,'mgr_id'

-- parent_col

,'mgr_id IS NULL'

-- start_with

,''

-- where_clause

-- max_level

,'tmp_result'

-- temp_tbl_n

);

-- [4] Stored Procedure output

SELECT * FROM tmp_result

ORDER BY level, mgr_id, emp_id;

-- dept | emp_title | level | emp_id | mgr_id

-- Board | Chairman | 1 | 100 |

-- Board | CEO | 2 | 101 | 100

-- IT | CTO | 3 | 102 | 101

-- Sales/Mkt | CMO | 3 | 103 | 101

-- [1] Create Example Table

CREATE TEMP TABLE tmp_employee (

emp_title VARCHAR

, emp_id INT

, mgr_id INT

, dept VARCHAR)

;

INSERT INTO tmp_employee VALUES

('Chairman' , 100, NULL,'Board' )

, ('CEO' , 101, 100 ,'Board' )

, ('CTO' , 102, 101 ,'IT' )

, ('CMO' , 103, 101 ,'Sales/Mkt')

, ('VP Analytics' , 104, 102 ,'IT' )

, ('VP Engineering' , 105, 102 ,'IT’ )

, ('Sales Director' , 106, 103 ,'Sales/Mkt’)

, ('Sales Mgr West' , 107, 106 ,'Sales/Mkt’)

;

amazon-redshift-utils/src/StoredProcedures/sp_connect_by_prior.sql

Emulates the output of queries using CONNECT

BY PRIOR syntax on parent-child hierarchies

sp_connect_by_prior() recursively queries the

table to retrieve all levels of the hierarchy

Emulate PIVOT FOR

-- [2] Example of PIVOT FOR syntax

SELECT country

, ak, al, ar, az, ca, co

FROM tmp_state_metrics

PIVOT (SUM( entries )

FOR state IN ('ak','al','ar','az','ca','co’)) pvt

WHERE state IS NOT NULL

GROUP BY country

;

-- [3] Syntax for Stored Procedure call

BEGIN;

CALL sp_pivot_for (

'country'

-- select_cols

,'tmp_state_metrics'

-- table_name

,'state'

-- pivot_src

,'SUM'

-- agg_func

,'entries'

-- metric_col

,'state IS NOT NULL'

-- where_clause

,'pivot_result'

-- result_set (cursor)

);

-- [4] Retrieve Stored Procedure output from cursor

FETCH ALL FROM pivot_result

;

-- country | ak | al | ar | az | ca | co

-- USA | 7002 | 20226 | 22773 | 4578 | 17878 | 19504

-- [5] Close transaction. Will release the cursor.

END;

-- [1] Create Example Table

CREATE TEMP TABLE tmp_state_metrics (

country VARCHAR

, state VARCHAR

, entries INTEGER

);

INSERT INTO tmp_state_metrics VALUES

('USA', NULL, 15177), ('USA', NULL, 1)

, ('USA', 'AK', 7001), ('USA', 'AK', 1)

, ('USA', 'AL', 20225), ('USA', 'AL', 1)

, ('USA', 'AR', 22772), ('USA', 'AR', 1)

, ('USA', 'AZ', 4577), ('USA', 'AZ', 1)

, ('USA', 'CA', 17877), ('USA', 'CA', 1)

, ('USA', 'CO', 19503), ('USA', 'CO', 1)

;

amazon-redshift-utils/src/StoredProcedures/sp_pivot_for.sql

Emulates the output of queries using PIVOT

FOR syntax to transpose rows to columns

sp_pivot_for() creates an aggregated CASE

statement for each row value

Stored procedure version does not require known

pivot values

Learn big data with AWS Training and Certification

Visit aws.amazon.com/training/paths-specialty/

New free digital course, Data Analytics Fundamentals, introduces

Amazon S3, Amazon Kinesis, Amazon EMR,

AWS Glue, and Amazon Redshift

Validate expertise with the AWS Certified Big Data - Specialty exam or

the new AWS Certified Data Analytics - Specialty beta exam

Resources created by the experts at AWS to help you build and validate data analytics skills

Classroom offerings, including Big Data on AWS, feature

AWS expert instructors and hands-on labs

Thank you!

Naresh Chainani

nareshkc@amazon.com

Joe Harris

harrjose@amazon.com