Keboola WIKI‎ > ‎Old Articles‎ > ‎Transformations‎ > ‎Redshift‎ > ‎

Redshift Hints


  • no empty lines in query
  • no dots in table name
  • no “-“ in table name
  • no backtick (`) > (“) 
  • IF > CASE
  • DATE(xx) > CAST(xx AS DATE)
  • can’t CAST ‘’ to DATE > CAST(CASE WHEN m.AdvertisedTo = '' THEN NULL ELSE m.AdvertisedTo END AS DATE)
  • “daysincelastorder: can’t use NOW() (because of MPP) DATEDIFF(DAYS, MAX(o.dateTime), CAST(GETDATE() AS timestamp))
  • comment is “--“ not “#”
  • GROUP BY must contains all non-groupped columns
  • no case sensitivity (id = Id)
  • dedupe by row_number (see below)
  • use BOOLEAN whenever you can

  • Timezone conversion: CONVERT_TIMEZONE('UTC', 'US/Pacific',CAST(Modified AS timestamp)) AS Date 
    • take care of datatypes
  • CREATE TABLE w/ AUTO INCREMENT: CREATE TABLE teams(id int8 identity(0, 1),name varchar(100),primary key(id));
  • DATE FORMAT:     TO_CHAR(CONVERT_TIMEZONE('UTC', 'US/Pacific',CAST(fs.Modified AS date)),'YYYYMMDD') AS DateFact
  • (SELECT IFNULL(MAX(snapshotDate), DATE_ADD(DATE(NOW()), INTERVAL -1 DAY)) FROM tmp.LeadSnapshot) => (SELECT CASE WHEN MAX(CAST(snapshotDate AS timestamp)) IS NULL THEN DATEADD(DAY,-1,GETDATE()) ELSE MAX(CAST(snapshotDate AS timestamp)) END FROM “tmp.LeadSnapshot”)
  • "/" as the last character of a string will shift columns.  For example, if you have a text comment field and a value is "thanks for the great service :-/", the "/" will escape the double quotes, and will mess up the rest of the row.  Where you have a column with a possibility of a backslash at the end of a value, you have to do a:
    • CASE WHEN (RIGHT("field",1) = '//' THEN "field"::VARCHAR(255) || ' ' ELSE "field" END CASE
    • this checks the right most character of the "field" and if it's a backslash, then it adds a space to the end, and if not it will just take the field as is



Primary Key


"Uniqueness" in a primary key column is not enforced in Redshift.  Example, you have primary key column "id", and have a insert query putting rows in the "id" column, you can rerun the query 100x and it will give you 100x rows, and 100x duplicates into the table.  
Reference: http://stackoverflow.com/questions/15170701/amazon-redshift-keys-are-not-enforced-how-to-prevent-duplicate-data

Possible solution if you need to update rows in a table (which you would normally use a primary key to help do this): 
http://docs.aws.amazon.com/redshift/latest/dg/t_updating-inserting-using-staging-tables-.html



DBeaver client





Show load errors


SELECT * FROM stl_load_errors;

Show table definition:
SELECT * from pg_table_def where tablename = ’tablename';

HANDY COPY COMMAND:
TRUNCATECOLUMNS


How to dedupe table

(table “orig” by “Id” column)

SELECT
     a.Id, a.2, a.3, a.4
FROM (
     SELECT
          Id,
          2,
          3,
          4,
          ROW_NUMBER() OVER (PARTITION BY Id ORDER BY Id) AS "row_number"
     FROM orig
     ) AS a
WHERE a."row_number" = 1;


SHOW CHANGES SINCE LAST 5 DAYS - FLOATING INTERVAL



SELECT
    *
FROM "out.c-main".data
WHERE transaction_date::timestamp > getdate() - interval '5 days'
;


COMMON PROBLEM WITH DATA TYPES:


Redshift is more datatype-specific than MySQL. 
Note, that the default data type in a redshift bucket is set to VARCHAR(65535).
This causes problems when attempting a CONCAT in the query. It leads
Redshift to try and CONCAT two VARCHAR(65535) string. 

Example

SELECT
a."author"::VARCHAR(255) || '_' || b."transaction"::VARCHAR(255) AS "columna"
FROM "asdf" a JOIN "xyz" b ON ....

There are two ways on how to deal with this:
1. Set a lower VARCHAR length in the input mapping.
2. Define the datatype on the fly via CAST

Defining data types is strongly recommended! If all your columns have default type VARCHAR(65535), every FULL TABLE operation will take way too long versus operation above table with appropriately defined data types.

Another common problem is when trying to cast a VARCHAR column as INT
or decimal. Redshift does not know how to deal with empty values in such case. 
A sanitization CASE query is needed before you can cast the data type. 
Be sure to convert all empty and NULL values to actual zeroes. 
In case you are trying to cast a VARCHAR column as decimal replace all "," delimiters with "."

Performance Hints


Let's say you need to create a new table with some aggregation columns and group it by another column (eg. usersegmentid). Due to nature of GROUP BY the query will fail every time there is a column in the result without aggregation function that is not in GROUP BY statement. So you might end up with this query:

CREATE TABLE test AS
SELECT
    u.usersegmentid,
    u.anonymous,
    DATE_TRUNC('month',u.registrationdatetime) AS registrationdate,
    u.dayssincelastvisitcat,
    u.dayssincelastvisitcatsort,
    u.dayssincelastclickcat,
    u.dayssincelastclickcatsort,
    u.refsource,
    u.refcampaign,
    u.firtsclickutmsource,
    u.firtsclickutmmedium,
    u.clickscat,
    u.clickscatsort,
    u.visitscat,
    u.visitscatsort,
    u.lastregrefsource,
    u.lastregrefcampaign,
    COUNT(u.id) AS usercount,
    COUNT(rev30.userid) AS usercount30,
    SUM(rev30.revenue) AS revenue30
FROM proc_users u
LEFT JOIN rev30 ON rev30.userid = u.id
GROUP BY
    u.usersegmentid,
    u.anonymous,
    registrationdate,
    u.dayssincelastvisitcat,
    u.dayssincelastvisitcatsort,
    u.dayssincelastclickcat,
    u.dayssincelastclickcatsort,
    u.refsource,
    u.refcampaign,
    u.firtsclickutmsource,
    u.firtsclickutmmedium,
    u.clickscat,
    u.clickscatsort,
    u.visitscat,
    u.visitscatsort,
    u.lastregrefsource,
    u.lastregrefcampaign;

Due to a large number of columns the query performs poorly (around 6 minutes on a 9M table). The solution is to use only one column in GROUP BY:

CREATE TABLE test_5 AS
SELECT
    u.usersegmentid,
    COUNT(u.id) AS usercount,
    COUNT(rev30.userid) AS usercount30,
    SUM(rev30.revenue) AS revenue30
FROM proc_users u
LEFT JOIN rev30 ON rev30.userid = u.id
GROUP BY
    u.usersegmentid;

This query is executed in a matter of just seconds. If you need to add the columns used in the first query, join it after computing the aggregations.

How to find duplicates
--by tomas.trnka@keboola.com
The traditional way of searching for duplicates seems very low performace.
(e.g. select id, count(*) from table group by id, order by count(*) desc;)

I found that something like this is circa 30x faster:

SELECT
*
FROM (
SELECT
            *,  
            ROW_NUMBER() OVER 
                (PARTITION BY id 
                ORDER BY id) AS "row_number"
FROM table
) AS a
WHERE a."row_number" > 1;


Views vs. Tables

When running a query using TABLE:
Input mapping: Read data from source, write data to table.
Transformation A: Read data, process data, write data to table.
Transformation B: Read data, process data, write data to table.
Transformation C: Read data, process data, write data to table.
Output mapping: Read data, write data to destination.

When running a query using VIEW:
Input mapping: Nothing (create view).
Transformation A: Nothing (create view).
Transformation B: Nothing (create view).
Transformation C: Nothing (create view).
Output mapping: Read data, process data, write data to destination.

The whole workload is shifted to the end, happens in one large bulk and does not have any I/O overhead as when using create table. But the bulk may hit cluster limit (memory, disk, etc) because the data is not executed in smaller steps as in the previous case.


Converting Unix Epoch stamp to Timestamp
If Unix time shows weird year it could be saved as Epoch (in milliseconds)

SELECT TIMESTAMP 'epoch' + "myEpochStamp"::BIGINT /1000 * INTERVAL '1 second' from myTable;

Converting Date to Good Data Timestamp
This example is for dates in MM/DD/YYYY format. Adjust to format needed in your case.

SELECT (extract(epoch from TO_DATE(date, 'MM/DD/YYYY'))/86400+25568);

Comments