- no empty lines in query
- no dots in table name
- no “-“ in table name
- no backtick (`) > (“)
- IF > CASE
- DATE(xx) > CAST(xx AS DATE)
- can’t CAST ‘’ to DATE > CAST(CASE WHEN m.AdvertisedTo = '' THEN NULL ELSE m.AdvertisedTo END AS DATE)
- “daysincelastorder: can’t use NOW() (because of MPP) DATEDIFF(DAYS, MAX(o.dateTime), CAST(GETDATE() AS timestamp))
- comment is “--“ not “#”
- GROUP BY must contains all non-groupped columns
- no case sensitivity (id = Id)
- dedupe by row_number (see below)
- use BOOLEAN whenever you can
- Timezone conversion: CONVERT_TIMEZONE('UTC', 'US/Pacific',CAST(Modified AS timestamp)) AS Date
-
- CREATE TABLE w/ AUTO INCREMENT: CREATE TABLE teams(id int8 identity(0, 1),name varchar(100),primary key(id));
- DATE FORMAT: TO_CHAR(CONVERT_TIMEZONE('UTC', 'US/Pacific',CAST(fs.Modified AS date)),'YYYYMMDD') AS DateFact
- (SELECT IFNULL(MAX(snapshotDate), DATE_ADD(DATE(NOW()), INTERVAL -1 DAY)) FROM tmp.LeadSnapshot) => (SELECT CASE WHEN MAX(CAST(snapshotDate AS timestamp)) IS NULL THEN DATEADD(DAY,-1,GETDATE()) ELSE MAX(CAST(snapshotDate AS timestamp)) END FROM “tmp.LeadSnapshot”)
- "/" as the last character of a string will shift columns. For example, if you have a text comment field and a value is "thanks for the great service :-/", the "/" will escape the double quotes, and will mess up the rest of the row. Where you have a column with a possibility of a backslash at the end of a value, you have to do a:
- CASE WHEN (RIGHT("field",1) = '//' THEN "field"::VARCHAR(255) || ' ' ELSE "field" END CASE
- this checks the right most character of the "field" and if it's a backslash, then it adds a space to the end, and if not it will just take the field as is
"Uniqueness" in a primary key column is not enforced in Redshift. Example, you have primary key column "id", and have a insert query putting rows in the "id" column, you can rerun the query 100x and it will give you 100x rows, and 100x duplicates into the table.
Reference: http://stackoverflow.com/questions/15170701/amazon-redshift-keys-are-not-enforced-how-to-prevent-duplicate-data
Possible solution if you need to update rows in a table (which you would normally use a primary key to help do this):
http://docs.aws.amazon.com/redshift/latest/dg/t_updating-inserting-using-staging-tables-.html
Show load errors
SELECT * FROM stl_load_errors;
Show table definition:
SELECT * from pg_table_def where tablename = ’tablename';
HANDY COPY COMMAND:
TRUNCATECOLUMNS
(table “orig” by “Id” column)
SELECT
a.Id, a.2, a.3, a.4
FROM (
SELECT
Id,
2,
3,
4,
ROW_NUMBER() OVER (PARTITION BY Id ORDER BY Id) AS "row_number"
FROM orig
) AS a
WHERE a."row_number" = 1;
SHOW CHANGES SINCE LAST 5 DAYS - FLOATING INTERVAL
SELECT
*
FROM "out.c-main".data
WHERE transaction_date::timestamp > getdate() - interval '5 days'
;
COMMON PROBLEM WITH DATA TYPES:
Redshift is more datatype-specific than MySQL.
Note, that the default data type in a redshift bucket is set to VARCHAR(65535).
This causes problems when attempting a CONCAT in the query. It leads
Redshift to try and CONCAT two VARCHAR(65535) string.
SELECT
a."author"::VARCHAR(255) || '_' || b."transaction"::VARCHAR(255) AS "columna"
FROM "asdf" a JOIN "xyz" b ON ....
There are two ways on how to deal with this:
1. Set a lower VARCHAR length in the input mapping.
2. Define the datatype on the fly via CAST
Defining data types is strongly recommended! If all your columns have default type VARCHAR(65535), every FULL TABLE operation will take way too long versus operation above table with appropriately defined data types.
Another common problem is when trying to cast a VARCHAR column as INT
or decimal. Redshift does not know how to deal with empty values in such case.
A sanitization CASE query is needed before you can cast the data type.
Be sure to convert all empty and NULL values to actual zeroes.
In case you are trying to cast a VARCHAR column as decimal replace all "," delimiters with "."
Performance Hints
Let's say you need to create a new table with some aggregation columns and group it by another column (eg. usersegmentid
). Due to nature of GROUP BY
the query will fail every time there is a column in the result without aggregation function that is not in GROUP BY
statement. So you might end up with this query:
CREATE TABLE test AS
SELECT
u.usersegmentid,
u.anonymous,
DATE_TRUNC('month',u.registrationdatetime) AS registrationdate,
u.dayssincelastvisitcat,
u.dayssincelastvisitcatsort,
u.dayssincelastclickcat,
u.dayssincelastclickcatsort,
u.refsource,
u.refcampaign,
u.firtsclickutmsource,
u.firtsclickutmmedium,
u.clickscat,
u.clickscatsort,
u.visitscat,
u.visitscatsort,
u.lastregrefsource,
u.lastregrefcampaign,
COUNT(u.id) AS usercount,
COUNT(rev30.userid) AS usercount30,
SUM(rev30.revenue) AS revenue30
FROM proc_users u
LEFT JOIN rev30 ON rev30.userid = u.id
GROUP BY
u.usersegmentid,
u.anonymous,
registrationdate,
u.dayssincelastvisitcat,
u.dayssincelastvisitcatsort,
u.dayssincelastclickcat,
u.dayssincelastclickcatsort,
u.refsource,
u.refcampaign,
u.firtsclickutmsource,
u.firtsclickutmmedium,
u.clickscat,
u.clickscatsort,
u.visitscat,
u.visitscatsort,
u.lastregrefsource,
u.lastregrefcampaign;
Due to a large number of columns the query performs poorly (around 6 minutes on a 9M table). The solution is to use only one column in GROUP BY:
CREATE TABLE test_5 AS
SELECT
u.usersegmentid,
COUNT(u.id) AS usercount,
COUNT(rev30.userid) AS usercount30,
SUM(rev30.revenue) AS revenue30
FROM proc_users u
LEFT JOIN rev30 ON rev30.userid = u.id
GROUP BY
u.usersegmentid;
This query is executed in a matter of just seconds. If you need to add the columns used in the first query, join it after computing the aggregations.
How to find duplicates
--by tomas.trnka@keboola.com
The traditional way of searching for duplicates seems very low performace.
(e.g. select id, count(*) from table group by id, order by count(*) desc;)
I found that something like this is circa 30x faster:
SELECT
*
FROM
(
SELECT
*
,
ROW_NUMBER
()
OVER
(
PARTITION BY
id
ORDER BY id
)
AS
"row_number"
FROM table
)
AS
a
WHERE
a."row_number" >
1
;
Views vs. Tables
When running a query using TABLE:
Input mapping: Read data from source, write data to table.
Transformation A: Read data, process data, write data to table.
Transformation B: Read data, process data, write data to table.
Transformation C: Read data, process data, write data to table.
Output mapping: Read data, write data to destination.
When running a query using VIEW:
Input mapping: Nothing (create view).
Transformation A: Nothing (create view).
Transformation B: Nothing (create view).
Transformation C: Nothing (create view).
Output mapping: Read data, process data, write data to destination.
The whole workload is shifted to the end, happens in one large bulk and does not have any I/O overhead as when using create table. But the bulk may hit cluster limit (memory, disk, etc) because the data is not executed in smaller steps as in the previous case.
Converting Unix Epoch stamp to Timestamp
If Unix time shows weird year it could be saved as Epoch (in milliseconds)
SELECT TIMESTAMP 'epoch' + "myEpochStamp"::BIGINT /1000 * INTERVAL '1 second' from myTable;
Converting Date to Good Data Timestamp
This example is for dates in MM/DD/YYYY format. Adjust to format needed in your case.
SELECT (extract(epoch from TO_DATE(date, 'MM/DD/YYYY'))/86400+25568);