Easier than you might have thought, these are in-depth tips for designing a much smarter data setup
If you are reading this article, chances are that you are already using Amazon’s (deep breath) fully-managed, petabyte-scale, blazing-fast, massively-parallel-processing and cost-effective cloud Data Warehouse solution. Or, perhaps you have heard about it and now are considering it instead of your good ol’ on-premise RDBMS that has served you well for some time, but is struggling to get you into the brave new world of Big Data, mentioned with extensive enthusiasm two or more years ago.
Since then standards have changed and adjusted. Nobody knew, or knows the exact definition of the term, so let’s call it simply Data, shall we, and get to the point.
Understand what is it that do you want to achieve with Redshift
As with any tool, perhaps the most important question that you should answer to is: What do you want to achieve with it? On many occasions, people who start using Redshift perceive it as a scalable alternative to their MySQL/Postgres/Oracle/.. RDBMS solution. In this case, they simply migrate their data to Redshift and use it as their all-in-one solution for data storage.
The main problem is that it might even work well on limited amounts of data. But as the data accumulates, they find out that Redshift was NOT built for:
1. On demand, real-time updates/deletes
2. Answering tens of thousands of concurrent queries in milliseconds
3. Searching specific values by a pre-defined index
4. Other non-OLAP use case
Redshift was built for BI style OLAP. The concurrency in Redshift is low and limited, and more importantly – it is independent of the cluster size. Data ingestion should be done through batch ETLs with fixed frequency. There are no indexes in Redshift and, therefore, search-type queries performance is usually slower than in common RDBMS that incorporates indexes as auxiliary data structures. Understanding this in an early stage is crucial to make the right decision.
Invest in learning Redshift
Yes, Redshift is a cluster. Yes, you will generally get a better performance if you just add more nodes to it, but it is also quite easy to get to the point where you waste money on additional SSD nodes to a cluster that otherwise utilizes only 5% of the available storage, just so that your dashboards load in less than an hour.
Ideally, a cost-effective Redshift cluster will grow as the data grows while providing a linear or at least better performance based on the query type. To achieve this, you will need to design a scalable schematic that will take into account the following concepts:
1. Data distribution and co-allocation
2. Sort Order and Column Encoding
3. The Columnar nature of Redshift
4. Table Management and ETL overhead
5. Work-Load Management
And most importantly – Scalability
Design a scalable Data Model – Distribution vs Co-allocation
Data distribution and Co-Allocation are core concepts in Redshift. On the one hand, you want to evenly distribute the I/O, CPU, Memory loads created by each query across all nodes in the cluster, on the other, you also want to reduce Networking often created by shuffling of data due to lack of co-allocation. It usually takes some effort to be able to fulfill both, however it is necessary to ensure good and scalable performance. Let’s review an example in which we have 3 entities in our Data Model: User, Session and Transaction. A user can appear in many sessions, and there can be many transactions within a given session. We could choose three independent distribution keys for the three of our tables: user_id, session_id and transaction_id and have the data evenly distributed across our nodes. However, joining the data, in this case, can be very expensive network-vise. To avoid this, we can sacrifice some of the data distribution, and co-allocate the data to get faster joins.
For example, given that the session_id has high cardinality, we could choose to propagate it to the Transaction table and distribute both Session and Transaction by session_id. The User table, being the smallest can be distributed to all nodes. In this case we would still have good data distribution (given that we have relatively small and constant amount of transactions per session) while having the benefit of co-allocation of sessions and transactions for efficient joins. You cannot change a distribution of an existing table in Redshift. The only option is to create a new table and copy the data from the old one to the new. Moreover, redistribution in any cases is not enough. Changes need to be done in the existing Data Model and SQL queries running against the Data Warehouse to utilize the new distribution. Many work hours can be spared if you design most of it all right in the first place.
Design a Scalable Data Model – Sort Order
So we have mentioned that there are no indexes in Redshift. Nevertheless, there are Sort – Keys that can be defined for each table. Inefficient Sort Order can cause large or full table scans that might not be noticed when the data volume is low, but will become a major issue later on. To choose the best Sort Keys you need to understand how data is usually filtered in the queries. Sort Order is also important to ensure efficient Data Maintenance in the cluster.
Inefficient Sort Order in large tables can cause ETLs that Vacuum the tables to run for hours and in extreme cases for days. Like with distribution – there is no way to change the Sort Order of an existing table.
Design a Scalable Data Model – Column Encoding
Column Encoding is another important tool that, if used wisely, can both reduce storage costs by compressing the data and improve query performance by reducing I/O.
Redshift lets you choose an encoding method for each column of each table.
It also exposes automated tools that can decide on encoding for you. In any case, and
like with all the cases described above, incorrect/lack of encoding will be felt only when the data volume is large enough. At this point there will be multiple clients connected to the Data Warehouse and changing the tables will have a negative impact on all of them.
Design a Scalable Data Model – Table Management and ETL overhead
Having a schema that ensures low query execution times is not enough.
You will also need to make sure that you can effectively load and transform data within the schema, especially when the data volumes grow and the frequency of the ETLs is high.
Redshift unlike InnoDB in MySQL, for instance, leaves to you the maintenance of each table. It is up to you to VACUUM and ANALYZE the tables in time.
As mentioned before, inefficient sort order together with inefficient Data Model (for example storing long VARCHAR columns in fact tables) can make VACUUM run forever.
Redshift’s native COPY and UNLOAD utilities should be used to load or unload massive amounts of data from and to S3. While in S3, the data should be partitioned and stored in a compressed format supported by Redshift. Despite the fact that Redshift supports data provided in JSON format, processing CSV is usually much faster.
Design a Scalable Data Model – Columnar DB
Redshift is Column Oriented. This means that every column in each stable is eventually stored as a separate file. Upon query execution, Redshift does not retrieve the whole row from a table and accesses only the columns defined in the query. Therefore, query performance is not affected by table width as long as the query operates upon a fixed and pre-defined number of columns. Trying to implement a normalized Data Model in Redshift is wrong. De-Normalization is the path to follow.
Star or Snowflake design with wide fact tables and a number of dimension tables that surround it are the best practice with Redshift. Having said that, you should still limit table width, since wider tables have higher maintenance overheads (longer Vacuum times), especially when data in those tables is updated and not only inserted.
Design Scalable Queries
As with everything else, poorly designed queries might show bad performance only on a later stage, when the data amounts grow. Your query design should not be separated from the Dashboard design itself. If you intend to build dashboards that allow drill-down with BI style slice-and-dice functionality, and you have implemented your Data Model using the steps above, then Redshift is the right tool for you. In this case, your queries will already be optimized for Redshift out of the box. If your dashboards look more like wide grids that present long set of metrics per grid, then Redshift is not the tool for you, especially if you intend to allow real-time updates to the Data via the Dashboard.
The typical query distribution by execution time in Feedvisor, when we just implemented Redshift for our Revenue Intelligence system, looked like this:
However, as the time went by it started looking more like this:
To clarify, we have changed nothing in our queries or data model. It was only the data that gradually grew. Can you notice the small bulge between 15 and 30 seconds? Those are the inefficiently written queries that started to spill from memory to the disk as the data volume increased. You might think that it is still ok if majority of queries return in under less than 5 seconds, but the problem is that the “bulge” consists of queries executed by customers that generate the largest amounts of data, which, in many cases, are also the customers that generate the largest part of company’s revenue.
To be able to write scalable queries, you will need to examine both the Execution Plan and Redshift log tables such as SVL_QUERY_SUMMARY to understand how is the query executed and what happens behind the scenes. The general guidelines to follow:
1. Query the least amount of columns you can
2. Use the distribution key in joins
3. Use the sort keys to the filter all tables used in the query.
4. Try to reduce the amount of query steps by moving JOIN logic into a single GROUP BY statement.
5. Do not use your test environment for performance evaluation of the query. In many cases, it will not have the amount of data or the concurrent query loads that you have in production.
Workload Management in Redshift
WLM configuration is crucial for optimal query performance, however, it can be changed at any given point of time and usually you will not need to invest time and effort to fix it. Yet, there might be exceptions.
Each query in Redshift executes in a queue, in a separate virtual space called “slot”. The current maximal number of slots in Redshift is 50. This means that there can be up to 50 concurrent queries running in the cluster no matter how many nodes you have in it. The number of the slots in a queue defines the amount of memory each slot gets. In short: more slots = less memory per slot. If your queries won’t get enough memory, they will spill to disk, and we already saw how that looks like. From my experience, it is better to set to number of slots per queue to the lowest possible. It is sometimes better, to make new queries wait in a WLM queue until the slot is available, than to let them all execute concurrently.
Carefully choose the tools you connect to Redshift
Do you intend to use Redshift with existing/future BI/Reporting/Monitoring/ETL tools? Many solutions claim that they support Redshift, but there is quite a difference between being able to run queries via JDBC/ODBC and actually supporting the thing. And again, on small scale things might perform well, but as the data grows, problems start to emerge. For example, let’s have a look at the chart below:
The chart shows CPU load on a 3-node Redshift cluster in the midst of a large Tableau extract.
Notice that steady load on the leader node alone? That is something you do not want to see in any type of cluster with a master-slave (or in the case of Redshift: Leader-Compute) architecture, because when the leader node is under load – all queries underperform. Notice that this is not something one can solve merely by adding more nodes to the cluster.
The problem here is that Tableau extracts data from Redshift using cursors and not how it was meant to be done (e.g. using native Redshift methods like UNLOAD). The cursors in Redshift are materialized on the leader node, hence the load on it. To solve this issue at Feedvisor, we have created an EMR cluster with Apache Drill, that allows the direct connection of Tableau to S3. Instead of letting Tableau extract data directly from Redshift, we unload the data to S3 using ETL to S3 (with native, efficient and parallel UNLOAD from Redshift) and let Tableau ingest it via Drill. In the future, we intend to move the majority of heavy-load Tableau extracts to perform against our Data Lake in S3.