Amazon Redshift: Data Warehousing for the Masses

Minh Nguyen
Aug 6, 2018
6 min read

Amazon Redshift has gotten a lot of news in the past few months. Larry Ellison slammed Redshift at OpenWorld in October 2017, the 2nd time after he had already thrown darts in 2016. You know you’re doing something right when Larry calls you out on stage.

We’ve not been able to discover the reason why Amazon chose the name Redshift for its data warehousing solution, but we suspect it had something to do with hyper expansion. Like the universe’s rapid expansion since the Big Bang, enterprise data has also expanded at a rapid rate. And, as with the universe, there’s no end to its growth in sight!

In this post, we’ll cover the basic concepts behind Amazon Redshift, and why it’s seen so much adoption.

Why Redshift?

Let’s take the expanding universe analogy a bit further. Much of the universe lies beyond our current observational horizon. In a similar way, lots of enterprise data also lies in the dark — untouched and unanalyzed. One motivation for Amazon building Redshift was to ease the path to analyzing this “dark data.” And do it in a way that allows anybody to deploy a data warehouse solution without large setup costs or deep technical expertise. Amazon offers a free 60 day trial for data analysts to get their feet wet with Redshift. A time-to-first-report is around 15 minutes. That’s enough time for analysts to quickly gain some initial business insights into their data and see if it is worthwhile before committing further resources. Should they decide to go with Redshift, there’s no upfront expense. The entry price point for the smallest data set is $0.25/node/hour. That’s less than $1,000/TB/year, an order of magnitude cheaper than current warehouse solutions.

With cost no longer a factor in preventing the democratization of data warehousing, let’s probe what Redshift is, how it works and what the future holds.

What is Amazon Redshift?

Amazon Redshift is a relational database system built on PostgreSQL principles. It’s optimized for performing online analytical processing (“OLAP”) queries efficiently over petabytes of data. The query handling efficiency is achieved through the combination of:

highly parallel processing
a columnar database design
data compression of columns
a query optimizer
compiled query code

Each of these merits its own post. In what follows, we’ll provide the highlights of the design.

What’s inside Amazon Redshift?

The heart of each Redshift deployment is a cluster. Each cluster has a leader node and one or more compute nodes, all connected by high speed links. There are two types of compute nodes, each with different CPU, RAM and storage characteristics.

The Dense Compute (DC) nodes are meant for speed of query execution, with less storage, and is best for high performance activities. It is implemented with SSD drives. The Dense Storage (DS) nodes are for storing and querying big data, using typical hard disk drives.

You choose the node type depending on what you wish to achieve. For instance, you may want to store large amounts of data and run complex queries against it, or increase speed of execution for near-real time analytics, etc. The number of nodes you select depends on the size of your data and the query handling performance you seek. There’s a difference in pricing for these options, of course. And you can adjust these as your needs change.

A leader node is the interface to your business intelligence application. It offers standard PostgreSQL JDBC/ODBC interfaces for queries and responses. It serves as the traffic cop directing queries from customer applications to the appropriate compute nodes, and manages the results returned. It also distributes ingested data into the compute nodes to build your databases.

Each compute node has its storage divided into at least two slices, with each slice performing a portion of the node’s workload in parallel. A user-specified data distribution plan places ingested data across the various compute nodes and their slices. Choose this plan with care so that nodes are not under-utilized and reduce data movement between nodes during query execution. There is also an art to selecting the appropriate sort key to make query handling more efficient. We’ll write more on best practices around these topics in future blog posts.

Redshift clusters run on the Amazon Elastic Compute Cloud (EC2) platform. These clusters can either be on a space shared with other AWS users (the so-called “classic” platform) or cordoned off by your own Virtual Private Cloud(VPC). In either case, you assign IP addresses to allow external applications to access you clusters. Also, you can decide if you want your external data sources to traverse the VPC to populate your clusters. Finally, you set up your clusters in what’s called an availability zone, which is an isolated “location” within an Amazon EC2 region.

Redshift data is also replicated in the Amazon Simple Storage Service (S3). Customers can also choose to have another S3 backup in a different region.

You can ingest data into Redshift data from multiple sources, including using Amazon S3, Amazon DynamoDB, or Amazon EMR as staging entities. Or, it can come directly from any other data source such as an EC2 instance or a host that supports the Secure Shell (SSH) protocol.

Redshift’s major advantage is that the other Amazon cloud services — S3 for backup, EC2 for instances, VPC for network isolation , and may other services in the AWS ecosystem — are built into the warehouse deployment, and you don’t pay for these separately, or need to manage their setup. It’s all done under the hood for you, through a simple configuration console when setting up your database and loading your clusters. Thus, you get the benefits of resilience, load balancing, monitoring, logging and all other such goodness in one managed package.

A key advantage of Redshift that we think a lot of people are not aware of is simplicity. It used to take months if not quarters to get a data warehouse up and running. And you’d need the help of an Accenture or IBM. None of that anymore. You can spin up a Redshift cluster in less than 15 minutes, and build a whole business intelligence stack in a weekend.

Who uses Amazon Redshift?

A report from June 2017 by Forrester mentions that Redshift has over 5,000 deployments, with plenty of success stories. That’s quite a remarkable adoption, considering that Redshift has only been available since November 2012.

The size of the data warehouses range from a few terabytes to petabytes. Looking at its success stories, there are four basic types of uses cases for Redshift.

Traditional data warehousing — where enterprises such as Amazon’s own shopping or Alexa interactions are fed into Redshift for analyzing consumer behavior and engagement; or migrants from on-premises enterprise solutions to a cloud-based one, such as NTT DoCoMo or Nasdaq.
Log analysis — as used by customers such as Lyft, where large volumes of machine generated data are analyzed for fast decision making (e.g., price for a ride) based on time-sensitive trends.
Business Applications — where customers such as Accenture run Redshift “under the hood” to offer an analytics platform to their customers.
Mission-critical workloads: Using Redshift for mission-critical workloads has emerged in the past few years. Here, data sitting in Redshift feeds into time-sensitive apps. It’s key that the the database stays up, because otherwise the business goes down (quite literally).

We cover more details on the use cases in our Quora answer What are some good real world examples of using Amazon Redshift?

DB-Engines provides a ranking for different databases. Benchmarking Redshift vs. legacy vendor Teradata, it becomes clear how much traction has seen in just 5 years of its existence.

What’s new with Amazon Redshift?

The latest addition to Amazon’s data warehouse toolkit is Redshift Spectrum. It is a pay-as-you-use way to run queries against data you’ve stored in your S3 “data lake.” It uses the same business intelligence applications as well as the same queries that you already use with Redshift.

Spectrum is useful when analyzing older or legacy data kept in S3 (incurring only normal S3 storage charges), while newer data is in Redshift. Redshift mediates the query towards data in S3, taking care of the hard stuff such as normalizing the data formats. It will be a bit slower, but older data is not “dark” any longer. (Some enterprises may well have exabytes of such dark data.) You only pay $5 per terabyte of data pulled from S3 during the query processing.

There’s more to Redshift Spectrum, of course, and we will write more about it in future posts.

Redshift Adoption

For now, it’s good to keep the following points about Amazon Redshift in mind:

Redshift is a fully managed, petabytes scale data warehouse optimized for analytical queries (“OLAP”), though we’re seeing emerging uses cases for business applications.
Redshift is inexpensive compared to its peers.
The Redshift engine does all the heavy lifting of managing many AWS features under the hood to provide storage, resilience, network isolation, security etc., at no extra cost.
Redshift is simple in that it can be configured easily without deep technical expertise
Redshift has a large ecosystem of partners, giving customers access to world-class tools for loading, transforming and visualizing data.

AWS Re:Invent 2017 is coming up, and we can expect more announcements on features and use cases that should further drive up adoption.

iSolutionista

A Logistics and Supply Chain Blog