Mastering the SUPER Data Type in Amazon Redshift: A Developer’s Journey

Published by

on

We recently started using Amazon Redshift as our data lakehouse solution. While there are numerous tutorials online showcasing Redshift’s fundamentals — such as how it works, its standard implementations, and basic operations — real-world use cases often present unique challenges. As we began actively using Redshift, we encountered several practical issues with minimal or ambiguous documentation. Resolving these challenges involved extensive experimentation, research, and reading between the lines.

One significant challenge we faced was how to efficiently store and query JSON data from Amazon S3 into a Redshift table. According to Redshift documentation, this should be straightforward using the COPY command. However, practical usage revealed hidden caveats. Here’s how we navigated and resolved these challenges.

Scenario: Loading JSON Data into Redshift

Imagine a straightforward scenario: you have a collection of JSON files with varying structures in your S3 bucket, and your task is to load this data into a raw landing-zone table in Redshift. Afterward, you’ll query this data by keys to normalize and process it further. Simple enough, right?

Most modern databases offer native support for JSON data. For instance:

  • PostgreSQL: JSON or JSONB data types.
  • MySQL: JSON data type.
  • MongoDB: Native BSON (Binary JSON) storage.

However, Amazon Redshift — though based on PostgreSQL — does not offer a native JSON column type. Instead, Redshift provides a specialized wrapper known as the SUPER data type.

What Exactly is the SUPER Data Type?

The SUPER data type in Amazon Redshift is designed specifically to handle semi-structured or nested JSON data. It enables storage without a predefined schema and simplifies complex querying operations.

Key benefits of the SUPER type include:

  • Flexible Schema: Avoid upfront normalization.
  • Intuitive SQL Queries: Query JSON data using familiar SQL syntax.
  • Large Storage Capacity: Supports JSON documents up to 16 MB per row. While 16 MB is generous, most other modern data lakehouse solutions offer significantly higher limits.

Initial Attempts and Challenges

Initially, our instinct was to use Redshift’s VARCHAR data type to store JSON data:

CREATE TABLE json_data (
    filename VARCHAR(256),
    text_json_data VARCHAR(65536)
);

But we quickly discovered Redshift’s VARCHAR fields have a maximum limit of 64 KB, resulting in errors like:

ERROR: value too long for type character varying(65535)

Transitioning to SUPER: Unexpected Issues

Turning to Redshift’s SUPER type, our new table schema became:

CREATE TABLE json_data_super (
filename VARCHAR(256),
json_content SUPER
);

We were confident, given the documented 16 MB storage limit. Yet again, we faced load errors:

ERROR: JSON parse error - data too long

Despite our JSON data being smaller than 16 MB, we continued encountering this frustrating limitation.

Discovering a Hidden Limitation

After extensive troubleshooting, we discovered an undocumented yet crucial limitation:

Redshift’s SUPER data type indeed allows up to 16 MB of JSON per row — but each individual JSON key-value pair cannot exceed 64 KB.

This limitation caught us off guard because the overall limit of 16 MB per row seemed sufficient for our use case. However, upon closer inspection and rigorous testing, we realized the subtle distinction that each key-value pair must individually conform to the 64 KB limit. For instance, a single lengthy string, even if it’s well-formed JSON, can trigger errors if it exceeds this size constraint.

Why Might This Limitation Exist?

The exact internal workings of Redshift’s SUPER data type aren’t explicitly documented by AWS. However, an intuitive way to think about this limitation is that Redshift might internally handle the SUPER type similarly to a dictionary or map structure, something like:

VARCHAR(65535) : VARCHAR(65535)

This mental model helps explain why each individual JSON field must remain under the 64 KB threshold. While not officially confirmed by AWS, considering this representation can significantly clarify why the limitation exists and assist in troubleshooting and structuring your JSON data appropriately.

Practical Workaround: Splitting Large JSON Fields

To address this, we developed a practical approach to modify our JSON structures. Instead of storing large data strings in single fields, we split them into arrays of smaller chunks, each safely under the 64 KB limit:

Original problematic JSON:

{
"id": "123",
"large_description": "<string exceeding 64 KB>"
}

Revised compliant JSON:

{
"id": "123",
"large_description_chunks": [
"Chunk 1 (under 64 KB)",
"Chunk 2 (under 64 KB)",
"Chunk 3 (under 64 KB)"
]
}

This structural adjustment enabled successful data loading into Redshift and significantly simplified subsequent queries and data manipulation tasks.

Step-by-Step Loading Example

Here’s how to practically load JSON data into Redshift from S3:

COPY json_data_super (filename, json_content)
FROM 's3://your-bucket/path/'
IAM_ROLE 'arn:aws:iam::your-account-id:role/your-redshift-role'
FORMAT AS JSON 'auto';

Querying Your SUPER Data

Querying stored JSON data becomes straightforward with dot notation:

SELECT 
filename,
json_content.id,
json_content.large_description_chunks[0] AS first_chunk
FROM json_data_super;

Conclusion

Amazon Redshift’s SUPER data type significantly simplifies handling complex, semi-structured JSON data. However, understanding its limitations — particularly the 64 KB per-field constraint — is essential for successful implementation. By carefully structuring your JSON data and following best practices, you can fully harness Redshift’s powerful capabilities and streamline your data workflows.

Leave a comment