
They store their data on Amazon S3, have no need for an index, and cannot perform joins. Redshift Spectrum and Athena are both serverless applications. There is only one major difference between Athena and Spectrum: Athena stores query results on S3, which can be loaded into Redshift from there while Spectrum can join tables directly on Redshift. FunctionalityĮssentially, both Athena and Redshift Spectrum do the same thing: query S3 using standard SQL, and store the results. For each Glue Data Catalog schema, external tables must be configured when using Redshift Spectrum. In Athena, table metadata is stored directly in the Glue Data Catalog. These tables are managed using Glue Data Catalog. When querying data stored on Amazon S3, Spectrum and Athena both use virtual tables. Athena, on the other hand, uses the resources allocated automatically by AWS, which might differ during peak usage periods. In cases where you need a query to return extra-fast, you can allocate additional compute resources (unfortunately, this can get costly over time). Redshift Spectrum, therefore, gives you greater control over performance.

Performanceīoth Spectrum and Athena are serverless but differ in that Athena uses pooled resources from Amazon Web Services (AWS) for queries, whereas Spectrum allocates resources depending upon the number of nodes within an RDS instance. While these costs are all-inclusive in Athena, they are also all-inclusive for Spectrum – as we will cover later, you will have to allocate these costs based on your cluster of Redshift servers.

Since these services are decoupled so that storage and computation are separated, you can make use of inexpensive S3 to handle petabyte or exabyte-scale data without racking up massive cloud fees. S3 storage would be another cost to consider since it is relatively inexpensive compared to databases.
AWS REDSHIFT SPECTRUM TRIAL
If your 10 MB free trial expires without any charges applied to your account, Athena will charge you based on how much data was scanned. AWS rounds up to the nearest megabyte, so you’ll always pay at least $5 per query. When running a query in Spectrum, the amount of data scanned is billed according to how much data is scanned. (I also use Presto, Hive, and Spark, neither of those has a problem with this.We’ll take a close look at Athena and Spectrum here, with the aim of helping you understand when to use them for different types of analytics tasks.Ĭonsidering their various use cases, Athena and Redshift Spectrum make excellent choices. It would be really nice if Spectrum supported displaying string representations of nested fields, if only for these simple selects, they are crucial for exploratory work. These two issues require me to know the schema of all my tables before writing my queries, which is quite user unfriendly. There is a related issue that I can't do "select nested_field from table" either, presumably because Spectrum can't serialize the field in any way. "Query 1 ERROR: ERROR: Nested tables do not support '*' in the SELECT clause."īut there are few alternatives for me - say I have dozens of columns in my table, do I need to enumerate them while somehow deducing which ones are nested and thus to be avoided? This is impossible if the table has nested fields, because Spectrum doesn't allow "*" queries on these. By that I mean something along the lines of "select * from table limit 100"

I'm enjoying working with Redshift Spectrum, but I find it difficult to quickly scan a table with nested fields.
