An ETL is nothing else than a DAG of jobs, each of them reading datasets as input, running computation, and producing new datasets (or updating them).
SELECT COUNT(*), SUM(column_1), AVG(column_2) FROM my_table GROUP_BY key
with 600M entries in my_table
requires less than 24Gb of memory using DuckDBSELECT * FROM table_a JOIN table_b ORDER BY key
, with table_a
having 300M rows and table_b
75M rows with DuckDB requires 24Gb of memorys3object
, you will see in the input form on the right a button helping you choose the file directly from the bucket.
Same for the result of the script. If you return an s3object
containing a key s3
pointing to a file inside your bucket, in the result panel there will be a button to open the bucket explorer to visualize the file.
Clicking on the button will lead directly to a bucket explorer. You can browse the bucket content and even visualize file content without leaving getOperate.
Clicking on one of those buttons, a drawer will open displaying the content of the workspace bucket. You can select any file to get its metadata and if the format is common, you’ll see a preview. In the above picture, for example, we’re showing a Parquet file, which is very convenient to quickly validate the result of a script.
From there you always have the possibility to use the S3 client library of your choice to read and write to S3.
That being said, Polars and DuckDB can read/write directly from/to files stored in S3 getOperate now ships with helpers to make the entire data processing mechanics very cohesive.
m4.xlarge
AWS server (8 vCPUs, 32Gb of memory). It’s not a small server, but also not terribly large. Keep in mind you can get up to 24Tb of Memory on a single server on AWS (yes, it’s not cheap, but it’s possible!)m4.xlarge
AWS instance.