Week 26 · Space GIS Architect

Cloud-native: COG, Zarr, STAC catalogs

The old way: download a 10 GB GeoTIFF. The new way: range-request just the bytes you need from S3. COG, Zarr, and STAC make this possible.

Learning objectives

Primer

The traditional way to use satellite imagery: download a 5 GB scene, unzip it, load it into desktop GIS. The cloud-native way: range-request just the bytes you need from a file living on S3, never download the whole thing. This week is the three formats and one spec that make that possible.

Cloud-Optimized GeoTIFF (COG)

COG isn't a new file format. It's a particular way of writing a regular GeoTIFF so that HTTP-range-request access is efficient. Three requirements:

  1. Internal tiling — the image is divided into ~256×256 or 512×512 pixel tiles, stored in row-major order. The file header has an index of where each tile begins.
  2. Internal overviews — the file also contains downsampled versions of the image (typically at 2×, 4×, 8×, ... resolution) for fast low-zoom rendering.
  3. Header at the beginning — TIFF allows the IFD (image file directory) to live anywhere; COG mandates the beginning so a small range request can read the structure first.

With those properties, a client can: (1) range-request the first ~64 KB to read the header and tile index, (2) compute which tiles cover the area of interest at the right zoom level, (3) range-request only those tiles. Total bytes transferred: kilobytes, not gigabytes.

from rio_tiler.io import COGReader

# Read just a small window from a COG on S3 — no full download
with COGReader('https://noaa-goes18.s3.amazonaws.com/.../foo.tif') as cog:
    img = cog.part(bbox=(-100, 30, -80, 40), max_size=512)

Zarr

Zarr is a format for chunked, compressed, multi-dimensional arrays. Where COG is for 2D rasters, Zarr is for the (time × lat × lon × band × ...) hypercubes that modern Earth-observation analysis often needs. The data is stored as a directory tree on S3, with each chunk a separate object — so parallel reads of different chunks can fan out across many concurrent workers.

import xarray as xr
ds = xr.open_zarr('s3://my-bucket/era5-temperature.zarr',
                  storage_options={'anon': True})
# Now ds is a lazy xarray Dataset; reading a slice triggers parallel chunk fetches
slice = ds.air_temperature.sel(time='2024-01-15', lat=slice(30,40), lon=slice(-100,-80))
slice.load()  # actually fetches the chunks

Zarr is the standard for cloud-native climate, reanalysis, and time-series gridded data. The Pangeo community runs a free public collection of huge Zarr datasets at catalog.pangeo.io.

STAC: SpatioTemporal Asset Catalog

You have a COG or a Zarr. How do you tell users about it? How do they discover that you have a frame over Florida on January 15? Enter STAC, the SpatioTemporal Asset Catalog spec.

STAC defines a small set of JSON schemas:

Major STAC catalogs (all free to query):

from pystac_client import Client

cat = Client.open('https://planetarycomputer.microsoft.com/api/stac/v1/')
search = cat.search(collections=['sentinel-2-l2a'],
                    bbox=[-81, 28, -80, 29],
                    datetime='2024-01-01/2024-02-01')
items = list(search.items())
print(f"{len(items)} matching scenes")

The lab

You'll identify a COG-formatted GOES product on AWS Open Data, use rio-tiler to fetch just a single map tile from it via HTTP range request, and time it against downloading the whole file. The speedup is typically 50–500×. Then you'll query the Microsoft Planetary Computer STAC API for Sentinel-2 scenes over Cape Canaveral in 2024 — a one-line search that returns dozens of cloud-free items, each with COG asset URLs you can immediately range-request.

This is the architecture every modern production geospatial pipeline uses, including LaunchDetect's. You no longer download data; you query catalogs and range-request the bytes you need.

Hands-on lab: Pull a single tile from a COG without downloading the file

Identify a COG-formatted GOES product on AWS Open Data. Use rio-tiler to fetch just a single tile via HTTP range request. Time it vs downloading the whole file.

Quiz

Test yourself. Answer key on the certificate-track page (Gold-tier feature: progress tracking and auto-grading).

Q1. COG is:
  1. A GeoTIFF with internal tiling + overviews + correct byte ordering for HTTP range reads
  2. A new format separate from GeoTIFF
  3. A vector format
  4. A compression scheme only
Q2. Zarr is best for:
  1. Multi-dimensional gridded data (e.g. time × lat × lon × band), chunkable, parallelizable
  2. Vector data
  3. 1D time series only
  4. Single static rasters
Q3. STAC is:
  1. SpatioTemporal Asset Catalog — a spec for cataloging geospatial assets
  2. A file format
  3. A query language
  4. A database
Q4. HTTP range request lets you:
  1. Fetch a byte range of a file rather than the whole file
  2. Run faster
  3. Authenticate
  4. Compress
Q5. STAC API standard endpoints include:
  1. /search, /collections, /items
  2. /users, /posts only
  3. /login, /logout
  4. GraphQL only