The old way: download a 10 GB GeoTIFF. The new way: range-request just the bytes you need from S3. COG, Zarr, and STAC make this possible.
The traditional way to use satellite imagery: download a 5 GB scene, unzip it, load it into desktop GIS. The cloud-native way: range-request just the bytes you need from a file living on S3, never download the whole thing. This week is the three formats and one spec that make that possible.
COG isn't a new file format. It's a particular way of writing a regular GeoTIFF so that HTTP-range-request access is efficient. Three requirements:
With those properties, a client can: (1) range-request the first ~64 KB to read the header and tile index, (2) compute which tiles cover the area of interest at the right zoom level, (3) range-request only those tiles. Total bytes transferred: kilobytes, not gigabytes.
from rio_tiler.io import COGReader
# Read just a small window from a COG on S3 — no full download
with COGReader('https://noaa-goes18.s3.amazonaws.com/.../foo.tif') as cog:
img = cog.part(bbox=(-100, 30, -80, 40), max_size=512)
Zarr is a format for chunked, compressed, multi-dimensional arrays. Where COG is for 2D rasters, Zarr is for the (time × lat × lon × band × ...) hypercubes that modern Earth-observation analysis often needs. The data is stored as a directory tree on S3, with each chunk a separate object — so parallel reads of different chunks can fan out across many concurrent workers.
import xarray as xr
ds = xr.open_zarr('s3://my-bucket/era5-temperature.zarr',
storage_options={'anon': True})
# Now ds is a lazy xarray Dataset; reading a slice triggers parallel chunk fetches
slice = ds.air_temperature.sel(time='2024-01-15', lat=slice(30,40), lon=slice(-100,-80))
slice.load() # actually fetches the chunks
Zarr is the standard for cloud-native climate, reanalysis, and time-series gridded data. The Pangeo community runs a free public collection of huge Zarr datasets at catalog.pangeo.io.
You have a COG or a Zarr. How do you tell users about it? How do they discover that you have a frame over Florida on January 15? Enter STAC, the SpatioTemporal Asset Catalog spec.
STAC defines a small set of JSON schemas:
/search, /collections, /items.Major STAC catalogs (all free to query):
from pystac_client import Client
cat = Client.open('https://planetarycomputer.microsoft.com/api/stac/v1/')
search = cat.search(collections=['sentinel-2-l2a'],
bbox=[-81, 28, -80, 29],
datetime='2024-01-01/2024-02-01')
items = list(search.items())
print(f"{len(items)} matching scenes")
You'll identify a COG-formatted GOES product on AWS Open Data, use rio-tiler to fetch just a single map tile from it via HTTP range request, and time it against downloading the whole file. The speedup is typically 50–500×. Then you'll query the Microsoft Planetary Computer STAC API for Sentinel-2 scenes over Cape Canaveral in 2024 — a one-line search that returns dozens of cloud-free items, each with COG asset URLs you can immediately range-request.
This is the architecture every modern production geospatial pipeline uses, including LaunchDetect's. You no longer download data; you query catalogs and range-request the bytes you need.
Identify a COG-formatted GOES product on AWS Open Data. Use rio-tiler to fetch just a single tile via HTTP range request. Time it vs downloading the whole file.
Test yourself. Answer key on the certificate-track page (Gold-tier feature: progress tracking and auto-grading).