A GeoTIFF with internal tiling + overviews + correct byte ordering for HTTP range reads

A new format separate from GeoTIFF

Multi-dimensional gridded data (e.g. time × lat × lon × band), chunkable, parallelizable

SpatioTemporal Asset Catalog — a spec for cataloging geospatial assets

Week 26 · Space GIS Architect

Cloud-native: COG, Zarr, STAC catalogs

The old way: download a 10 GB GeoTIFF. The new way: range-request just the bytes you need from S3. COG, Zarr, and STAC make this possible.

Learning objectives

Understand COG (Cloud-Optimized GeoTIFF) structure
Use Zarr for multi-dimensional gridded data
Build and query a STAC catalog
Range-request a tile out of a COG without downloading the file

Primer

The traditional way to use satellite imagery: download a 5 GB scene, unzip it, load it into desktop GIS. The cloud-native way: range-request just the bytes you need from a file living on S3, never download the whole thing. This week is the three formats and one spec that make that possible.

Cloud-Optimized GeoTIFF (COG)

COG isn't a new file format. It's a particular way of writing a regular GeoTIFF so that HTTP-range-request access is efficient. Three requirements:

Internal tiling — the image is divided into ~256×256 or 512×512 pixel tiles, stored in row-major order. The file header has an index of where each tile begins.
Internal overviews — the file also contains downsampled versions of the image (typically at 2×, 4×, 8×, ... resolution) for fast low-zoom rendering.
Header at the beginning — TIFF allows the IFD (image file directory) to live anywhere; COG mandates the beginning so a small range request can read the structure first.

With those properties, a client can: (1) range-request the first ~64 KB to read the header and tile index, (2) compute which tiles cover the area of interest at the right zoom level, (3) range-request only those tiles. Total bytes transferred: kilobytes, not gigabytes.

from rio_tiler.io import COGReader

# Read just a small window from a COG on S3 — no full download
with COGReader('https://noaa-goes18.s3.amazonaws.com/.../foo.tif') as cog:
    img = cog.part(bbox=(-100, 30, -80, 40), max_size=512)

Zarr

Zarr is a format for chunked, compressed, multi-dimensional arrays. Where COG is for 2D rasters, Zarr is for the (time × lat × lon × band × ...) hypercubes that modern Earth-observation analysis often needs. The data is stored as a directory tree on S3, with each chunk a separate object — so parallel reads of different chunks can fan out across many concurrent workers.

import xarray as xr
ds = xr.open_zarr('s3://my-bucket/era5-temperature.zarr',
                  storage_options={'anon': True})
# Now ds is a lazy xarray Dataset; reading a slice triggers parallel chunk fetches
slice = ds.air_temperature.sel(time='2024-01-15', lat=slice(30,40), lon=slice(-100,-80))
slice.load()  # actually fetches the chunks

Zarr is the standard for cloud-native climate, reanalysis, and time-series gridded data. The Pangeo community runs a free public collection of huge Zarr datasets at catalog.pangeo.io.

STAC: SpatioTemporal Asset Catalog

You have a COG or a Zarr. How do you tell users about it? How do they discover that you have a frame over Florida on January 15? Enter STAC, the SpatioTemporal Asset Catalog spec.

STAC defines a small set of JSON schemas:

Item — one asset (e.g. one Landsat scene). Has geometry, time range, properties, and asset URLs (the actual COG / Zarr / etc.).
Collection — a homogeneous group of items (e.g. "Landsat 9 Level-2 surface reflectance").
Catalog — a hierarchy of collections.
STAC API — a standardized REST interface for searching across catalogs. Endpoints: /search, /collections, /items.

Major STAC catalogs (all free to query):

Microsoft Planetary Computer — Landsat, Sentinel-1, Sentinel-2, NAIP, ESA WorldCover, and dozens more.
AWS Earth Search — Sentinel-2, Landsat, NAIP via Element 84.
Radiant Earth MLHub — labeled training datasets for ML.

from pystac_client import Client

cat = Client.open('https://planetarycomputer.microsoft.com/api/stac/v1/')
search = cat.search(collections=['sentinel-2-l2a'],
                    bbox=[-81, 28, -80, 29],
                    datetime='2024-01-01/2024-02-01')
items = list(search.items())
print(f"{len(items)} matching scenes")

The lab

You'll identify a COG-formatted GOES product on AWS Open Data, use rio-tiler to fetch just a single map tile from it via HTTP range request, and time it against downloading the whole file. The speedup is typically 50–500×. Then you'll query the Microsoft Planetary Computer STAC API for Sentinel-2 scenes over Cape Canaveral in 2024 — a one-line search that returns dozens of cloud-free items, each with COG asset URLs you can immediately range-request.

This is the architecture every modern production geospatial pipeline uses, including LaunchDetect's. You no longer download data; you query catalogs and range-request the bytes you need.

Hands-on lab: Pull a single tile from a COG without downloading the file

Identify a COG-formatted GOES product on AWS Open Data. Use rio-tiler to fetch just a single tile via HTTP range request. Time it vs downloading the whole file.

Open in Colab Download .ipynb

Quiz

Test yourself. Answer key on the certificate-track page (Gold-tier feature: progress tracking and auto-grading).

Q1. COG is:

A GeoTIFF with internal tiling + overviews + correct byte ordering for HTTP range reads
A new format separate from GeoTIFF
A vector format
A compression scheme only

Q2. Zarr is best for:

Multi-dimensional gridded data (e.g. time × lat × lon × band), chunkable, parallelizable
Vector data
1D time series only
Single static rasters

Q3. STAC is:

SpatioTemporal Asset Catalog — a spec for cataloging geospatial assets
A file format
A query language
A database

Q4. HTTP range request lets you:

Fetch a byte range of a file rather than the whole file
Run faster
Authenticate
Compress

Q5. STAC API standard endpoints include:

/search, /collections, /items
/users, /posts only
/login, /logout
GraphQL only