Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

summarize_data.py - Summarize data by directory

Description

Summarizes data metadata by directory levels from a CSV file. Reads the generated/data-metadata.csv file and aggregates file sizes by directory, providing a high-level overview of data organization and storage.

Usage

python tools/summarize_data.py

No command-line arguments required. The script reads from a fixed input path.

Input File

The script expects a CSV file at:

./generated/data-metadata.csv

This file should contain columns:

Example Output

Summary of data by highest directory level:
raw_data: 1458.32 MB
processed: 892.15 MB
output: 245.67 MB

Requirements

Use Cases

Technical Details

The script:

  1. Reads generated/data-metadata.csv

  2. Extracts the second-to-last directory level from each file path

  3. Aggregates file sizes (converted to MB)

  4. Prints a summary sorted by directory name

Notes

This script is typically used after data file listing scripts have generated the metadata CSV file. See 02_list_data_files.sh for metadata generation.