Description¶
Summarizes data metadata by directory levels from a CSV file. Reads the generated/data-metadata.csv file and aggregates file sizes by directory, providing a high-level overview of data organization and storage.
Usage¶
python tools/summarize_data.pyNo command-line arguments required. The script reads from a fixed input path.
Input File¶
The script expects a CSV file at:
./generated/data-metadata.csvThis file should contain columns:
filename- File path with directory structurebytes- File size in bytes
Example Output¶
Summary of data by highest directory level:
raw_data: 1458.32 MB
processed: 892.15 MB
output: 245.67 MBRequirements¶
Python >= 3.6
Standard library only (csv, collections)
Use Cases¶
Quickly assess data distribution across directories
Identify large data folders requiring special handling
Generate size summaries for documentation
Analyze data organization structure
Technical Details¶
The script:
Reads
generated/data-metadata.csvExtracts the second-to-last directory level from each file path
Aggregates file sizes (converted to MB)
Prints a summary sorted by directory name
Notes¶
This script is typically used after data file listing scripts have generated the metadata CSV file. See 02