File: bitbucket
Purpose¶
This file defines the CI/CD pipelines for automated replication package analysis and verification. It orchestrates downloading deposits from various repositories, analyzing code and data, scanning for dependencies and PII, and generating comprehensive reports.
Custom Pipelines¶
The configuration defines several custom pipelines that can be manually triggered via Bitbucket’s web interface.
1. 1-populate-from-icpsr¶
Purpose: Full analysis pipeline with parallel processing for optimal performance.
Parameters:
openICPSRID- openICPSR project ID (or read from config.yml)jiraticket- JIRA ticket identifierZenodoID- Zenodo deposit ID (alternative to openICPSR)ProcessStata- Enable/disable Stata scanning (default: yes)ProcessR- Enable/disable R scanning (default: yes)ProcessPython- Enable/disable Python scanning (default: yes)ProcessJulia- Enable/disable Julia scanning (default: no)ProcessPii- Enable/disable PII scanning (default: yes)SkipProcessing- Skip all processing steps (default: no)
Pipeline Steps:
Step 1: Download¶
Image:
python:3.12Downloads deposit from openICPSR or Zenodo
Unpacks ZIP archives
Lists data and program files
Creates manifests with checksums
Checks for ZIP files, duplicates, zero-byte files
Validates file paths
Compares manifests
Artifacts:
generated/**,cache/**
Step 2: Check downloads¶
Verifies that deposit ZIP file exists in cache
Fails pipeline if download was unsuccessful
Step 3: Parallel Processing¶
Runs multiple scanners concurrently for maximum efficiency:
3a. Run Stata parser
Image:
larsvilhuber/bitbucket-stata:latestScans Stata code for package dependencies
Generates
candidatepackages.csvArtifacts:
generated/**
3b. Run Stata PII scanner
Image:
larsvilhuber/bitbucket-stata:latestScans for Personally Identifiable Information
Generates PII reports
Artifacts:
generated/**
3c. Run R parser
Image:
aeadataeditor/verification-r:latestChecks R dependencies and data files
Generates
r-deps.csvandr-data-checks.csv
3d. Run Python parser
Image:
python:3.12Scans Python code for package dependencies
Generates
python-deps.csvArtifacts:
generated/**
3e. Run Julia parser
Image:
julia:latestScans Julia code for package dependencies
Artifacts:
generated/**
3f. Count lines and comments
Image:
aldanial/clocCounts lines of code by language
Generates code statistics
Artifacts:
generated/**
Step 4: Add info to REPLICATION.md¶
Image:
python:3.12Consolidates all findings into report
Runs
24_amend_report.shArtifacts:
generated/**
Step 5: Commit everything back¶
Image:
python:3.12Re-extracts deposit
Commits analyzed code to repository
Cleans up deposit directory
Replaces report sections
Updates
config.ymlPushes to Git with tags
Use Case: Standard replication package analysis with optimal performance.
2. w-big-populate-from-icpsr¶
Purpose: Single-step pipeline for large deposits requiring more resources.
Parameters:
openICPSRID- openICPSR project IDZenodoID- Zenodo deposit IDjiraticket- JIRA ticket identifier
Pipeline Steps:
Step: Download and commit¶
Image:
python:3.12Size:
2x(double resources)Installs
clocfor line countingDownloads and analyzes deposit sequentially
All processing in single step (no parallelization)
Commits and pushes results
Key Differences from Pipeline 1:
All processing is sequential (slower but more reliable)
Double resources (2x memory and CPU)
No artifact passing between steps
Includes
clocinstallation
Use Case: Large deposits that may timeout or fail with standard resources.
3. z-run-stata¶
Purpose: Execute Stata replication code.
Parameters:
openICPSRID- Project IDMainFile- Main script to execute (default: main.do)RunCommand- Command to run (default: run.sh)
Pipeline Steps:
Step: Run Stata code¶
Image:
larsvilhuber/bitbucket-stata:latestSize:
2xDownloads deposit
Executes replication code
Commits outputs
Pushes results
Use Case: Running actual Stata replication code (not just analysis).
4. z-run-any-big¶
Purpose: Execute replication code with maximum resources.
Parameters:
openICPSRID- Project IDMainFile- Main script to executeRunCommand- Command to run (default: run.sh)
Pipeline Steps:
Step: Run R or Stata code¶
Image:
larsvilhuber/bitbucket-stata:latestSize:
8x(8x resources - maximum available)Downloads deposit
Executes replication code
Commits outputs
Pushes results
Use Case: Large, resource-intensive replications requiring maximum compute.
5. 2-merge-report¶
Purpose: Combine Part A and Part B of a split report.
Parameters:
jiraticket- JIRA ticket identifier
Pipeline Steps:
Runs
50_merge-parts.shPushes merged report
Use Case: After separate completion of report sections.
6. 3-split-report¶
Purpose: Split REPLICATION.md into Part A and Part B.
Parameters:
jiraticket- JIRA ticket identifier
Pipeline Steps:
Runs
51_split-parts.shPushes split reports
Use Case: When report needs to be worked on in sections.
7. 4-refresh-tools¶
Purpose: Update pipeline tools from master template.
Pipeline Steps:
Downloads latest
update_tools.shExecutes update
Pushes updated tools
Use Case: Keeping tools synchronized with template repository.
8. 5-rename-directory¶
Purpose: Rename a deposit directory in the repository.
Parameters:
oldName- Current directory namenewName- New directory namejiraticket- JIRA ticket identifier
Pipeline Steps:
Runs
git mv $oldName $newNameCommits with
[skip ci]to avoid triggering pipelinesPushes changes
Use Case: Correcting directory names or reorganizing deposits.
9. 6-convert-eps-pdf¶
Purpose: Convert EPS and PDF graphics to PNG format.
Parameters:
path- Directory containing graphicsjiraticket- JIRA ticket identifierProcessEPS- Convert EPS files (default: yes)ProcessPDF- Convert PDF files (default: no)DockerImg- Docker image to use (default: dpokidov/imagemagick)
Pipeline Steps:
Uses Docker-in-Docker (services: docker)
Mounts current directory into container
Runs
52_convert_eps_pdf.shCommits converted files
Pushes with
[skip ci]
Use Case: Converting graphics for better diff visualization or compatibility.
10. 7-download-box-manifest¶
Purpose: Download restricted data from Box and generate manifest files.
Parameters:
jiraticket- JIRA ticket identifier
Pipeline Steps:
Step: Download Box and create manifests¶
Image:
python:3.12Caches: pip packages
Installs Python requirements
Runs
download_box_private.pyto download restricted data from BoxExecutes
04_create_manifest.sh restrictedtwice to generate checksumsForce-adds all files in
generated/directoryCommits with
[skip ci]to avoid triggering pipelinesPushes changes
Use Case: Downloading and documenting restricted data stored on Box for replication packages that include confidential data.
Note: Requires Box API credentials to be configured in the environment.
11. x-run-python¶
Purpose: Execute custom Python scripts.
Parameters:
Script- Script to run (default: run-python.sh)
Pipeline Steps:
Image:
python:3.11Executes specified script
Runs post-run script if present
Pushes results
Use Case: Custom Python processing tasks.
Docker Images¶
The pipeline uses several specialized Docker images:
| Image | Purpose | Used In |
|---|---|---|
python:3.12 | Python analysis, downloads | Download, Python scanner |
larsvilhuber/bitbucket-stata:latest | Stata scanning/execution | Stata scanner, PII scanner, execution |
aeadataeditor/verification-r:latest | R dependency checking | R scanner |
julia:latest | Julia dependency checking | Julia scanner |
aldanial/cloc | Line counting | Line counter |
dpokidov/imagemagick | Image conversion | Graphics conversion |
Artifact Management¶
Artifacts are files passed between pipeline steps:
artifacts:
- generated/** # Analysis outputs
- cache/** # Downloaded depositsOnly specified steps preserve artifacts. This reduces storage and transfer time.
Caching¶
The pipeline uses Bitbucket’s caching for pip packages:
caches:
- pipThis speeds up subsequent runs by reusing downloaded Python packages.
Configuration Integration¶
All pipelines read from config.yml:
. ./tools/parse_yaml.sh
eval $(parse_yaml config.yml)This allows parameters to be stored in the repository rather than entered manually.
Conditional Processing¶
Scripts check environment variables to skip processing:
[[ "$SkipProcessing" == "yes" ]] && exit 0
[[ "$ProcessStata" == "no" ]] && exit 0This provides fine-grained control over which analyses run.
Git Integration¶
Most pipelines end with:
git status
git push
git push --tags # Some pipelinesSome use [skip ci] in commit messages to prevent recursive pipeline triggers:
git commit -m "[skip ci] Rename $oldName to $newName"Resource Sizing¶
Bitbucket provides different resource tiers:
Default: 4GB RAM, 2 vCPU
2x: 8GB RAM, 4 vCPU (
size: 2x)8x: 32GB RAM, 16 vCPU (
size: 8x)
Larger sizes cost more build minutes but prevent timeouts on big deposits.
Parallel vs Sequential¶
Pipeline 1 (1-populate-from-icpsr):
Parallel processing of language scanners
Faster completion
More efficient use of build minutes
Requires artifact passing
Pipeline w (w-big-populate-from-icpsr):
Sequential processing
Simpler (no artifact coordination)
Better for large files
Higher resource allocation
YAML Anchors¶
The configuration uses YAML anchors for reusability:
- step: &z-run-any-anchor
name: Run R or Stata code
script: [...]Referenced later:
- step:
<<: *z-run-any-anchor
name: Run Stata code
size: 2xEnvironment Variables¶
Available in pipelines:
$CI- Set in CI environment$openICPSRID- From parameters or config$ZenodoID- From parameters or config$W_DOCKER_USERNAME- Docker Hub credentials (secured)$W_DOCKER_PAT- Docker Hub PAT (secured)
Best Practices¶
Use Pipeline 1 for standard deposits
Use Pipeline w for deposits >1GB or with many files
Use Pipeline z-run-any-big for compute-intensive replications
Set ProcessX=“no” for languages not in deposit (faster)
Use SkipProcessing=“yes” to only download without analysis
Use [skip ci] commits to avoid recursive triggers
Check artifacts in Bitbucket UI if debugging
Troubleshooting¶
| Issue | Solution |
|---|---|
| Timeout during download | Use w-big-populate-from-icpsr |
| Out of memory during analysis | Use w-big-populate-from-icpsr or z-run-any-big |
| Parallel steps fail | Check individual step logs in Bitbucket |
| Artifacts not found | Verify previous step completed successfully |
| Git push fails | Check repository permissions and credentials |