Bitbucket Pipelines Configuration

Purpose¶

This file defines the CI/CD pipelines for automated replication package analysis and verification. It orchestrates downloading deposits from various repositories, analyzing code and data, scanning for dependencies and PII, and generating comprehensive reports.

Custom Pipelines¶

The configuration defines several custom pipelines that can be manually triggered via Bitbucket’s web interface.

1. `1-populate-from-icpsr`¶

Purpose: Full analysis pipeline with parallel processing for optimal performance.

Parameters:

openICPSRID - openICPSR project ID (or read from config.yml)
jiraticket - JIRA ticket identifier
ZenodoID - Zenodo deposit ID (alternative to openICPSR)
ProcessStata - Enable/disable Stata scanning (default: yes)
ProcessR - Enable/disable R scanning (default: yes)
ProcessPython - Enable/disable Python scanning (default: yes)
ProcessJulia - Enable/disable Julia scanning (default: no)
ProcessPii - Enable/disable PII scanning (default: yes)
SkipProcessing - Skip all processing steps (default: no)

Pipeline Steps:

Step 1: Download¶

Image: python:3.12
Downloads deposit from openICPSR or Zenodo
Unpacks ZIP archives
Lists data and program files
Creates manifests with checksums
Checks for ZIP files, duplicates, zero-byte files
Validates file paths
Compares manifests
Artifacts: generated/**, cache/**

Step 2: Check downloads¶

Verifies that deposit ZIP file exists in cache
Fails pipeline if download was unsuccessful

Step 3: Parallel Processing¶

Runs multiple scanners concurrently for maximum efficiency:

3a. Run Stata parser

Image: larsvilhuber/bitbucket-stata:latest
Scans Stata code for package dependencies
Generates candidatepackages.csv
Artifacts: generated/**

3b. Run Stata PII scanner

Image: larsvilhuber/bitbucket-stata:latest
Scans for Personally Identifiable Information
Generates PII reports
Artifacts: generated/**

3c. Run R parser

Image: aeadataeditor/verification-r:latest
Checks R dependencies and data files
Generates r-deps.csv and r-data-checks.csv

3d. Run Python parser

Image: python:3.12
Scans Python code for package dependencies
Generates python-deps.csv
Artifacts: generated/**

3e. Run Julia parser

Image: julia:latest
Scans Julia code for package dependencies
Artifacts: generated/**

3f. Count lines and comments

Image: aldanial/cloc
Counts lines of code by language
Generates code statistics
Artifacts: generated/**

Step 4: Add info to REPLICATION.md¶

Image: python:3.12
Consolidates all findings into report
Runs 24_amend_report.sh
Artifacts: generated/**

Step 5: Commit everything back¶

Image: python:3.12
Re-extracts deposit
Commits analyzed code to repository
Cleans up deposit directory
Replaces report sections
Updates config.yml
Pushes to Git with tags

Use Case: Standard replication package analysis with optimal performance.

2. `w-big-populate-from-icpsr`¶

Purpose: Single-step pipeline for large deposits requiring more resources.

Parameters:

openICPSRID - openICPSR project ID
ZenodoID - Zenodo deposit ID
jiraticket - JIRA ticket identifier

Pipeline Steps:

Step: Download and commit¶

Image: python:3.12
Size: 2x (double resources)
Installs cloc for line counting
Downloads and analyzes deposit sequentially
All processing in single step (no parallelization)
Commits and pushes results

Key Differences from Pipeline 1:

All processing is sequential (slower but more reliable)
Double resources (2x memory and CPU)
No artifact passing between steps
Includes cloc installation

Use Case: Large deposits that may timeout or fail with standard resources.

3. `z-run-stata`¶

Purpose: Execute Stata replication code.

Parameters:

openICPSRID - Project ID
MainFile - Main script to execute (default: main.do)
RunCommand - Command to run (default: run.sh)

Pipeline Steps:

Step: Run Stata code¶

Image: larsvilhuber/bitbucket-stata:latest
Size: 2x
Downloads deposit
Executes replication code
Commits outputs
Pushes results

Use Case: Running actual Stata replication code (not just analysis).

4. `z-run-any-big`¶

Purpose: Execute replication code with maximum resources.

Parameters:

openICPSRID - Project ID
MainFile - Main script to execute
RunCommand - Command to run (default: run.sh)

Pipeline Steps:

Step: Run R or Stata code¶

Image: larsvilhuber/bitbucket-stata:latest
Size: 8x (8x resources - maximum available)
Downloads deposit
Executes replication code
Commits outputs
Pushes results

Use Case: Large, resource-intensive replications requiring maximum compute.

5. `2-merge-report`¶

Purpose: Combine Part A and Part B of a split report.

Parameters:

jiraticket - JIRA ticket identifier

Pipeline Steps:

Runs 50_merge-parts.sh
Pushes merged report

Use Case: After separate completion of report sections.

6. `3-split-report`¶

Purpose: Split REPLICATION.md into Part A and Part B.

Parameters:

jiraticket - JIRA ticket identifier

Pipeline Steps:

Runs 51_split-parts.sh
Pushes split reports

Use Case: When report needs to be worked on in sections.

7. `4-refresh-tools`¶

Purpose: Update pipeline tools from master template.

Pipeline Steps:

Downloads latest update_tools.sh
Executes update
Pushes updated tools

Use Case: Keeping tools synchronized with template repository.

8. `5-rename-directory`¶

Purpose: Rename a deposit directory in the repository.

Parameters:

oldName - Current directory name
newName - New directory name
jiraticket - JIRA ticket identifier

Pipeline Steps:

Runs git mv $oldName $newName
Commits with [skip ci] to avoid triggering pipelines
Pushes changes

Use Case: Correcting directory names or reorganizing deposits.

9. `6-convert-eps-pdf`¶

Purpose: Convert EPS and PDF graphics to PNG format.

Parameters:

path - Directory containing graphics
jiraticket - JIRA ticket identifier
ProcessEPS - Convert EPS files (default: yes)
ProcessPDF - Convert PDF files (default: no)
DockerImg - Docker image to use (default: dpokidov/imagemagick)

Pipeline Steps:

Uses Docker-in-Docker (services: docker)
Mounts current directory into container
Runs 52_convert_eps_pdf.sh
Commits converted files
Pushes with [skip ci]

Use Case: Converting graphics for better diff visualization or compatibility.

10. `7-download-box-manifest`¶

Purpose: Download restricted data from Box and generate manifest files.

Parameters:

jiraticket - JIRA ticket identifier

Pipeline Steps:

Step: Download Box and create manifests¶

Image: python:3.12
Caches: pip packages
Installs Python requirements
Runs download_box_private.py to download restricted data from Box
Executes 04_create_manifest.sh restricted twice to generate checksums
Force-adds all files in generated/ directory
Commits with [skip ci] to avoid triggering pipelines
Pushes changes

Use Case: Downloading and documenting restricted data stored on Box for replication packages that include confidential data.

Note: Requires Box API credentials to be configured in the environment.

11. `x-run-python`¶

Purpose: Execute custom Python scripts.

Parameters:

Script - Script to run (default: run-python.sh)

Pipeline Steps:

Image: python:3.11
Executes specified script
Runs post-run script if present
Pushes results

Use Case: Custom Python processing tasks.

12. `s-sync-issue-fields`¶

Purpose: Sync Jira fields from an original issue to its associated revision issue.

Parameters:

jiraticket - Jira ticket key for the revision issue (falls back to jira from config.yml if not provided)

Pipeline Steps:

Image: python:3.12
Caches: pip packages
Installs Python requirements
Reads jiraticket from the pipeline parameter or config.yml
Runs jira_sync_fields.py <jiraticket> --yes --comment to copy empty fields from the original issue to the revision issue and post a comment listing all synced fields

Use Case: When a revision case is created, many metadata fields (DOI, openICPSR URL, manuscript ID, etc.) need to be carried over from the original Jira issue. This pipeline automates that transfer, filling only fields that are blank on the revision issue and leaving already-populated fields untouched.

Requirements: JIRA_USERNAME and JIRA_API_KEY environment variables must be configured in the Bitbucket repository settings. The revision issue must have an "is a revision of" link pointing to the original issue.

See Also: jira_sync_fields.py

Docker Images¶

The pipeline uses several specialized Docker images:

Image	Purpose	Used In
`python:3.12`	Python analysis, downloads	Download, Python scanner
`larsvilhuber/bitbucket-stata:latest`	Stata scanning/execution	Stata scanner, PII scanner, execution
`aeadataeditor/verification-r:latest`	R dependency checking	R scanner
`julia:latest`	Julia dependency checking	Julia scanner
`aldanial/cloc`	Line counting	Line counter
`dpokidov/imagemagick`	Image conversion	Graphics conversion

Artifact Management¶

Artifacts are files passed between pipeline steps:

artifacts:
  - generated/**   # Analysis outputs
  - cache/**       # Downloaded deposits

Only specified steps preserve artifacts. This reduces storage and transfer time.

Caching¶

The pipeline uses Bitbucket’s caching for pip packages:

caches:
  - pip

This speeds up subsequent runs by reusing downloaded Python packages.

Configuration Integration¶

All pipelines read from config.yml:

. ./tools/parse_yaml.sh
eval $(parse_yaml config.yml)

This allows parameters to be stored in the repository rather than entered manually.

Conditional Processing¶

Scripts check environment variables to skip processing:

[[ "$SkipProcessing" == "yes" ]] && exit 0
[[ "$ProcessStata" == "no" ]] && exit 0

This provides fine-grained control over which analyses run.

Git Integration¶

Most pipelines end with:

git status
git push
git push --tags  # Some pipelines

Some use [skip ci] in commit messages to prevent recursive pipeline triggers:

git commit -m "[skip ci] Rename $oldName to $newName"

Resource Sizing¶

Bitbucket provides different resource tiers:

Default: 4GB RAM, 2 vCPU
2x: 8GB RAM, 4 vCPU (size: 2x)
8x: 32GB RAM, 16 vCPU (size: 8x)

Larger sizes cost more build minutes but prevent timeouts on big deposits.

Parallel vs Sequential¶

Pipeline 1 (1-populate-from-icpsr):

Parallel processing of language scanners
Faster completion
More efficient use of build minutes
Requires artifact passing

Pipeline w (w-big-populate-from-icpsr):

Sequential processing
Simpler (no artifact coordination)
Better for large files
Higher resource allocation

YAML Anchors¶

The configuration uses YAML anchors for reusability:

- step: &z-run-any-anchor
    name: Run R or Stata code
    script: [...]

Referenced later:

- step:
    <<: *z-run-any-anchor
    name: Run Stata code
    size: 2x

Environment Variables¶

Available in pipelines:

$CI - Set in CI environment
$openICPSRID - From parameters or config
$ZenodoID - From parameters or config
$W_DOCKER_USERNAME - Docker Hub credentials (secured)
$W_DOCKER_PAT - Docker Hub PAT (secured)

Best Practices¶

Use Pipeline 1 for standard deposits
Use Pipeline w for deposits >1GB or with many files
Use Pipeline z-run-any-big for compute-intensive replications
Set ProcessX=“no” for languages not in deposit (faster)
Use SkipProcessing=“yes” to only download without analysis
Use [skip ci] commits to avoid recursive triggers
Check artifacts in Bitbucket UI if debugging

Troubleshooting¶

Issue	Solution
Timeout during download	Use `w-big-populate-from-icpsr`
Out of memory during analysis	Use `w-big-populate-from-icpsr` or `z-run-any-big`
Parallel steps fail	Check individual step logs in Bitbucket
Artifacts not found	Verify previous step completed successfully
Git push fails	Check repository permissions and credentials

Purpose¶

Custom Pipelines¶

1. 1-populate-from-icpsr¶

Step 1: Download¶

Step 2: Check downloads¶

Step 3: Parallel Processing¶

Step 4: Add info to REPLICATION.md¶

Step 5: Commit everything back¶

2. w-big-populate-from-icpsr¶

Step: Download and commit¶

3. z-run-stata¶

Step: Run Stata code¶

4. z-run-any-big¶

Step: Run R or Stata code¶

5. 2-merge-report¶

6. 3-split-report¶

7. 4-refresh-tools¶

8. 5-rename-directory¶

9. 6-convert-eps-pdf¶

10. 7-download-box-manifest¶

Step: Download Box and create manifests¶

11. x-run-python¶

12. s-sync-issue-fields¶

Docker Images¶

Artifact Management¶

Caching¶

Configuration Integration¶

Conditional Processing¶

Git Integration¶

Resource Sizing¶

Parallel vs Sequential¶

YAML Anchors¶

Environment Variables¶

Best Practices¶

Troubleshooting¶

Related Documentation¶

1. `1-populate-from-icpsr`¶

2. `w-big-populate-from-icpsr`¶

3. `z-run-stata`¶

4. `z-run-any-big`¶

5. `2-merge-report`¶

6. `3-split-report`¶

7. `4-refresh-tools`¶

8. `5-rename-directory`¶

9. `6-convert-eps-pdf`¶

10. `7-download-box-manifest`¶

11. `x-run-python`¶

12. `s-sync-issue-fields`¶