Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

File: bitbucket-pipelines.yml

Purpose

This file defines the CI/CD pipelines for automated replication package analysis and verification. It orchestrates downloading deposits from various repositories, analyzing code and data, scanning for dependencies and PII, and generating comprehensive reports.

Custom Pipelines

The configuration defines several custom pipelines that can be manually triggered via Bitbucket’s web interface.

1. 1-populate-from-icpsr

Purpose: Full analysis pipeline with parallel processing for optimal performance.

Parameters:

Pipeline Steps:

Step 1: Download

Step 2: Check downloads

Step 3: Parallel Processing

Runs multiple scanners concurrently for maximum efficiency:

3a. Run Stata parser

3b. Run Stata PII scanner

3c. Run R parser

3d. Run Python parser

3e. Run Julia parser

3f. Count lines and comments

Step 4: Add info to REPLICATION.md

Step 5: Commit everything back

Use Case: Standard replication package analysis with optimal performance.


2. w-big-populate-from-icpsr

Purpose: Single-step pipeline for large deposits requiring more resources.

Parameters:

Pipeline Steps:

Step: Download and commit

Key Differences from Pipeline 1:

Use Case: Large deposits that may timeout or fail with standard resources.


3. z-run-stata

Purpose: Execute Stata replication code.

Parameters:

Pipeline Steps:

Step: Run Stata code

Use Case: Running actual Stata replication code (not just analysis).


4. z-run-any-big

Purpose: Execute replication code with maximum resources.

Parameters:

Pipeline Steps:

Step: Run R or Stata code

Use Case: Large, resource-intensive replications requiring maximum compute.


5. 2-merge-report

Purpose: Combine Part A and Part B of a split report.

Parameters:

Pipeline Steps:

Use Case: After separate completion of report sections.


6. 3-split-report

Purpose: Split REPLICATION.md into Part A and Part B.

Parameters:

Pipeline Steps:

Use Case: When report needs to be worked on in sections.


7. 4-refresh-tools

Purpose: Update pipeline tools from master template.

Pipeline Steps:

Use Case: Keeping tools synchronized with template repository.


8. 5-rename-directory

Purpose: Rename a deposit directory in the repository.

Parameters:

Pipeline Steps:

Use Case: Correcting directory names or reorganizing deposits.


9. 6-convert-eps-pdf

Purpose: Convert EPS and PDF graphics to PNG format.

Parameters:

Pipeline Steps:

Use Case: Converting graphics for better diff visualization or compatibility.


10. 7-download-box-manifest

Purpose: Download restricted data from Box and generate manifest files.

Parameters:

Pipeline Steps:

Step: Download Box and create manifests

Use Case: Downloading and documenting restricted data stored on Box for replication packages that include confidential data.

Note: Requires Box API credentials to be configured in the environment.


11. x-run-python

Purpose: Execute custom Python scripts.

Parameters:

Pipeline Steps:

Use Case: Custom Python processing tasks.


Docker Images

The pipeline uses several specialized Docker images:

ImagePurposeUsed In
python:3.12Python analysis, downloadsDownload, Python scanner
larsvilhuber/bitbucket-stata:latestStata scanning/executionStata scanner, PII scanner, execution
aeadataeditor/verification-r:latestR dependency checkingR scanner
julia:latestJulia dependency checkingJulia scanner
aldanial/clocLine countingLine counter
dpokidov/imagemagickImage conversionGraphics conversion

Artifact Management

Artifacts are files passed between pipeline steps:

artifacts:
  - generated/**   # Analysis outputs
  - cache/**       # Downloaded deposits

Only specified steps preserve artifacts. This reduces storage and transfer time.

Caching

The pipeline uses Bitbucket’s caching for pip packages:

caches:
  - pip

This speeds up subsequent runs by reusing downloaded Python packages.

Configuration Integration

All pipelines read from config.yml:

. ./tools/parse_yaml.sh
eval $(parse_yaml config.yml)

This allows parameters to be stored in the repository rather than entered manually.

Conditional Processing

Scripts check environment variables to skip processing:

[[ "$SkipProcessing" == "yes" ]] && exit 0
[[ "$ProcessStata" == "no" ]] && exit 0

This provides fine-grained control over which analyses run.

Git Integration

Most pipelines end with:

git status
git push
git push --tags  # Some pipelines

Some use [skip ci] in commit messages to prevent recursive pipeline triggers:

git commit -m "[skip ci] Rename $oldName to $newName"

Resource Sizing

Bitbucket provides different resource tiers:

Larger sizes cost more build minutes but prevent timeouts on big deposits.

Parallel vs Sequential

Pipeline 1 (1-populate-from-icpsr):

Pipeline w (w-big-populate-from-icpsr):

YAML Anchors

The configuration uses YAML anchors for reusability:

- step: &z-run-any-anchor
    name: Run R or Stata code
    script: [...]

Referenced later:

- step:
    <<: *z-run-any-anchor
    name: Run Stata code
    size: 2x

Environment Variables

Available in pipelines:

Best Practices

  1. Use Pipeline 1 for standard deposits

  2. Use Pipeline w for deposits >1GB or with many files

  3. Use Pipeline z-run-any-big for compute-intensive replications

  4. Set ProcessX=“no” for languages not in deposit (faster)

  5. Use SkipProcessing=“yes” to only download without analysis

  6. Use [skip ci] commits to avoid recursive triggers

  7. Check artifacts in Bitbucket UI if debugging

Troubleshooting

IssueSolution
Timeout during downloadUse w-big-populate-from-icpsr
Out of memory during analysisUse w-big-populate-from-icpsr or z-run-any-big
Parallel steps failCheck individual step logs in Bitbucket
Artifacts not foundVerify previous step completed successfully
Git push failsCheck repository permissions and credentials