How to Mirror WordPress SVN Repositories

Creating a mirror of the SVN (Subversion) repositories hosting the source code of the WordPress core, plugins and themes is the first step to replicating the install and update APIs used by the WordPress itself.

There are several SVN repositories that need to be mirrored:

After cloning the repositories, it is possible to keep them in sync with the origin using svnsync which would periodically pull in just the latest changes.

Download SVN Dumps

I’ve gone through the steps outlined in this guide and created the SVN dumps of all three repositories and published them on the Internet Archive so you can get right to local import and sync:

Be sure to use a download manager that enables resuming interrupted downloads such as wget:

wget --continue https://archive.org/download/wp-org-svn-themes-dump/wp-org-svn-themes-dump.gz

or curl:

curl --location --remote-name --continue-at - https://archive.org/download/wp-org-svn-themes-dump/wp-org-svn-themes-dump.gz

Extract the Dumps

All dumps are compressed with gzip and both themes and plugins are also packed as tar bundles since they contain multiple dumpstream files.

Use the following commands to extract the single core dump (remove the --keep flag if you don’t want to preserve the compressed original):

gunzip --keep wp-org-core-svn-dump.gz

and the following command for tar.gz archives of theme and plugin dumps:

tar --extract --gunzip --file wp-org-svn-themes-dump.gz
tar --extract --gunzip --file wp-org-svn-plugins-dumps.gz

Below are steps for how to recreate the above dumps from the origin SVN repositories.


How about a Checkout?

The first idea might be to use svn checkout ... for each of the repositories but that doesn’t work due to the following reasons:

  1. Network errors and disconnects leave the local repository in a broken state and require svn cleanup before resuming the checkout, which often fails.
  2. The SVN checkout allows only a single process and can’t be parallelized due to locking of commits to the same files.

Alternatives to Checkout

Ideally, the process would:

  1. download the whole SVN revision history as a single file,
  2. or allow specifying a range of revisions to enable parallel downloads.
SVN mirror workflow using svnrdump and svnadmin

The svnadmin dump is a tool that creates a dump stream for specific ranges of revisions. Since it only works with local repositories, there is also svnrdump dump that supports remote repositories.

So the final workflow is this:

  1. Create a file with ranges of revisions like XXX:YYY to download in each process.
  2. Use parallel to call svnrdump dump --incremental --revision {} > repo-{}.dump where {} is replaced with one of the revision ranges.
  3. Import individual dumps into a new local repository using svnadmin load --file repo-XXX:YYY.dump repo-directory

Run this in a screen session to ensure the processes keep running even if you log out of the computer.

Requirements

Use Homebrew to install the required tooling:

brew install subversion parallel pv

while tools like bash, grep, cut and gzip are already included i macOS by default.

Step 1: Define Revision Ranges

The examples below are for the WordPress core SVN repository. Replace https://core.svn.wordpress.org with other repository URLs as needed.

Save this bash script as rev-ranges.sh:

#!/usr/bin/env bash

if [ $# -ne 1 ]; then
	echo "Usage: $0 <repository-url>"
	exit 1
fi

# Set the number of revisions included in a single dump.
REV_STEP=10000

LATEST_REV=$(svn info "$1" | grep "Revision:" | cut -c11-)

for (( rev_start = 0; rev_start < $LATEST_REV; rev_start += $REV_STEP )); do
	if (( $rev_start + $REV_STEP < $LATEST_REV )); then
		echo "$rev_start:$(($rev_start + $REV_STEP - 1))"
	else
		echo "$rev_start:$LATEST_REV"
	fi
done

Customise the REV_STEP as necessary, and mark it as executable:

chmod +x rev-ranges.sh

Finally, run it to generate core-revs.txt with a revision range per line:

./rev-ranges.sh https://core.svn.wordpress.org > core-revs.txt

which produces the following contents:

0:9999
10000:19999
20000:29999
30000:39999
40000:49999
50000:58547

Step 2: Dump Revision Ranges

Create a new directory to store the revision dumps:

mkdir core-dumps

Then to start parallel downloads of the revision range dumps pass the contents of core-revs.txt to parallel using cat along with the command above:

cat core-revs.txt | parallel "svnrdump dump --revision {} --incremental https://core.svn.wordpress.org > core-dumps/core-{}.dump"

where:

  • --revision {} specifies the revision range piped from the file,
  • --incremental makes the dumps standalone for incremental import.

This starts one svnrdump dump process per CPU. Pass a -j NN flag to parallel to specify a custom job count.

Note, that there is no output to terminal while the commands are running as all of the stdout is sent to the dump files. Use watch "ls -lh core-dumps" in another window to monitor the size of the individual dumps:

Every 2.0s: ls -lh core-dumps

total 734M
-rw-r--r-- 1 root root  28M Sep 28 15:24 core-0:9999.dump
-rw-r--r-- 1 root root  34M Sep 28 15:24 core-10000:19999.dump
-rw-r--r-- 1 root root  67M Sep 28 15:24 core-20000:29999.dump
-rw-r--r-- 1 root root  45M Sep 28 15:24 core-30000:39999.dump
-rw-r--r-- 1 root root 284M Sep 28 15:24 core-40000:49999.dump
-rw-r--r-- 1 root root 279M Sep 29 05:29 core-50000:58547.dump

For reference — here is the source code of the svnrdump — the dump_cmd function invokes replay_revisions which in turn calls svn_ra_replay_range.

Each dump file should be anywhere from 20MB to 1.5GB depending on the repository. The combined size of all dumps for WP core is 730MB.

In order to save the disk space, you can compress the dump stream with gzip before streaming to a file:

cat core-revs.txt | parallel "svnrdump dump --revision {} --incremental https://core.svn.wordpress.org | gzip > core-dumps/core-{}.dump.gz"

Remember to decompress the files when importing!

Step 3: Import Dumps Locally

Unfortunately, the import process can’t be parallelized because the revisions are referring to previous revisions which must exist in the SVN database before the later ones can be inserted.

Therefore, we must ensure that svnadmin load is called sequentially with dump ranges from the lowest revisions to the highest. We use sort --version-sort to list the dump file names in the natural order.

Save this bash script as load-dumps.sh (adjust this if using gzipped dumps):

#!/usr/bin/env bash

if [ $# -ne 1 ]; then
	echo "Usage: $0 <dumps-source-dir>"
	exit 1
fi

DUMPS_DIR=$1
SVN_DIR="$DUMPS_DIR-svn"

if [ -d $SVN_DIR ]; then
	echo "SVN directory $SVN_DIR already exists. Not sure how to merge dumps."
	exit 1
fi

svnadmin create "$SVN_DIR"

for dumpfile in $(ls "$DUMPS_DIR" | sort --version-sort); do
	pv "$DUMPS_DIR/$dumpfile" | svnadmin load --quiet --no-flush-to-disk --bypass-prop-validation --force-uuid --memory-cache-size 2048 "$SVN_DIR"
done

where you must customize the value of --memory-cache-size argument based on the available RAM.

Make it executable:

chmod +x load-dumps.sh

and run it by specifying the source directory containing the dump files as the first argument:

./load-dumps.sh core-dumps

which produces the following output for each file:

Loading core-0:9999.dump:
3.12MiB 0:00:05 [ 422KiB/s] [>  ]  28% ETA 0:00:07

If one of the imports fails, you can attempt to manually restart the specific import and specify --revision AAA:BBB to the known remaining revision range. You might also need to run svnadmin recover repo-directory if the repo was left in a corrupt state.

Import Performance

The best load performance I’ve seen is 20MB/s on average which leads to the following import times for each repository:

RepositoryDump Size UncompressedRepository SizeImport Time
WordPress Core779 MB935 MB5 minutes
WordPress Themes353 GBTBDTBD
WordPress Plugins1258 GBTBDTBD

Syncing with Origin

After creating the repositories locally, use svnsync to keep it with sync with the remotes. It is important to note that this tool works with repository URLs so we must specify the local repository as file:///Users/yourname/to/core-dumps-svn. Run pwd to print the full path to the current directory and append it to file://.

First, set the contents of hooks/pre-revprop-change in the repository directory to do nothing:

#!/bin/sh
exit 0

and make it executable:

chmod +x core-dumps-svn/hooks/pre-revprop-change

Then associate the local repository with a remote origin:

svnsync init --allow-non-empty file:///Users/yourname/wp-svn/core-dumps-svn https://core.svn.wordpress.org

which returns:

Copied properties for revision 58547.

And finally, run the actual sync:

svnsync sync file:///Users/yourname/wp-svn/core-dumps-svn

You can setup a cronjob to run this at regular intervals.

Checkout Working Copy

The repository we’ve created only contains the revision history and the associated meta data. To get the actual files or a working copy, we run:

svn checkout file:///Users/yourname/wp-svn/core-dumps-svn core-svn-checkout

where the last argument is the directory path for the working copy. If you skip the last argument, it will dump the working files in the SVN repository directory.


Discussion

Leave a Reply

Your email address will not be published. Required fields are marked *