In this blog post, I will show how to completely remove large files from a GIT repository and its GIT history. I have found many examples on the net, but it was still a challenge to reduce the size of the GIT repository.

This has been tested on a real-world repository with almost 8000 commits and 120 branches. The target was

  • to keep the full (rewritten) history
  • to be able to keep any branch you want
  • usage of external software like BFG was forbidden due to security restrictions. However, I have tested BFG privately in this GIT Repo and I can recommend its usage. See this GIT repo, which will promote a simple example how BFG can be used.

In our case, the GIT history has grown to 223 MB, because a binary of ~40 MB was updated several times:

du -h -d 1
223M    ./.git
...

Step 0: Inform your Developers

First of all, you need to inform everybody, what you are going to do, and when it will be done. The changes are destructive, and even if you fork the repository, there is no easy way to revert all the changes. E.g. existing pull requests will be gone.

Step 1: Disallow any changes to the Repo

Repos like BitBucket allow fine-grained control over who is allowed to do what. I recommend that you disallow any changes to the repo apart the one you are performing.

Step 2: Move and Fork the Repo

As a kind of backup of the repo, I recommend to move the repository and work on a forked copy of the repo.

Step 3 (optional): Replace remote Branches by Tags

In our case, we had about 120 branches. On one hand, cleaning the repo from binaries was a good opportunity to substantially reduce the number of branches. However, I wanted to give everybody the possibility to re-create the branch in question. For that, I have written a little script that helps to replace branches by tags. In a later step, I have sent information to the developers, how they can use the tag to retain their branch.

#!/bin/sh

eval $(ssh-agent -s)
ssh-add ~/.ssh/id_rsa

# packed-refs might not alwasy work, if the refs are not packed:
#grep -R origin .git/packed-refs | awk '{print $2}' | sed 's/refs\/remotes\/origin\///' | while read BRANCH
# using git branch -a instead:
git branch -a | grep remotes/origin | sed 's/remotes\/origin\///' | while read BRANCH
   do
      echo git checkout $BRANCH
      git checkout $BRANCH
      echo git tag branch/$BRANCH refs/heads/$BRANCH
      git tag branch/$BRANCH refs/heads/$BRANCH
   done

With that, all branches can be re-created from the tags named branch/<branchname>, as seen below.

Step 4: Clean the local GIT History

Step 4.1: Clone the GIT Repo

First of all, we need t clone the forked repository to the local machine. As a reference for later, we also print the number of objects found.

git clone <URL-of-your-large-repo>
cd <your-large-repo>
git count-objects -v

Step 4.2: Remove Binaries from current Commit

Let us that all binaries that you need to remove are located in a directory named binarydir. In that case, let us remove the binaries first:

cd binarydir
rm *.ear *.war" *.jar *.zip*.exe
cd ..

Step 4.3: Filter Binaries from GIT History

Now the binaries are still found in the .git GIT history. To remove them from .git, let us perform the following:

git filter-branch --tag-name-filter 'cat' -f --tree-filter '
    find . -type d -name binarydir | while read dir
      do
        find $dir -type f -name "*.ear" -o -name "*.war" -o -name "*.jar" -o -name "*.zip" -o -name "*.exe" | while read file
          do
             git rm -r -f --ignore-unmatch $file
          done
      done
' -- --all

This will take quite long, if your repo has many commits (in our case, it was ~2 to 3 hrs for >~ 7000 commits).

Note: If you have the possibility to use BFG, you can do so to speed up the process. See my GIT repo git-repo-cleaner for an example.

Step 4.4: Adapt References and perform a Garbage Collection

Now the references need to be updated and we need to perform a garbage collection:

git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref –d
git reflog expire --expire=now –all
git gc --prune=now --aggressive

Step 5: Verify that the Size is Reduced

After this procedure, all specified binaries should be removed from the repository. In our case the git history size has been reduced to 21 MB.

$ du -h -d 1
21M     ./.git
...

If the size is not reduced, you might want to troubleshoot with following command

git count-objects -v

and re-perform step 4.4. If this does not help, you can review some more troubleshooting commands on this blog post.

Step 6: Save the Repository to Overwrite the old Repository

Note: We have experienced a lot of problems, because we tried to retain the old repo names and URL (seen with ‚git remote -v‘) furst: there were around 30 developers and testers that had local clones of the repo. Two or three times, a developer or tester was pushing his local clone to the remote repo and all the work was in vain. At the end, we have decided to use a different repo name (and thus a different repo URL) for the cleaned repo. This has helped to prevent any further unintended spoiling of the repo.

On GitHub or BitBucket, create a new repo we will use for the cleaned repo. Then, we change the GIT URL of our local cleaned repo to match the new repo:

git remote -v
git remote remove origin
git remote add origin <new cleaned repo URL>
git remote -v

git push --force

Step 7 (optional): Re-Create Branches

Step 7.1: Re-create Branches from Tags

If you have followed step 3, your developers and testers can re-crate their branches from the tags like follows:

BRANCH=<your branch  name>
git checkout branch/$BRANCH
git checkout -b $BRANCH
git push origin $BRANCH

If the last command does not work, it the error message usually tells you, what to do instead (set the remote branch name…).

Step 7.2: Alternatively, push a Branches directly

Since Step 3 will also have created all branches locally, you also can re-create remote branches from them more easily than other developers and testers:

git checkout $BRANCH
git push --set-upstream origin $BRANCH

Step 7.3: Alternatively, create all Branches

If you want to re-create all branches, you can do that like follows:

eval $(ssh-agent -s)
ssh-add ~/.ssh/id_rsa

git branch | awk -F '[ *]' '{print $3}' | while read BRANCH
   do
      echo git checkout $BRANCH
      git checkout $BRANCH
      echo "git push --set-upstream origin $BRANCH"
      git push --set-upstream origin $BRANCH
   done

We have chosen, not to do so, since only a few of the branches really were needed.

Step 8 (optional): Clean Tags

Tags that are not needed anymore can be deleted like follows:

TAG=branch/$BRANCH
git tag -d $TAG  # delete tag locally
git push origin :refs/tags/$TAG  # delete tag in repo

Here, I have assumed that the tag you want to remove has been created in Step 3 before with the name branch/$BRANCH.

Step 9: Push Tags to remote Repo

Tags that are still needed, can be pushed to the remote repo:

git push origin <tag_name>

Last Steps

  • The new repo needs to be adapted with respect to all settings, so it matches the original repo, e.g.
    • default branch (=develop in our case)
    • pull request reviewer list
    • whether or not to allow changes without pull request (not allowed for „develo“ and „release/vx.x“ branches in our case)
    • disallow forced updates

Summary

In a real-world example, we have cleaned a GIT repo from a set of large binary files.

References

2 comments

Comments

Diese Website verwendet Akismet, um Spam zu reduzieren. Erfahre mehr darüber, wie deine Kommentardaten verarbeitet werden.