In this blog post, I will show how to completely remove large files from a GIT repository and its GIT history. I have found many examples on the net, but it was still a challenge to reduce the size of the GIT repository.
This has been tested on a real-world repository with almost 8000 commits and 120 branches. The target was
- to keep the full (rewritten) history
- to be able to keep any branch you want
- usage of external software like BFG was forbidden due to security restrictions. However, I have tested BFG privately in this GIT Repo and I can recommend its usage. See this GIT repo, which will promote a simple example how BFG can be used.
In our case, the GIT history has grown to 223 MB, because a binary of ~40 MB was updated several times:
du -h -d 1 223M ./.git ...
Step 0: Inform your Developers
First of all, you need to inform everybody, what you are going to do, and when it will be done. The changes are destructive, and even if you fork the repository, there is no easy way to revert all the changes. E.g. existing pull requests will be gone.
Step 1: Disallow any changes to the Repo
Repos like BitBucket allow fine-grained control over who is allowed to do what. I recommend that you disallow any changes to the repo apart the one you are performing.
Step 2: Move and Fork the Repo
As a kind of backup of the repo, I recommend to move the repository and work on a forked copy of the repo.
Step 3 (optional): Replace remote Branches by Tags
In our case, we had about 120 branches. On one hand, cleaning the repo from binaries was a good opportunity to substantially reduce the number of branches. However, I wanted to give everybody the possibility to re-create the branch in question. For that, I have written a little script that helps to replace branches by tags. In a later step, I have sent information to the developers, how they can use the tag to retain their branch.
#!/bin/sh eval $(ssh-agent -s) ssh-add ~/.ssh/id_rsa # packed-refs might not alwasy work, if the refs are not packed: #grep -R origin .git/packed-refs | awk '{print $2}' | sed 's/refs\/remotes\/origin\///' | while read BRANCH # using git branch -a instead: git branch -a | grep remotes/origin | sed 's/remotes\/origin\///' | while read BRANCH do echo git checkout $BRANCH git checkout $BRANCH echo git tag branch/$BRANCH refs/heads/$BRANCH git tag branch/$BRANCH refs/heads/$BRANCH done
With that, all branches can be re-created from the tags named branch/<branchname>, as seen below.
Step 4: Clean the local GIT History
Step 4.1: Clone the GIT Repo
First of all, we need t clone the forked repository to the local machine. As a reference for later, we also print the number of objects found.
git clone <URL-of-your-large-repo> cd <your-large-repo> git count-objects -v
Step 4.2: Remove Binaries from current Commit
Let us that all binaries that you need to remove are located in a directory named binarydir. In that case, let us remove the binaries first:
cd binarydir rm *.ear *.war" *.jar *.zip*.exe cd ..
Step 4.3: Filter Binaries from GIT History
Now the binaries are still found in the .git GIT history. To remove them from .git, let us perform the following:
git filter-branch --tag-name-filter 'cat' -f --tree-filter ' find . -type d -name binarydir | while read dir do find $dir -type f -name "*.ear" -o -name "*.war" -o -name "*.jar" -o -name "*.zip" -o -name "*.exe" | while read file do git rm -r -f --ignore-unmatch $file done done ' -- --all
This will take quite long, if your repo has many commits (in our case, it was ~2 to 3 hrs for >~ 7000 commits).
Note: If you have the possibility to use BFG, you can do so to speed up the process. See my GIT repo git-repo-cleaner for an example.
Step 4.4: Adapt References and perform a Garbage Collection
Now the references need to be updated and we need to perform a garbage collection:
git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref –d git reflog expire --expire=now –all git gc --prune=now --aggressive
Step 5: Verify that the Size is Reduced
After this procedure, all specified binaries should be removed from the repository. In our case the git history size has been reduced to 21 MB.
$ du -h -d 1 21M ./.git ...
If the size is not reduced, you might want to troubleshoot with following command
git count-objects -v
and re-perform step 4.4. If this does not help, you can review some more troubleshooting commands on this blog post.
Step 6: Save the Repository to Overwrite the old Repository
Note: We have experienced a lot of problems, because we tried to retain the old repo names and URL (seen with ‚git remote -v‘) furst: there were around 30 developers and testers that had local clones of the repo. Two or three times, a developer or tester was pushing his local clone to the remote repo and all the work was in vain. At the end, we have decided to use a different repo name (and thus a different repo URL) for the cleaned repo. This has helped to prevent any further unintended spoiling of the repo.
On GitHub or BitBucket, create a new repo we will use for the cleaned repo. Then, we change the GIT URL of our local cleaned repo to match the new repo:
git remote -v git remote remove origin git remote add origin <new cleaned repo URL> git remote -v git push --force
Step 7 (optional): Re-Create Branches
Step 7.1: Re-create Branches from Tags
If you have followed step 3, your developers and testers can re-crate their branches from the tags like follows:
BRANCH=<your branch name> git checkout branch/$BRANCH git checkout -b $BRANCH git push origin $BRANCH
If the last command does not work, it the error message usually tells you, what to do instead (set the remote branch name…).
Step 7.2: Alternatively, push a Branches directly
Since Step 3 will also have created all branches locally, you also can re-create remote branches from them more easily than other developers and testers:
git checkout $BRANCH git push --set-upstream origin $BRANCH
Step 7.3: Alternatively, create all Branches
If you want to re-create all branches, you can do that like follows:
eval $(ssh-agent -s) ssh-add ~/.ssh/id_rsa git branch | awk -F '[ *]' '{print $3}' | while read BRANCH do echo git checkout $BRANCH git checkout $BRANCH echo "git push --set-upstream origin $BRANCH" git push --set-upstream origin $BRANCH done
We have chosen, not to do so, since only a few of the branches really were needed.
Step 8 (optional): Clean Tags
Tags that are not needed anymore can be deleted like follows:
TAG=branch/$BRANCH git tag -d $TAG # delete tag locally git push origin :refs/tags/$TAG # delete tag in repo
Here, I have assumed that the tag you want to remove has been created in Step 3 before with the name branch/$BRANCH.
Step 9: Push Tags to remote Repo
Tags that are still needed, can be pushed to the remote repo:
git push origin <tag_name>
Last Steps
- The new repo needs to be adapted with respect to all settings, so it matches the original repo, e.g.
- default branch (=develop in our case)
- pull request reviewer list
- whether or not to allow changes without pull request (not allowed for „develo“ and „release/vx.x“ branches in our case)
- disallow forced updates
Summary
In a real-world example, we have cleaned a GIT repo from a set of large binary files.
References
- GIT Repo example: https://github.com/oveits/git-repo-cleaner is demonstrating on a simple example how to add and remove a large file from history.
- This blog post is giving you quite some additional commands how to troubleshoot, if the size fails to be reduced.
Your article helped me a lot, is there any more related content? Thanks!