by tux3 on 10/27/2024, 7:44:59 AM
by yunusabd on 10/27/2024, 6:31:46 AM
> For many reasons, that's just too big, we have folks in Europe that can't even clone the repo due to it's size.
What's up with folks in Europe that they can't clone a big repo, but others can? Also it sounds like they still won't be able to clone, until the change is implemented on the server side?
> This meant we were in many occasions just pushing the entire file again and again, which could be 10s of MBs per file in some cases, and you can imagine in a repo
The sentence seems to be cut off.
Also, the gifs are incredibly distracting while trying to read the article, and they are there even in reader mode.
by eviks on 10/27/2024, 6:30:48 AM
upd: silly mistake - file name does not include its full path
The explanation probably got lost among all the gifs, but the last 16 chars here are different:
> was actually only checking the last 16 characters of a filename > For example, if you changed repo/packages/foo/CHANGELOG.md, when git was getting ready to do the push, it was generating a diff against repo/packages/bar/CHANGELOG.md!
by tazjin on 10/27/2024, 2:59:54 PM
I just tried this on nixpkgs (~5GB when cloned straight from Github).
The first option mentioned in the post (--window 250) reduced the size to 1.7GB. The new --path-walk option from the Microsoft git fork was less effective, resulting in 1.9GB total size.
Both of these are less than half of the initial size. Would be great if there was a way to get Github to run these, and even greater if people started hosting stuff in a way that gives them control over this ...
by jakub_g on 10/27/2024, 9:19:52 AM
The article mentions Derick Stolee who dig the digging and shipped the necessary changes. If you're interested in git internals, shrinking git clone sizes locally and in CI etc, Derrick wrote some amazing blogs on GitHub blog:
https://github.blog/author/dstolee/
See also his website:
Kudos to Derrick, I learnt so much from those!
by fragmede on 10/27/2024, 6:20:39 AM
> Large blobs happens when someone accidentally checks in some binary, so, not much you can do
> Retroactively, once the file is there though, it's semi stuck in history.
Arguably, the fix for that is to run filter-branch, remove the offending binary, teach and get everyone setup to use git-lfs for binaries, force push, and help everyone get their workstation to a good place.
Far from ideal, but better than having a large not-even-used file in git.
by develatio on 10/27/2024, 9:29:36 AM
Hacking Git sounds fun, but isn't there a way to just not have 2.500 packages in a monorepo?
by snthpy on 10/27/2024, 7:41:26 AM
Thanks for this post. Really interesting and a great win for OSS!
I've been watching all the recent GitMerge talks put up by GitButler and following the monorepo / scaling developments - lots of great things being put out there by Microsoft, Github, and Gitlab.
I'd like to understand this last 16 char vs full path check issue better. How does this fit in with delta compression, pack indexes, multi-pack indexes etc ... ?
by wodenokoto on 10/27/2024, 8:27:45 AM
Nice to see that Microsoft is dog-fooding Azure DevOps. It seems that more and more Azure services only have native connectors to GitHub so I actually thought it was moving towards abandonware.
by issung on 10/27/2024, 7:04:28 AM
Having someone in arms reach to help out that knows the inner workings of Git so much must be a lovely perk of working on such projects at companies of this scale.
by nkmnz on 10/27/2024, 8:24:23 AM
> we have folks in Europe that can't even clone the repo due to it's size
Officer, I'd like to report a murder committed in a side note!
by dizhn on 10/27/2024, 4:27:38 PM
They call him Linux Torvalds over there?
by bubblesnort on 10/27/2024, 6:50:43 AM
> We work in a very large Javascript monorepo at Microsoft we colloquially call 1JS.
I used to call it office.com.. Teams is the worst offender there. Even a website with a cryptominer on it runs faster than that junk.by triyambakam on 10/27/2024, 6:31:13 AM
> we have folks in Europe that can't even clone the repo due to it's size.
What is it about Europe that makes it more difficult? That internet in Europe isn't as good? Actually, I have heard that some primary schools in Europe lack internet. My grandson's elementary school in rural California (population <10k) had internet as far back as 1998.
by rettichschnidi on 10/27/2024, 7:48:28 AM
I'm surprised they are actually using Azure DevOps internally. Creating your own hell I guess.
by nixosbestos on 10/27/2024, 9:34:20 PM
Oh hey I know that name, Stolee. Fellow JSR grad here.
by jbverschoor on 10/27/2024, 8:14:06 AM
> those branches that only change CHANGELOG.md and CHANGELOG.json, we were fetching 125GB of extra git data?! HOW THO??
Unrecognized 100x programmer somewhere lol
by mattlondon on 10/27/2024, 10:45:41 AM
I recently had a similar moment of WTF for git in a JavaScript repo.
Much much smaller of course though. A raspberry pi had died and I was trying to recover some projects that had not been pushed to GitHub for a while.
Holy crap. A few small JavaScript projects with perhaps 20 or 30 code files, a few thousand lines of code for a couple of 10s of KBs of actual code at most had 10s of gigabytes of data in the .git/ folder. Insane.
In the end I killed the recovery of the entire home dir and had to manually select folders to avoid accidentally trying to recover a .git/ dir as it was taking forever on a poorly SD card that was already in a bad way and I did not want to finally kill it for good by trying to salvage countless gigabytes of trash for git.
by Vilian on 10/28/2024, 2:34:53 AM
People who use git in monorepos don't understand git
by nsonha on 10/28/2024, 7:16:16 AM
I think the title misses the "Honey, " part
by EDEdDNEdDYFaN on 10/27/2024, 11:48:55 AM
better question - does the changelog need to be checked in the first place?
by jakub_g on 10/27/2024, 9:10:05 AM
Paraphrasing meat of the article:
- When you have multiple files in the repo which have the same trailing 16 characters in the repo path, git may wrongly calculate deltas, mixing up between those files. In here they had multiple CHANGELOG.md files mixed up.
- So if those files are big and change often, you end up with massive deltas and inflated repo size.
- There's a new git option (in Microsoft git fork for now) and config to use full file path to calculate those deltas, which fixes the issue when pushing, and locally repacking the repo.
```
git repack -adf --path-walk
git config --global pack.usePathWalk true
```
- According to a screenshot, Chromium repacked in this way shrinks from 100GB to 22GB.
- However AFAIU until GitHub enables it by default, GitHub clones from such repos will still be inflated.
by jimjimjim on 10/27/2024, 7:16:05 AM
Did anybody else shudder at "Shrunked"?
by killingtime74 on 10/27/2024, 6:23:04 AM
Shrank
by AbuAssar on 10/27/2024, 7:55:22 AM
the gif memes were very distracting...
For those wondering where this new git-survey command is, it's actually not in git.git yet!
The author is using microsoft's git fork, they've added this new command just this summer: https://github.com/microsoft/git/pull/667