English 中文(简体)
Is git worth for managing many files bigger than 500MB
原标题:

I would put under version control a big amount of data, i.e. a directory structure (with depth<=5) with hundreds files with size about 500Mb).

The things I need is a system that help me: - to detect if an files has been changed - to detect if files were added/removed - to clone the entire repository in another location - to store a "checkpoint" and restore it later

I don t need sha1 for change detect, something faster is acceptable.

Is git worth for this? There is a better alternative?

最佳回答

As I mentioned in "What are the Git limits", Git is not made to manage big files (or big binary files for that matter).

Git would be needed if you needed to:

  • know what has actually changed within a file. But for the directory-level, the other answers are better (Unison or rsynch)
  • keep a close proximity (i.e. "same referential") between your development data, and those large resources. Having only one referential would help, but then you would need a fork of Git, like git-bigfiles to efficiently manage them.

Note: still using Git, you can try this approach

Unfortunately, rsync isn t really perfect for our purposes either.

  • First of all, it isn t really a version control system. If you want to store multiple revisions of the file, you have to make multiple copies, which is wasteful, or xdelta them, which is tedious (and potentially slow to reassemble, and makes it hard to prune intermediate versions), or check them into git, which will still melt down because your files are too big.
  • Plus rsync really can t handle file renames properly - at all.

Okay, what about another idea: let s split the file into chunks, and check each of those blocks into git separately.
Then git s delta compression won t have too much to chew on at a time, and we only have to send modified blocks...

Based on gzip --rsyncable, with a POC available in this Git repo.

问题回答

git-annex is a solution to this problem. Rather than storing the large file data directly in git, it stores it in a key/value store. Symlinks to the keys are then checked into git as a proxy for the actual large files.

http://git-annex.branchable.com

Unison File Synchroniser is an excellent tool for maintaining multiple copies of large binary files. It will do everything you ask for apart from storing a checkpoint - but that you could do with an rsync hardlink copy.

If you re on a unix system (probably are, since you re using git):

  • Use a git repo for all the small stuff.
  • Symlink large files from a single "large_files" folder to the appropriate locations within your repository.
  • Backup the large_files folder using a more traditional-non-versioning backup system, bundle em all up into a zip file from time to time if you need to pass em to others.

That way, you get the benefits of git, you keep whatever tree structure you want, and the large sized files are backed up elsewhere, despite appearing to still be inside the normal folder hierarchy.

Maybe something like rsync is better for your needs (if you just want some backups, no concurrency, merge, branching etc.)





相关问题
编辑大案

在我可以预见到需要编辑的大量档案(主要是固定文本档案,但可以是CSV、固定-width、XML,......迄今为止)。 我需要发展......。

How can I quickly parse large (>10GB) files?

I have to process text files 10-20GB in size of the format: field1 field2 field3 field4 field5 I would like to parse the data from each line of field2 into one of several files; the file this gets ...

gcc/g++: error when compiling large file

I have a auto-generated C++ source file, around 40 MB in size. It largely consists of push_back commands for some vectors and string constants that shall be pushed. When I try to compile this file, g+...

Is git worth for managing many files bigger than 500MB

I would put under version control a big amount of data, i.e. a directory structure (with depth<=5) with hundreds files with size about 500Mb). The things I need is a system that help me: - to ...

热门标签