A very BIG ML dataset un-TAR GZIP command

I have learned that none of my GUI Mac programs were able to expand the 13 GB dataset, however, the command line had no problem with it.


$ tar xvzf BIG_DATASET_MANY_THOUSANDS_FOLDERS.tar.gz

It would be great is it was this simple!

The command has failed as I run out of 41 GB of free disk space before I was able to expand it.

Alternatively, I considered going one directory at the time,

$ tar xvfz BIG_DATASET_MANY_THOUSANDS_FOLDERS.tar.gz /directory_path


with a script that traverses the directories. This way I can keep track which directories were correctly expanded.

At this point, I ended up with multiple directories on various disks, a directory merging tool is very useful:

# parameters:
# -a --archive; look at everything recursively
# -i; --itemize-changes; print update about each file
# -h; --human-readable
# -W; --whole-file; avoid file deltas
# --progress; show progress in terminal
# --log-file=XYZ.log; log the progress to file, this might be useful when resuming
$ rsync -aW source_directory/ destination_directory/


References:

  • https://www.thegeekstuff.com/2010/04/unix-tar-command-examples/
  • https://medium.com/@sethgoldin/a-gentle-introduction-to-rsync-a-free-powerful-tool-for-media-ingest-86761ca29c34