Moving Mountains — Large File Upload Performance
Last week it was papercuts, and this week it’s elephants… or something…
In this GCS Bytes post, we’re talking about large file uploads and how you can move mountains of materials in mere moments!
Large upload, Low speed
I’ve got a dataset with files ranging in size from a few hundred MB up to 2GB and when i was uploading it into GCS, I noticed it was giving me some trouble. The upload speed was pretty low, and since that isn’t what I want, I decided to chart my upload performance to see what was up.
In taking that peek under the hood, I noticed a slight upper limit in terms of upload speeds. It seems to be getting to the larger uploads in particular, but at a certain point everything just starts to crawl.
While unfortunate, this wasn’t entirely unexpected. Using the Python API, GSUTIL warned me the file was too large. If I’d taken a closer look at that warning, I could have remedied my ways with a game changing suggestion. And I’ll save you the heartache:
While giving you grief, it’s telling you to use parallel composite uploads for the best performance when dealing with one or more large files. But what does this mean, and does it have anything to do with our previous conversations about small file upload performance?
Let’s take a look!
Parallel Composite Uploads
During parallel composite uploads, GSUTIL splits your larger files into smaller equal sized chunks on your local disk. After that, it will initiate a parallel upload into your GCS bucket, and recombine all those pieces back to a since file once they’re uploaded.
The result is a faster upload for you, without having to worry about slicing it up yourself.
**File splitting not to scale**
GSUTIL even cleans up after itself, removing all temporary components, which is critical if you want to avoid any unintentional storage costs from composite objects (more in the article on that).
And just like we discussed in our parallel uploads for smaller files post, composite uploads dramatically improve upload performance for larger files.
Using it in practice
To test what the performance of composite upload looks like, we’ll grab a 500mb file, and upload it ~100 times and chart the performance of each.
You can enable this by setting the `parallel_composite_upload_threshold` option on gsutil (or, updating your .boto file, like the console output suggests)
gsutil -o GSUtil:parallel_composite_upload_threshold=150M cp ./localbigfile gs://your-bucket
Where `localbigfile` is a file larger than 150 MiB. This will divide up your data into chunks ~150MiB and upload them in parallel, increasing upload performance. (Note, there’s some restrictions on the # of chunks that can be used. Refer to the docs for more information)
If we run the test again to check out those upper limits and upload times, we’ve got a refreshing visual:
On some tests, there’s as much as a 2x improvement in upload time, meaning you’ve successfully avoided that upper limit, and are well on your way to moving that mountain.
Make sure you check out the other blogs on large file optimization, so you can really make the most of your Cloud Storage instance!