Asked By: Anonymous
I have some very large files (100GB) in GCS that need to be processed to remove invalid characters.
Downloading them and processing them and uploading them again takes forever.
Does anyone know if it is possible to process them in the Google Cloud Platform eliminating the need for download/upload ?
I am familiar with Python and Cloud functions if those are an option.
Answered By: Anonymous
As John Hanley said in the comments section, there is no compute features on Cloud Storage, so to process it you need to download it.
Once that said, instead of downloading the huge file locally to process it, you can start a Compute Engine VM, download that file, process it with a Python script (since you have stated that you’re familiar with Python), and updated the processed file.
It will be probably quicker to download the file on a Compute Engine VM (it depends on the machine type though) than downloading the file on your computer.
Also, for faster downloads of huge files, you can use some
gsutil options :
gsutil -o 'GSUtil:parallel_thread_count=1' -o 'GSUtil:sliced_object_download_max_components=16' cp gs://my-bucket/my-huge-file .
And for faster uploads of huge files, you can use parallel composite uploads :
gsutil -o 'GSUtil:parallel_composite_upload_threshold=150M' cp my-huge-file gs://my-bucket