Brahma's Personal Blog

Writing about Data Engineering problems and SQL

View the Project on GitHub brahma19/blog

28 September 2024

Efficient Listing of root folder in GCS Bucket

by Brahmanand Singh

Efficient Listing of Root Folders in Google Cloud Storage Buckets

In the world of cloud storage, efficiently navigating and managing your data is crucial. When working with Google Cloud Storage (GCS), you might often need to list the root folders in a bucket. While GCS doesn’t have a true folder structure, it uses a flat namespace with object names that can contain slashes to simulate folders. In this post, we’ll explore an efficient method to list these “root folders” in a GCS bucket using Python.

The Challenge

GCS doesn’t have a native “list folders” functionality because it doesn’t actually have folders - it just has objects with names that may include slashes. This means we need to get creative with how we identify and list what we consider “root folders.” For example consider below strcutre:

bucket/

├── dir1/
│   └── file.txt
├── dir2/
│   └── subdir/
└── dir3/  (empty)

you want to get output as:
dir1/
dir2/
dir3/

one way is to parse all the objects under the bucket and then split and find the 1st element from the object name, but this is not efficient as bucket may contains 1k or 10k objects.

The Solution

We’ll use the Google Cloud Storage Python client library to list all objects in the bucket and then extract the root folder names from the object paths. Here’s an efficient implementation:

	
from google.cloud import storage
client = storage.Client()
bucket_name = '<bucket-name>'
bucket = client.bucket(bucket_name)
blobs = bucket.list_blobs(
    match_glob="**/",
    delimiter="/",
)
list(blobs)
for folder in list(blobs.prefixes):
    print(folder)

How It Works

  1. Initialization: We create a GCS client and get a reference to our bucket.

  2. Listing Blobs:We use match_glob=’**/’ in the list_blobs call. This pattern matches any object that has at least one forward slash in its name, effectively finding all objects that are in any subfolder.

  3. Simplified Processing: Since match_glob is filtering the objects in the call, we do not need to run a for loop and filter objects that ends with “/”.

Efficiency Considerations

Potential Optimizations

For very large buckets, you might consider:

  1. Pagination: Process the results in batches to reduce memory usage.
  2. Parallel Processing: Use multiple threads or processes to parse blob names concurrently.
  3. Early Termination: we can add max_results and timeout parameters when using the list_blob method.

Conclusion

This method provides an efficient way to list root folders in a GCS bucket, balancing simplicity with performance. It’s particularly useful for buckets with a moderate number of objects. For extremely large buckets or more complex scenarios, you may need to implement additional optimizations.

Remember, while this approach simulates a folder structure, it’s important to keep in mind that GCS uses a flat namespace. Always design your storage structure with this in mind for optimal performance and ease of management.

Happy coding, and may your cloud storage always be well-organized!

tags: GCS - Bucket - Object Storage - namespace