Writing about Data Engineering problems and SQL
by Brahmanand Singh
In the world of cloud storage, efficiently navigating and managing your data is crucial. When working with Google Cloud Storage (GCS), you might often need to list the root folders in a bucket. While GCS doesn’t have a true folder structure, it uses a flat namespace with object names that can contain slashes to simulate folders. In this post, we’ll explore an efficient method to list these “root folders” in a GCS bucket using Python.
GCS doesn’t have a native “list folders” functionality because it doesn’t actually have folders - it just has objects with names that may include slashes. This means we need to get creative with how we identify and list what we consider “root folders.” For example consider below strcutre:
bucket/
├── dir1/
│ └── file.txt
├── dir2/
│ └── subdir/
└── dir3/ (empty)
you want to get output as:
dir1/
dir2/
dir3/
one way is to parse all the objects under the bucket and then split and find the 1st element from the object name, but this is not efficient as bucket may contains 1k or 10k objects.
We’ll use the Google Cloud Storage Python client library to list all objects in the bucket and then extract the root folder names from the object paths. Here’s an efficient implementation:
from google.cloud import storage
client = storage.Client()
bucket_name = '<bucket-name>'
bucket = client.bucket(bucket_name)
blobs = bucket.list_blobs(
match_glob="**/",
delimiter="/",
)
list(blobs)
for folder in list(blobs.prefixes):
print(folder)
Initialization: We create a GCS client and get a reference to our bucket.
Listing Blobs:We use match_glob=’**/’ in the list_blobs call. This pattern matches any object that has at least one forward slash in its name, effectively finding all objects that are in any subfolder.
Simplified Processing: Since match_glob is filtering the objects in the call, we do not need to run a for loop and filter objects that ends with “/”.
For very large buckets, you might consider:
max_results
and timeout
parameters when using the list_blob method.This method provides an efficient way to list root folders in a GCS bucket, balancing simplicity with performance. It’s particularly useful for buckets with a moderate number of objects. For extremely large buckets or more complex scenarios, you may need to implement additional optimizations.
Remember, while this approach simulates a folder structure, it’s important to keep in mind that GCS uses a flat namespace. Always design your storage structure with this in mind for optimal performance and ease of management.
Happy coding, and may your cloud storage always be well-organized!
tags: GCS - Bucket - Object Storage - namespace