Enhancing File Uploads to Amazon S3 Using multi-part with Pre-signed URLs and Threaded Parallelism
Efficiently handling file uploads to Amazon S3 is crucial for today’s applications, especially those dealing with large amounts of data. In this guide, we’ll delve into methods to improve file upload processes using pre-signed URLs and threaded parallelism. By the end of this blog, you’ll understand how to enhance your file upload workflows to Amazon S3 using these techniques, making your application more efficient and responsive.
Let’s Start
We are not discussing any details of AWS or Python concepts in this blog, we explore the ways of uploading a large file (>5GB) to AWS s3 buckets. This blog will give you an idea how to design your upload functionality effectively touching basic concepts of multipart upload and pre-signed urls.
Efficiently managing file uploads to Amazon S3 is essential for modern applications. In this guide, we’ll explore techniques such as pre-signed URLs and threaded parallelism to optimize the upload process. Pre-signed URLs offer secure and efficient uploads, while threaded parallelism enhances performance. By understanding these methods, developers can streamline file uploads and improve application responsiveness.
Do you know that is multi-part upload?
Multipart upload in Amazon S3 is a feature that allows you to upload large objects in parts, which can improve reliability, speed, and efficiency of uploads. Multipart upload in Amazon S3 is a feature that allows you to upload large objects in parts, which can improve reliability, speed, and efficiency of uploads.
Multipart upload in Amazon S3 offers a robust solution for uploading large files efficiently. It enables resumable uploads, parallel uploading, and optimized network bandwidth usage, making it ideal for scenarios such as large file transfers, streaming data, and backup operations. By breaking down objects into smaller parts, multipart upload enhances reliability by reducing the impact of failures and providing a mechanism for real-time data uploads. Its versatility addresses the challenges of managing large datasets, ensuring improved performance and minimizing upload times, thus making it an indispensable feature for various industries and applications in cloud storage environments
Let’s explore the solution approach in some basic steps:
- Understand the importance of file upload optimization.
- Leverage pre-signed URLs for secure and efficient uploads.
- Implement pre-signed URL generation in Python using Boto3.
- Upload file parts to Amazon S3 using pre-signed URLs.
- Complete the multipart upload process to consolidate file parts.
Understanding the Importance of File Upload Optimization:
Efficient file uploads are crucial for user experience and system performance, especially when dealing with large files. Optimizing file uploads ensures faster transfer times, reduces server load, and enhances overall application responsiveness.
Leveraging Pre-signed URLs for Secure and Efficient Uploads:
Pre-signed URLs offer a secure method for uploading files directly to Amazon S3 without exposing sensitive credentials. They provide temporary access to specific S3 resources, enhancing security and minimizing potential risks.
Implementing Pre-signed URL Generation in Python:
We’ll delve into the Python code required to generate pre-signed URLs using the Boto3 library.
def generate_presigned_url_transfer_accel(bucket_name, object_key, part_number, upload_id, expiration=3600):
try:
s3_client = boto3.client('s3', config=Config(
s3={
'use_accelerate_endpoint': True
}
))
response = s3_client.generate_presigned_url(
'upload_part',
Params={
'Bucket': bucket_name,
'Key': object_key,
'PartNumber': part_number,
'UploadId': upload_id
},
ExpiresIn=expiration
)
return response
except Exception as e:
print(f"Error generating presigned URL for part {part_number}: {str(e)}")
return None
Uploading File Parts with Pre-signed URLs:
The following function uploads individual file parts to Amazon S3 using pre-signed URLs.
def upload_part_with_presigned_url(url, part_data):
try:
# Upload the part using the pre-signed URL
response = requests.put(url, data=part_data,headers={'Content-Encoding': 'gzip'})
response.raise_for_status()
# Extract the ETag for the part from the response headers
etag = response.headers.get('ETag')
if etag:
print(f"Part uploaded successfully! ETag: {etag}")
return etag
else:
print("Failed to retrieve ETag from response headers")
return None
except Exception as e:
print(f"An error occurred during upload: {str(e)}")
traceback.print_exc()
return None
Completing the Multipart Upload Process:
Once all file parts are uploaded, the multipart upload process must be completed to consolidate the parts into a single object within the S3 bucket.
def complete_multipart_upload(bucket_name, object_key, upload_id, parts):
try:
s3_client = boto3.client('s3')
response = s3_client.complete_multipart_upload(
Bucket=bucket_name,
Key=object_key,
UploadId=upload_id,
MultipartUpload={'Parts': parts}
)
print("Multipart upload completed successfully!")
except Exception as e:
print(f"Error completing multipart upload: {str(e)}")
Where is main program, that chunks parts and upload?
This function upload_large_file_to_s3_with_presigned_urls
takes a file path, bucket name, and object key as input parameters. It uploads large files to Amazon S3 by breaking them into smaller parts and uploading each part using pre-signed URLs. The function ensures efficient and reliable uploads while handling potential errors gracefully.
def upload_large_file_to_s3_with_presigned_urls(file_path, bucket_name, object_key, part_size=10 * 1024 * 1024):
st = time.time()
try:
s3_client = boto3.client('s3')
response = s3_client.create_multipart_upload(
Bucket=bucket_name,
Key=object_key
)
upload_id = response['UploadId']
print(f'upload_id: {upload_id}')
parts = []
file_size = os.path.getsize(file_path)
num_parts = math.ceil(file_size / part_size)
with open(file_path, 'rb') as file:
for part_number in range(1, num_parts + 1):
# Read the part data with the specified part size
part_data = file.read(part_size)
# Generate pre-signed URL for the current part
url = generate_presigned_url_transfer_accel(bucket_name, object_key, part_number, upload_id)
if url:
print(f'Pre-signed URL for part {part_number}: {url}')
etag = upload_part_with_presigned_url(url, part_data)
if etag:
parts.append({'PartNumber': part_number, 'ETag': etag})
else:
print(f"Failed to upload part {part_number}")
traceback.print_exc()
s3_client.abort_multipart_upload(
Bucket=bucket_name,
Key=object_key,
UploadId=upload_id
)
return
else:
print(f"Failed to generate presigned URL for part {part_number}")
# Abort the multipart upload if any part fails
s3_client.abort_multipart_upload(
Bucket=bucket_name,
Key=object_key,
UploadId=upload_id
)
return
# Complete multipart upload
complete_multipart_upload(bucket_name, object_key, upload_id, parts)
print(f'Totle time taken {time.time() - st}')
except Exception as e:
print(f"An error occurred during upload: {str(e)}")
Who calls above function?
Your main program, and assumption is you handled AWS credentials as environment variables in your local AWS setup.
For eg., one of way of setting these is
os.environ[‘AWS_DEFAULT_PROFILE’] = ‘your-profile-name’
os.environ[‘AWS_REGION_NAME’] = ‘your-region-deployment’
if __name__ == "__main__":
bucket_name = 'bucket_name'
file_name = 'my_file.tar.gz'
file_to_upload = f'C:/test/{file_name}'
print(f'Original file size bytes {os.path.getsize(file_to_upload)}')
object_key = f'multi-part-folder/{file_name}'
part_size = 5 * 1024 * 1024 #This is your part size, that should be greater than 5MB for AWS multipart upload
st = time.time()
upload_large_file_to_s3_with_presigned_urls(file_to_upload, bucket_name, object_key, part_size)
print(f'Upload time {time.time() -st}')
Let us inspect a bit, if we can increase its performance further???
Let us add some threads and parallelism, if we can,
- Add some threaded parallelism for enhanced performance.
- Implement threaded parallelism in file uploads using Python’s ThreadPoolExecutor.
We change the function implementation little to add parallelism, see below code snippet,
def upload_large_file_to_s3_with_presigned_urls(file_path, bucket_name, object_key, part_size=10 * 1024 * 1024):
st = time.time()
try:
s3_client = boto3.client('s3')
response = s3_client.create_multipart_upload(
Bucket=bucket_name,
Key=object_key
)
upload_id = response['UploadId']
print(f'upload_id: {upload_id}')
parts = []
file_size = os.path.getsize(file_path)
num_parts = math.ceil(file_size / part_size)
def upload_part(part_number):
with open(file_path, 'rb') as file:
file.seek((part_number - 1) * part_size)
part_data = file.read(part_size)
url = generate_presigned_url_transfer_accel(bucket_name, object_key, part_number, upload_id)
if url:
print(f'Pre-signed URL for part {part_number}: {url}')
etag = upload_part_with_presigned_url(url, part_data)
if etag:
return {'PartNumber': part_number, 'ETag': etag}
else:
print(f"Failed to upload part {part_number}")
return None
else:
print(f"Failed to generate presigned URL for part {part_number}")
return None
# Upload parts in parallel
with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
future_to_part = {executor.submit(upload_part, part_number): part_number for part_number in
range(1, num_parts + 1)}
for future in concurrent.futures.as_completed(future_to_part):
part_number = future_to_part[future]
try:
data = future.result()
if data:
parts.append(data)
except Exception as exc:
print(f'Part {part_number} generated an exception: {exc}')
for part in parts:
if part is None:
print("Error: Some parts failed to upload. Aborting multipart upload.")
s3_client.abort_multipart_upload(
Bucket=bucket_name,
Key=object_key,
UploadId=upload_id
)
return
parts.sort(key=lambda x: x['PartNumber'])
complete_multipart_upload(bucket_name, object_key, upload_id, parts)
print(f'Total time taken {time.time() - st}')
except Exception as e:
print(f"An error occurred during upload: {str(e)}")
The above function is self explanatory, but let us look at one important instruction
parts.sort(key=lambda x: x['PartNumber'])
This is very important where you need to sort the all parts before you do a final upload method’s call
Any Challenges with this parallel execution of parts?
While threaded parallelism offers significant performance gains, it also introduces challenges such as resource overhead, thread safety considerations, and context switching. Understanding these nuances is essential for effective utilization of threaded parallelism in file uploads.
Conclusion and Next Steps:
In conclusion, optimizing file uploads to Amazon S3 requires a combination of pre-signed URLs and threaded parallelism. By implementing these techniques effectively, developers can streamline file upload workflows, enhance system efficiency, and deliver an optimal user experience. As you continue to explore file upload optimization, consider experimenting with additional strategies and staying updated on best practices in cloud storage management.