Process files serverlessly using Lambda
File processing workflows often start with files that arrive on an NFS or SMB file share — scanned documents from branch offices, images uploaded by field teams, audio captured from contact centers, or data files delivered by partners.
With an Amazon S3 access point attached to the FSx for ONTAP volume, AWS Lambda functions read and write the files directly using the Amazon S3 API. File-level operations can be processed serverlessly against the same data your users and applications access over NFS and SMB.
This tutorial shows three common file processing patterns. Each example reads a file from the volume through the access point, processes it with an AWS service or library, and writes the result back to the volume.
| Example | Input | Processing | Output |
|---|---|---|---|
| Example 1: Generate image thumbnails | JPEG image | Pillow (image library) | Resized thumbnail |
| Example 2: Extract text from documents | PDF document | Amazon Textract | Extracted text (JSON) |
| Example 3: Transcribe audio files | MP3 audio | Amazon Transcribe | Transcript (JSON) |
Note
This tutorial takes approximately 40 to 60 minutes to complete. The AWS services used incur charges for the resources you create. If you complete all the steps, including the Clean up section promptly, the expected cost is less than $1 in the US East (N. Virginia) AWS Region. This estimate does not include ongoing charges for the FSx for ONTAP volume itself.
Prerequisites
Before you begin, make sure you have the following:
An FSx for ONTAP volume with an Amazon S3 access point attached. For instructions on creating an access point, see Creating an access point.
The access point alias for your access point. You can find this in the Amazon FSx console or by running
aws fsx describe-s3-access-point-attachments.AWS CLI version 1 or version 2 installed and configured. The
aws lambda invokecommands in this tutorial include the--cli-binary-format raw-in-base64-outoption, which is required in AWS CLI version 2 so that raw JSON payloads are not interpreted as base64. If you use AWS CLI version 1, omit that option.IAM permissions for the caller (the user or role running this tutorial) to invoke Lambda functions (
lambda:CreateFunction,lambda:InvokeFunction), access the Amazon S3 access point (s3:GetObject,s3:PutObject), and pass the Lambda execution role (iam:PassRole).
Note
This tutorial uses the default Lambda configuration, where functions run in a managed network outside your VPC. In that case, the access point must have an internet network origin so the function can reach it. If you attach your Lambda function to a VPC, you can instead use a VPC network origin on the access point; the VPC must have an Amazon S3 Gateway or Interface endpoint. For more information, see Configuring network access for Amazon S3 access points.
Step 1: Upload sample files
Download the following sample files and upload them to your FSx for ONTAP volume through the
access point. Replace with your
access point alias throughout this tutorial.my-ap-alias-ext-s3alias
Sample image: Download the NASA Blue Marble image
(public domain, 2.4 MB) and save it as sample-image.jpg.Sample audio: Download the sample audio file
from the Amazon Transcribe getting started tutorial (410 KB) and save it as sample-audio.mp3.
Upload the sample files to your FSx for ONTAP volume through the access point.
$aws s3 cp sample-image.jpg s3://my-ap-alias-ext-s3alias/samples/images/sample-image.jpg aws s3 cp sample-audio.mp3 s3://my-ap-alias-ext-s3alias/samples/audio/sample-audio.mp3
Note
The sample image is a NASA Blue Marble photograph (public domain, 2.4 MB). The sample audio is from the Amazon Transcribe getting started tutorial (410 KB). The sample PDF is generated in Example 2: Extract text from documents.
Step 2: Create the Lambda execution role
Lambda functions assume an execution role to interact with other AWS services. For
this tutorial, attach the AWS-managed AWSLambdaBasicExecutionRole
policy for CloudWatch Logs logging, then add an inline policy that grants access to the Amazon S3
access point and to the Textract and Transcribe APIs the examples use.
Replace ,
region, and
account-id with your values.access-point-name
Save the following trust policy as
trust-policy.json.{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": {"Service": "lambda.amazonaws.com"}, "Action": "sts:AssumeRole" } ] }Save the following inline permissions policy as
permissions-policy.json. It grants access to the access point and to the additional services the examples use.{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": ["s3:GetObject", "s3:PutObject", "s3:ListBucket"], "Resource": [ "arn:aws:s3:region:account-id:accesspoint/access-point-name", "arn:aws:s3:region:account-id:accesspoint/access-point-name/object/*" ] }, { "Effect": "Allow", "Action": ["textract:DetectDocumentText"], "Resource": "*" }, { "Effect": "Allow", "Action": [ "transcribe:StartTranscriptionJob", "transcribe:GetTranscriptionJob" ], "Resource": "*" } ] }Create the role, attach the managed logging policy, and attach the inline policy.
$aws iam create-role \ --role-namefsxn-lambda-file-processor\ --assume-role-policy-document file://trust-policy.json aws iam attach-role-policy \ --role-namefsxn-lambda-file-processor\ --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole aws iam put-role-policy \ --role-namefsxn-lambda-file-processor\ --policy-name fsxn-access-point-policy \ --policy-document file://permissions-policy.json
Integrating into your workflow
The examples in this tutorial use manual invocation with a test event. In production, you can trigger these functions automatically using the following approaches:
Amazon EventBridge schedule. Run the function on a recurring schedule (for example, every hour or daily) to process new files. The function can list files through the access point and process any that have not been processed yet. For more information, see Schedule Lambda functions using EventBridge in the Amazon EventBridge User Guide.
Amazon API Gateway. Expose the function as an HTTP API so that users or applications can request processing of a specific file on demand. For more information, see Build an API Gateway REST API with Lambda integration in the Amazon API Gateway Developer Guide.
Step Functions. Orchestrate multi-step file processing pipelines that combine multiple Lambda functions. For example, a workflow that extracts text from a document, translates it, and writes the result back to the volume. For more information, see Call Lambda with Step Functions in the AWS Step Functions Developer Guide.
Example 1: Generate image thumbnails
This example reads a JPEG image from your FSx for ONTAP volume, resizes it to a 200-pixel thumbnail using the Pillow image library, and writes the thumbnail back to the volume.
Lambda function code
Save the following code as lambda_function.py.
import boto3 from io import BytesIO from PIL import Image s3 = boto3.client('s3') def lambda_handler(event, context): bucket = event['access_point_alias'] key = event['key'] # Read the image from FSx through the access point response = s3.get_object(Bucket=bucket, Key=key) image_data = response['Body'].read() # Resize to thumbnail img = Image.open(BytesIO(image_data)) img.thumbnail((200, 200)) # Write the thumbnail back to FSx buffer = BytesIO() img.save(buffer, format='JPEG', quality=85) buffer.seek(0) thumbnail_key = key.rsplit('.', 1)[0] + '_thumbnail.jpg' s3.put_object( Bucket=bucket, Key=thumbnail_key, Body=buffer.getvalue(), ContentType='image/jpeg' ) return { 'original_size': len(image_data), 'thumbnail_size': len(buffer.getvalue()), 'thumbnail_key': thumbnail_key }
Create and invoke the function
This function requires the Pillow library. Create a deployment package that includes Pillow built for the Lambda Linux runtime.
$# Create a deployment package with Pillow for Lambda (Linux) mkdir package && pip install Pillow -t package/ \ --platform manylinux2014_x86_64 --only-binary=:all: cd package && zip -r ../thumbnail-function.zip . cd .. && zip thumbnail-function.zip lambda_function.py # Create the function aws lambda create-function \ --function-namefsxn-thumbnail-generator\ --runtime python3.12 \ --handler lambda_function.lambda_handler \ --role arn:aws:iam::account-id:role/fsxn-lambda-file-processor\ --zip-file fileb://thumbnail-function.zip \ --timeout 30 \ --memory-size 256 # Invoke with a test event aws lambda invoke \ --function-namefsxn-thumbnail-generator\ --cli-binary-format raw-in-base64-out \ --payload '{"access_point_alias": "my-ap-alias-ext-s3alias", "key": "samples/images/sample-image.jpg"}' \ response.json cat response.json
Verify the result
$aws s3 ls s3://my-ap-alias-ext-s3alias/samples/images/2024-01-23 12:19:32 2566770 sample-image.jpg 2024-01-23 12:25:49 7065 sample-image_thumbnail.jpg
The original 2.4 MB image (5400 × 2700 pixels) was resized to a 7 KB thumbnail (200 × 100 pixels).
Example 2: Extract text from documents
This example reads a PDF document from your FSx for ONTAP volume, sends it to Amazon Textract to extract the text, and writes the extracted text as a JSON file back to the volume.
Create and upload a sample PDF
For this example, you need a PDF document on your FSx for ONTAP volume. The following Python script generates a simple invoice PDF and uploads it through the access point. Run this script on your local machine (not in Lambda).
$pip install fpdf2 boto3
# create_invoice.py — run locally to generate and upload a sample PDF from fpdf import FPDF import boto3 pdf = FPDF() pdf.add_page() pdf.set_font("Helvetica", "B", 24) pdf.cell(0, 15, "INVOICE", new_x="LMARGIN", new_y="NEXT", align="C") pdf.set_font("Helvetica", "", 12) pdf.cell(0, 8, "Invoice Number: INV-2024-00142", new_x="LMARGIN", new_y="NEXT") pdf.cell(0, 8, "Date: January 15, 2024", new_x="LMARGIN", new_y="NEXT") pdf.cell(0, 8, "Customer: Example Corp", new_x="LMARGIN", new_y="NEXT") pdf.ln(5) pdf.set_font("Helvetica", "B", 12) pdf.cell(80, 8, "Description", border=1) pdf.cell(30, 8, "Qty", border=1, align="C") pdf.cell(40, 8, "Unit Price", border=1, align="R") pdf.cell(40, 8, "Amount", border=1, align="R") pdf.ln() pdf.set_font("Helvetica", "", 12) for desc, qty, price, amt in [ ("Cloud Storage Service", "1", "$2,400.00", "$2,400.00"), ("Data Transfer (TB)", "5", "$90.00", "$450.00"), ("Technical Support", "1", "$500.00", "$500.00"), ]: pdf.cell(80, 8, desc, border=1) pdf.cell(30, 8, qty, border=1, align="C") pdf.cell(40, 8, price, border=1, align="R") pdf.cell(40, 8, amt, border=1, align="R") pdf.ln() s3 = boto3.client('s3') s3.put_object( Bucket='my-ap-alias-ext-s3alias', Key='samples/documents/invoice.pdf', Body=pdf.output(), ContentType='application/pdf' ) print("Uploaded invoice.pdf")
$python3 create_invoice.py
Lambda function code
Save the following code as lambda_function.py.
import boto3 import json s3 = boto3.client('s3') textract = boto3.client('textract') def lambda_handler(event, context): bucket = event['access_point_alias'] key = event['key'] # Read the PDF from FSx through the access point response = s3.get_object(Bucket=bucket, Key=key) document_bytes = response['Body'].read() # Extract text with Textract textract_response = textract.detect_document_text( Document={'Bytes': document_bytes} ) lines = [ block['Text'] for block in textract_response['Blocks'] if block['BlockType'] == 'LINE' ] # Write extracted text as JSON back to FSx result = { 'source_file': key, 'total_lines': len(lines), 'extracted_text': lines } output_key = key.rsplit('.', 1)[0] + '_extracted.json' s3.put_object( Bucket=bucket, Key=output_key, Body=json.dumps(result, indent=2), ContentType='application/json' ) return { 'lines_extracted': len(lines), 'output_key': output_key }
Create and invoke the function
$zip textract-function.zip lambda_function.py aws lambda create-function \ --function-namefsxn-text-extractor\ --runtime python3.12 \ --handler lambda_function.lambda_handler \ --role arn:aws:iam::account-id:role/fsxn-lambda-file-processor\ --zip-file fileb://textract-function.zip \ --timeout 30 \ --memory-size 256 aws lambda invoke \ --function-namefsxn-text-extractor\ --cli-binary-format raw-in-base64-out \ --payload '{"access_point_alias": "my-ap-alias-ext-s3alias", "key": "samples/documents/invoice.pdf"}' \ response.json cat response.json
Example output:
{"lines_extracted": 22, "output_key": "samples/documents/invoice_extracted.json"}
Example 3: Transcribe audio files
This example starts an Amazon Transcribe job for an audio file stored on your FSx for ONTAP volume. Amazon Transcribe reads the audio file directly from the access point using the access point alias in the media file URI. When the job completes, the function writes the transcript back to the volume.
Lambda function code
Save the following code as lambda_function.py.
import boto3 import json import time import urllib.request s3 = boto3.client('s3') transcribe = boto3.client('transcribe') def lambda_handler(event, context): bucket = event['access_point_alias'] key = event['key'] media_format = key.rsplit('.', 1)[-1] # mp3, wav, etc. # Start a Transcribe job pointing to the file on FSx job_name = f"fsxn-{int(time.time())}" transcribe.start_transcription_job( TranscriptionJobName=job_name, Media={'MediaFileUri': f's3://{bucket}/{key}'}, MediaFormat=media_format, LanguageCode='en-US' ) # Wait for the job to complete while True: status = transcribe.get_transcription_job( TranscriptionJobName=job_name ) state = status['TranscriptionJob']['TranscriptionJobStatus'] if state in ('COMPLETED', 'FAILED'): break time.sleep(5) if state == 'FAILED': raise Exception( status['TranscriptionJob'].get('FailureReason', 'Unknown error') ) # Download the transcript transcript_uri = status['TranscriptionJob']['Transcript']['TranscriptFileUri'] with urllib.request.urlopen(transcript_uri) as resp: transcript_data = json.loads(resp.read()) transcript_text = transcript_data['results']['transcripts'][0]['transcript'] # Write the transcript back to FSx result = { 'source_file': key, 'job_name': job_name, 'transcript': transcript_text } output_key = key.rsplit('.', 1)[0] + '_transcript.json' s3.put_object( Bucket=bucket, Key=output_key, Body=json.dumps(result, indent=2), ContentType='application/json' ) return { 'transcript_length': len(transcript_text), 'output_key': output_key }
Create and invoke the function
$zip transcribe-function.zip lambda_function.py aws lambda create-function \ --function-namefsxn-audio-transcriber\ --runtime python3.12 \ --handler lambda_function.lambda_handler \ --role arn:aws:iam::account-id:role/fsxn-lambda-file-processor\ --zip-file fileb://transcribe-function.zip \ --timeout 120 aws lambda invoke \ --function-namefsxn-audio-transcriber\ --cli-binary-format raw-in-base64-out \ --payload '{"access_point_alias": "my-ap-alias-ext-s3alias", "key": "samples/audio/sample-audio.mp3"}' \ --cli-read-timeout 180 \ response.json cat response.json
Note
The Transcribe job typically takes 15 to 45 seconds to complete. The function's timeout is set to 120 seconds to allow for this.
Considerations
Internet origin required for default configuration. By default, Lambda accesses Amazon S3 from managed infrastructure outside your VPC, which requires an internet-origin access point. If you attach your Lambda function to a VPC, you can use a VPC-origin access point instead. See the prerequisites for details.
File size limits. Lambda functions have a maximum memory of 10 GB and a maximum execution time of 15 minutes. For large files, consider using range reads (
GetObjectwithRangeheader) or streaming the response.Textract limits. The synchronous
DetectDocumentTextAPI accepts documents up to 10 MB and 1 page. For multi-page documents, use the asynchronousStartDocumentTextDetectionAPI.Transcribe reads directly from the access point. Amazon Transcribe accepts the access point alias in the
MediaFileUriparameter (s3://). The Lambda function does not need to download and re-upload the audio file.ap-alias/keyFile system user permissions. The file system user associated with the access point must have read permission on input files and write permission on output directories.
Clean up
To avoid ongoing charges, delete the resources you created in this tutorial.
$# Delete Lambda functions aws lambda delete-function --function-namefsxn-thumbnail-generatoraws lambda delete-function --function-namefsxn-text-extractoraws lambda delete-function --function-namefsxn-audio-transcriber# Delete the IAM role and policies aws iam delete-role-policy \ --role-namefsxn-lambda-file-processor\ --policy-name fsxn-access-point-policy aws iam detach-role-policy \ --role-namefsxn-lambda-file-processor\ --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole aws iam delete-role --role-namefsxn-lambda-file-processor# Delete sample files from your FSx volume aws s3 rm s3://my-ap-alias-ext-s3alias/samples/ --recursive