Migrate data to the AWS Cloud by using Starburst
Created by Antony Prasad Thevaraj (AWS), Shaun Van Staden (Starburst), and Suresh Veeragoni (AWS)
Environment: Production | Technologies: Analytics; Data lakes; Databases | Workload: All other workloads |
AWS services: Amazon EKS |
Summary
Starburst helps accelerate your data migration journey to Amazon Web Services (AWS) by providing an enterprise query engine that brings existing data sources together in a single access point. You can run analytics across multiple data sources to get valuable insights, before finalizing any migration plans. Without disrupting business-as-usual analytics, you can migrate the data by using the Starburst engine or a dedicated extract, transform, and load (ETL) application.
Prerequisites and limitations
Prerequisites
An active AWS account
A virtual private cloud (VPC)
An Amazon Elastic Kubernetes Service (Amazon EKS) cluster
An Amazon Elastic Compute Cloud (Amazon EC2) Auto Scaling group
A list of current system workloads that need to be migrated
Network connectivity from AWS to your on-premises environment
Architecture
Reference architecture
The following high-level architecture diagram shows the typical deployment of Starburst Enterprise in the AWS Cloud:
The Starburst Enterprise cluster runs inside your AWS account.
A user authenticates by using Lightweight Directory Access Protocol (LDAP) or Open Authorization (OAuth) and interacts directly with the Starburst cluster.
Starburst can connect to several AWS data sources, such as AWS Glue, Amazon Simple Storage Service (Amazon S3), Amazon Relational Database Service (Amazon RDS), and Amazon Redshift. Starburst provides federated query capabilities across data sources in the AWS Cloud, on premises, or in other cloud environments.
You launch Starburst Enterprise in an Amazon EKS cluster by using Helm charts.
Starburst Enterprise uses Amazon EC2 Auto Scaling groups and Amazon EC2 Spot Instances to optimize infrastructure.
Starburst Enterprise connects directly to your existing on-premises data sources to read data real-time. In addition, if you have an existing Starburst Enterprise deployment in this environment, you can directly connect your new Starburst cluster in the AWS Cloud to this existing cluster.
Please note the following:
Starburst is not a data virtualization platform. It is a SQL-based massively parallel processing (MPP) query engine that forms the basis of an overall data mesh strategy for analytics.
When Starburst is deployed as part of a migration, it has direct connectivity to the existing on-premises infrastructure.
Starburst provides several built-in enterprise and open-source connectors that facilitate connectivity to a variety of legacy systems. For a full list of connectors and their capabilities, see Connectors
in the Starburst Enterprise user guide. Starburst can query data in real-time from on-premises data sources. This prevents interruptions of regular business operations while data is being migrated.
If you are migrating from an existing on-premises Starburst Enterprise deployment, you can use a special connector, Starburst Stargate, to connect your Starburst Enterprise cluster in AWS directly to your on-premises cluster. This provides additional performance benefits when business users and data analysts are federating queries from the AWS Cloud to your on-premises environment.
High-level process overview
You can accelerate data migration projects by using Starburst because Starburst enables insights across all of your data, prior to migrating it. The following image shows the typical process for migrating data by using Starburst.
Roles
The following roles are typically required to complete a migration using Starburst:
Cloud administrator – Responsible for making cloud resources available to run the Starburst Enterprise application
Starburst administrator – Responsible for installing, configuring, managing, and supporting the Starburst application
Data engineer– Responsible for:
Migrating the legacy data to the cloud
Building semantic views to support analytics
Solution or system owner – Responsible for the overall solution implementation
Tools
AWS services
Amazon EC2 – Amazon Elastic Compute Cloud (Amazon EC2) provides scalable computing capacity in the AWS Cloud.
Amazon EKS – Amazon Elastic Kubernetes Service (Amazon EKS) is a managed service for running Kubernetes on AWS without needing to stand up or maintain your own Kubernetes control plane. Kubernetes is an open-source system for automating the deployment, scaling, and management of containerized applications.
Other tools
Helm
– Helm is a package manager for Kubernetes that helps you install and manage applications on your Kubernetes cluster. Starburst Enterprise
– Starburst Enterprise is a SQL-based massively parallel processing (MPP) query engine that forms the basis of an overall data mesh strategy for analytics. Starburst Stargate
– Starburst Stargate links catalogs and data sources in one Starburst Enterprise environment, such as a cluster in an on-premises data center, to the catalogs and data sources in another Starburst Enterprise environment, such as a cluster in the AWS Cloud.
Epics
Task | Description | Skills required |
---|---|---|
Identify and prioritize your data. | Identify the data you want to move. Large, on-premises legacy systems can include core data that you want to migrate alongside data that you don’t want to move or can’t be moved because of compliance reasons. Starting with a data inventory helps you prioritize which data you should target first. For more information, see Get started with automated portfolio discovery. | Data engineer, DBA |
Explore, inventory, and back up your data. | Validate the quality, quantity, and relevance of the data for your use case. Back up or create a snapshot of the data as needed, and finalize the target environment for the data. | Data engineer, DBA |
Task | Description | Skills required |
---|---|---|
Configure Starburst Enterprise in the AWS Cloud. | While data is being catalogued, set up Starburst Enterprise in a managed Amazon EKS cluster. For more information see, Deploying with Kubernetes | AWS administrator, App developer |
Connect Starburst to the data sources. | After you have identified the data and set up Starburst Enterprise, connect Starburst to the data sources. Starburst reads data directly from the data source as a SQL query. For more information, see the Starburst Enterprise reference documentation | AWS administrator, App developer |
Task | Description | Skills required |
---|---|---|
Build and run the ETL pipelines. | Begin the data migration process. This activity can occur at the same time as business-as-usual analytics. For the migration, you can use a third-party product or Starburst. Starburst has the capability to both read and write data across different sources. For more information, see the Starburst Enterprise reference documentation | Data engineer |
Validate the data. | After the data has been migrated, validate the data to ensure all required data has been moved and is intact. | Data engineer, DevOps engineer |
Task | Description | Skills required |
---|---|---|
Cut over the data. | After data migration and validation is complete, you can cut over the data. This involves changing the data connection links in Starburst. Instead of pointing at the on-premises sources, you point to the new cloud sources and update the semantic views. For more information, see Connectors | Data engineer, Cutover lead |
Roll out to users. | Data consumers begin working off the migrated data sources. This process is invisible to the analytics end users. | Cutover lead, Data engineer |
Related resources
AWS Marketplace
Starburst documentation
Other AWS documentation
Get started with automated portfolio discovery (AWS Prescriptive Guidance)
Optimizing Cloud Infrastructure Cost and Performance with Starburst on AWS
(blog post)