AWS Glue 使用 SDK for JavaScript (v3) 的範例 - AWS SDK 程式碼範例

文件範例儲存庫中有更多 AWS SDK可用的AWS SDK範例 GitHub 。

本文為英文版的機器翻譯版本,如內容有任何歧義或不一致之處,概以英文版為準。

AWS Glue 使用 SDK for JavaScript (v3) 的範例

下列程式碼範例示範如何使用 AWS SDK for JavaScript (v3) 搭配 來執行動作和實作常見案例 AWS Glue。

基本概念是程式碼範例,這些範例說明如何在服務內執行基本操作。

Actions 是大型程式的程式碼摘錄,必須在內容中執行。雖然動作會告訴您如何呼叫個別服務函數,但您可以在其相關情境中查看內容中的動作。

每個範例都包含完整原始程式碼的連結,您可以在其中找到如何在內容中設定和執行程式碼的指示。

開始使用

下列程式碼範例示範如何開始使用 AWS Glue。

SDK for JavaScript (v3)
注意

還有更多功能 GitHub。尋找完整範例,並了解如何在 AWS 程式碼範例儲存庫中設定和執行。

import { ListJobsCommand, GlueClient } from "@aws-sdk/client-glue"; const client = new GlueClient({}); export const main = async () => { const command = new ListJobsCommand({}); const { JobNames } = await client.send(command); const formattedJobNames = JobNames.join("\n"); console.log("Job names: "); console.log(formattedJobNames); return JobNames; };
  • 如需API詳細資訊,請參閱 AWS SDK for JavaScript API 參考ListJobs中的 。

基本概念

以下程式碼範例顯示做法:

  • 建立爬蟲程式,以爬取公有 Amazon S3 儲存貯體並產生 CSV格式化中繼資料的資料庫。

  • 列出 中資料庫和資料表的相關資訊 AWS Glue Data Catalog。

  • 建立任務以從 S3 儲存貯體擷取CSV資料、轉換資料,並將JSON格式化的輸出載入至另一個 S3 儲存貯體。

  • 列出任務執行的相關資訊、檢視已轉換的資料以及清除資源。

如需詳細資訊,請參閱教學: AWS Glue Studio 入門

SDK for JavaScript (v3)
注意

還有更多功能 GitHub。尋找完整範例,並了解如何在 AWS 程式碼範例儲存庫中設定和執行。

建立並執行爬蟲程式,該爬蟲程式會爬取公有 Amazon Simple Storage Service (Amazon S3) 儲存貯體,並產生中繼資料資料庫,描述其找到的 CSV格式化資料。

const createCrawler = (name, role, dbName, tablePrefix, s3TargetPath) => { const client = new GlueClient({}); const command = new CreateCrawlerCommand({ Name: name, Role: role, DatabaseName: dbName, TablePrefix: tablePrefix, Targets: { S3Targets: [{ Path: s3TargetPath }], }, }); return client.send(command); }; const getCrawler = (name) => { const client = new GlueClient({}); const command = new GetCrawlerCommand({ Name: name, }); return client.send(command); }; const startCrawler = (name) => { const client = new GlueClient({}); const command = new StartCrawlerCommand({ Name: name, }); return client.send(command); }; const crawlerExists = async ({ getCrawler }, crawlerName) => { try { await getCrawler(crawlerName); return true; } catch { return false; } }; /** * @param {{ createCrawler: import('../../../actions/create-crawler.js').createCrawler}} actions */ const makeCreateCrawlerStep = (actions) => async (context) => { if (await crawlerExists(actions, process.env.CRAWLER_NAME)) { log("Crawler already exists. Skipping creation."); } else { await actions.createCrawler( process.env.CRAWLER_NAME, process.env.ROLE_NAME, process.env.DATABASE_NAME, process.env.TABLE_PREFIX, process.env.S3_TARGET_PATH, ); log("Crawler created successfully.", { type: "success" }); } return { ...context }; }; /** * @param {(name: string) => Promise<import('@aws-sdk/client-glue').GetCrawlerCommandOutput>} getCrawler * @param {string} crawlerName */ const waitForCrawler = async (getCrawler, crawlerName) => { const waitTimeInSeconds = 30; const { Crawler } = await getCrawler(crawlerName); if (!Crawler) { throw new Error(`Crawler with name ${crawlerName} not found.`); } if (Crawler.State === "READY") { return; } log(`Crawler is ${Crawler.State}. Waiting ${waitTimeInSeconds} seconds...`); await wait(waitTimeInSeconds); return waitForCrawler(getCrawler, crawlerName); }; const makeStartCrawlerStep = ({ startCrawler, getCrawler }) => async (context) => { log("Starting crawler."); await startCrawler(process.env.CRAWLER_NAME); log("Crawler started.", { type: "success" }); log("Waiting for crawler to finish running. This can take a while."); await waitForCrawler(getCrawler, process.env.CRAWLER_NAME); log("Crawler ready.", { type: "success" }); return { ...context }; };

列出 中資料庫和資料表的相關資訊 AWS Glue Data Catalog。

const getDatabase = (name) => { const client = new GlueClient({}); const command = new GetDatabaseCommand({ Name: name, }); return client.send(command); }; const getTables = (databaseName) => { const client = new GlueClient({}); const command = new GetTablesCommand({ DatabaseName: databaseName, }); return client.send(command); }; const makeGetDatabaseStep = ({ getDatabase }) => async (context) => { const { Database: { Name }, } = await getDatabase(process.env.DATABASE_NAME); log(`Database: ${Name}`); return { ...context }; }; /** * @param {{ getTables: () => Promise<import('@aws-sdk/client-glue').GetTablesCommandOutput}} config */ const makeGetTablesStep = ({ getTables }) => async (context) => { const { TableList } = await getTables(process.env.DATABASE_NAME); log("Tables:"); log(TableList.map((table) => ` • ${table.Name}\n`)); return { ...context }; };

建立並執行任務,從來源 Amazon S3 儲存貯體擷取CSV資料、移除並重新命名欄位來轉換資料,並將JSON格式化的輸出載入至另一個 Amazon S3 儲存貯體。

const createJob = (name, role, scriptBucketName, scriptKey) => { const client = new GlueClient({}); const command = new CreateJobCommand({ Name: name, Role: role, Command: { Name: "glueetl", PythonVersion: "3", ScriptLocation: `s3://${scriptBucketName}/${scriptKey}`, }, GlueVersion: "3.0", }); return client.send(command); }; const startJobRun = (jobName, dbName, tableName, bucketName) => { const client = new GlueClient({}); const command = new StartJobRunCommand({ JobName: jobName, Arguments: { "--input_database": dbName, "--input_table": tableName, "--output_bucket_url": `s3://${bucketName}/`, }, }); return client.send(command); }; const makeCreateJobStep = ({ createJob }) => async (context) => { log("Creating Job."); await createJob( process.env.JOB_NAME, process.env.ROLE_NAME, process.env.BUCKET_NAME, process.env.PYTHON_SCRIPT_KEY, ); log("Job created.", { type: "success" }); return { ...context }; }; /** * @param {(name: string, runId: string) => Promise<import('@aws-sdk/client-glue').GetJobRunCommandOutput> } getJobRun * @param {string} jobName * @param {string} jobRunId */ const waitForJobRun = async (getJobRun, jobName, jobRunId) => { const waitTimeInSeconds = 30; const { JobRun } = await getJobRun(jobName, jobRunId); if (!JobRun) { throw new Error(`Job run with id ${jobRunId} not found.`); } switch (JobRun.JobRunState) { case "FAILED": case "TIMEOUT": case "STOPPED": case "ERROR": throw new Error( `Job ${JobRun.JobRunState}. Error: ${JobRun.ErrorMessage}`, ); case "SUCCEEDED": return; default: break; } log( `Job ${JobRun.JobRunState}. Waiting ${waitTimeInSeconds} more seconds...`, ); await wait(waitTimeInSeconds); return waitForJobRun(getJobRun, jobName, jobRunId); }; /** * @param {{ prompter: { prompt: () => Promise<{ shouldOpen: boolean }>} }} context */ const promptToOpen = async (context) => { const { shouldOpen } = await context.prompter.prompt({ name: "shouldOpen", type: "confirm", message: "Open the output bucket in your browser?", }); if (shouldOpen) { return open( `https://s3.console.aws.amazon.com/s3/buckets/${process.env.BUCKET_NAME} to view the output.`, ); } }; const makeStartJobRunStep = ({ startJobRun, getJobRun }) => async (context) => { log("Starting job."); const { JobRunId } = await startJobRun( process.env.JOB_NAME, process.env.DATABASE_NAME, process.env.TABLE_NAME, process.env.BUCKET_NAME, ); log("Job started.", { type: "success" }); log("Waiting for job to finish running. This can take a while."); await waitForJobRun(getJobRun, process.env.JOB_NAME, JobRunId); log("Job run succeeded.", { type: "success" }); await promptToOpen(context); return { ...context }; };

列出任務執行的相關資訊,並檢視部分轉換的資料。

const getJobRuns = (jobName) => { const client = new GlueClient({}); const command = new GetJobRunsCommand({ JobName: jobName, }); return client.send(command); }; const getJobRun = (jobName, jobRunId) => { const client = new GlueClient({}); const command = new GetJobRunCommand({ JobName: jobName, RunId: jobRunId, }); return client.send(command); }; /** * @typedef {{ prompter: { prompt: () => Promise<{jobName: string}> } }} Context */ /** * @typedef {() => Promise<import('@aws-sdk/client-glue').GetJobRunCommandOutput>} getJobRun */ /** * @typedef {() => Promise<import('@aws-sdk/client-glue').GetJobRunsCommandOutput} getJobRuns */ /** * * @param {getJobRun} getJobRun * @param {string} jobName * @param {string} jobRunId */ const logJobRunDetails = async (getJobRun, jobName, jobRunId) => { const { JobRun } = await getJobRun(jobName, jobRunId); log(JobRun, { type: "object" }); }; /** * * @param {{getJobRuns: getJobRuns, getJobRun: getJobRun }} funcs */ const makePickJobRunStep = ({ getJobRuns, getJobRun }) => async (/** @type { Context } */ context) => { if (context.selectedJobName) { const { JobRuns } = await getJobRuns(context.selectedJobName); const { jobRunId } = await context.prompter.prompt({ name: "jobRunId", type: "list", message: "Select a job run to see details.", choices: JobRuns.map((run) => run.Id), }); logJobRunDetails(getJobRun, context.selectedJobName, jobRunId); } return { ...context }; };

刪除透過示範建立的所有資源。

const deleteJob = (jobName) => { const client = new GlueClient({}); const command = new DeleteJobCommand({ JobName: jobName, }); return client.send(command); }; const deleteTable = (databaseName, tableName) => { const client = new GlueClient({}); const command = new DeleteTableCommand({ DatabaseName: databaseName, Name: tableName, }); return client.send(command); }; const deleteDatabase = (databaseName) => { const client = new GlueClient({}); const command = new DeleteDatabaseCommand({ Name: databaseName, }); return client.send(command); }; const deleteCrawler = (crawlerName) => { const client = new GlueClient({}); const command = new DeleteCrawlerCommand({ Name: crawlerName, }); return client.send(command); }; /** * * @param {import('../../../actions/delete-job.js').deleteJob} deleteJobFn * @param {string[]} jobNames * @param {{ prompter: { prompt: () => Promise<any> }}} context */ const handleDeleteJobs = async (deleteJobFn, jobNames, context) => { /** * @type {{ selectedJobNames: string[] }} */ const { selectedJobNames } = await context.prompter.prompt({ name: "selectedJobNames", type: "checkbox", message: "Let's clean up jobs. Select jobs to delete.", choices: jobNames, }); if (selectedJobNames.length === 0) { log("No jobs selected."); } else { log("Deleting jobs."); await Promise.all( selectedJobNames.map((n) => deleteJobFn(n).catch(console.error)), ); log("Jobs deleted.", { type: "success" }); } }; /** * @param {{ * listJobs: import('../../../actions/list-jobs.js').listJobs, * deleteJob: import('../../../actions/delete-job.js').deleteJob * }} config */ const makeCleanUpJobsStep = ({ listJobs, deleteJob }) => async (context) => { const { JobNames } = await listJobs(); if (JobNames.length > 0) { await handleDeleteJobs(deleteJob, JobNames, context); } return { ...context }; }; /** * @param {import('../../../actions/delete-table.js').deleteTable} deleteTable * @param {string} databaseName * @param {string[]} tableNames */ const deleteTables = (deleteTable, databaseName, tableNames) => Promise.all( tableNames.map((tableName) => deleteTable(databaseName, tableName).catch(console.error), ), ); /** * @param {{ * getTables: import('../../../actions/get-tables.js').getTables, * deleteTable: import('../../../actions/delete-table.js').deleteTable * }} config */ const makeCleanUpTablesStep = ({ getTables, deleteTable }) => /** * @param {{ prompter: { prompt: () => Promise<any>}}} context */ async (context) => { const { TableList } = await getTables(process.env.DATABASE_NAME).catch( () => ({ TableList: null }), ); if (TableList && TableList.length > 0) { /** * @type {{ tableNames: string[] }} */ const { tableNames } = await context.prompter.prompt({ name: "tableNames", type: "checkbox", message: "Let's clean up tables. Select tables to delete.", choices: TableList.map((t) => t.Name), }); if (tableNames.length === 0) { log("No tables selected."); } else { log("Deleting tables."); await deleteTables(deleteTable, process.env.DATABASE_NAME, tableNames); log("Tables deleted.", { type: "success" }); } } return { ...context }; }; /** * @param {import('../../../actions/delete-database.js').deleteDatabase} deleteDatabase * @param {string[]} databaseNames */ const deleteDatabases = (deleteDatabase, databaseNames) => Promise.all( databaseNames.map((dbName) => deleteDatabase(dbName).catch(console.error)), ); /** * @param {{ * getDatabases: import('../../../actions/get-databases.js').getDatabases * deleteDatabase: import('../../../actions/delete-database.js').deleteDatabase * }} config */ const makeCleanUpDatabasesStep = ({ getDatabases, deleteDatabase }) => /** * @param {{ prompter: { prompt: () => Promise<any>}} context */ async (context) => { const { DatabaseList } = await getDatabases(); if (DatabaseList.length > 0) { /** @type {{ dbNames: string[] }} */ const { dbNames } = await context.prompter.prompt({ name: "dbNames", type: "checkbox", message: "Let's clean up databases. Select databases to delete.", choices: DatabaseList.map((db) => db.Name), }); if (dbNames.length === 0) { log("No databases selected."); } else { log("Deleting databases."); await deleteDatabases(deleteDatabase, dbNames); log("Databases deleted.", { type: "success" }); } } return { ...context }; }; const cleanUpCrawlerStep = async (context) => { log("Deleting crawler."); try { await deleteCrawler(process.env.CRAWLER_NAME); log("Crawler deleted.", { type: "success" }); } catch (err) { if (err.name === "EntityNotFoundException") { log("Crawler is already deleted."); } else { throw err; } } return { ...context }; };

動作

下列程式碼範例示範如何使用 CreateCrawler

SDK for JavaScript (v3)
注意

還有更多功能 GitHub。尋找完整範例,並了解如何在 AWS 程式碼範例儲存庫中設定和執行。

const createCrawler = (name, role, dbName, tablePrefix, s3TargetPath) => { const client = new GlueClient({}); const command = new CreateCrawlerCommand({ Name: name, Role: role, DatabaseName: dbName, TablePrefix: tablePrefix, Targets: { S3Targets: [{ Path: s3TargetPath }], }, }); return client.send(command); };
  • 如需API詳細資訊,請參閱 AWS SDK for JavaScript API 參考CreateCrawler中的 。

下列程式碼範例示範如何使用 CreateJob

SDK for JavaScript (v3)
注意

還有更多功能 GitHub。尋找完整範例,並了解如何在 AWS 程式碼範例儲存庫中設定和執行。

const createJob = (name, role, scriptBucketName, scriptKey) => { const client = new GlueClient({}); const command = new CreateJobCommand({ Name: name, Role: role, Command: { Name: "glueetl", PythonVersion: "3", ScriptLocation: `s3://${scriptBucketName}/${scriptKey}`, }, GlueVersion: "3.0", }); return client.send(command); };
  • 如需API詳細資訊,請參閱 AWS SDK for JavaScript API 參考CreateJob中的 。

下列程式碼範例示範如何使用 DeleteCrawler

SDK for JavaScript (v3)
注意

還有更多功能 GitHub。尋找完整範例,並了解如何在 AWS 程式碼範例儲存庫中設定和執行。

const deleteCrawler = (crawlerName) => { const client = new GlueClient({}); const command = new DeleteCrawlerCommand({ Name: crawlerName, }); return client.send(command); };
  • 如需API詳細資訊,請參閱 AWS SDK for JavaScript API 參考DeleteCrawler中的 。

下列程式碼範例示範如何使用 DeleteDatabase

SDK for JavaScript (v3)
注意

還有更多功能 GitHub。尋找完整範例,並了解如何在 AWS 程式碼範例儲存庫中設定和執行。

const deleteDatabase = (databaseName) => { const client = new GlueClient({}); const command = new DeleteDatabaseCommand({ Name: databaseName, }); return client.send(command); };
  • 如需API詳細資訊,請參閱 AWS SDK for JavaScript API 參考DeleteDatabase中的 。

下列程式碼範例示範如何使用 DeleteJob

SDK for JavaScript (v3)
注意

還有更多功能 GitHub。尋找完整範例,並了解如何在 AWS 程式碼範例儲存庫中設定和執行。

const deleteJob = (jobName) => { const client = new GlueClient({}); const command = new DeleteJobCommand({ JobName: jobName, }); return client.send(command); };
  • 如需API詳細資訊,請參閱 AWS SDK for JavaScript API 參考DeleteJob中的 。

下列程式碼範例示範如何使用 DeleteTable

SDK for JavaScript (v3)
注意

還有更多功能 GitHub。尋找完整範例,並了解如何在 AWS 程式碼範例儲存庫中設定和執行。

const deleteTable = (databaseName, tableName) => { const client = new GlueClient({}); const command = new DeleteTableCommand({ DatabaseName: databaseName, Name: tableName, }); return client.send(command); };
  • 如需API詳細資訊,請參閱 AWS SDK for JavaScript API 參考DeleteTable中的 。

下列程式碼範例示範如何使用 GetCrawler

SDK for JavaScript (v3)
注意

還有更多功能 GitHub。尋找完整範例,並了解如何在 AWS 程式碼範例儲存庫中設定和執行。

const getCrawler = (name) => { const client = new GlueClient({}); const command = new GetCrawlerCommand({ Name: name, }); return client.send(command); };
  • 如需API詳細資訊,請參閱 AWS SDK for JavaScript API 參考GetCrawler中的 。

下列程式碼範例示範如何使用 GetDatabase

SDK for JavaScript (v3)
注意

還有更多功能 GitHub。尋找完整範例,並了解如何在 AWS 程式碼範例儲存庫中設定和執行。

const getDatabase = (name) => { const client = new GlueClient({}); const command = new GetDatabaseCommand({ Name: name, }); return client.send(command); };
  • 如需API詳細資訊,請參閱 AWS SDK for JavaScript API 參考GetDatabase中的 。

下列程式碼範例示範如何使用 GetDatabases

SDK for JavaScript (v3)
注意

還有更多功能 GitHub。尋找完整範例,並了解如何在 AWS 程式碼範例儲存庫中設定和執行。

const getDatabases = () => { const client = new GlueClient({}); const command = new GetDatabasesCommand({}); return client.send(command); };
  • 如需API詳細資訊,請參閱 AWS SDK for JavaScript API 參考GetDatabases中的 。

下列程式碼範例示範如何使用 GetJob

SDK for JavaScript (v3)
注意

還有更多功能 GitHub。尋找完整範例,並了解如何在 AWS 程式碼範例儲存庫中設定和執行。

const getJob = (jobName) => { const client = new GlueClient({}); const command = new GetJobCommand({ JobName: jobName, }); return client.send(command); };
  • 如需API詳細資訊,請參閱 AWS SDK for JavaScript API 參考GetJob中的 。

下列程式碼範例示範如何使用 GetJobRun

SDK for JavaScript (v3)
注意

還有更多功能 GitHub。尋找完整範例,並了解如何在 AWS 程式碼範例儲存庫中設定和執行。

const getJobRun = (jobName, jobRunId) => { const client = new GlueClient({}); const command = new GetJobRunCommand({ JobName: jobName, RunId: jobRunId, }); return client.send(command); };
  • 如需API詳細資訊,請參閱 AWS SDK for JavaScript API 參考GetJobRun中的 。

下列程式碼範例示範如何使用 GetJobRuns

SDK for JavaScript (v3)
注意

還有更多功能 GitHub。尋找完整範例,並了解如何在 AWS 程式碼範例儲存庫中設定和執行。

const getJobRuns = (jobName) => { const client = new GlueClient({}); const command = new GetJobRunsCommand({ JobName: jobName, }); return client.send(command); };
  • 如需API詳細資訊,請參閱 AWS SDK for JavaScript API 參考GetJobRuns中的 。

下列程式碼範例示範如何使用 GetTables

SDK for JavaScript (v3)
注意

還有更多功能 GitHub。尋找完整範例,並了解如何在 AWS 程式碼範例儲存庫中設定和執行。

const getTables = (databaseName) => { const client = new GlueClient({}); const command = new GetTablesCommand({ DatabaseName: databaseName, }); return client.send(command); };
  • 如需API詳細資訊,請參閱 AWS SDK for JavaScript API 參考GetTables中的 。

下列程式碼範例示範如何使用 ListJobs

SDK for JavaScript (v3)
注意

還有更多功能 GitHub。尋找完整範例,並了解如何在 AWS 程式碼範例儲存庫中設定和執行。

const listJobs = () => { const client = new GlueClient({}); const command = new ListJobsCommand({}); return client.send(command); };
  • 如需API詳細資訊,請參閱 AWS SDK for JavaScript API 參考ListJobs中的 。

下列程式碼範例示範如何使用 StartCrawler

SDK for JavaScript (v3)
注意

還有更多功能 GitHub。尋找完整範例,並了解如何在 AWS 程式碼範例儲存庫中設定和執行。

const startCrawler = (name) => { const client = new GlueClient({}); const command = new StartCrawlerCommand({ Name: name, }); return client.send(command); };
  • 如需API詳細資訊,請參閱 AWS SDK for JavaScript API 參考StartCrawler中的 。

下列程式碼範例示範如何使用 StartJobRun

SDK for JavaScript (v3)
注意

還有更多功能 GitHub。尋找完整範例,並了解如何在 AWS 程式碼範例儲存庫中設定和執行。

const startJobRun = (jobName, dbName, tableName, bucketName) => { const client = new GlueClient({}); const command = new StartJobRunCommand({ JobName: jobName, Arguments: { "--input_database": dbName, "--input_table": tableName, "--output_bucket_url": `s3://${bucketName}/`, }, }); return client.send(command); };
  • 如需API詳細資訊,請參閱 AWS SDK for JavaScript API 參考StartJobRun中的 。