Wählen Sie Ihre Cookie-Einstellungen aus

Wir verwenden essentielle Cookies und ähnliche Tools, die für die Bereitstellung unserer Website und Services erforderlich sind. Wir verwenden Performance-Cookies, um anonyme Statistiken zu sammeln, damit wir verstehen können, wie Kunden unsere Website nutzen, und Verbesserungen vornehmen können. Essentielle Cookies können nicht deaktiviert werden, aber Sie können auf „Anpassen“ oder „Ablehnen“ klicken, um Performance-Cookies abzulehnen.

Wenn Sie damit einverstanden sind, verwenden AWS und zugelassene Drittanbieter auch Cookies, um nützliche Features der Website bereitzustellen, Ihre Präferenzen zu speichern und relevante Inhalte, einschließlich relevanter Werbung, anzuzeigen. Um alle nicht notwendigen Cookies zu akzeptieren oder abzulehnen, klicken Sie auf „Akzeptieren“ oder „Ablehnen“. Um detailliertere Entscheidungen zu treffen, klicken Sie auf „Anpassen“.

Data collection - Machine Learning Lens
Diese Seite wurde nicht in Ihre Sprache übersetzt. Übersetzung anfragen

Data collection

Important steps in the ML lifecycle are to identify the data needed, followed by the evaluation of the various means available for collecting that data to train your model.

Figure 8 includes the key components of data collection phase. These components include: labeling, ingesting as streaming or batch, and aggregating.

Figure 8: The main components of data collection

The main components of the data collection phase (shown in Figure 8) are as follows:

  • Label - Labeled data is a group of samples that have been tagged with one or more labels. If labels are missing, then some effort is required to label it (either manual or automated).

  • Ingest and aggregate - Data collection includes ingesting and aggregating data from multiple data sources.

Figure 9 includes the data sources (like IoT, sensor, timeseries), data ingestion systems (like data streaming and analytics steaming), and data technologies (like data pipelines, data catalog, data lake, databases, and data warehouses).

Figure 9: Data sources, data ingestion, and data technologies

The sub-components of the ingest and aggregate component (shown in Figure 9) are as follows:

  • Data sources - Data sources include time-series, events, sensors, IoT devices, and social networks, depending on the nature of the use case. You can enrich your data sources by using the geospatial capability of Amazon SageMaker AI to access a range of geospatial data sources from AWS (for example, Amazon Location Service), open-source datasets (for example, Open Data on AWS), or your own proprietary data including from third-party providers (such as Planet Labs). To learn more about the geospatial capability in Amazon SageMaker AI, visit Geospatial ML with Amazon SageMaker AI (Preview).

  • Data ingestion- Data ingestion processes and technologies capture and store data on storage media. Data ingestion can occur in real-time using streaming technologies or historical mode using batch technologies.

  • Data technologies - Data storage technologies vary from transactional (SQL) databases, to data lakes and data warehouses. Extract, transform, and load (ETL) pipeline technology automates and orchestrates the data movement and transformations across cloud services and resources. A data lake technology enables storing and analyzing structured and unstructured data.

DatenschutzNutzungsbedingungen für die WebsiteCookie-Einstellungen
© 2025, Amazon Web Services, Inc. oder Tochtergesellschaften. Alle Rechte vorbehalten.