AWS Glue

AWS Glue

  • AWS Glue is a fully-managed, ETL i.e extract, transform, and load service that automates the time-consuming steps of data preparation for analytics
  • AWS Glue is serverless and supports pay-as-you-go model. There is no infrastructure to provision or manage. AWS Glue handles provisioning, configuration, and scaling of the resources required to run the ETL jobs on a fully managed, scale-out Apache Spark environment.
  • AWS Glue makes it simple and cost-effective to categorize the data, clean it, enrich it, and move it reliably between various data stores and streams
  • AWS Glue automatically discovers and profiles the data via the Glue Data Catalog, recommends and generates ETL code to transform the source data into target schemas, and runs the ETL jobs on a fully managed, scale-out Apache Spark environment to load the data into its destination.
  • AWS Glue also helps setup, orchestrate, and monitor complex data flows.
  • AWS Glue consists of a
    • Data Catalog, which is a central metadata repository,
    • ETL engine that can automatically generate Scala or Python code,
    • Flexible scheduler that handles dependency resolution, job monitoring, and retries.
  • AWS Glue components help automate much of the undifferentiated heavy lifting involved with discovering, categorizing, cleaning, enriching, and moving data, so more time can be spent on analyzing the data.
  • Glue can automatically discover both structured and semi-structured data stored in the data lake on S3, data warehouse in Redshift, and various databases running on AWS.
  • Glue provides a unified view of the data via the Glue Data Catalog that is available for ETL, querying and reporting using services like Athena, EMR, and Redshift Spectrum.
  • AWS Glue natively supports data stored in
    • RDS (Aurora, MySQL, Oracle, PostgreSQL, SQL Server)
    • Redshift
    • DynamoDB
    • S3
    • MySQL, Oracle, Microsoft SQL Server, and PostgreSQL databases in the Virtual Private Cloud (VPC) running on EC2.
    • in (beta) data streams from MSK, Kinesis Data Streams, and Apache Kafka.
  • Glue also supports custom Scala or Python code and import custom libraries and Jar files into the AWS Glue ETL jobs to access data sources not natively supported by AWS Glue.
  • Glue supports server side encryption for data at rest and SSL for data in motion.
  • AWS Glue provides development endpoints to edit, debug, and test the code it generates.

AWS Glue Data Catalog

  • AWS Glue Data Catalog is a central repository and persistent metadata store to store structural and operational metadata for all the data assets.
  • AWS Glue Data Catalog provides a uniform repository where disparate systems can store and find metadata to keep track of data in data silos, and use that metadata to query and transform the data.
  • For a given data set, AWS Glue Data Catalog can store its table definition, physical location, add business relevant attributes, as well as track how this data has changed over time.
  • AWS Glue Data Catalog is Apache Hive Metastore compatible and is a drop-in replacement for the Apache Hive Metastore for Big Data applications running on Amazon EMR
  • AWS Glue Data Catalog also provides out-of-box integration with Athena, EMR, and Redshift Spectrum.
  • Table definitions once added to the Glue Data Catalog, are available for ETL and also readily available for querying in Athena, EMR, and Redshift Spectrum to provide a common view of the data between these services.
  • AWS Glue Data Catalog supports bulk import of the metadata from existing persistent Apache Hive Metastore by using our import script.
  • Data Catalog provides comprehensive audit and governance capabilities, with schema change tracking and data access controls, which helps ensure that data is not inappropriately modified or inadvertently shared
  • Each AWS account has one AWS Glue Data Catalog per region.

AWS Glue Crawlers

  • AWS Glue crawler connects to a data store, progresses through a prioritized list of classifiers to extract the schema of the data and other statistics, and then populates the Glue Data Catalog with this metadata.
  • Glue crawlers scan various data stores to automatically infer schemas and partition structure to populate the Glue Data Catalog with corresponding table definitions and statistics.
  • Glue crawlers can be scheduled to run periodically so that the metadata is always up-to-date and in-sync with the underlying data.
  • Crawlers automatically add new tables, new partitions to existing table, and new versions of table definitions.

Dynamic Frames

  • AWS Glue is designed to work with semi-structured data and introduces a component called a dynamic frame, which you can use in the ETL scripts
  • Dynamic frame is a distributed table that supports nested data such as structures and arrays.
  • Each record is self-describing, designed for schema flexibility with semi-structured data. Each record contains both data and the schema that describes that data.
  • A Dynamic Frame is similar to an Apache Spark dataframe, which is a data abstraction used to organize data into rows and columns, except that each record is self-describing so no schema is required initially.
  • Dynamic frames provide schema flexibility and a set of advanced transformations specifically designed for dynamic frames.
  • Conversion can be done between Dynamic frames and Spark dataframes, to take advantage of both AWS Glue and Spark transformations to do the kinds of analysis needed.

AWS Glue Streaming ETL

  • AWS Glue enables performing ETL operations on streaming data using continuously-running jobs.
  • AWS Glue streaming ETL is built on the Apache Spark Structured Streaming engine, and can ingest streams from Kinesis Data Streams and Apache Kafka using Amazon Managed Streaming for Apache Kafka.
  • Streaming ETL can clean and transform streaming data and load it into S3 or JDBC data stores.
  • Use Streaming ETL in AWS Glue to process event data like IoT streams, clickstreams, and network logs.

Glue Job Bookmark

  • AWS Glue Job Bookmark tracks data that has already been processed during a previous run of an ETL job by persisting state information from the job run.
  • Job bookmarks help AWS Glue maintain state information and prevent the reprocessing of old data.
  • Job bookmarks help process new data when rerunning on a scheduled interval
  • Job bookmark is composed of the states for various elements of jobs, such as sources, transformations, and targets. for e.g, a ETL job might read new partitions in an S3 file. AWS Glue tracks which partitions the job has processed successfully to prevent duplicate processing and duplicate data in the job’s target data store.

AWS Certification Exam Practice Questions

  • Questions are collected from Internet and the answers are marked as per my knowledge and understanding (which might differ with yours).
  • AWS services are updated everyday and both the answers and questions might be outdated soon, so research accordingly.
  • AWS exam questions are not updated to keep up the pace with AWS updates, so even if the underlying feature has changed the question might not be updated
  • Open to further feedback, discussion and correction.
  1. An organization is setting up a data catalog and metadata management environment for their numerous data stores currently running on AWS. The data catalog will be used to determine the structure and other attributes of data in the data stores. The data stores are composed of Amazon RDS databases, Amazon Redshift, and CSV files residing on Amazon S3. The catalog should be populated on a scheduled basis, and minimal administration is required to manage the catalog. How can this be accomplished?
    1. Set up Amazon DynamoDB as the data catalog and run a scheduled AWS Lambda function that connects to data sources to populate the database.
    2. Use an Amazon database as the data catalog and run a scheduled AWS Lambda function that connects to data sources to populate the database.
    3. Use AWS Glue Data Catalog as the data catalog and schedule crawlers that connect to data sources to populate the database.
    4. Set up Apache Hive metastore on an Amazon EC2 instance and run a scheduled bash script that connects to data sources to populate the metastore.