Framework Description Use Cases Strengths Apache Hadoop A distributed framework for storing and processing large datasets using the Hadoop Distributed File System (HDFS) and the MapReduce programming model. Batch processing, ETL, data storage. – Scalable and reliable storage (HDFS).– Proven technology. Apache Spark A unified analytics engine for large-scale data processing, offering in-memory computing and…
Category: aws
Amazon EMR
Amazon EMR is a cloud-based big data platform that allows users to process and analyze large datasets quickly and cost-effectively. It provides a managed environment for running open-source tools like Apache Hadoop, Apache Spark, Hive, Presto, and more. With EMR, you can build scalable data pipelines, run large-scale analytics, and process data for machine learning…
AWS Glue
AWS Glue is a fully managed ETL (Extract, Transform, and Load) service provided by Amazon Web Services. It is designed to prepare and transform data for analytics and machine learning workflows by automating the processes of data discovery, cataloging, and preparation. Key Features of AWS Glue ETL (Extract, Transform, Load): Build, manage, and run ETL…
Analytics services
AWS Data Lake – Centralized repository that allows to store structured semi-structured and strutured data at any scale. Amazon Redshift – cloud-based data warehousing service from AWS that is designed to handle large-scale data analytics and queries efficiently.
AWS Lake Formation
AWS Lake Formation is a managed service that simplifies the process of creating, managing, and securing a data lake on AWS. It streamlines the tasks of ingesting, cataloging, securing, and preparing data, allowing you to focus on gaining insights from your data instead of managing the infrastructure. Simplified Data Ingestion: Easily ingest data from various…
AWS Data Lake
AWS Data Lake is a centralized repository that allows you to store and manage structured, semi-structured, and unstructured data at any scale. It enables you to store raw data as-is and process it later for analytics, machine learning, or other use cases. AWS provides a suite of services to build and manage data lakes efficiently,…
Amazon Redshift
Amazon Redshift is a cloud-based data warehousing service from AWS that is designed to handle large-scale data analytics and queries efficiently. It enables organizations to perform complex analytical queries on massive datasets quickly and cost-effectively. Scalable Data Warehousing: Redshift can scale to petabytes of data, making it suitable for big data use cases. You can…
Redshift Architecture
Components 1. Leader Node 2. Compute Nodes 3. Node Slices 4. Cluster 5. Network Layer Data Distribution and Processing Data Distribution: Data is distributed across compute nodes and slices based on the distribution style: Massively Parallel Processing (MPP): Queries are split into smaller tasks and distributed to compute nodes. Each node processes its portion of…
DynamoDB Comparisons
DynamoDB DAX vs Global Tables Feature DynamoDB DAX DynamoDB Global Tables Purpose In-memory caching for low-latency reads Multi-region replication for low-latency access globally Performance Focus Speeds up read-heavy workloads Ensures low-latency access across regions Latency Microseconds for cached reads Milliseconds (based on network latency and consistency) Data Replication No replication (caches only in-memory near application)…
Amazon QLDB
Amazon QLDB is a fully managed, serverless ledger database service offered by AWS. It provides a transparent, immutable, and cryptographically verifiable transaction log. QLDB is designed for use cases where there is a need to maintain a reliable and trusted record of all changes to data over time, such as in financial transactions, supply chains,…