Big Data Frameworks – AWS Technologies Blog

Framework	Description	Use Cases	Strengths
Apache Hadoop	A distributed framework for storing and processing large datasets using the Hadoop Distributed File System (HDFS) and the MapReduce programming model.	Batch processing, ETL, data storage.	– Scalable and reliable storage (HDFS). – Proven technology.
Apache Spark	A unified analytics engine for large-scale data processing, offering in-memory computing and support for batch, streaming, and ML workloads.	Streaming analytics, machine learning, graph processing, batch ETL.	– In-memory processing for speed. – Broad use case support.
Apache HBase	A distributed, NoSQL database built on HDFS for real-time, random read/write access to large datasets.	Real-time applications, time-series data, IoT data storage.	– Low-latency reads/writes. – Scales horizontally.
Apache Flink	A framework for real-time stream processing and distributed batch processing, with low-latency and high-throughput capabilities.	Real-time analytics, event processing, streaming ETL.	– True real-time processing. – Stateful stream management.
Presto	A distributed SQL query engine designed for fast, interactive queries across large datasets, optimized for analytics over heterogeneous data sources.	Interactive SQL querying, federated queries, analytics on data lakes.	– High performance for SQL. – Query federation.
Hive	A data warehouse tool built on Hadoop, providing SQL-like query capabilities (HiveQL) for processing and analyzing structured datasets.	Data warehousing, batch analytics, schema-on-read processing.	– Familiar SQL-like interface. – Integrates with Hadoop.