Kinesis Firehose vs Streams – AWS Technologies Blog

Amazon Kinesis Data Firehose and Amazon Kinesis Data Streams are both part of the Kinesis family of services for handling real-time data streams, but they are designed for different use cases and have distinct characteristics. Here’s a comparison to help clarify their differences:

1. Purpose and Use Cases:

Kinesis Data Streams:
- Use case: Best for real-time streaming data that needs to be processed and analyzed by custom applications.
- Ideal for: Use cases like real-time analytics, log processing, and event-driven architectures where you need to read, process, and analyze data with fine-grained control.
- Flow: You manually manage the consumption of data using Kinesis consumers like Kinesis Client Library (KCL) or AWS Lambda.
Kinesis Data Firehose:
- Use case: Best for real-time data ingestion with automatic delivery to destinations like S3, Redshift, Elasticsearch, and Splunk.
- Ideal for: Use cases where you just need to collect, buffer, and automatically deliver streaming data to specific AWS destinations without needing to build a custom data processing pipeline.
- Flow: Data is automatically delivered to specified destinations without requiring manual consumption.

2. Data Processing:

Kinesis Data Streams:
- You process data in real time using custom applications, frameworks, or services like AWS Lambda.
- You can implement complex processing logic, filter, aggregate, or transform the data before delivery.
Kinesis Data Firehose:
- Provides basic data transformation via AWS Lambda before delivering the data to the destination. But it’s generally less customizable than Data Streams when it comes to processing logic.
- The focus is on delivery to destinations, not complex processing.

3. Scalability:

Kinesis Data Streams:
- You must manage the shards (partitions) in your stream. Each shard has a fixed throughput (1 MB/sec input, 2 MB/sec output). To scale, you need to increase the number of shards or manage them programmatically.
- More fine-grained control over scaling, but requires active management.
Kinesis Data Firehose:
- Fully managed and auto-scaling. It automatically scales to accommodate the data throughput without needing manual intervention. You don’t need to worry about shards or scaling since Firehose handles it for you.

4. Data Retention:

Kinesis Data Streams:
- Data can be stored in the stream for up to 7 days (default retention), but this is configurable.
- After the retention period, data is deleted unless it’s consumed by a processor.
Kinesis Data Firehose:
- No data retention in Firehose itself. The data is buffered temporarily (based on size and time limits), and once it’s delivered to the destination (like S3 or Redshift), it is no longer in Firehose.

5. Destinations:

Kinesis Data Streams:
- No built-in delivery mechanism to destinations like S3, Redshift, etc. You must implement your own consumer to process and store the data wherever needed (e.g., write it to S3 using Lambda or EC2).
Kinesis Data Firehose:
- Provides direct integration with destinations like:
  - Amazon S3 (for storage).
  - Amazon Redshift (for data warehousing).
  - Amazon Elasticsearch Service (for search and analytics).
  - Splunk (for monitoring and analysis).

6. Processing Complexity:

Kinesis Data Streams:
- More complex to set up and manage. Requires building and managing your own consumers and possibly complex processing logic.
- Suitable for custom processing and advanced use cases like data transformation, aggregation, or filtering before delivery.
Kinesis Data Firehose:
- Simpler to set up and use. You just configure the stream and specify where to deliver the data.
- Minimal processing (optional Lambda transformations), focusing mainly on data ingestion and delivery.

7. Cost Structure:

Kinesis Data Streams:
- Charged based on the number of shards you provision, the volume of data ingested, and the data retrieval from the stream.
- Scaling and shard management can make cost unpredictable if you are not monitoring throughput effectively.
Kinesis Data Firehose:
- Charged based on the volume of data ingested and the destination to which the data is delivered.
- Generally simpler and more predictable in terms of cost, especially for users who only need to deliver data to destinations.