Operational Excellence
AWS Direct Connect
Use a physical, private, and dedicated connection from your on premises location to your AWS environment. You can monitor AWS Direct Connect connections using Amazon CloudWatch to collect and process raw data from AWS Direct Connect into readable, near real-time metrics. You can consolidate these metrics in CloudWatch and build dashboards and alerts to notify your operations team based on the defined conditions.
Amazon notifies you about scheduled maintenance for AWS Direct Connect to help you manage these events, you can use AWS Personal Health Dashboard to display relevant information and provide proactive notifications so that you can plan for scheduled activities. AWS recommends using the AWS Personal Health Dashboard to receive notifications for scheduled maintenance or events that will affect Direct Connect.
Your operations team should be prepared for unplanned outages with networking while connecting from on-premises to AWS. For example, to be prepared for an unplanned outage such as an AWS Direct Connect connection failure, you should establish a second Direct Connect connection. Traffic will fail over to the second link automatically if the Border Gateway Protocol (BGP) prefixes advertised are the same over both connections. We recommend enabling Bidirectional Forwarding Detection (BFD) when configuring your connections to ensure fast detection and failover. Ensure that you test your high-availability design and configuration periodically using AWS Direct Connect Resiliency toolkit failover testing. Additionally, you can configure a backup IPsec VPN connection, in which case all VPC traffic will fail over to the VPN connection automatically when Direct Connect connections fails. Traffic to and from public resources, such as Amazon Simple Storage Service (Amazon S3), can be routed over the internet if they were previously being routed over Direct Connect public virtual interface.
AWS Site-to-Site VPN
Establish a dedicated IPsec connection from your location to your AWS environment. You can monitor AWS Site-to-Site VPN connections using CloudWatch with near real-time metrics. You can monitor the state of your VPN tunnels and the data retrieved in/out of the tunnels. These metrics are recorded for 15 months, so you can access historical information and gain a better perspective on how your hybrid setup performed. VPN metric data is automatically sent to CloudWatch as it becomes available.
To ensure operational stability in case of failures, AWS VPN has built in high availability. AWS Site-to-Site VPN connection has two tunnels, with each tunnel using a unique virtual private gateway public IP address connecting to a single customer gateway. It is important to configure both tunnels for redundancy. Then, if one tunnel becomes unavailable (for example, if it is down for maintenance), network traffic is automatically routed to the available tunnel for that specific connection.
However, to protect against a loss of connectivity if your single customer gateway becomes unavailable, you can set up a second Site-to-Site VPN connection to your VPC and virtual private gateway by using a second customer gateway. By using redundant Site-to-Site VPN connections and customer gateways, you can perform maintenance on one of your customer gateways while traffic continues to flow over the second customer gateway Site-to-Site VPN connection. Redundancy can be achieved two ways, using static or dynamic routing. If you implement dynamic routing, you will need to configure BGP to advertise your dynamic routes.
Transit Gateway
Using a transit gateway as a central hub enables access between your VPC resources and on premises using AWS Direct Connect or AWS VPN. Transit gateway provides statistics and logs that can be used by services such as CloudWatch and Amazon VPC Flow Logs.
You can start by tracking health data and manage operations by building dashboards and alarms off transit gateway attachment-level CloudWatch metrics. You can use CloudWatch to retrieve bandwidth usage between Amazon VPCs and a VPN connection, packet flow count, and packet drop count. Additionally, you can use VPC Flow Logs on transit gateway to capture information on the IP traffic routed through the transit gateway.
AWS Transit Gateway Network Manager provides a single global view of your private network including hybrid connectivity. You can view network activity in many locations from a single dashboard. It also includes event data to help you monitor and troubleshoot the quality of your global network. Events describe changes in your global network. Transit Gateway Network Manager sends the following type of events to CloudWatch Events:
- Topology changes: For example, an AWS Direct Connect gateway was attached to a transit gateway
- Routing updates: For example, a VPN attachment’s route table association changed
- Status updates: For example, a VPN tunnel’s BGP session went up (after being down)
Route Analyzer
Helps to perform an analysis of the routes in your transit gateway route tables in your global network. The Route Analyzer analyzes the routing path between a specified source and destination, and returns information about the connectivity between components.
Using the Route Analyzer, you can:
- Verify that the transit gateway route table configuration will work as expected before you start sending traffic.
- Validate your existing route configuration.
- Diagnose route-related issues that are causing traffic disruption in your global network.
Reliability
AWS sets service quotas or service limits to protect you from accidentally over-provisioning resources. A service quota is an upper limit on the number of each resource your team can request.
You will need to have governance and processes in place to monitor and change these quotas to meet your business needs and goals.
As you migrate to AWS, design and plan how your AWS and on premises resources will interact and integrate as a network topology. If you are using:
- AWS Direct Connect, there are quotas on the amount of data that can be transferred on each connection. Currently, you can have a dedicated connection of 1 Gbps, 10Gbps, or 100 Gbps bandwidth and if you need more bandwidth you can pool bandwidth across circuits using Link Aggregation Groups (LAG) of up to four 1 Gbps, four 10 Gbps, or two 100Gbps.
- AWS Site-to-site VPN connection to access resources in an Amazon VPC, you are cumulatively bound by the virtual gateway throughput of 1.25 Gbps.
Service quotas can be increased from the default values to handle the requirements of a large deployment according to your business needs. You can also proactively raise quotas if you anticipate exceeding them in your workloads. When raising these quotas, ensure that there is a sufficient gap between your service quota and your maximum usage to accommodate scale.
—
- How do you prepare for AWS Direct Connect scheduled maintenance or events?
Logs and metrics are a powerful tool to gain insight into the health of your workloads. Configure your workload to monitor CloudWatch logs and metrics and send notifications when thresholds are crossed or significant events occur. For example, with AWS Direct Connect, log these metrics to track the connection capacity utilization:
- ConnectionBpsIngress
- ConnectionBpsEgress
- ConnectionPpsEgress
- ConnectionPpsIngress
2. How do you regulate bandwidth usage for Direct Connect connections and implement changes?
When an AWS Direct Connect connection is down for maintenance, that connection can be down from a few minutes to a few hours based on the level of maintenance required. To prepare for this downtime:
- Request a redundant Direct Connect connection.
- Configure a virtual private network (VPN) connection as a backup.
—
- How does your network withstand component failures?
- How are you testing for resiliency?
- How are you planning for disaster recovery?
All systems are expected to fail. It is best practice to know how to become aware of these failures and respond to them automatically. This helps to ensure your network can withstand the failures and not affect the existing workload over it.
Highly resilient network connections are key to a Well-Architected system.
Best practices are to:
- Connect from multiple data centers for physical location redundancy.
- Use dynamically routed, with active/active connections for automatic load balancing and failover across redundant network connections.
- Provision sufficient network capacity to ensure that the failure of one network connection does not overwhelm and degrade redundant connections.
Each Site-to-Site VPN connection should be designed with two tunnels, with each tunnel using a unique virtual private gateway public IP address. Configure both tunnels for redundancy by preferably using dynamic routing, with an active/active setup. When one tunnel becomes unavailable (for example, is down for maintenance or unplanned outage), network traffic is automatically routed to the available tunnel for that specific Site-to-Site VPN connection.
Test your Direct Connect failover to help find any issues that could surface in production. Exercise these tests regularly to ensure that your configurations are appropriate for failovers, and verify the impact on workload during these tests. These tests help in validating your recovery procedures.
—
3. How do you select the best performing network VPN architecture?
Start by selecting the right VPN termination endpoint at the AWS end. There are four termination options at the AWS end. The VPN performance and scalability will vary based on which option you choose. For each option it’s important to understand the bandwidth and scalability characteristics.
- Termination at the virtual private gateway
- Termination on Transit Gateway
- Termination on a customer-managed EC2 instance running virtual VPN appliance
- Termination on a Direct Connect gateway
—
How do you monitor and scale your hybrid connectivity post launch to ensure that they are performing as expected?
Continual monitoring and tracking the performance of your network is important.
In the original design, you deploy your network and the connectivity for an initial set of requirements. However, as your architecture grows and evolves, so must your design and configurations. More applications start using your existing connectivity, which might not meet new requirements and can lead to a low-performing connectivity.
Amazon CloudWatch metrics and your on premises device and router metrics can track the performance of your VPN and Direct Connect connection. And then, you can use the metrics to root cause and remediate any issues.
—
Monitoring is imperative to the security of any network. Enable AWS CloudTrail and VPC Flow Logs to monitor all activities and traffic movement. CloudTrail will record all activities, such as provisioning, configuring, and modifying all Amazon VPC components. VPC Flow Logs records the metadata flowing in and out of your Amazon VPC for all the resources. Additionally, you can set up config rules for the AWS Config service for your Amazon VPC for all resources that should not have changes in their configuration.
—
Security
Identity and access management (IAM)–
How do you control access to your resources and workloads across your network?
Deploying your network requires the creation of constructs that must be controlled to prevent unauthorized access to your Amazon VPCs and services. To secure and control access, consider the roles and responsibilities of your teams managing and operating your workloads using the principle of least privilege. Isolate your networking services and implement separation of duties between the network specialists and application owners to allow the different teams to have required access to network services based on their roles.
AWS recommends implementing a landing zone, which is a preconfigured, secure, scalable, and multi-account AWS environment based on best practices. AWS Control Tower can automate the provisioning of your landing zone and accounts, help to manage the level of separation, and provide an initial set of guardrails to help enhance the security of your overall AWS environment.
Detective controls–
How are you capturing and analyzing metrics on your network?
Detective controls can be used to identify a potential security threat or incident. You can get detailed insights into your network performance and use that information to detect misconfigurations or potential malicious activity, and further optimize your deployment.
It is best practice to monitor and implement an immediate response process that detects and reacts to any suspicious or malicious activity. Monitoring workloads is important, especially when investigating a security incident.
Logs should be captured for network connections and private connections such as AWS Direct Connect. You can also use Amazon GuardDuty, a threat detection service with built-in VPC flow logs, which continuously monitors your workloads for malicious activity.
Create a central logging and analytics setup for your environment and implement detective controls using Amazon CloudWatch Logs, Amazon CloudWatch metrics, and Transit Gateway Route Analyzer.
Infrastructure protection–
How do you protect network resources?
Enforce boundary protection, monitoring points of ingress and egress, and add comprehensive logging, monitoring, and alerting. Use multiple layers of defense:
- Security groups
- Network access control lists (network ACLs)
- AWS Transit Gateway route tables
- Gateway Load Balancer
- AWS Network Firewall
- Amazon Route 53 Resolver DNS Firewall
Data protection–
How will you provide support for encryption of customer data?
Encrypting sensitive data traffic to connect to AWS over the internet or over a private network connection for their hybrid networking workloads, is important in to ensure that an unauthorized person or entity is unable to gain access to your data.
Incident response–
How do you isolate a networking incident from a security incident that originates from your on-premises network?
Automate incident response rather than using manual processes to monitor your security posture and manually react to events. Automating responses improve manual processes, reduce containment time, and prevent alert fatigue by the incident response teams, leaving your human processes to handle the sensitive and unique incidents.
AWS Security Hub can help to automate and detect security incidents. Security Hub continuously monitors your environment using automated checks and you can take action on the security findings with event based automation tools such as AWS Lambda, AWS Step Functions, and AWS Config rules.
Follow the principle of least privilege.
For every resource you provision or configure in your Amazon VPC, follow the principle of least privilege. If a subnet has resources that do not need to access the internet, it should be a private subnet and should have routing based on this requirement. Security groups and network ACLs should have rules based on this principle. They should allow access for required traffic only.
Create one Amazon VPC for each environment: development, testing, and production. This adds more security to your resources and it will also reduce your blast radius which is the impact on your environment if one of your Amazon VPCs goes down.
Use detective controls to identify potential security threats or incidents. In AWS, you can implement detective controls by processing logs, events, and monitoring that allows for auditing, automated analysis, and alarming.
CloudTrail, AWS API calls, and CloudWatch provide monitoring of metrics with alarming. AWS Config provides configuration history. Amazon Guard Duty is a managed threat detection service that continuously monitors for malicious or unauthorized behavior to help you protect your AWS accounts and workloads. Service-level logs are also available, for example, you can use Amazon Simple Storage Service (Amazon S3) to log access requests.
—
Your network, and especially your critical workloads, require multiple layers of security.
An Amazon VPC can be secured such as your on premises data center by:
- Using a web application firewall, a firewall virtual appliance, AWS Network Firewall, and several other tools which you can use to secure your Amazon VPC.
- Securing your protocols from unauthorized use or intrusion by configuring intrusion detection systems and intrusion prevention virtual appliances.
- Creating accounts with granular-level of permissions IAM and auditing and monitoring the Administrator access to your Amazon VPC.
- Transferring information securely between Amazon VPCs in diverse Regions or between Amazon VPC to an on premises data center by configuring a Site-to-Site VPN. Another option to transfer information securely is to use AWS Transfer for Secure File Transfer Protocol (AWS SFTP). With AWS SFTP, you use VPC endpoints and avoid using public IP addresses or going through the internet. In addition, VPC endpoints for AWS SFTP leverage security functionality via AWS PrivateLink, which provides private connections between your VPCs and AWS services.
Cost Optimization
Practice Cloud Financial Management–
For best practices use Cloud Financial Management for cost optimization in networking. AWS offers multiple services and tools to help cost optimize your environment.
- AWS Cost Explorer
- AWS Budgets
- AWS Cost and Usage Reports
- Reserved Instance Reporting
- Savings Plans
- AWS Application Cost Profiler
—
The second step towards understanding and reducing your AWS bill is to measure everything, discover your historical cost patterns, and visualize your spending trends. In the case of data transfer costs across your network, find how much you are spending.
It is best practice to build a tagging strategy and use cost allocation tags to tag your resources inside your Amazon VPC and further group your spending by data transfer, Region, Availability Zone, NAT device, CloudFront, Direct Connect, VPN, specific services, and so on. This tagging strategy should be part of your planning phase. A good practice is to tag a resource immediately after it is created. After you have isolated these costs and observed the trend, you can establish a baseline for optimal network costs for your environment.
—
Data transfer fees can be a hidden cost if not properly tracked. It is important to understand what drives your data transfer costs in order to optimize the cost of running your environment. Understand how data is transferred within AWS and how data is transferred over the internet.
If you transfer data within AWS or over the internet:
- bulletHaving a service in a single Availability Zone increases the risk of a service outage. However, part of optimizing data transfers is balancing the cost of hosting your service in multiple Availability Zones to add high availability and reliability against the risk of a service outage.
- bulletInter-Region data transfer fees are charged at the source Region rate. These rates include all the aggregate data transferred by services. For cost optimization, if you need a separate Region for disaster recovery, you can choose a less expensive Region to save costs on the data transfers.
- bulletThere are no Direct Connect data transfer costs into an AWS Region (data ingress). However, there is a charge for data transferring from a Region (data egress). Rates for data egress from a Region to an on premises location over an AWS Direct Connect connection vary based on the source Region and the location of the destination.
For example, comparing US East (Ohio) and EU (Frankfurt) Regions, there is a $0.01 USD per GB difference in the cost of transferring data over a Direct Connect connection to a data center in Montreal. Data egress from Ohio to Montreal over a Direct Connect is $0.02 USD per GB, but is $0.03 USD per GB from Frankfurt. - bulletTransfer data between edge locations and Regions, there is no additional cost, but it is important to consider that edge locations are data centers that are part of the Amazon CloudFront which is Amazon’s content delivery network. Amazon CloudFront does have a cost. CloudFront caches data such as videos, APIs, or applications for low-latency access.
A best practice is to use:- Private IP addresses for data transfers when possible instead of using a public IP address or an elastic IP address.
- CloudFront for an additional 20% to 40% savings over the standard data transfer rate.
Data transfer charges in AWS apply and are based on the source, destination, and amount of traffic.
Avoid routing traffic over the internet when connecting to AWS services from within AWS by using VPC endpoints:
VPC gateway endpoints allow communication to Amazon S3 and Amazon DynamoDB without incurring data transfer charges.
VPC interface endpoints are available for some AWS services. This type of endpoint incurs hourly service charges and data transfer charges.
Use Direct Connect instead of a VPN connection over the internet for sending data to on-premises networks. VPN charges can be higher. For Direct Connect, there are two billing elements: port hours and outbound data transfer. Port hour pricing is determined by connection type, Dedicated Connection or Hosted Connection, along with capacity. Data transfer out over AWS Direct Connect is charged per gigabyte.
Traffic that crosses an Availability Zone boundary typically incurs a data transfer charge. Use resources from the local Availability Zone whenever possible.
Traffic that crosses a Regional boundary will typically incur a inter-Availability Zone data transfer charge. Avoid cross-Region data transfer unless your business case requires it.
Use the AWS Pricing Calculator to help estimate the data transfer costs for your solution.
Use a dashboard to better visualize data transfer charges.
—