Deep Dive Container Security at AWS using CAF and Well Architecture
The Amazon Web Services (AWS) Cloud Adoption Framework (CAF) provides guidance for coordinating the different parts of organizations migrating to cloud computing. The CAF guidance is broken into 6 focus areas relevant to implementing cloud-based IT systems. These focus areas are called perspectives, and each perspective further separated into components. There is a whitepaper for each of the 6 CAF perspectives.
Components of Security Perspective are:
· Directive controls establish the governance, risk, and compliance models the environment will operate within.
· Preventive controls protect your workloads and mitigate threats and vulnerabilities.
· Detective controls provide full visibility and transparency over the operation of your deployments in AWS.
· Responsive controls drive remediation of potential deviations from your security baselines.
The process of designing and implementing how different capabilities will work together represents an opportunity to quickly gain familiarity and learn how to iterate your designs to best meet your requirements. Learn from actual implementation early, then adapt and evolve using small changes as you learn. To help you with your implementation, you can use the CAF Security Epics. The Security Epics consist of groups of user stories (use cases and abuse cases) that you can work on during sprints.
The Core 5 epics are the core control and capability categories that you should consider early on because they are fundamental to getting your cloud adoption journey started.
Augmenting the Core five epics represent the themes that will drive continued operational excellence through availability, automation, and audit. You will want to judiciously integrate these epics into each sprint.
After this structure is defined, an implementation plan can be crafted. Capabilities change over time and opportunities for improvement will be continually identified. Multiple sprints will lead to increased maturity while retaining flexibility to adapt to business pace and demand.
An Example of Sprint Series can be set of 6 two weeks sprints:
· Sprint 0 — Security cartography: compliance mapping, policy mapping, initial threat model review, establish risk registry; Build a backlog of use and abuse cases; plan the security epics
· Sprint 1 — IAM; logging and monitoring
· Sprint 2 — IAM; logging and monitoring; infrastructure protection
· Sprint 3 — IAM; logging and monitoring; infrastructure protection
· Sprint 4 — IAM; logging and monitoring; infrastructure protection; data protection
· Spring 5 — Data protection, automating security operations, incident response planning/tooling; resilience
· Sprint 6 — Automating security operations, incident response; resilience
The overall approach aims to define the MVP or baseline. Then map first sprint in each area. In the initial stages, the end goal can be less defined, but a clear roadmap of initial sprints is created. A key element is incorporating the compliance validation into each sprint through security and compliance unit test cases, and then undergoing the promotion to production process. This approach can be more effective and have greater cost efficiency than a big bang approach based on long timelines and high capital outlays.
AWS Well-Architected Framework, covering key concepts, design principles for architecting in the cloud, and the five pillars. The AWS Well-Architected Framework helps you understand the pros and cons of decisions you make while building systems on AWS. By using the Framework you will learn architectural best practices for designing and operating reliable, secure, efficient, and cost-effective systems in the cloud. It provides a way for you to consistently measure your architectures against best practices and identify areas for improvement.
The pillars of the AWS Well-Architected Framework are:
The ability to run and monitor systems to deliver business value and to continually improve supporting processes and procedures.
The ability to protect information, systems, and assets while delivering business value through risk assessments and mitigation strategies
The ability of a system to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions such as misconfigurations or transient network issues.
The ability to use computing resources efficiently to meet system requirements, and to maintain that efficiency as demand changes and technologies evolve
The ability to run systems to deliver business value at the lowest price point.
Security in the cloud is composed of five areas:
1. Identity and access management are key parts of an information security program, ensuring that only authorized and authenticated users are able to access your resources, and only in a manner that you intend.
2. Detective controls to identify a potential security threat or incident. They are an essential part of governance frameworks and can be used to support a quality process, a legal or compliance obligation, and for threat identification and response efforts.
3. Infrastructure protection encompasses control methodologies, such as defense in depth, necessary to meet best practices and organizational or regulatory obligations. Use of these methodologies is critical for successful, ongoing operations either in the cloud or on premises. Infrastructure protection is a key part of an information security program. It ensures that systems and services within your workload are protected against unintended and unauthorized access and potential vulnerabilities.
4. Data protection before designing any system, foundational practices that influence security should be in place. For example, data classification provides a way to categorize organizational data based on levels of sensitivity, and encryption protects data by way of rendering it unintelligible to unauthorized access. These methods are important because they support objectives such as preventing financial loss or complying with regulatory obligations.
5. Incident response even with extremely mature preventive and detective controls, your organization should still put processes in place to respond to and mitigate the potential impact of security incidents. The architecture of your workload strongly affects the ability of your teams to operate effectively during an incident, to isolate or contain systems, and to restore operations to a known good state. Putting in place the tools and access ahead of a security incident, then routinely practicing incident response through game days, will help you ensure that your architecture can accommodate timely investigation and recovery.
Let us have a look at how should we approach container security. This is with defense in depth so a multi-layered approach.
1. Starting with first layer which is the most valuable one that is the user data. They are core business data, any kind of PII information and things like GDPR in Europe. Amazon Macie is a security service that uses machine learning to automatically discover, classify, and protect sensitive data in AWS.
2. Next layer is around configuration data. Especially around sensitive configuration data such as passwords, API keys and so on. Things like pushing clear text passwords. Aws git-secrets Prevents you from committing secrets and credentials into git repositories
3. Code is the next layer. Sanitizing user input, static code analysis and so on. Cfn-nag, Cfn-lint, task cat.
4. Dependencies is the next layer. Like pulling lot of dependencies, libraries, packages, frameworks and runtime. Security analysis and vulnerability checks of these dependencies. AWS integrates with Synk is an open source Dependency-Check software composition analysis utility that identifies project dependencies and checks if there are any known, publicly disclosed, vulnerabilities
5. Next layer is the container layer. They contain a runtime layer, standards for both runtime and the image. (OCI standard) the principle idea of container images being that they are immutable and they share same kernel. AWS integrates with aquasec and qualys.
6. Finally yet importantly, containers need to run on the host. Are they full-blown distribution or container optimized distribution and what are multi-tenancy requirements.
Lets see what we have in our toolbox from AWS so if we use it correctly we will be safe and build, operate these containerized Microservices or applications.
Let us not forget that this is a shared responsibility model. Security and Compliance is a shared responsibility between AWS and the customer. This shared model can help relieve customer’s operational burden as AWS operates, manages and controls the components from the host operating system and virtualization layer down to the physical security of the facilities in which the service operates. The customer assumes responsibility and management of the guest operating system (including updates and security patches), other associated application software as well as the configuration of the AWS provided security group firewall. Customers should carefully consider the services they choose as their responsibilities vary depending on the services used. The integration of those services into their IT environment, and applicable laws and regulations. The nature of this shared responsibility also provides the flexibility and customer control that permits the deployment.
There are two parts. Security of the cloud that AWS looking after. Then there are security measures that as a user of the cloud we have to take care of and that is also known as security in the cloud. Security off the cloud means things like the foundation services to compute, storage, databases, networking and so on.
An example security off the cloud means things like the foundation services to compute storage databases networking and so on. An example security in the cloud means you’re responsible for your customer data as in the defense in depth diagram like core business data you are dealing with running a an application in the container.
Let us have a look managed container services offerings on AWS. We can see there are 4 layers:
· Image registry is a container image repository that is ECR similar like Docker hub.
· Compute Engine in is ECS where the containers run on both Linux and Windows operating system at EC2 instances or clusters you manage. There is also AWS Fargate launch type which, is a compute engine for Amazon ECS that allows you to run containers without having to manage servers or clusters. Amazon EKS is a managed service that makes it easy for you to run Kubernetes on AWS without needing to install and operate your own Kubernetes clusters. Currently EKS support Linux workloads however; Windows workload support is in public preview.
· Orchestration is Deployment, scheduling, scaling, and management of containerized applications which is typically done at Amazon ECS and Amazon EKS.
· Higher Level of services First one in this category is service mesh is an infrastructure layer for Microservices architectures. It handles communication concerns between services, making that communication more visible (or “observable”) and manageable. More specifically, it can handle things like service discovery, routing & load balancing, security (e.g., encryption, TLS, authentication, authorization) and provide error handling such as retries and circuit breaking. AWS App Mesh is a service mesh based on the Envoy proxy that makes it easy to monitor and control Microservices. App Mesh standardizes how your Microservices communicate, giving you end-to-end visibility and helping to ensure high-availability for your applications. App Mesh gives you consistent visibility and network traffic controls for every Microservices in an application. App Mesh supports Microservices applications that use service discovery naming for their components. To use App Mesh, you must have an existing application running on AWS Fargate, Amazon ECS, Amazon EKS, Kubernetes on AWS, or Amazon EC2.
A modern microservice-based application typically runs in a virtualized or containerized environments where the number of instances of a service and their locations changes dynamically. Consequently, you must implement a mechanism for that enables the clients of service to make requests to a dynamically changing set of ephemeral service instances.
Service discovery is the automatic detection of devices and offered services over a network. Many of microservice-based modern applications are built using various types of cloud resources and deployed on dynamically changing infrastructure. AWS Cloud Map is a cloud resource discovery service. Cloud Map enables you to name your application resources with custom names, and it automatically updates the locations of these dynamically changing resources. This increases your application availability because your applications always discover the most up-to-date locations of its resources. You can use the AWS Cloud Map service registry selectors with AWS App Mesh. This new option allows you to define a subset of endpoints (through matching on key and value selectors) that were defined in Cloud Map. You simply change your App Mesh VirtualNode configuration to use Cloud Map, add the selectors for the subset of service endpoints you want the VirtualNode to represent and register your running service (IP addresses and metadata) with the corresponding Cloud Map Service name and Namespace and keys and value metadata. When routing traffic to that Virtual Node, App Mesh will route to the endpoints that match the CloudMap key and value selectors you configured.
AWS best practices for building modern applications are:
· Create a culture of innovation by organizing into small DevOps teams
· Continually evaluate your security posture by automating security
· Componentize applications using microservices
· Update applications & infrastructure quickly by automating CI/CD
· Standardize and automate operations by modeling infrastructure as code
· Simplify infrastructure management with serverless technologies
· Improve application performance by increasing observability
Let us have a closer look at certain topics. We will not go deep dive now but will scratch the surface of these topics.
AWS IAM vs. Kubernetes RBAC
· For ECS: is fully managed by AWS IAM
· For EKS: you need to understand and configure both AWS IAM and Kubernetes RBAC
Role based access control essentially defining which user or service account gets carry out what operation and what kind of risks. Although there are number of tools available, there is a learning curve to create a recipe of roles and role bindings.
AWS Security Groups vs. Kubernetes Network Polices
· For ECS: you need to understand and configure AWS VPC and Security Groups.
· For EKS: you need to understand and configure both AWS VPCs/Security Groups and Kubernetes Network Polices
Network policies can be supplied through third parties things like asylum or calico that implements and forces these policies.
Security Best Practices For Container Images
· Less is more (secure)
· No secrets in images
· One service per container
· use sidecars within task/pod
· Minimize container footprint
· include only what is needed at runtime
· Use known and trusted base images
· Standardize or official ones from Docker Hub
· Scan the image for CVEs
· Specify USER in Dockerfile (otherwise it’s root)
· Assign unique and informative image tags
· Docker RBAC
Let us move on to the very beginning that is how to deal with container images. Container images are essentially, how actual application packaged up and distributed through the container registry to the actual runtime environment. Container images contain the DNA that makes up a container. If that DNA is contaminated, it affects all containers created from the same image. This is true for Docker images you create yourself, but the problem is much more severe when you use open source images from public repositories. In general less is more secure what that means is try to put as little as possible into that image. Often times, you might start projects with a generic Docker container image such as writing a Dockerfile with a FROM node, as your “default”. However, when specifying the node image, you should take into consideration that the fully installed Debian Stretch distribution is the underlying image that is used to build it. If your project doesn’t require any general system libraries or system utilities then it is better to avoid using a full blown operating system (OS) as a base image. There is a certain trade-off there for example if you need to troubleshoot and you do not have basic things. Such as binaries, that you can use to debug what is going on. For example, when you use a generic and popularly downloaded node image such as docker pull node, you are actually introducing an OS into your application that is known to have 580 vulnerabilities in its system libraries.
In order to keep container images reusable and secure, keep them clean of sensitive information. Storing secrets such as tokens, passwords, and API keys in image containers can grant access to unauthorized personnel. Storing this data in your application code can result in secrets being pushed to Git repositories and exposed to the public.
Try to follow the pattern that you put one service per container. If you have dependencies that need to run together, for example, you might have one container that takes care of pulling some data from an s3 bucket and then you have a second container that does something with it like a web server. For example, an application server that reads the data and does something with it.
In the diagram (from Kubernetes docs), one container is a web server for files kept in a shared volume. A sidecar container updates the files from a remote source. The two processes are tightly coupled and share both network and storage and are therefore suited to being placed within a single Pod.
The sidecar container extends and works with the primary container. This pattern is best used when there is a clear
difference between a primary container and any secondary tasks that need to be done for it. For example, a web server container (a primary application) that needs to have its logs parsed and forwarded to log storage (a secondary task) may use a sidecar container that takes care of the log forwarding. This same sidecar container can also be used in other places in the stack to forward logs for other web servers or even other applications.
Docker requires root privileges, anyone with access to the Docker hosts and Docker daemon automatically gains full control of all related Docker containers and images. Attackers with root privileges can create and stop containers, remove or pull images, inject commands into running containers, and expose sensitive information. Docker RBAC offers access control with roles such as users, teams, organizations, and service accounts, this setup doesn’t allow for complexity. In DevOps organizations, developers, testers and IT staff need access to the same containers at different points in the development pipeline. Some users need restricted access while others need the ability to modify and manage containers. It can be complex to set up this type of variable access. Docker access management solutions help reduce docker security issues by enabling granular RBAC management. Authorized access management solutions like Active Directory let you operate containers with minimal privileges and manage access across teams and development lifecycle stages.
Try to minimize the container footprint to include only what is needed at runtime. This is easier with the multistage builds. If you can do use only known and trusted base images. For example, you can have a policy that only use official container images from the docker hub. Make sure your base images are set up properly and patch regularly. You can have base images one per language like Java, Python and NodeJs. Therefore, you might have three base images per language that are maintained. Authenticity of Docker images is a challenge. We put a lot of trust into these images as we are literally using them as the container that runs our code in production. Therefore, it is critical to make sure the image we pull is the one that is pushed by the publisher, and that no party has modified it. Tampering may occur over the wire, between the Docker client and the registry, or by compromising the registry of the owner’s account in order to push a malicious image to. Sign and verify images to mitigate MITM attacks. Docker allows signing images, and by this, provides another layer of protection. To sign images, use Docker Notary. Notary verifies the image signature for you, and blocks you from running an image if the signature of the image is invalid.
Also, make sure that these images are scanned. Scan these images for CVEs. Tools like Synk can help you monitor, manage, and analyze every aspect of the containers infrastructure. By scanning for vulnerabilities during the delivery lifecycle, you can prevent deployment of contaminated containers. Implementing complete lifecycle management ensures containers remain secure throughout all stages of development and deployment.
Unfortunately, many docker files do not define the user, which means that the container then at run time unnecessarily runs on root. When a Dockerfile doesn’t specify a USER, it defaults to executing the container using the root user. In practice, there are very few reasons why the container should have root privileges. Docker defaults to running containers using the root user. When that namespace is then mapped to the root user in the running container, it means that the container potentially has root access on the Docker host. Having an application on the container run with the root user further broadens the attack surface and enables an easy path to privilege escalation if the application itself is vulnerable to exploitation. To minimize exposure, opt-in to create a dedicated user and a dedicated group in the Docker image for the application; use the USER directive in the Dockerfile to ensure the container runs the application with the least privileged access possible. A specific user might not exist in the image; create that user using the instructions in the Dockerfile.
Make sure that you assign neat and informative image tags. Each Docker image can have multiple tags, which are variants of the same images. The most common tag is latest, which represents the latest version of the image. Image tags are not immutable, and the author of the images can publish the same tag multiple times.
This means that the base image for your Docker file might change between builds. This could result in inconsistent behavior because of changes made to the base image.
There are multiple ways to mitigate this issue:
- Prefer the most specific tag available. If the image has multiple tags, such as :8 and :8.0.1or even :8.0.1-alpine, prefer the latter, as it is the most specific image reference. Avoid using the most generic tags, such as latest. Keep in mind that when pinning a specific tag, it might be deleted eventually.
- To mitigate the issue of a specific image tag becoming unavailable and becoming a show-stopper for teams that rely on it, consider running a local mirror of this image in a registry or account that is under your own control. It’s important to take into account the maintenance overhead required for this approach — because it means you need to maintain a registry. Replicating the image you want to use in a registry that you own is good practice to make sure that the image you use does not change.
- Be very specific! Instead of pulling a tag, pull an image using the specific SHA256 reference of the Docker image, which guarantees you get the same image for every pull. However notice that using a SHA256 reference can be risky, if the image changes that hash might not exist anymore.
You can have a dual strategy for example you can use github based model which you can use git commit hash as the default tag. Then important major releases that you plan to promote to production you can manually tag with something like v0.1.
Let us move on run time container security. Containers can be monitored using tools like aws cloudwatch, aws x-ray or Scout, Datadog and Prometheus. Monitoring systems can help you identify attacks, send alerts, and even automatically implement fixes. Periodically review log data generated by containers and use it to generate preventive security insights. You can protect against zero-day vulnerabilities not yet even in a CVE database via partner products such as Aqua Security &Twistlock. Main points in runtime container security are
· Limit what can execute within container(s) via rules engine
· Ensure only trusted images can be deployed/run in your cluster
· Get visibility into the runtime behavior of the entire environment
· Detect vulnerable running containers as soon as a CVE is made public
Then there are a couple of advanced policies and auditing. There might be different motivations for that like compliance reasons, regulatory requirements, and industry like financial, health care so on. Therefore, you simply have to follow certain rules. Business workflows may need you to enforce these advance policies. There are hot security policies that essentially allow enforcing security context throughout all of the pods in the cluster. For example, do not allow someone to run pods that use root or you can apply certain mandatory access control policies and so on. Then there is the class of general-purpose advance policies best example here is OPA. Open Policy Agent (OPA) is a general-purpose policy engine with uses ranging from authorization and admission control to data filtering. You can read on AWS Blog Deploying OPA into an Amazon Elastic Container Service for Kubernetes (EKS) cluster and implementing a check to only allow images from your own Amazon Elastic Container Registry (ECR) or the EKS ECR repository. Then there is the class of so-called software supply chain policy and policy enforcement controls. Two examples are In-toto and Grafeas. However, software supply chain policy is still in its early days.
CIS Docker Benchmark, provides prescriptive guidance for establishing a secure configuration posture for Docker Engine — Community version 18.09 and Docker Enterprise 2.1
- Create a separate partition for containers
- Harden the container host
- Update your Docker software on a regular basis
- Manage Docker daemon access authorization wisely
- Configure your Docker files directories, and Audit all Docker daemon activity.
Docker Daemon Configuration
- Restrict network traffic between default bridge containers and access to new privileges from containers.
- Enable user namespace support to provide additional, Docker client commands authorization, live restore, and default cgroup usage
- Disable legacy registry operations and Userland Proxy
- Avoid networking misconfiguration by allowing Docker to make changes to iptables, and avoid experimental features during production.
- Configure TLS authentication for Docker daemon and centralized and remote logging.
- Set the logging level to ‘info’, and set an appropriate default ulimit
- Don’t use insecure registries and aufs storage drivers
- Apply base device size for containers and a daemon-wide custom SECCOMP profile to limit calls.
Container Images and Build File
- Create a user for the container
- Ensure containers use only trusted images
- Ensure unnecessary packages are not installed in the container
- Include security patches during scans and rebuilding processes
- Enable content trust for Docker
- Add HEALTHCHECK instructions to the container image
- Remove setuid and setgid permissions from the images
- Use COPY is instead of ADD in Dockerfile
- Install only verified packages
- Don’t use update instructions in a single line or alone in the Dockerfile
- Don’t store secrets in Dockerfiles
- Restrict containers from acquiring additional privileges and restrict Linux Kernel Capabilities.
- Enable AppArmor Profile.
- Avoid use of privileged containers during runtime, running ssh within containers, mapping privileged ports within containers.
- Ensure sensitive host system directories aren’t mounted on containers, the container’s root file system is mounted as read-only, the Docker socket is not mounted inside any containers.
- Set appropriate CPU priority for the container, set ‘on-failure’ container restart policy to ‘5’, and open only necessary ports on the container.
- Apply per need SELinux security options, and overwrite the default ulimit at runtime.
- Don’t share the host’s network namespace and the host’s process namespace, the host’s IPC namespace, mount propagation mode, the host’s UTS namespace, the host’s user namespaces.
- Limit memory usage for container and bind incoming container traffic is to a specific host interface.
- Don’t expose host devices directly to containers, don’t disable the default SECCOMP profile, don’t use docker exec commands with privileged and user option, and don’t use Docker’s default bridge docker0.
- Confirm cgroup usage and use PIDs cgroup limit, check container health at runtime, and always update docker commands with the latest version of the image.
Docker Security Operations
· Ensure image sprawl is avoided
· Ensure container sprawl is avoided