How ExpressVPN keeps its web servers patched and secure

1
ExpressVPN server rises from the ashes.

This article explains ExpressVPN’s approach to security patch management for the infrastructure running the ExpressVPN website (not the VPN servers). In general, our approach to security is:

  1. Make systems very difficult to hack.
  2. Minimize the potential damage if a system hypothetically gets hacked and acknowledge the fact that some systems cannot be made perfectly secure. Typically, this starts in the architectural design phase, where we minimize an application’s access.
  3. Minimize the amount of time that a system can remain compromised.
  4. Validate these points with regular pentests, both internal and external.

Security is ingrained in our culture and is the primary concern guiding all our work. There are many other topics such as our security software development practices, application security, employee processes and training, etc., but those are out of scope for this post.

Here we explain how we achieve the following:

  1. Ensure that all servers are fully patched and never more than 24 hours behind publications of CVEs.
  2. Ensure that no server is ever used for more than 24 hours, thus putting an upper limit on the amount of time that an attacker can have persistence.

We accomplish both goals through an automated system that rebuilds servers, starting with the OS and all latest patches, and destroys them at least once every 24 hours.

Our intent for this article is to be useful for other developers facing similar challenges and to give transparency into ExpressVPN’s operations to our customers and the media.

How we use Ansible playbooks and Cloudformation

ExpressVPN’s web infrastructure is hosted on AWS (as opposed to our VPN servers that run on dedicated hardware) and we make heavy use of its features to make rebuilding possible.

Our entire web infrastructure is provisioned with Cloudformation, and we try to automate as many processes as we can. However, we find working with raw Cloudformation templates to be quite unpleasant due to the need for repetition, overall poor readability, and the constraints of JSON or YAML syntax.

To mitigate this, we use a DSL called cloudformation-ruby-dsl that enables us to write template definitions in Ruby and export Cloudformation templates in JSON.

In particular, the DSL allows us to write user data scripts as regular scripts which are converted to JSON automatically (and not go through the painful process of making each line of the script into a valid JSON string).

A generic Ansible role called cloudformation-infrastructure takes care of rendering the actual template to a temporary file, which is then used by the cloudformation Ansible module:


- name: 'render {{ component }} stack cloudformation json'
  shell: 'ruby "{{ template_name | default(component) }}.rb" expand --stack-name {{ stack }} --region {{ aws_region }} > {{ tempfile_path }}'
  args:
chdir: ../cloudformation/templates
  changed_when: false

- name: 'create/update {{ component }} stack'
  cloudformation:
stack_name: '{{ stack }}-{{ xv_env_name }}-{{ component }}'
state: present
region: '{{ aws_region }}'
template: '{{ tempfile_path }}'
template_parameters: '{{ template_parameters | default({}) }}'
stack_policy: '{{ stack_policy }}'
  register: cf_result

In the playbook, we call the cloudformation-infrastructure role several times with different component variables to create several Cloudformation stacks. For example, we have a network stack that defines the VPC and related resources and an app stack that defines the Auto Scaling group, launch configuration, lifecycle hooks, etc.

We then use a somewhat ugly but useful trick to turn the output of the cloudformation module into Ansible variables for subsequent roles. We have to use this approach since Ansible does not allow the creation of variables with dynamic names:


- include: _tempfile.yml
- copy:
content: '{{ component | regex_replace("-", "_") }}_stack: {{ cf_result.stack_outputs | to_json }}'
dest: '{{ tempfile_path }}.json'
no_log: true
changed_when: false

- include_vars: '{{ tempfile_path }}.json'

Updating the EC2 Auto Scaling group

The ExpressVPN website is hosted on multiple EC2 instances in an Auto Scaling group behind an Application Load Balancer which enables us to destroy servers without any downtime since the load balancer can drain existing connections before an instance terminates.

Cloudformation orchestrates the entire rebuild, and we trigger the Ansible playbook described above every 24 hours to rebuild all instances, making use of the AutoScalingRollingUpdate UpdatePolicy attribute of the AWS::AutoScaling::AutoScalingGroup resource.

When simply triggered repeatedly without any changes, the UpdatePolicy attribute is not used—it is only invoked under special circumstances as described in the documentation. One of those circumstances is an update to the Auto Scaling launch configuration—a template that an Auto Scaling group uses to launch EC2 instances—which includes the EC2 user data script that runs on the creation of a new instance:


resource 'AppLaunchConfiguration', Type: 'AWS::AutoScaling::LaunchConfiguration',
Properties: {
KeyName: param('AppServerKey'),
ImageId: param('AppServerAMI'),
InstanceType: param('AppServerInstanceType'),
SecurityGroups: [
param('SecurityGroupApp'),
],
IamInstanceProfile: param('RebuildIamInstanceProfile'),
InstanceMonitoring: true,
BlockDeviceMappings: [
{
DeviceName: '/dev/sda1', # root volume
Ebs: {
VolumeSize: param('AppServerStorageSize'),
VolumeType: param('AppServerStorageType'),
DeleteOnTermination: true,
},
},
],
UserData: base64(interpolate(file('scripts/app_user_data.sh'))),
}

If we make any update to the user data script, even a comment, the launch configuration will be considered changed, and Cloudformation will update all instances in the Auto Scaling group to comply with the new launch configuration.

Thanks to cloudformation-ruby-dsl and its interpolate utility function, we can use Cloudformation references in the app_user_data.sh script:

readonly rebuild_timestamp="{{ param('RebuildTimestamp') }}"

This procedure ensures our launch configuration is new every time the rebuild is triggered.

Lifecycle hooks

We use Auto Scaling lifecycle hooks to make sure our instances are fully provisioned and pass the required health checks before they go live.

Using lifecycle hooks allows us to have the same instance lifecycle both when we trigger the update with Cloudformation and when an auto-scaling event occurs (for example, when an instance fails an EC2 health check and gets terminated). We don’t use cfn-signal and the WaitOnResourceSignals auto-scaling update policy because they are only applied when Cloudformation triggers an update.

When an auto-scaling group creates a new instance, the EC2_INSTANCE_LAUNCHING lifecycle hook is triggered, and it automatically puts the instance in a Pending:Wait state.

After the instance is fully configured, it starts hitting its own health check endpoints with curl from the user data script. Once the health checks report the application to be healthy, we issue a CONTINUE action for this lifecycle hook, so the instance gets attached to the load balancer and starts serving traffic.

If the health checks fail, we issue an ABANDON action which terminates the faulty instance, and the auto scaling group launches another one.

Besides failing to pass health checks, our user data script may fail at other points—for example, if temporary connectivity issues prevent software installation.

We want the creation of a new instance to fail as soon as we realize that it will never become healthy. To achieve that, we set an ERR trap in the user data script together with set -o errtrace to call a function that sends an ABANDON lifecycle action so a faulty instance can terminate as soon as possible.

User data scripts

The user data script is responsible for installing all the required software on the instance. We’ve successfully used Ansible to provision instances and Capistrano to deploy applications for a long time, so we’re also using them here, allowing for the minimal difference between regular deploys and rebuilds.

The user data script checks out our application repository from Github, which includes Ansible provisioning scripts, then runs Ansible and Capistrano pointed to localhost.

When checking out code, we need to be sure that the currently deployed version of the application is deployed during the rebuild. The Capistrano deployment script includes a task that updates a file in S3 that stores the currently deployed commit SHA. When the rebuild happens, the system picks up the commit that is supposed to be deployed from that file.

Software updates are applied by running unattended-upgrade in the foreground with the unattended-upgrade -d command. Once complete, the instance reboots and starts the health checks.

Dealing with secrets

The server needs temporary access to secrets (such as the Ansible vault password) which are fetched from the EC2 Parameter Store. The server can only access secrets for a short duration during the rebuild. After they are fetched, we immediately replace the initial instance profile with a different one which only has access to resources that are required for the application to run.

We want to avoid storing any secrets on the instance’s persistent memory. The only secret we save to disk is the Github SSH key, but not its passphrase. We don’t save the Ansible vault password, either.

However, we need to pass these passphrases to SSH and Ansible respectively, and it’s only possible in interactive mode (i.e. the utility prompts the user to input the passphrases manually) for a good reason—if a passphrase is a part of a command it is saved in the shell history and can be visible to all users in the system if they run ps. We use the expect utility to automate interaction with those tools:


expect << EOF
cd ${repo_dir}
spawn make ansible_local env=${deploy_env} stack=${stack} hostname=${server_hostname}
set timeout 2
expect 'Vault password'
send "${vault_password}\r"
set timeout 900
expect {
"unreachable=0 failed=0" {
exit 0
}
eof {
exit 1
}
timeout {
exit 1
}
}
EOF

Triggering the rebuild

Since we trigger the rebuild by running the same Cloudformation script that is used to create/update our infrastructure, we need to make sure that we don’t accidentally update some part of the infrastructure that is not supposed to be updated during the rebuild.

We achieve this by setting a restrictive stack policy on our Cloudformation stacks so only the resources necessary for the rebuild are updated:


{
"Statement" : [
{
"Effect" : "Allow",
"Action" : "Update:Modify",
"Principal": "*",
"Resource" : [
"LogicalResourceId/*AutoScalingGroup"
]
},
{
"Effect" : "Allow",
"Action" : "Update:Replace",
"Principal": "*",
"Resource" : [
"LogicalResourceId/*LaunchConfiguration"
]
}
]
}

When we need to do actual infrastructure updates, we have to manually update the stack policy to allow updates to those resources explicitly.

Because our server hostnames and IPs change every day, we have a script that updates our local Ansible inventories and SSH configs. It discovers the instances via the AWS API by tags, renders the inventory and config files from ERB templates, and adds the new IPs to SSH known_hosts.

ExpressVPN follows the highest security standards

Rebuilding servers protects us from a specific threat: attackers gaining access to our servers via a kernel/software vulnerability.

However, this is only one of the many ways we keep our infrastructure secure, including but not limited to undergoing regular security audits and making critical systems inaccessible from the internet.

Additionally, we make sure that all of our code and internal processes follow the highest security standards.

1 COMMENT

LEAVE A REPLY

Please enter your name here
Please enter your comment!