How our engineers improved VPN connection times
The success of a company isn't measured solely in dollars.
Sometimes, it's measured in seconds—or even fractions of a second.
This is the story of how ExpressVPN engineers worked together to dramatically reduce the time it takes for our customers to connect to the VPN—often to less than a second.
If you're like us—the kind of person who isn't satisfied with "good enough," and likes applying knowledge systematically to identify, analyze, and solve a real problem—this is for you! We’re hiring across Europe and Asia at both offices, and for some remote positions...
Why connection time matters
We recently improved how quickly the ExpressVPN app for iOS gets connected to the VPN, especially on networks that apply some traffic restrictions. This is particularly common at schools, workplaces, cafes, hotels, hospitals, airports, and also in places with censorship.
How an app performs in such situations highlights the differences in quality across VPN providers.
Here are the results in numbers:
The green bars—most prominent on the left—represent the new version of our app; the beige bars represent the old. As you can see, connection times have improved across the board, in fact almost three-quarters of all connections now take 2 seconds or less, and nearly half take less than a second. Even more remarkably, the share of connections that take more than 15 seconds—most likely to occur in high-censorship environments—has plunged from 10% to almost nothing.
It starts with a problem...
Let’s start by framing the problem: We knew from anonymous analytics and by speaking with customers that there were many times when the VPN took a long time to get connected. This was unacceptable to us. It was time to get solving.
Ideally all connection times would be less than a second, yet in some situations it took as long as 30 seconds. Imagine walking out of an elevator where you lost a signal, pulling out your phone to try to message a friend, yet needing to wait 30 seconds before the message went out.
We want to make it easy to keep ExpressVPN “always on,” enjoy all the benefits of protection, and not experience any downsides. If we add 30 seconds of offline-time, clearly that’s a problem in need of fixing.
Even with 30 seconds, we were already doing better than many competitors, who fail to connect at all in several such situations. For example, try using a Wi-Fi network that blocks UDP from reaching the internet, as is common at universities and with many public Wi-Fi networks. Many other VPNs would fail to connect entirely.
Step 1: Form a team and brainstorm
We decided to create a team with the goal of speeding up connection times. At ExpressVPN, we organize ourselves in cross-functional teams aligned to objectives. We think about what skills are needed to accomplish the objective, then assemble the right people to join that team.
The team started with a brainstorm to identify metrics that could help us track our progress, identify root causes for slow connection times, and solutions we might want to split test.
Through several spikes (i.e., time-boxed research efforts), we realized that a fundamental reason for slow VPN connection times was the fact that there are many different ways of setting up a VPN, yet sometimes only a few specific ways can actually work. VPN apps need to be fast at finding the appropriate method for each situation. That especially matters in networks that are configured with some restrictions.
Previously, ExpressVPN’s app would iteratively try different ways of connecting to the VPN until finally finding one that worked. When a network is permissive, the first attempt will work and thus the connection is set up quickly, but on restricted networks this may take many attempts.
We realized that one fast way to find the winning combination was to simply try them all in parallel at the same time, then just pick the one that gets connected first. In fact, our open-source Lightway VPN protocol makes such approaches relatively easy to engineer, as we confirmed through a prototype. We refer to this method as parallel connections.
Step 2: Run a test case
We generally practice test-driven development. We first write the test to highlight the problem, then write the feature to make the test pass.
In this instance, our main test case created a network that blocked UDP packets from reaching the internet. Schools, workplaces, hotels, and other networks are sometimes configured that way to only allow UDP on the local network (such as for DNS), but not to reach the internet. That means a VPN app needs to quickly determine either through which UDP ports it can reach the internet, or decide to fall back to TCP.
Running a VPN over TCP can have downsides when the network experiences high rates of packet loss (quite common on mobile networks), so we generally try to avoid TCP unless it’s really necessary. Lightway supports both UDP and TCP, whereas many competing protocols, including WireGuard, may only use UDP.
We were happy to see that our test case failed: On the network created by the test case, it took our app about 30 seconds to get connected—far too long! (And when we tested other VPNs using non-Lightway protocols they failed to connect entirely.) Then we ran the test on our parallel connections prototype, and it reached success in less than a second—a remarkable outcome.
Step 3: Define the goals
Confident that we knew how to solve this problem for our customers, we gave ourselves a specific measurable goal. We like to use the OKR (Objectives and Key Results) framework and track our goals and progress in a tool called 7Geese. It looked like this:
Step 4: Story mapping
We then ran a story-mapping exercise to determine how we could deliver value to our users quickly in small weekly increments. This entailed drafting stories from various points of view:
A product manager would like to control how many users can see a preview-version of the feature, determine who participates in a split test, and judge winners of split tests.
An operations analyst needs to decide how many VPN servers and network resources to buy (which parallel connections might influence).
Engineers need to know that when they change code in this feature it still meets important requirements about response time, RAM usage, and stability.
For the end user, scenarios include connecting for the first time, reconnecting, experiencing failure, and being based in a country with censorship where our app needs to employ extra techniques. We also explored who shouldn’t use TCP unnecessarily when it’s possible to still use UDP.
The team drafted the stories and prioritized them. They then discussed details and reached a shared understanding of what we were really trying to build and how we’d judge it to be “done.” We find team-based story refinements a great way to “shift left” on quality: Compared with having someone write a story by themselves in isolation and then passing it to others to implement, collaborative refinement is not only far more fun—it also leads to fewer bugs.
Step 5: Delivery
This team chose to follow Kanban principles for its delivery mechanics. Teams at ExpressVPN typically either choose Scrum or Kanban, depending on their preferences. The team agreed on a cycle-time SLA of five days for any ticket, and tracked metrics for throughput and cycle time.
Here’s an example of a Tableau dashboard the team used to visualize metrics about its Jira tickets:
Notice the inverse relationship between cycle time and throughput, and how the team’s throughput increased over time as it found ways to execute more efficiently. In bi-weekly retros, the team looked through its OKRs and delivery metrics, which triggered productive discussions about how to improve.
Feature flagging really helped the team deliver quickly, as we could merge code to master whenever any specific story was done, even if we didn’t yet want many users to see that feature. Plus, if we discovered a bug after merging, we’d still go ahead with the release and simply turn off the feature flag, which ensured we’d be less likely to slow down a release train of our client-side apps and thus impact other teams.
Once we had delivered enough stories to be worth putting parallel connections to a split test with customers, we started at a small scale with willing customers who had signed up for our beta program.
Very promising results quickly followed: Parallel connections shifted the 95th percentile of connection times from 33 to 8.5 seconds. We made some more improvements, then turned on the feature for more customers, though still only 1% and chosen randomly. Again, very promising results, and we repeated this cycle until we turned it on confidently for all customers.
Like how we work?
We hope you enjoyed these insights on how a team at ExpressVPN operates. If this sounds like the type of place you’d like to work, please get in touch!
We’re hiring across all functions for all levels of experience, across Europe and Asia at offices in London, Poznan, Hong Kong, and Singapore, as well as some remote positions in European or Asian time zones.