At the start of April, I attended JAX DevOps in London.
At the start of April, I attended JAX DevOps in London with a view to better understanding how we can improve our infrastructure and processes to ensure the quality of our products whilst cutting the time it takes to go from concept to release. I spent the first day in a workshop presented by Gianluca Arbezzano which focused on Docker in production.
Docker is relatively new to us. It’s been hard to ignore the overwhelming drive towards containers, and so when the limitations of our previous platform became apparent I decided to give Docker a try. We have been making use of Docker now for a few months however some limitations of Docker Cloud were becoming apparent.
This is where Docker Swarm comes in. Whilst there are various options for hosting your Docker clusters, in general none seemed to be affordable whilst being able to fit in with our existing infrastructure. Utilising Amazon OpsWorks and Docker Swarm I was able to create a new platform for My School Portal.
What stands out most about this though is just how little time it took. Less than a week was spent from the first ideas being sketched out through to all of our My School Portal infrastructure being moved over to the new Docker Swarm. Such a move is typically highly risky however a few key factors are what allowed us to make such a major change with minimal risk.
Firstly, we make extensive use of automated testing on every code change no matter how small. This is not something that was put together overnight, and in fact is the product of a couple of years hard work building up our testing suite. Whilst it took a lot of effort to get ready, now we have it we are able to make major changes whilst still being confident that the system works as intended. In turn, this facilitates major rewrites of core code to improve performance on a regular basis. Without automated testing, such changes would typically require weeks if not months of testing and scrutiny before deployment.
Secondly, due to how our infrastructure was configured, rolling back to the old known-good infrastructure takes seconds. This does cost a little more in the short term as we are currently running multiple clusters, however the benefit in that we can quickly restore functionality should something unexpected happen quickly outweighs the temporary cost.
Thirdly, due to our use of Docker there is a separation between the underlying servers and the functional code within the Docker containers. We test the containers and so can be certain that they function as intended. If the underlying infrastructure should fail we are able to move that same container elsewhere and so restore functionality with a minimum of downtime. In a more traditional environment this would be impossible as even provisioning a new server can take days. These servers would then require manual configuration into a state that allows them to replace the failed machines.
In summary, due to our investment in how we test My School Portal we are able to confidently make major changes without risking downtime. Additionally due to spending time to investigate new technologies, we are able to embrace them where suitable to improve the service we give to customers.