sdc final conversation
Since the Midpoint conversation, the second half of this project consisted of stress testing and further optimizing the server infrastructure. Here are the two critical choices I made for my service.
Decision 1: Load balancing & scalability
Upon stress testing the service ec2 instance which connects to a separate database ec2 instance, the max results were 550 rps with a 0% error rate. Although, decent, I knew the addition of multiple service instances would help handle more incoming traffic. I implemented horizontal scaling using Nginx as a proxy server, pm2 to be keep the node processes alive, and aws ELB for load balancing. Initially, I manually created additional service instances by using cpu utilization and error rate as metrics. If the instance was overly stressed, the error rates would jump and cpu utilization spikes. I found under 60-70% cpu usage per instance kept my error rate below 1% and I was able to handle more rps.
Ex .
2500k rps w/ 3 instances
6500k rps w/ 6 instances
10k rps w/ 14 instances
I hit the milestones I set out for but the tradeoff was time and cost. Spinning up the instances manually and the cost of having to keep the instances alive was necessary to reach that level of performance.
To offset the potential cost, I implemented autoscaling by creating an AMI image of my service instance and set the cpu utilization between 30-45%. If incoming traffic stresses the service, my configuration will spin up more instances as needed. Although, I did not incur any performance gains, I did prevent any unnecessary billing costs and automate the monitoring of instances. If the service is idle or low traffic, the autoscaling will turn off the other service instances.
Decision 2: Implement caching layer & optimizations
With or without autoscaling, having up to 14 instances seemed excessive to me. I decided to take a step back and see if there was any additional areas for improvement. At this juncture, my database instance crapped out and I lost access to my 10 million records. I had to go back and redo my database instance and reseed the data. A few hours later, the same thing happened to the new instance. Although incredibly frustrating and time consuming, something clicked in my head. When considering user experience, the clients would be staring at a 500-504 status code right now. I decided to add an additional level of fault tolerance by introducing a caching layer. Nginx has caching capabilities and was the practical choice since I was already using it as a reverse proxy. I configured the cache to serve stale content when the server is busy or if the origin server is temporarily down. The performance gains were amazing because the subsequent clients were getting the hashed content (static, dynamic) directly from the cache rather than waiting for the server to return the response. The valid cache setting was set to 10 mins. The results ended up being 10k rps on 1 instance with 3ms and 0% error rate. Incredible speeds and cost efficient.
I took it a level deeper by using micro-caching, the concept is to keep the cache valid for 1s so any update or added records to the database will be returned back to client without any significant delays. The worst case would be to serve an outdated stale content for too long so I made sure to implement background caching update which sends a GET request to the server and retrieves the updated data. The data ends up being the stale cache and the cycle continues. My final configuration was autoscaling with 1 minimum service instance and a max of 2. This will easily handle up to 25k rps with the option to add as many service instances as necessary depending on traffic volume metrics.