This feature is is only available on the Production and Enterprise plans.
Key Benefits of Horizontal Autoscaling
- Cost efficiency & Performance: Ensure your App Services are always using the optimal amount of containers. Scale loads with periods of high and low usage that can be parallelized - efficiently.
- Greater reliability: Reduce the likelihood of an expensive computation (ie. a web request) consuming all of your fixed size processing capability
- Reduced engineering time: Save time manually scaling your app services with greater automation
What App Services are good candidates for HAS?
First, let’s consider what sort of process is NOT a candidate:- Job workers, unless your jobs are idempotent and/or your queue follows exactly-once semantics
- Services that cannot be easily parallelized
- If your workload is not easily parallelized, you could end up in a scenario where all the load is on one container and the others do near-zero work.
So what’s a good candidate?
- Services that have predictable and well-understood load patterns
- We talk about this more in How to set thresholds and container minimums for App Services
- Services that have a workload that can be easily parallelized
- Web workers as an example, since each web request is completely independent from another
- Services that experience periods of high/low load
- However, there’s no real risk to setting up HAS on any service just in case they ever experience higher load than expected, as long as having multiple processes running at the same time is not a problem (see above for processed that are not candidates).
How to set thresholds and container minimums for App Services
Horizontal Autoscaling is configured per App Service. Guidelines to keep in mind for configuration:- Minimum number of containers - Should be set to 2 as a minimum if you want High-Availability
- Max number of containers - This one depends on how many requests you want to be able to handle under load, and will differ due to specifics of how your app behaves. If you’ve done load testing with your app and understand how many requests your app can handle with the container size you’re currently using, it is a matter of calculating how many more containers you’d want.
- Min CPU threshold - You should set this to slightly above the CPU usage your app exhibits when there’s no/minimal usage to ensure scale downs happen, any lower and your app will never scale down. If you want scale downs to happen faster, you can set this threshold higher.
- Max CPU threshold - A good rule of thumb is 80-90%. There is some lead time to scale ups occurring, as we need a minimum amount of metrics to have been gathered before the next scale-up event happens, so setting this close to 100% can lead to bottlenecks. If you want scale ups to happen faster, you can set this threshold lower.
- Scale Up, and Scale Down Steps - These are set to 1 by default, but you are able to modify the values if you want autoscaling events to jump up or down by more than 1 container at a time.
CPU thresholds are expressed as a decimal between 0 and 1, representing the percentage of your container’s allocated CPU that is actively used by your app. For instance, if a container with a 25% CPU limit is using 12% of its allocated CPU, this would be expressed as 0.48 (or 48%).
Let’s go through an example:
We have a service that exhibits periods of load and periods of near-zero use. It is a production service that is critical to us, so we want a high-availability setup, which means our minimum containers will be 2. Metrics for this service are as follows:Container Size | CPU Limit | Low Load CPU Usage | High Load CPU Usage |
---|---|---|---|
1GB | 25% | 3% (12% of allocation) | 22% (84% of allocation) |
Autoscaling Worker/Job processes
You can use horizontal autoscaling to scale your worker/job processes. However, you should consider some additional configurations:- Restart Free Scaling: When enabled, scale operations for modifying the number of running containers will not restart the other containers in the service. This is particularly useful for worker/job processes, as it allows you to scale up without interrupting work on the containers already processing jobs.
- Service Stop Timeout: When scaling down, the service stop timeout is respected. This is particularly useful for worker/job processes, since it allows time to either finish processing the current job, or put the job back on the queue for processing by another container. Note that containers are selected for removal based on the
APTIBLE_PROCESS_INDEX
metadata variable, selecting higher indices first, so if possible you should prefer to process long running jobs on containers with a lower index.