Sometimes, the most effective optimization is understanding and working with the existing system's constraints instead of adding more resources. Scaling software infrastructure doesn’t mean finding the most complex solution; rather, it requires finding the most pragmatic one.
One of our projects has a component that is essentially a link aggregator. For all of the links being tracked, we need to periodically check to see if they are no longer active. As an opening approach to be scaled up, we initially settled on updating link status once a week.
That's not to say that we run a single, giant process to refresh every link at, say, midnight on Saturday. Given the number of links being tracked, that would be prohibitive. Instead, we refresh throughout the week, grabbing a set every few minutes to be the next block performed.
Scale must happen, whether in the size of the set or the frequency of the action. In this case, we wanted to shift our weekly check to a daily process, meaning our links get checked seven times more often.
Given the number of links being tracked and the queue burn-down rate in the weekly check, this would require a more robust approach.
We have a queue for stacking up links that need to be checked. We're already doing some parallel action by having multiple processes grab tasks from that queue. So, we have a few options for the approach to scale this up.
Three Options
First, we could simply increase the number of processes.
But server resources are finite, so that means either provisioning a server with more memory to support the number of app instances required or provision a fleet of smaller instances. There is a more cost-effective way to do this.
Second, we could pivot to a different queuing mechanism and move the link checking to a functional component, like a lambda. Then, we could scale up as many as needed.
As an approach, this is valid and would solve a long-term projected concern. But the scale at which this would be necessary is far off.
This means the multidimensional effort (infrastructure, developing/testing the lambda, etc.) would be a hefty investment at this phase. This certainly deserves future consideration, but there's a more cost-effective and less labor-intensive way to accomplish what we need now.
So, I settled on option three: parallelize the parallel processes.
Parallelization Considerations
The fact is, every time we check a link, we lose a lot of cycles to the network latency. The key bottleneck when it comes to checking links is the network. That's an I/O problem, not CPU-bound, which would require scaling up the hardware.
Given the I/O bind on the problem, we can process a batch task rather than a single link at a time in one of our processes, spin up threads to accommodate several links at a time, and just wait for completions on all of them.
When our queuing process runs every few minutes, we can create one or more batches of links to be run in parallel in one of our background tasks. Give each batch an identifier, like a UUID, record that on the links to be checked in the batch, and then queue a task with that identifier.
We do need to plan for the potential of something breaking. In other words, if something goes sideways in the link-checking task and our links don't get their batch ID cleared, how do they get requeued?
We'll also need to track when a link has been batched. A second task can simply clear any stale batches after a few hours, freeing the links to be queued for review once again.
Results
In my testing, I saw that running a batch of links in parallel on a single process eliminated 80% or more of the runtime.
That's 80% per process, of which there are several, so we have saved well more than enough to accomplish our 7x addition of queue throughput.
Now, our process can happily run a daily refresh on the full set while having significant room to scale before more complex approaches are necessary.
Scaling as a Solution
By implementing batch processing with parallel threads, we transformed a potential scaling challenge into an elegant solution.
We achieved a significant performance improvement—reducing runtime by 80% per process—without resorting to expensive infrastructure changes or complex architectural redesigns.
This method embodies a core principle of efficient software engineering: solve the immediate problem with the least friction while keeping future scalability in mind. Our solution provides immediate benefits and leaves ample runway for future enhancements.
Sometimes, the smartest optimization is the one that works now, scales incrementally, and doesn't prematurely complicate your infrastructure.