I see a lot in software development this idea of “its always been this way” or “we spent a lot of time building that so we’ll have to live with it” and this is a dangerous mindset. One that, if I had followed it, would have caused serious problems for logbreeze.
Doesn’t Work In The Real World
Let’s make this about the real world and not about software for a moment.
In the real world, do we say “oh no we can’t do that, this building has always had a gaping hole in the roof” or do we look at it, see that it needs ripping down and rebuilding?
We rebuild. But we rebuild it better. Software should be the same.
Making It About Me
So, what got me to write about the subject?
Well, throughout the last few weeks I’ve been working evenings and weekends trying to solve problems of scalability and simplicity for the ingestion and processing engine behind logbreeze. The platform is already coping with roughly 50 million logged messages per month but was starting to show serious signs of strain and something had to be done.
This performance problems were caused primarily by a couple of things
- Messages being received were being sent to the API for processing instead of being queued first
- The search engine behind the scenes was heavily optimised within Laravel but ultimately was still using a Postgres backend for storing and searching data which became slow when dealing with millions or records
At this stage I’d invested a ton of time in to both of these parts with a smaller data set and far fewer messages being ingested. Now when I was putting them under some strain the cracks were showing.
I had two options. Keep at the current setup and just try to optimise the hell out of it, or scrap and rebuild.
Instantly I opted for the second, this time changing the two ways massively.
Let There Be Queues
Just as an update before I get in to this section, the decision has since been made to move to Redis purely because hot damn SQS gets expensive The rest of this still stands though.
First, all jobs were to come in and be queued in SQS for processing. This meant that all the nodjs ingest service was doing was receiving a message, sending to SQS (and infinitely scalable platform) and that’s it. Previously it would fire off a HTTP request to an API which would then process the incoming message. At small scale this is fine, but with a couple million messages per day this was already starting to fall over.
This actually had instant side benefits I hadn’t thought of before.
When a message comes in, as it is queued and then picked up by a worker later it means that I could perform maintenance whenever required on the application, database, or elasticsearch cluster without losing any messages just by pausing the workers first. Then, when its done just hit resume on the workers and it’ll catch up without missing anything!
The other added bonus is that if there is a sudden influx of messages the queue will fill act as a buffer for the workers to catch up. If needed more workers can be added at any time, but in general this should work very nicely without making the API fall over!
Looking For Some Logs
Second, the search engine I had built within the application was scrapped. I did consider Algolia but cost and it being a remote service put me off this idea. Instead, I’ve gone with ElasticSearch which is working beautifully.
This is also where the big gain for performance came from. ElasticSearch can search through millions of rows incredibly fast (I was genuinely surprised at just how fast).
In fact, in current testing the livewire component for search and showing messages under a device manages to reload, search and update the UI in roughly 50 – 60ms (this hasn’t been properly benchmarked, just quickly tested with dev tools showing the request in the network tab of Google Chrome).
So, Whats The Conclusion?
Never be afraid to rebuild, just because that’s how it has always been. When you rebuild, you always build upon the foundation of the past.