Special events like Black Friday, Cyber Monday and Christmas, very important times for all online merchants and even more so during this pandemic year (2020). The pressure this year is much higher because of the shift on the market from offline to online.
We have this experience year by year again and objectsource has a good understanding how important it is to support our clients in this time. They like to see their clients focus on their business and not worrying about technical issues.
An important case study
One of our Magento Commerce clients is www.My1stYears.com. As with all merchants at this time of year, Black Friday represents a big part of their annual revenue and Magento needs to handle this traffic with no blips. This is crucial. The good old term “Each millisecond matters!” is still in the game and still matters.
Back to www.My1stYears.com, we knew the website, Magento is ready and we measured some good performance metrics. Page load was fast, Varnish as a full page cache solution works like a charm at the front of Magento and so on and so on… Seemed we are ready for Black Friday.
However, they has asked us to make a more in depth review regarding the Magento performance, just to be sure. Magento and the ecosystem around is complex, so to conduct a review never hurts. You can always find ways to increase speed, stability and reliability.
And we found something…
This Magento setting can help you too
During my Magento performance review I had a look on each key pages systematically and checked page load. I checked the load time separately for DOM and also for full content load. Important to understand when the page became visible for the visitor. There are great tools like Lighthouse that can make this much easier.
We can say in general, key pages are those pages where your visitors land and make a first impression. There are typical pages but you always need to consider the upcoming campaign, where visitors will land (landing pages).
- Home page, Category page, Product page, Search result page
- Cart page, Checkout
- Login, Register and Customer area
- Add to cart, amend cart item
- Remove item, Update on QTY
- CSS, JS merge and compression
- HTML minification and compression
- Full page cache life time and cache storage size (good to understand the logic behind these settings, what is the best for the business flow)
- Image optimisation and loads
You can have different priorities based on the business model. I just want to give you a base guide line. You can’t go wrong with these but, of course, there are more you can do. You need to find the right balance there and find the area where may you need to dig to get an improvement.
In my case, I found pretty much everything in good condition on the first level. This is a very good sign when you have page load in 600-700ms cross over the site. That is a good indicator you are good and no major bad guy in the background (or back end) processes.
This is where things get a bit more tricky. I can see a few issues again and again what developers are doing regarding load tests.
What are the base numbers.
Numbers are about the current traffic like how many visitors are there on your key pages. Of course you can test on a staging environment but always there is a consideration, the difference between the production and staging environment. Not always possible, but the best if you can have a time window, let me say 30 mins or an hour, when you can do the load test on the production environment. Never do it in without an agreement with the client!
Try to figure out the expected numbers.
How many visitors will be there simultaneously, what are the expectations of your client. What have been the numbers last time for example. Also try to get a high level picture of the current effort on the marketing side. Meaning, what campaigns will be there be and how long will they run for. That can give you a picture of what are the key points and the timeframe for you.
Once you know the numbers
You will know what pages what you need to test and how strongly you need to do it. Let’s not kill the production site.
I recommend to use a service for load testing, but there are awesome tools out there what you can use, so you can do it yourself as well. You will need to calculate the bandwidth. Based on your numbers, how many visitors you need to emulate, you will need a certain bandwidth. And here are typical issues, when you need to do a heavier test but you don’t have enough bandwidth to make it happen (from home). Your PC/Mac can be strong enough but check your bandwidth always.
I have seen load test reports without even the target URL. Nothing extra, but be sure the report has target URL (the url has been tested) and also what has been tested. In nutshell you can test the DOM load (no static content is loaded) or you can test a full content load. Both have purposes but the chart will be useless if there is no information about what has been tested.
Hope you will find the sweet spot of your Magento instance and also the weakness of it as well.
I have checked major points in the Magento admin area as well, to see there is no critical issue. Always good to keep in mind (!) most likely there is a admin area activity during peak time as well and that can cause an performance issue on the store front as well. The reason is Magento admin is working on the very same database instance then the store front. Also there are real heavy duty guys like “Save Product” action for example. Even if your Magento is set to keep indexes correct on the save action. That can be heavy and cause issue on store front on a Black Friday peak.
Not a bad idea to recommend to the “support team” to avoid certain actions in the admin during these hours! Can be a life saver when your Magento is on the edge in a peak.
- Product list page (grid list)
- Customer list page
- Order list page and Order view page
- Product save action
- Order actions like create invoice, order memo, shipping… basically major submit actions here.
These are mainly the key points/actions what you want to be sure work as expected and they don’t ruin store front too much. That is the point and not the admin area itself.
In my case, I found all good there. Almost started to get a feeling this is “too good” if I can say.
Checking the back end
This is my favourite part. Check all the logs systematically, look for and collect critical issues. This is not about finding stuff to fix but after you have a list of errors and even exceptions than you can go through on them and see what is the impact of them on the performance side.
Errors and exceptions are not good for sure. But here what you want to see is what can hold back a process regardless of whether it is on the front end or on the back end. Timeouts, for example, are what can cause long page load and give a bad UX.
Also try to profile those pages where you found issues or just experienced slow page load. Profiling is a very classic method but still can be very effective. Newrelic, Blackfire.io and other tools can help a lot.
My favourite is Elastic APM (with Elasticsearch and Kibana) but for PHP and Magento it’s not straightforward install (yet) like a simple Newrelic agent install. So many tools are there that can give you very detailed picture in minutes basically. Saves much more money for you than just digging in raw reports and log.
What was the catch in our case?
This is very interesting. The point is, even all charts and all the indicators have been right, what I found weird was a few of them regarding Redis session reads, so I tried to dig into this as possible and find the bad guy.
I have reached out to our hosting support and asked them to confirm Redis is doing well. I wanted to know if anything is there what we can do because sometimes the Redis session read was super slow. But they said Redis is healthy and doing well basically. Hmm, then what it can be?
I just didn’t like the massive purple (Redis hget) in Newrelic, compared to even MySQL that has a hard time under Magento, you know that. Massive joins and heavy requests are there. Compare to that Redis is a key-value server and no join, nothing. Redis storage is supposed to be in memory and it should be super fast.
Long story short, finally I found this one and that was the last drop in the glass.
Have a closer look. The blue “Redis:hget” is FAST. But the Redis session handler is where the request spends 92% of the time. Boom!
And the memory was coming back immediately from the past. Of course.
If you have been in the Magento space for a long time and you even had experience with Magento 1 then may you remember there was an issue in early days with Redis sessions. Specifically the session locking mechanism of Magento was able to cause some performance issues, actually very serious ones. But it was long long ago in a far far galaxy… so who cares? Right?
Surprisingly the issue is still here with us.
Please keep in mind, there is NO wrong code or solution. This is only the result of a “safe mode” approach. Let me explain it a bit.
Magento, the Redis session handler in default uses a “soft lock” approach on sessions and locks the session in Redis on each touch for safety reason. Kind of the same why MySQL has a lock during write and read (Select). But Redis DOES NOT have this locking feature natively. So Magento tries to make it on the software level (client side) to happen and writes a “variable” in the session to lock it each time.
When you have a look on the Redis monitor and see what is going on, then you can see the number of writes on a session just because of the LOCK to set and also counting on a “variable” each time when Magento tries to write there, is big. So the locking mechanism is kind of a lot of work already for Redis.
I don’t think this is an issue, but very interesting to see this soft lock mechanism has a cost itself.
When you try to google it, you will find actually not easy to find related content in the space. When you check all the Magento Doc (official documentation, recommendations…) you won’t find this. May be there something what I didn’t find, but this is not clear communication for sure.
Later one of my colleagues found a link where the Magento support team has some thoughts about this:
January, 2018: “This article provides a fix for the issue when logging in to Magento Admin or opening the checkout page causes lag or timeout (over 30 seconds)….”
This is a good hint from Magento Support. You can ignore the “fix” part of it and please put attention on the “workaround” section.
Workaround: disable session locking
To disable session locking, set disable_locking to 1 in the Redis configuration section of the env.php file:
‘save’ => ‘redis’,
‘host’ => ‘redis.internal’,
‘port’ => 6379,
‘database’ => ‘0’,
‘disable_locking’ => ‘1’
Also, as I am aware of Magento Cloud sets disable_locking to 1 to avoid serious performance issues.
So the solution is to lock this soft locking in the app/etc/env.php file, as you can see it above.
Theoretically there can now be two concurrent requests on the very same session in the very same time but the likelihood is very very tiny. After the change you can use Redis monitor to check all fine.
Right after the change it was noticeable on the website. But because this effects all threats and all processes/requests including all the Ajax requests and much more importantly most of the Varnish ESI requests from Varnish to Magento(!), this has measurable impact on the store front. Also on stability!
Keep in mind you may not have an issue when there are a few visitors on the site. But at Black Friday, when the peak is there and in our case we had 2000+ customers simultaneously on the website, that is the time when this improvement can save your Magento.
This is the change on the customer load (session load) action only. This request is there maybe 10 times or more on a single page load. Ajax calls and all back end calls from Varnish to Magento get uncached blocks for the requested page.
Also, smooth is good in this case. You can see where the change happened on the chart.
Certified Magento developer