KumoMTA Performance Update

Way back in July 2023, we wrote about KumoMTA's performance testing and made some pretty big claims. Since then, we have published ten stable releases and dozens of dev releases that included significant tuning changes and observability additions. Oh, yeah, we also now support ARM processors and several more Linux distros. It sounds like it is time for an update to see if we are STILL the world's most performant MTA.

This performance update is based on the latest public release, Release 2025.05.06-b29689af and in order to be as consistent as possible with the previous report, the test bed has the following characteristics:
- The initial test instance used the AWS c5n family of instances with 8,16, and 36 cores
- RAM was scaled with the cores at 21, 42, and 96 Gb.
- Storage was 100Gb of gp3, 7200IOPS, 225MBps
- All tests were performed using Ubuntu 24

It is also worth noting that the spool storage engine was RocksDB, and write-ahead caching is used to prevent data loss. Where other systems might post impressive numbers based on ram-spooling, that is very dangerous and should not be used in production due to the potential for loss. The use of RocksDB write-ahead caching is extremely fast and extremely safe in the event of an unexpected system failure.

arm-intel-race-sm This time we also tested similar ARM processor deployments using the AWS c7g family of instances with 8, 16, and 32 cores. RAM was scaled with the processor, and Storage was the same as the x86 instances above.

To show realistic numbers, we used a traffic generator that injected 100,000 messages with 100 KB payloads as rapidly as possible. This was intended to simulate a typical marketing scenario. The test was repeated four times, and the results were averaged. Users sending small transactional alerts will see much higher numbers if they test their own scenarios.

As you can see in the results below, an x86 instance with 8 cores and 21 Gb RAM (AWS c5n.2xlarge) will be able to deliver nearly 3.3 million 100 KB messages per hour consistently. Doubling the instance size to 16 cores and 42 GB RAM will increase the volume to approximately 4.6 million per hour. Using a c5n.9xlarge with 36 cores and 96 GB RAM will get you a transport rate of 6.3 million per hour.

A bonus to the work in this latest release was a speed improvement of about 17%. While running performance testing, we ran the last stable release from March 19th on the same hardware to show the performance gain between versions.

Results:
Screenshot 2025-05-12 at 12.12.11 PM

Since the last performance report, we have added the ability to deploy on ARM systems. For the second part of the performance testing, we repeated all of the testing on similarly equipped ARM servers with similar results. You will notice a 7%-10% reduction in performance, mostly due to the DKIM processing capabilities in the ARM processors. However, the lower cost of the ARM systems may make this factor acceptable. The AWS demand pricing for an 8-core, 16 GB instance is 43 cents per hour for x86 and 29 cents per hour for ARM. Based on those rates, using ARM instances could save you 32% on server costs as a trade-off for a potential 10% performance loss. Of course, you can get far better pricing with an Enterprise account and reserved instances.

In either case, the x86 and ARM deployments showed similar scaling factors. A 16-core server was double the cost of an 8-core server, but yielded a 40% increase in volume. Moving from 16 cores to 32 doubles the cost and provides a 38% improvement in volume.

Ludicrous Speed Plaid_Ludicrous_Speed

The above chart was only a portion of the performance testing, but it touches on the most common use case, which is sending high volumes of marketing mail that is typically about a 100kb payload through the internet on a realistically budgeted server footprint. If we want to look at the extreme cases (they call me "edge-case" for a reason), let's see what happens when we switch to "Ludicrous Speed".

When we take that real-world testing above and remove the Internet lag by sending mail to a local sink, the numbers are dramatically improved. This is not as unrealistic as it might sound, as there are many cases where MTAs are used for local pre-processing before actual network delivery. This also removes the relatively slow network pipe I have in my development environment. Users with access to 100 Gbps trunk speeds on the Internet backbone will see better results. Those same servers can process more than 2.5x the mail without that pesky Internet lag.

Screenshot 2025-05-12 at 12.19.07 PM

But this was also still using the 100Kb Marketing Payload test. If we reduce that payload to 10kb, which is more typical of Alerts and One-Time-Password notices, the numbers are considerably higher. Also, just for kicks, I tested with the largest instance I could manage in my dev account - an r6a.32xlarge with 128 cores and a TB of RAM. Go big or go home, as they say.

Screenshot 2025-05-15 at 1.58.21 PM

The results above are pretty crazy, and you might say they are fringe data, but it does show the raw processing power of what KumoMTA can do when the gloves are off. If your business is sending small OTP or alert notifications and you have a 100TB NIC, you could potentially send 66 Million Messages PER HOUR with a 128-core instance. Note that it will cost you about $14 USD per hour to run that monster, too. It is worth noting that no special effort was made to re-tune for the larger sized instances. There is definitely room for a skilled engineer to get better performance than the numbers I posted here.

As impressive as these results are, we still don't recommend building one massive instance to handle all the load. That would abandon important engineering principles, including the loss of redundancy. Instead, we recommend using a cluster of smaller nodes, potentially orchestrated with Kubernetes or some other management tool. You will likely get much better performance and reliability from a cluster of four 8-core instances than from one 32-core instance. In addition, using a massive instance will require some tuning of threads to manage performance. With the 128-core system in the test above, I needed to limit SMTP server handling threads to 6 to get peak performance. There were also required thread tunings for queue maintainers and DKIM signing, which adds a complication that is unnecessary with smaller instances.

These findings support the idea that you will get better overall performance from a cluster of smaller instances than one massive server, as shown in the chart below. Whether you choose to deploy x86 or ARM, the Cost/Production values seem to favour multiple 8 or 16-core nodes in a cluster. Since KumoMTA has no per-node license fee, you are free to deploy as many as you like without affecting your support costs. In email terms, this is perhaps better translated as CPM, as shown overlayed on that graph here.

Screenshot 2025-05-28 at 11.24.22 AM

The net results?
Nearly 2 years after the last performance post, KumoMTA is still (arguably) the most performant MTA on the planet. This is despite the addition of critical new clustering and monitoring features, cutting-edge security improvements, and the most expansive API in the industry. If you are still not using KumoMTA for your email engine, let us help you change that today.

Let us know if this helped you. We would love to tell your story.

------------------------------------------------------------------

KumoMTA is the first open-source MTA designed from the ground up for the world's largest commercial senders.
We are fueled by Professional Services and Sponsorship revenue.

Join the Discord | Review the Docs | Read the Blog | Grab the Code | Support the team

KumoMTA Performance Update

Pages

Resources

Contact