KumoMTA metrics explained

  • November 18, 2024

One of the things that is incredibly important for any MailOps engineer is system observability.  That is a big fancy phrase for "show me all the metrics" and can be the difference between effective system management and functioning on prayer. KumoMTA was built from the start with the idea that high visibility was critical and should be real-time and always accessible.

From a business management point of view, observability can translate directly to dollars so the more actionable data you can get, the more efficient your business can be and the more competitive advantage you can leverage.  KumoMTA was not just built to be an amazingly powerful Message Transfer Agent, it was built to be an effective business tool for commercial email operations. Here are some of the ways KumoMTA can give you real-time access to business-critical information.

In a previous post we talked about KCLI and its "ktop" function that can show a live feed of essential system metrics. There are three types of metrics KumoMTA will report on in the TOP utility - Rates, Counts, and Averages. The data reported includes the following, but this may look different on your system depending on your configuration.

RATES - measured in count per second as seen in the most recent sampling window. These are the rows that end with `/s`, indicating per-second:

Delivered
- Number of Deliveries per second in the most recent time window
Received - Number of Receptions per second in the most recent time window
Transfail - Number of Transient Failures per second in the most recent time window
Permfail - Number of Permanent Failures per second in the most recent time window
bounce_classify_latency - The rate of bounce classifications per second
context-creation - The rate of Lua context creations per second
dir-/var/log/kumomta/ - The rate of writes to the log file per second (there will be one of these for each defined logger)
get_egress_path_config - Number of times per second get_egress_path is being triggered
get_egress_pool - Number of times per second get_egress_pool is being triggered
get_egress_source - Number of times per second get_egress_source is being triggered
get_listener_domain - Number of times per second get_listener_domain is being triggered
get_queue_config - Number of times per second get_queue_config is being triggered
hook- {hook-name} - Number of times per second a defined hook is being triggered

COUNTS - these are either counters which accumulate since the last service reload or gauges that represent the current value of a quantity that can increase or decrease.

Scheduled - Total number of messages in all scheduled queues. These are messages that are not immediately due for delivery but will be at some point in the future.
Ready - Total number of messages in the ready queue, which are immediately due for delivery
Messages - Total number of message objects anywhere in the system (ready, scheduled, logging, in-flight, etc.)
Resident - Number of messages whose bodies are currently resident in memory 
Memory - The Resident Set Size (RSS) of the KumoMTA process
Conn Out - Current open outbound connections
Conn In - Current open inbound connections

ktop sample

AVERAGES - the average values since the last service reload. Many of these correspond to similar RATES above.

bounce_classify_latency - The average amount of time it takes to process your bounce classification rules
context-creation- The average amount of time it takes to create a Lua context
dir-/var/log/kumomta/- The average amount of time it takes to submit a log event to the local logger (there will be one of these for each defined logger)
get_egress_path_config- The average amount of time it takes to invoke the get_egress_path_config event
get_egress_pool- The average amount of time it takes to invoke the get_egress_pool event
get_egress_source- The average amount of time it takes to invoke the get_egress_source event
get_listener_domain- The average amount of time it takes to invoke the get_listener_domain event
get_queue_config- The average amount of time it takes to process the get_queue_config event
hook-{hook-name} - The average amount of time it takes to process a particular hook
message_save_latency - The average amount of time it takes to save a message to the spool
queue_insert_latency - The average amount of time it takes to insert a message to a queue
queue_resolve_latency - The average amount of time it takes to resolve the configuration for a queue 
should_enqueue_log_record - The average amount of time it takes to invoke the event that is used to filter log events from the logging system
smtp_client - The average amount of time it takes to communicate a message to a receiving MTA (SMTP conversation from MAIL FROM through to the completion of DATA)
smtp_server_message_received - The average amount of time it takes to process the actions in the smtp_server_message_received event
smtpsrv_process_data_duration - The average amount of time it takes to handle the received DATA for inbound messages
smtpsrv_read_data_duration - The average amount of time it takes to receive the DATA payload for inbound messages
smtpsrv_transaction_duration - The average amount of time it takes to process an incoming SMTP transaction.

Graphana DashIn another post, we discussed visualizing the metrics feed with Prometheus and Grafana. The raw data in that metrics feed can be seen with a simple cURL to the metrics API endpoint: curl http://localhost:8000/metrics

Most of the currently exposed metrics are shown in the summary below. These will grow and contract as more data is available. For instance, the scheduled_count value will appear for every queue in the system, so there may be one or thousands of them. We generally also include an aggregate label for these sorts of metrics, so you might see `smtp_client:queue-name` as the service label, representing the count for that specific queue, and `smtp_client` as the service label representing the total count across all of those queues for the SMTP client service.

There are many such fields that will expand per domain/provider/queue as needed.

Metrics Summary:
connection_count -- the number of active connections for a defined service. If you have HTTP and/or custom Lua connectors, you will also see entries for those services.
disk_free_bytes -- the number of available bytes in a monitored location. You will see one of these for each storage location (spool, meta, logs, etc)
disk_free_percent -- percent of available bytes in a monitored location. You will see one of these for each storage location (spool, meta, logs, etc)
disk_free_inodes -- number of available inodes in a monitored location. You will see one of these for each storage location (spool, meta, logs, etc)
disk_free_inodes_percent -- the percentage of available inodes in a monitored location. You will see one of these for each storage location (spool, meta, logs, etc)
log_submit_latency_bucket -- latency of log event submission operations.
lua_count -- the number of Lua contexts currently alive. Lua contexts are used to evaluate your configuration and event handlers.
lua_event_latency_bucket -- how long a given Lua event callback took. There will be one histogram for each event type.
lua_load_count -- how many times the policy Lua script has been loaded into a new context. 
lua_spare_count -- the number of Lua contexts available for reuse in the pool
memory_limit -- soft memory limit measured in bytes.
memory_usage -- the number of bytes of Resident Set Size
scheduled_count -- The number of scheduled messages in a given queue. Depending on your sending patterns, there may be one or thousands of them.
delayed_due_to_throttle_insert_ready -- number of times a message was delayed due to a throttle_insert_ready_queue event. This counts the number of times a throttle was triggered 
queued_count_by_provider- The combined number of messages in the scheduled and ready queue for a "provider"
ready_count -- The number of messages in the ready queue for a particular service. If you have HTTP and/or custom Lua connectors, you will also see entries for those services.
total_messages_delivered -- The number of messages marked as delivered for a particular service. If you have HTTP and/or custom Lua connectors, you will also see entries for those services.
total_messages_transfail -- The number of messages marked as deferred (Transient failures) for a particular service. If you have HTTP and/or custom Lua connectors, you will also see entries for those services.
total_messages_fail -- The number of messages marked as bounced (Permanently failed) for a particular service. If you have HTTP and/or custom Lua connectors, you will also see entries for those services.
dkim_signer_creation_bucket - How long it takes to create a dkim signer on a cache miss.
dkim_signer_sign_bucket -- How long it takes to dkim sign parsed messages.
thread_pool_parked -- number of parked(idle) threads in a given thread pool ( one entry for each thread pool)
thread_pool_size -- number of threads in a thread pool (one entry for each thread pool)
total_connections_denied -- the total number of connections rejected due to load shedding or concurrency limits. There will be one entry per service in use.
tokio_budget_forced_yield_count -- Returns the number of times that tasks have been forced to yield back to the scheduler after exhausting their task budgets.
tokio_elapsed -- Total amount of time elapsed since observing runtime metrics.
tokio_injection_queue_depth -- The number of tasks currently scheduled in the runtime’s injection queue.
tokio_io_driver_ready_count -- Returns the number of ready events processed by the runtime’s I/O driver.
tokio_num_remote_schedules -- The number of tasks scheduled from outside of the runtime.
tokio_total_busy_duration -- The amount of time worker threads were busy.
tokio_total_local_queue_depth -- The total number of tasks currently scheduled in workers’ local queues.
tokio_total_local_schedule_count -- The number of tasks scheduled from worker threads.
tokio_total_noop_count -- The number of times worker threads unparked but performed no work before parking again.
tokio_total_overflow_count -- The number of times worker threads saturated their local queues.
tokio_total_park_count -- The number of times worker threads parked.
tokio_total_polls_count -- The number of tasks that have been polled across all worker threads.
tokio_total_steal_count -- The number of tasks worker threads stole from another worker thread.
tokio_total_steal_operations -- The number of times worker threads stole tasks from another worker thread.
tokio_workers_count  -- The number of worker threads used by the runtime.

These values can be consumed by observability tools like Prometheus, Grafana, DataDog, NewRelic, Elastic, etc, for integration into your own existing systems.

Finally, the kcli queue-summary details which queues are handling which messages and in what state they are.

queue-summary

Queue-summary is divided into two sections.  The top is active message flow from the ready queue. The counters to the right are:
D - the total number of delivered messages
T - the total number of transiently failed messages
C - the number of open connections
Q - the number of ready messages in the queue

The lower section lists the scheduled queues and the count of messages in each waiting to be advanced to the ready queue. As you can see from the image, this will list ALL messages, including emails, HTTP deliveries and webhooks.  

We aimed to provide extreme visibility into all message activity while making the data highly accessible.  All of this data is available via API to integrate it with your own reporting systems or create actionable feeds based on the data.

Let us know if this helped you. We would love to tell your story.

------------------------------------------------------------------

KumoMTA is the first open-source MTA designed from the ground up for the world's largest commercial senders. We are fueled by Professional Services and Sponsorship revenue.

Join the Discord | Review the Docs | Read the Blog | Grab the Code | Support the team