Analytics and Alerting#
Introduction#
The Unryo platform processes millions of metrics and events within the stream of data, in order to detect anomalies in real-time.
To determine if conditions are met, Unryo uses numerous alert configurations that cover the major technologies, and that are built-in with best-practices thresholds. When an anomaly is detected, Unryo performs root cause analysis (to determine the probable cause), impact analysis (to identify the impacted resources) and (if specified) can execute notification(s), such as email, SNMP trap, a Microsoft Teams message, or more.
Current and past alerts are visible from the Alert Console, as well as displayed in context in dashboards and topology map.
Section Content:
Topics | Description |
---|---|
The Alert Console | The Alert Console displays current and past alerts. |
Alert Life Cycle | Alerts in Unryo have a state (active or inactive) and a set a normalized metadata (to facilitate triage and filtering). Users can take actions on them (acknowledge, ownership, ...) |
Predefined Alerts | Unryo is shipped with tens of default alert rules configured with best practices thresholds covering major technologies. |
Customizing Alerts | Users can create their own alerts using the Alert Editor. |
Notification Channels | Unryo supports multiple notification channels, such as sending an email , a Slack message, a Microsoft Teams notification or more. |
Root Cause Analysis | The Unryo Correlation Engine determines if the resource is the root-cause or an impact, based on the topology. |
Impact Analysis | Unryo lets you defines business elements and logical groups, then calculates the impact based on the propagation. |
ChatGPT Integration | Provides users with insight and potential solutions on their alerts. |
Alert Life Cycle#
Events are processed and regrouped into alerts, which are visible by users in the Alert Console. Events have a set of tags (or properties), some of which are required and some of which are optional. The Alert Engine uses event tags to create alerts, update them and maintain their state (active or inactive).
Event Properties#
Event Property | Type | Description |
---|---|---|
resource | Required Tag | Resource associated with the event. |
resource_type | Required Tag | Resource Type for the resource |
technology | Required Tag | Technology for the resource |
level | Required Tag | The level (or severity) of the event. Possible values: WARNING , CRITICAL , OK , INFO |
category | Required Tag | The category of the event. Could be any string, or a normalized category: Availability (Down, Not Responding, Cluster at Risk, Cluster Degraded, Unavailable, and more): indicates a resource unavailable or at risk of being unavailable. Reachability (No Data Received): indicates a connectivity problem with the monitored resources, such as a network communication failure, cloud-API unavailable, EMS access unavailable. Errors (Failed Status Check, Failed Health Check, Authentication Failed, and more): Error-related events (e.g. a request that fails) or increased error rates (e.g. traffic errors). Saturation (High Processor Usage, High Memory Usage, High Queue Length, High Disk Reads, Disk almost full, High Load, and more): a measure of the resource utilization, that indicates how "full" a resource is. Latency (Degraded Disk IO, Query Slowdown, Long Response Time, and more): informs of slowdowns, increased times for query executions or user transactions. Custom : User-Defined Event Type. Informational : Purely informational event, no impact. |
eventname | Required Tag | A string that acts as a title of the event. e.g. "High CPU Utilization" |
eventtext | Required Tag | A string that provides a short description about the event, e.g. "Linux Server is experiencing a high CPU utilization" |
message | Optional Field | A string that provides a longer description about the event, e.g. "CPU utilization is high: 83.4%" or "CPU utilization is now back to normal: 72.0%" |
eventtype | Required Tag | Either durable (indicates that Unryo knows the state of the event and is able to send a new event if the state changes) or momentary (indicates a problem occurred at a point of time, for example, a SNMP trap notification) |
value | Required Field | The value associated with the current event. It has to be a float. |
unit | Optional Tag | |
resource_component | Optional Tag | |
resource_component_type | Optional Tag | |
measurement | Required Tag | Timeseries measurement that stores the metric |
alertname | Required Tag | The alert policy that triggered the event, e.g. "Linux CPU" |
alertID | Required Tag | The unique alert identifier. Automatically set. Could be customized to control events to alert matching. |
If you customize your own alert policies, make sure all the required tags and fields are set on the resulting events. If not, the event could not be interpreted and converted into an alert. You can add your own custom tags and fields.
Alert Properties#
Alert Property | Description |
---|---|
Event State | Active indicates that the problem is active and requires attention. This is the initial state when an event occurs. Inactive indicates that the problem no longer requires attention because it has been cleared or has expired. They are displayed in white in the Alerts Console. A durable alert becomes Inactive when the corresponding clear is received. A momentary alert becomes inactive when acknowledged if its clear on ack is enabled or when the expiration is reached (default 24 hours). All Inactive alerts are deleted after a period of time (default 3 days). Deleted Inactive alerts are automatically deleted after a period of time (default 3 days). Deleted alerts do not appear in the Alerts Console anymore. |
Acknowledge | Acknowledging an alert tells other operators that you are aware of the issue and are working on it. Acknowledging an alert assigns the ownership to you. Acknowledging a momentary alert sets its state to inactive (if clearOnAck). Momentary alerts are automatically acknowledged after a period of time (default 24 hours). Unacknowledging an alert does not release the ownership. |
Ownership | Take Ownership tells other users you are working on it. Release Ownership releases the ownership. |
Root Cause | Boolean enriched by the Correlation Engine. Yes means that the Correlation Engine determined that the alert is a root-cause alert. No means that the alert is an impact. |
Possible actions on alerts:
- Alert Details: Shows information such as alert occurrences, tags, alert policy information, resource information, etc.
- Acknowledge/Unacknowledge
- Take Ownership/Release Ownership
- Force Clear: Force clear an alert. Forcibly cleared alerts are slightly different than naturally cleared alerts. They will stay cleared for the remainder of a specific occurrence while naturally cleared alerts will change back to being active on any re-notification.
View Alerts#
Go on the Alerts
section to view all the active alerts in your organization.
Using the Alerts Console, users can:
Search
,Sort
,Filter
alerts.Acknowledge
,Force Clear
or setownership
on alerts.Details
an alert and see all event occurrences and properties.- Execute
Action Tools
such as ping, incident system integration or any executable. - Click on a
resource
to access the dashboard or topology map. - Customize their alert console, by adding panels and columns.
Configure Alerts#
Predefined Alert Configurations#
Unryo is shipped with out-of-the-box, best-practice alert definitions for common devices and applications. Those alerts are enabled by default, so day one, you are informed on most common problems, for example if your AWS VMs are in trouble, if your Kubernetes PODs are running out of memory, or if your users have a degraded experience.
Setting up Alerts#
Alert definitions are managed centrally from the Configuration UI
. In addition, you can also use the Unryo API
to programmatically manage alert definitions.
Go in Configuration Management
.
Click on the Alert Definitions
panel to list all the alert configurations. Tens of configurations are available and ready to use. They are instrumented with best-practice thresholds and settings to monitor a particular technology.
From there, you can:
Enable
,Disable
,Delete
andDuplicate
an alert configuration.Edit
an alert configuration. You can change alert settings to your particular requirements. Typically, you may want to change thresholds, monitoring time windows, formulas, filter-out the stream of data to analyze (based on devices or any criteria), add a notification channel such as an email, Slack or other.Add
a new configuration. Numerous alert templates are predefined to cover most common alerting needs.
Create your own Alert Configuration#
You add an alert definition by choosing a template to start from. Templates are designed to work out-of-the-box, by covering many analytics cases, such as simple threshold, forecast, deviation, no data detection or combo-metrics KPIs. You can use them as-is or adjust your thresholds and other settings.
Click +
button to open the alert editor.
Select:
- the
alert template
you want to use, - the
analytics engine
on which you want this configuration to be deployed; - and provide a
Configuration Name
that is meaningful for you. - The
Description
is optional.
Define the alert definition as per your requirements, by specifying the stream of data to analyze, the alert conditions and which notifications to fire if any.
You can either use the
Alert UI
or switch in edition mode to display the configuration file.
Once done, click Apply
to save and then finally Enable
the configuration to start the analysis.
Configure Notifications#
Add a Notification Channel#
Now that you are familiar with how to configure alerts, you may want to have notifications from them, such as sending an email or a Slack message. Unryo allows you to automatically notify users about a serious problem or condition that requires a quick resolution.
Channel | Description |
---|---|
Discord | Sends out notifications to Discord |
HipChat | Sends out notifications to your HipChat room |
HTTP Post | Sends out notifications to a HTTP endpoint |
Kafka | Sends out notifications to a Kafka consumer |
Microsoft Teams | Sends out notifications in Microsoft Teams |
MQTT | Sends out notifications to MQTT |
OpsGenie | Sends out notifications in OpsGenie |
PagerDuty | Sends out notifications in PagerDuty |
Pushover | Sends out notifications to Pushover |
ServiceNow | Sends out notifications to ServiceNow |
Slack | Sends out notifications to your Slack channel |
SMTP | Sends out notifications to email recipients |
SNMP Traps | Sends out notifications as SNMP Traps to a SNMP Trap receiver. |
Telegram | Sends out notifications to Telegram |
VictorOps | Sends out notifications in VictorOps |
Notification rate limiting & silencing#
Notification Channels can be configured to limit the notification rate, and prevent a same notification to be fired in some situations.