About Events and Alerts#
Overview#
Events comes from two sources:
- generated by Unryo when it detects anomalies from the data stream
- ingested from your external systems (monitoring tools, SNMP traps, Logs, ...).
Events are filtered, deduplicated and mapped into alerts, visibles in the alarm console and in the topology maps.
Alert Life Cycle#
Events are processed and regrouped into alerts, which are visible by users in the Alert Console. Events have a set of tags (or properties), some of which are required and some of which are optional. The Alert Engine uses event tags to create alerts, update them and maintain their state (active or inactive).
Event Properties#
All events in Unryo are standardized with a set of common properties.
Event Property | Type | Description |
---|---|---|
resource | Required Tag | Resource associated with the event. |
resource_type | Required Tag | Resource Type for the resource |
technology | Required Tag | Technology for the resource |
level | Required Tag | The level (or severity) of the event. Possible values: WARNING , CRITICAL , OK , INFO |
category | Required Tag | The category of the event. Could be any string, or a normalized category: Availability (Down, Not Responding, Cluster at Risk, Cluster Degraded, Unavailable, and more): indicates a resource unavailable or at risk of being unavailable. Reachability (No Data Received): indicates a connectivity problem with the monitored resources, such as a network communication failure, cloud-API unavailable, EMS access unavailable. Errors (Failed Status Check, Failed Health Check, Authentication Failed, and more): Error-related events (e.g. a request that fails) or increased error rates (e.g. traffic errors). Saturation (High Processor Usage, High Memory Usage, High Queue Length, High Disk Reads, Disk almost full, High Load, and more): a measure of the resource utilization, that indicates how "full" a resource is. Latency (Degraded Disk IO, Query Slowdown, Long Response Time, and more): informs of slowdowns, increased times for query executions or user transactions. Custom : User-Defined Event Type. Informational : Purely informational event, no impact. Notification : events (such as SNMP traps, Kubernetes Events) received from external systems that are unknown or not mapped. |
eventname | Required Tag | A string that acts as a title of the event. e.g. "High CPU Utilization" |
eventtext | Required Tag | A string that provides a short description about the event, e.g. "Linux Server is experiencing a high CPU utilization" |
message | Optional Field | A string that provides a longer description about the event, e.g. "CPU utilization is high: 83.4%" or "CPU utilization is now back to normal: 72.0%" |
eventtype | Required Tag | Either durable (indicates that Unryo knows the state of the event and is able to send a new event if the state changes) or momentary (indicates a problem occurred at a point of time, for example, a SNMP trap notification) |
value | Required Field | The value associated with the current event. It has to be a float. |
unit | Optional Tag | |
resource_component | Optional Tag | |
resource_component_type | Optional Tag | |
measurement | Required Tag | Timeseries measurement that stores the metric |
alertname | Required Tag | The alert policy that triggered the event, e.g. "Linux CPU" |
alertID | Required Tag | The unique alert identifier. Automatically set. Could be customized to control events to alert matching. |
If you customize your own alert policies, make sure all the required tags and fields are set on the resulting events. If not, the event could not be interpreted and converted into an alert. You can add your own custom tags and fields.
Alert Properties#
Alert Property | Description |
---|---|
Event State | Active indicates that the problem is active and requires attention. This is the initial state when an event occurs. Inactive indicates that the problem no longer requires attention because it has been cleared or has expired. They are displayed in white in the Alerts Console. A durable alert becomes Inactive when the corresponding clear is received. A momentary alert becomes inactive when acknowledged if its clear on ack is enabled or when the expiration is reached (default 24 hours). All Inactive alerts are deleted after a period of time (default 3 days). Deleted Inactive alerts are automatically deleted after a period of time (default 3 days). Deleted alerts do not appear in the Alerts Console anymore. |
Acknowledge | Acknowledging an alert tells other operators that you are aware of the issue and are working on it. Acknowledging an alert assigns the ownership to you. Acknowledging a momentary alert sets its state to inactive (if clearOnAck). Momentary alerts are automatically acknowledged after a period of time (default 24 hours). Unacknowledging an alert does not release the ownership. |
Ownership | Take Ownership tells other users you are working on it. Release Ownership releases the ownership. |
Root Cause | Boolean enriched by the Correlation Engine. Yes means that the Correlation Engine determined that the alert is a root-cause alert. No means that the alert is an impact. |