Deploy AOM Prevent ELB Alarm Storm

Application Scenario

Application Operations Management (AOM) is a one-stop application operations management platform provided by Huawei Cloud, supporting core functions such as application monitoring, log management, and alarm management. When monitoring ELB business layer metrics, a large number of duplicate or similar alarms may be generated, causing alarm storms that affect operational efficiency. By configuring AOM alarm group rules, similar alarms can be grouped and merged, reducing alarm noise and preventing alarm storms, improving the effectiveness of alarm management.

This best practice will introduce how to use Terraform to automatically deploy AOM prevent ELB alarm storm, including creating LTS log groups and streams, SMN topics and log tanks, AOM alarm action rules, alarm group rules, and configuring alarm rules.

This best practice involves the following main resources:

Resources

Resource/Data Source Dependencies

Operation Steps

1. Script Preparation

Prepare the TF file (e.g., main.tf) in the specified workspace for writing the current best practice script, ensuring that it (or other TF files in the same directory) contains the provider version declaration and Huawei Cloud authentication information required for deploying resources. Refer to the "Preparation Before Deploying Huawei Cloud Resources" document for configuration introduction.

2. Create Log Tank Service Log Group Resource

Add the following script to the TF file (e.g., main.tf) to instruct Terraform to create a Log Tank Service log group resource:

Parameter Description:

  • group_name: The log group name, assigned by referencing the input variable lts_group_name

  • ttl_in_days: The log retention time (unit: days), set to 30 days

  • enterprise_project_id: The enterprise project ID, assigned by referencing the input variable enterprise_project_id, set to null when the value is an empty string

3. Create Log Tank Service Log Stream Resource

Add the following script to the TF file (e.g., main.tf) to instruct Terraform to create a Log Tank Service log stream resource:

Parameter Description:

  • group_id: The log group ID, referencing the ID of the previously created Log Tank Service log group resource (huaweicloud_lts_group.test)

  • stream_name: The log stream name, assigned by referencing the input variable lts_stream_name

  • enterprise_project_id: The enterprise project ID, assigned by referencing the input variable enterprise_project_id, set to null when the value is an empty string

4. Create Simple Message Notification Topic Resource

Add the following script to the TF file (e.g., main.tf) to instruct Terraform to create a Simple Message Notification topic resource:

Parameter Description:

  • name: The topic name, assigned by referencing the input variable smn_topic_name

  • enterprise_project_id: The enterprise project ID, assigned by referencing the input variable enterprise_project_id, set to null when the value is an empty string

5. Create Simple Message Notification Log Tank Resource

Add the following script to the TF file (e.g., main.tf) to instruct Terraform to create a Simple Message Notification log tank resource:

Parameter Description:

  • topic_urn: The topic URN, referencing the topic_urn of the previously created Simple Message Notification topic resource (huaweicloud_smn_topic.test)

  • log_group_id: The log group ID, referencing the ID of the previously created Log Tank Service log group resource (huaweicloud_lts_group.test)

  • log_stream_id: The log stream ID, referencing the ID of the previously created Log Tank Service log stream resource (huaweicloud_lts_stream.test)

6. Create AOM Alarm Action Rule Resource

Add the following script to the TF file (e.g., main.tf) to instruct Terraform to create an AOM alarm action rule resource:

Parameter Description:

  • name: The alarm action rule name, assigned by referencing the input variable alarm_action_rule_name, default value is "apm"

  • user_name: The user name, assigned by referencing the input variable alarm_action_rule_user_name

  • type: The alarm action rule type, assigned by referencing the input variable alarm_action_rule_type, default value is "1" (indicating notification type)

  • notification_template: The notification template name, using the built-in template "aom.built-in.template.zh"

  • smn_topics.topic_urn: The SMN topic URN, referencing the topic_urn of the previously created Simple Message Notification topic resource (huaweicloud_smn_topic.test)

7. Create AOM Alarm Group Rule Resource

Add the following script to the TF file (e.g., main.tf) to instruct Terraform to create an AOM alarm group rule resource:

Parameter Description:

  • depends_on: Explicit dependency relationship, ensuring the AOM alarm action rule resource is created before the alarm group rule resource

  • name: The alarm group rule name, assigned by referencing the input variable alarm_group_rule_name

  • group_by: The list of grouping fields, set to ["resource_provider"] (indicating grouping by resource provider)

  • group_interval: The group check interval (unit: seconds), assigned by referencing the input variable alarm_group_rule_group_interval, default value is 60 seconds

  • group_repeat_waiting: The group repeat waiting time (unit: seconds), assigned by referencing the input variable alarm_group_rule_group_repeat_waiting, default value is 3600 seconds

  • group_wait: The group wait time (unit: seconds), assigned by referencing the input variable alarm_group_rule_group_wait, default value is 15 seconds

  • description: The alarm group rule description, assigned by referencing the input variable alarm_group_rule_description, set to null when the value is an empty string

  • enterprise_project_id: The enterprise project ID, assigned by referencing the input variable enterprise_project_id, set to null when the value is an empty string

  • detail.bind_notification_rule_ids: The list of bound notification rule IDs, referencing the name of the previously created AOM alarm action rule resource (huaweicloud_aom_alarm_action_rule.test)

  • detail.match: The list of matching conditions, dynamically generated through the dynamic block based on the input variable alarm_group_rule_condition_matching_rules, default filters for Critical and Major severity alarms and alarms from AOM

8. Create AOM Alarm Rule Resource

Add the following script to the TF file (e.g., main.tf) to instruct Terraform to create an AOM alarm rule resource:

Parameter Description:

  • name: The alarm rule name, assigned by referencing the input variable alarm_rule_name

  • type: The alarm rule type, set to "metric" (indicating metric type)

  • enable: Whether to enable the alarm rule, set to true

  • prom_instance_id: The Prometheus instance ID, assigned by referencing the input variable prometheus_instance_id, default value is "0" (indicating the default Prometheus_AOM_Default instance)

  • alarm_notifications.notification_enable: Whether to enable notifications, set to true

  • alarm_notifications.notification_type: The notification type, set to "alarm_policy" (indicating alarm policy type)

  • alarm_notifications.route_group_enable: Whether to enable route grouping, set to true

  • alarm_notifications.route_group_rule: The route group rule name, referencing the name of the previously created AOM alarm group rule resource (huaweicloud_aom_alarm_group_rule.test)

  • alarm_notifications.notify_resolved: Whether to notify on recovery, set to true

  • alarm_notifications.notify_triggered: Whether to notify on trigger, set to true

  • alarm_notifications.notify_frequency: The notification frequency, set to "-1" (indicating using the alarm group rule's frequency settings)

  • metric_alarm_spec.monitor_type: The monitoring type, set to "all_metric" (indicating all metrics)

  • metric_alarm_spec.recovery_conditions.recovery_timeframe: The recovery time frame, set to 1 (unit: minutes)

  • metric_alarm_spec.trigger_conditions: The trigger conditions list, dynamically generated through the dynamic block based on the input variable alarm_rule_trigger_conditions

9. Preset Input Parameters Required for Resource Deployment (Optional)

In this practice, some resources use input variables to assign configuration content. These input parameters need to be manually entered during subsequent deployment. At the same time, Terraform provides a method to preset these configurations through tfvars files, which can avoid repeated input during each execution.

Create a terraform.tfvars file in the working directory with the following example content:

Usage:

  1. Save the above content as a terraform.tfvars file in the working directory (this filename allows users to automatically import the content of this tfvars file when executing terraform commands. For other naming, you need to add .auto before tfvars, such as variables.auto.tfvars)

  2. Modify parameter values according to actual needs

  3. When executing terraform plan or terraform apply, Terraform will automatically read the variable values in this file

In addition to using the terraform.tfvars file, you can also set variable values in the following ways:

  1. Command line parameters: terraform apply -var="lts_group_name=test-group" -var="alarm_rule_name=test-rule"

  2. Environment variables: export TF_VAR_lts_group_name=test-group

  3. Custom named variable file: terraform apply -var-file="custom.tfvars"

Note: If the same variable is set through multiple methods, Terraform will use variable values according to the following priority: command line parameters > variable file > environment variables > default values.

10. Initialize and Apply Terraform Configuration

After completing the above script configuration, execute the following steps to create resources:

  1. Run terraform init to initialize the environment

  2. Run terraform plan to view the resource creation plan

  3. After confirming that the resource plan is correct, run terraform apply to start creating AOM prevent ELB alarm storm

  4. Run terraform show to view the details of the created AOM prevent ELB alarm storm

Reference Information

Last updated