Create a High Availability Setup – HetrixTools, PagerDuty, Python, Cloudflare

Create a High Availability Setup – HetrixTools, PagerDuty, Python, Cloudflare

Introduction

This tutorial will cover three things: Webhooks in Python with Flask, integrating PagerDuty with HetrixTools monitoring service, and using the Cloudflare DNS API with the⁣ cURL command. Python is one of the top programming languages for developers, and we will learn how to use Python scripting to make a call to the Cloudflare API. The webhook will be integrated into PagerDuty, a popular online incident management platform, to limit downtime incidents and reduce impacts to customers. Lastly, we will use HetrixTools monitoring service to trigger an alert to PagerDuty. The Mean Time To Resolution (MTTR) is the average amount of time it takes to fix a failure.

Background

As I was a full-stack PHP developer, I had zero interest in Python and tried to avoid picking up Python because I hardly had the need to use it as I could complete 90% of my required tasks with PHP. The only time I used Python was to use Selenium to scrape websites with Python. But my new job as a Site Reliability Engineer (SRE) requires Python to script automation, and I thought maybe it’s time to get started. I am also interested in learning more about PagerDuty, since I already use HetrixTools to monitor my VPS. We will connect HetrixTools with PagerDuty because PagerDuty is a tool for alerting people about problems.

Step 1 – Cloudflare

I was searching for something in the LowEndTalk community when I came across a thread about How to build a High Availability Nginx using two VPS. One of the replies (below) caught my attention because I was using HetrixTools to monitor the uptime of my server and Cloudflare as my DNS service.

https://lowendtalk.com/discussion/181831/how-to-build-a-high-availability-nginx-using-two-vps#latest

Step 1a – Create API Token to Update Specific Zone Only

You can use the Edit zone DNS template to create an API token. By following the principle of least privilege, we will choose a specific zone from the drop-down list of domains, and whitelist our webhook servers IP address (123.123.123.123) to Client IP Address Filtering for more security.

The API to update DNS records can be found here. Before we turn the ⁣cURL command into a webhook with Python scripting, we will test it in our Linux terminal to make sure it works with the correct API tokens and keys on our whitelisted webhook server.

$ curl -X GET "https://api.cloudflare.com/client/v4/user/tokens/verify" \
     -H "Authorization: Bearer <INSERT API token here>" \
     -H "Content-Type:application/json"
$ {"result":{"id":"daxxxxx94bf8aacxxxxx6e41ecxxxxx3","status":"active"},"success":true,"errors":[],"messages":[{"code":10000,"message":"This API Token is valid and active","type":null}]}

Step 1b – Find Zone and DNS Record ID

There is a Zone ID located at the bottom right of the selected domain Overview page for every domain name hosted by Cloudflare.
Next, we need to locate the DNS record ID for the API to update. The Zone ID (vUeC9tfMVqNyp5s5fHno) can be used in the cURL command below to find the DNS Record ID (QxYR2kE9tJvU0fct4b3N).
$ curl -X GET "https://api.cloudflare.com/client/v4/zones/vUeC9tfMVqNyp5s5fHno/dns_records?type=CNAME&name=example.com" \
>      -H "X-Auth-Email: youremail@gmail.com" \
>      -H "X-Auth-Key: XXXXXebbe0927XXXXX4ca793e575d8f0XXXXX" \
>      -H "Content-Type: application/json"
{"result":[{"id":"QxYR2kE9tJvU0fct4b3N","zone_id":"vUeC9tfMVqNyp5s5fHno","zone_name":"example.com","name":"example.com","type":"CNAME","content":"production.example.com","proxiable":true,"proxied":false,"ttl":1,"locked":false,"meta":{"auto_added":false,"managed_by_apps":false,"managed_by_argo_tunnel":false,"source":"primary"},"created_on":"2022-10-26T02:48:47.374981Z","modified_on":"2022-10-26T05:38:37.704961Z"}],"success":true,"errors":[],"messages":[],"result_info":{"page":1,"per_page":100,"count":1,"total_count":1,"total_pages":1}}

Step 2 – Python

Most Linux systems come with Python already installed. Use python --version or python3 --version to check. Pip is a package management system for Python. You can check if pip is installed by using pip3 --version, or you can use apt install python3-pip to install.

Step 2a – Install Flask as Webhook Listener

Flask is an HTTP endpoint webhook receiver that is easy to deploy on Linux. By default, Flask runs on port 5000 and http://127.0.0.01:5000/. If you get the FileNotFoundError: [Errno 2] error, first ⁣--upgrade pip before installing flask using the pip⁣ command.

$ pip install flask
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/bin/pip'

$ pip3 install --upgrade pip
    Successfully uninstalled pip-22.2.2
Successfully installed pip-22.3
$ pip install flask

Step 2b – Build a Python Webhook

In this tutorial, we will combine the Flask initialization code and the Cloudflare API code into a single Python script. Flask will listen for a URL subdirectory @app.route("/failover-backup"), such as http://123.123.123.123:5000/failover-backup. When this happens, the function failover_backup() will be called. Below is the cURL example provided by Cloudflare API v4 Documentation, and we will translate the equivalent of the command requests.put into the Python script.

curl -X PUT "https://api.cloudflare.com/client/v4/zones/vUeC9tfMVqN---Zone ID---yp5s5fHno/dns_records/QxYR2kE---Record ID---9tJvU0fct4b3N" \
     -H "X-Auth-Email: user@example.com" \
     -H "X-Auth-Key: YEGFk5P21JAGUhmloBbI" \
     -H "Content-Type: application/json" \
     --data '{"type":"A","name":"example.com","content":"127.0.0.1","ttl":3600,"proxied":false}'
import requests
from flask import Flask


headers = {'X-Auth-Email':'user@gmail.com', 'X-Auth-Key':'YEGFk5P21JAGUhmloBbI', 'Content-Type':'application/json'}

app = Flask(__name__)

@app.route("/failover-backup")
def failover_backup():
    url = 'https://api.cloudflare.com/client/v4/zones/vUeC9tfMVqNyp5s5fHno/dns_records/QxYR2kE9tJvU0fct4b3N'

    payload = {'type':'CNAME','name':'example.com','content':'backup.example.com','ttl':3600, 'proxied':False}

    resp = requests.put(url, headers=headers, json=payload)

    return resp.content

Step 2c – Test Webhook with Flask

Below are the commands to run Flask to listen on the default localhost and public IP. In this tutorial, we will be using --host=<public IP> because we are tagging the webhook URL to the PagerDuty service. Remember that this public IP is already whitelisted in Step 1a.

$ flask run
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)

$ flask run --host=123.123.123.123
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://123.123.123.123:5000/ (Press CTRL+C to quit)

To test, go to http://<webhook server IP>/failover-backup in a browser.

After entering the URL in a browser, a JSON result is returned, while Flask returns an HTTP 200 OK status response code.

Step 3 – PagerDuty

Since the webhook is now working via the browser, we will need to apply it somewhere. PagerDuty is a multi-function platform that notifies its users of outages through an incident management dashboard. However, it is not a monitoring tool like Prometheus or Zabbix. We currently send the monitoring alert to PagerDuty after a condition/threshold is triggered.

Step 3a – Create a Service

You can sign up for the free plan on the PagerDuty website. There is no need to enter credit card details, so there is one less thing to worry about. To create a new service, go to Services > Services Directory > Click + [New Service button]

  1. Name: HetrixTools Monitor; Description: Remote uptime monitoring for the website and ping
  2. Select an existing Escalation Policy > Default
  3. Alert Grouping > Intelligent (Recommended); Transient Alerts > Auto-pause incident notifications (Recommended)
  4. Select integration(s) to send alerts to this service > API Fortress Connector
  5. Note down the Integration Key

For detailed steps, please refer to the HetrixTools documentation on PagerDuty Integration.

Step 3b – Custom Incident Actions (Webhook)

With custom incident actions, incident responders can quickly do something outside the PagerDuty system. This will create a button on PagerDuty incidents in both the web and mobile versions, which will fire a webhook to our specified URL. For example, actions could include restarting a server or doing something from a run book. Click the [+ New Action] button, enter a URL endpoint, and choose a service from the drop-down list. Otherwise, we could use runbook automation to make our incident response process simpler.

PagerDuty – Create a custom incident action

Step 4 – HetrixTools

I was aware of other free uptime monitoring services, such as UptimeRobot, and had used Pingdom and StatusCake in the past. I learned about the existence of HetrixTools in the LowEndTalk community and have since used it as my only primary uptime service for all my VPS. It is an easy-to-use tool for checking domain expiration, website status, and server uptime. It also offers free monitoring from various remote locations. The only downside is that you have to keep your HetrixTools account active by logging in to your dashboard at least once every 90 days.

Step 4a – Add PagerDuty Integration Key

Paste the integration key generated from the PagerDuty service in Step 3a into HetrixTools Contract Lists > PagerDuty.

Step 5 – Simulate Alerting Workflow

To send a test notification, log in to HetrixTools > Contact Lists and click the [Send test notification] button under ‘PagerDuty’ in Step 4a. We will notice that a new open incident was triggered in the PagerDuty Incidents dashboard. Click into the incident and look for the webhook button under the More menu. You may hover over to read the description we wrote in Step 3b.

Step 5a – CName Record Before and After

  • Production server IP: 123.123.123.123
  • Backup server IP: 123.123.123.111
  • CNAME: alias to Production server’s A record (before)
  • CNAME: alias to Backup server’s A record (after)

When I performed a ping test from Windows Command Prompt to example.com, it showed that example.com reached the IP 123.123.123.123. However, after I clicked the Failover button at PagerDuty, the ping test showed that example.com reached the IP 123.123.123.111, which is the IP of the backup server.

Cloudflare – After updating the CNAME record, example.com is now being served by the backup server.

Conclusion

The primary goals of this tutorial are to learn Python (scripting, webhooks, Flask), try out the Cloudflare API, and get acquainted with PagerDuty. Python is a good choice for automation scripting and browser testing with Selenium or Beautiful Soap. We implemented a Cloudflare API call by coding a Python webhook using Flask as a listener. PagerDuty is a popular incident management platform with many other features and built-in connectors to external monitoring hosts beyond simply creating a webhook. Monitoring and incident alerting, as well as the automation of manual response processes, helps operators preserve the high availability of our servers.

Leave a Comment

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *