Aurora MySQLのCPUが急に100％に張り付いた

こんばんわ、hisayukiです。

世の中によくある、”何もしていないのに急に◯◯になった！！”シリーズ

今回はAurora MySQLのWRITE側のインスタンスが急に上限に張り付いたので、何が問題だったかを虱潰しに調べていったことを残します。

CloudWatchを見て気づく
いろいろ試してみる
まとめ

CloudWatchを見て気づく

まずはこれ。

時間表示は日本時間で3/11の早朝3時〜5時にかけて。
急激なスピードでCPU使用率を増やしていき、98％以上のところで今日まで張り付いてる

いろいろ試してみる

MySQLのプロセスチェック

こちらのサイトを参考にまずはプロセスチェックするためにWork用EC2インスタンス立てて、内容チェック

[ec2-user@ip-192-168-4-35 ~]$ mysql -u ******* -p******************************** -h *******-0.*******.ap-northeast-1.rds.amazonaws.com
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MySQL connection id is 45681
Server version: 5.7.12 MySQL Community Server (GPL)

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.       

MySQL [(none)]> show full processlist;
+-------+----------+----------------------+---------+---------+------+----------------------+-----------------------+
| Id    | User     | Host                 | db      | Command | Time | State                | Info                  |
+-------+----------+----------------------+---------+---------+------+----------------------+-----------------------+
| 40383 | rdsadmin | localhost            | NULL    | Sleep   |    0 | delayed send ok done | NULL                  |
| 40384 | rdsadmin | localhost            | NULL    | Sleep   |    0 | delayed send ok done | NULL                  |
| 40398 | rdsadmin | localhost            | NULL    | Sleep   |   85 | delayed send ok done | NULL                  |
| 45437 | *******  | 192.168.4.35:51848   | NULL    | Sleep   | 3814 | cleaned up           | NULL                  |
| 45564 | *******  | 192.168.4.35:51858   | NULL    | Query   |    0 | starting             | show full processlist |
| 45580 | *******  | 192.168.29.254:42102 | ******* | Sleep   |  134 | delayed send ok done | NULL                  |
| 45621 | *******  | 192.168.29.254:42228 | ******* | Sleep   |  763 | delayed send ok done | NULL                  |
| 45631 | *******  | 192.168.29.254:42252 | ******* | Sleep   |  605 | delayed send ok done | NULL                  |
| 45632 | *******  | 192.168.29.254:42256 | ******* | Sleep   |  587 | delayed send ok done | NULL                  |
| 45635 | *******  | 192.168.29.254:42260 | ******* | Sleep   |  577 | delayed send ok done | NULL                  |
| 45643 | *******  | 192.168.29.254:42288 | ******* | Sleep   |  453 | delayed send ok done | NULL                  |
| 45647 | *******  | 192.168.29.254:42298 | ******* | Sleep   |  383 | delayed send ok done | NULL                  |
| 45651 | *******  | 192.168.29.254:42310 | ******* | Sleep   |  304 | delayed send ok done | NULL                  |
| 45652 | *******  | 192.168.29.254:42314 | ******* | Sleep   |  287 | delayed send ok done | NULL                  |
| 45658 | *******  | 192.168.29.254:42328 | ******* | Sleep   |  221 | delayed send ok done | NULL                  |
| 45668 | rdsadmin | localhost            | NULL    | Sleep   |    5 | cleaned up           | NULL                  |
+-------+----------+----------------------+---------+---------+------+----------------------+-----------------------+
16 rows in set (0.01 sec)

うん、わからん

少なくともQueryでCPUが食らい付くされてるわけではないようだ。

拡張モニタリングをON

この設定はAuroraのクラスター側ではなくインスタンス側で設定。

これをやってなかったのが結構痛い・・・
terraformの修正を後回しにして、とりあえずONにする。

こうしておくと最小1秒単位で拡張モニタリングをするのと、CloudWatchLogsにログを投げてくれる。
まぁ・・・とりあえずは1秒単位じゃなくていいので60秒単位で吐き出させる。

CloudWatchLogsへログを投げる

これもやってなかった・・・
こっちはクラスター側で設定すれば、インスタンス側にも適用される。
何が原因なのかわからなかったのでとりあえずすべて吐き出させてみた。

それでも監査ログ、エラーログ、スロークエリログにはなんにも出ておらず・・・

唯一の手がかりが拡張モニタリングをONにしたことで、CloudWatchLogsに飛んでいたログ。
ここでようやく手がかりっぽいものを見つける。

{
    "engine": "Aurora",
    "instanceID": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
    "instanceResourceID": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
    "instanceResourceID": "db-N4ALQP6KLVEH6FPKTYJNJ3YYRU",
    "timestamp": "2019-03-14T08:01:55Z",
    "version": 1,
    "uptime": "8 days, 15:03:48",
    "numVCPUs": 1,
    "cpuUtilization": {
        "guest": 0,
        "irq": 0.05,
        "system": 4.15,
        "wait": 0,
        "idle": 0,
        "user": 7.25,
        "total": 100,
        "steal": 1.77,
        "nice": 86.78
    },

・・・

    "tasks": {
        "sleeping": 264,
        "zombie": 0,
        "running": 7,
        "stopped": 0,
        "total": 271,
        "blocked": 0
    },

・・・

    "processList": [
        {
            "vss": 1369080,
            "name": "aurora",
            "tgid": 7842,
            "vmlimit": "unlimited",
            "parentID": 1,
            "memoryUsedPc": 15.26,
            "cpuUsedPc": 0,
            "id": 7842,
            "rss": 313164
        },
        {
            "vss": 719444,
            "name": "OS processes",
            "tgid": 0,
            "vmlimit": "",
            "parentID": 0,
            "memoryUsedPc": 1.6,
            "cpuUsedPc": 0.17,
            "id": 0,
            "rss": 33448
        },
        {
            "vss": 5063628,
            "name": "RDS processes",
            "tgid": 0,
            "vmlimit": "",
            "parentID": 0,
            "memoryUsedPc": 25.5,
            "cpuUsedPc": 90.81,
            "id": 0,
            "rss": 523576
        }
    ]
}

どうやらcpuUtilizationのnice項目がCPUを食いつぶしている模様。
リソースモニタリングの項目はAWSのドキュメントにて