Sidekiq keep restarting

Hi,

An instance I am running experience recurring issues with Sidekiq, throwing 502 errors to the users because it keeps restarting.

Jun 15 07:43:36 rage-mast01 systemd[1]: mastodon-sidekiq.service: Main process exited, code=killed, status=9/KILL
Jun 15 07:43:36 rage-mast01 systemd[1]: mastodon-sidekiq.service: Failed with result 'signal'.
Jun 15 07:43:36 rage-mast01 systemd[1]: mastodon-sidekiq.service: Service RestartSec=100ms expired, scheduling restart.
Jun 15 07:43:36 rage-mast01 systemd[1]: mastodon-sidekiq.service: Scheduled restart job, restart counter is at 584.
Jun 15 07:43:36 rage-mast01 systemd[1]: Stopped mastodon-sidekiq.
Jun 15 07:43:36 rage-mast01 systemd[1]: Started mastodon-sidekiq.

It happens way too often, and I can’t find the root cause. The instance is running inside an LXC container, with 4GiB of RAM, 1GiB SWAP and 3 dedicated CPU cores. Memory usage is as follow:

  • Whole container
# free -h
              total        used        free      shared  buff/cache   available
Mem:          4.0Gi       847Mi        14Mi       3.1Gi       3.2Gi       3.2Gi
  • Mastodon services:
# systemctl status mastodon-*.service
* mastodon-streaming.service - mastodon-streaming
   Loaded: loaded (/etc/systemd/system/mastodon-streaming.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2021-06-08 16:23:40 UTC; 6 days ago
 Main PID: 84 (node)
    Tasks: 18 (limit: 4915)
   Memory: 85.4M
   CGroup: /system.slice/mastodon-streaming.service
           |- 84 /usr/bin/node ./streaming
           `-139 /usr/bin/node /home/mastodon/live/streaming

Jun 15 07:49:04 rage-mast01 node[84]: verb Unsubscribe timeline:69651
Jun 15 07:49:09 rage-mast01 node[84]: verb Unsubscribe timeline:access_token:3662
Jun 15 07:49:09 rage-mast01 node[84]: verb Unsubscribe timeline:68010
Jun 15 07:49:10 rage-mast01 node[84]: verb Subscribe timeline:access_token:3662
Jun 15 07:49:10 rage-mast01 node[84]: verb a9cd7159-120d-4b40-ac56-31fe08919ee5 Starting stream from timeline:68010 for 68010
Jun 15 07:49:10 rage-mast01 node[84]: verb Subscribe timeline:68010
Jun 15 07:53:07 rage-mast01 node[84]: verb Subscribe timeline:access_token:1409
Jun 15 07:53:07 rage-mast01 node[84]: verb aaeff1d9-eae2-4c14-b3d9-cdadb88f52f2 Starting stream from timeline:70073 for 70073
Jun 15 07:53:07 rage-mast01 node[84]: verb Subscribe timeline:70073
Jun 15 07:53:07 rage-mast01 node[84]: verb aaeff1d9-eae2-4c14-b3d9-cdadb88f52f2 Starting stream from timeline:public:local for 70073

* mastodon-sidekiq.service - mastodon-sidekiq
   Loaded: loaded (/etc/systemd/system/mastodon-sidekiq.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2021-06-15 07:47:27 UTC; 6min ago
 Main PID: 15915 (bundle)
    Tasks: 39 (limit: 4915)
   Memory: 269.9M
   CGroup: /system.slice/mastodon-sidekiq.service
           `-15915 sidekiq 6.2.1 live [0 of 30 busy]

Jun 15 07:54:09 rage-mast01 bundle[15915]: 2021-06-15T07:54:09.204Z pid=15915 tid=2p4n class=ActivityPub::ProcessingWorker jid=07b1d8205c8304577
Jun 15 07:54:09 rage-mast01 bundle[15915]: 2021-06-15T07:54:09.233Z pid=15915 tid=2i8j class=ThreadResolveWorker jid=ad7658a0a72d7663cd285688 IN
Jun 15 07:54:09 rage-mast01 bundle[15915]: 2021-06-15T07:54:09.235Z pid=15915 tid=2ogf class=DistributionWorker jid=1a2a36f80c36629df0288816 INF
Jun 15 07:54:09 rage-mast01 bundle[15915]: 2021-06-15T07:54:09.236Z pid=15915 tid=2p4n class=ActivityPub::ProcessingWorker jid=07b1d8205c8304577
Jun 15 07:54:09 rage-mast01 bundle[15915]: 2021-06-15T07:54:09.241Z pid=15915 tid=2p7z class=FeedInsertWorker jid=90c1f5eecb5694ed8411a39f INFO:
Jun 15 07:54:09 rage-mast01 bundle[15915]: 2021-06-15T07:54:09.250Z pid=15915 tid=2p7z class=FeedInsertWorker jid=90c1f5eecb5694ed8411a39f elaps
Jun 15 07:54:09 rage-mast01 bundle[15915]: 2021-06-15T07:54:09.254Z pid=15915 tid=2ogf class=DistributionWorker jid=1a2a36f80c36629df0288816 ela
Jun 15 07:54:09 rage-mast01 bundle[15915]: 2021-06-15T07:54:09.343Z pid=15915 tid=2i8j class=ThreadResolveWorker jid=ad7658a0a72d7663cd285688 el
Jun 15 07:54:16 rage-mast01 bundle[15915]: 2021-06-15T07:54:16.328Z pid=15915 tid=2pej class=LinkCrawlWorker jid=bb19b35804567970a13b8951 INFO: 
Jun 15 07:54:16 rage-mast01 bundle[15915]: 2021-06-15T07:54:16.330Z pid=15915 tid=2pej class=LinkCrawlWorker jid=bb19b35804567970a13b8951 elapse

* mastodon-web.service - mastodon-web
   Loaded: loaded (/etc/systemd/system/mastodon-web.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2021-06-08 16:23:40 UTC; 6 days ago
 Main PID: 90 (bundle)
    Tasks: 29 (limit: 4915)
   Memory: 444.3M
   CGroup: /system.slice/mastodon-web.service
           |-   90 puma 5.3.0 (tcp://127.0.0.1:3000) [live]
           |-15983 puma: cluster worker 1: 90 [live]
           `-16062 puma: cluster worker 0: 90 [live]

Looking at these numbers I don’t think computing ressources are a bottleneck but I have no clue where to look at.

Thanks in advance for your help.
Best,

Anything in the operating system logs? out of memory killer in action?

Also something more in journalctl -u mastodon-sidekiq to explain why it got killed?

Yes, as mentioned on #mastodon channel on Libera.chat IRC server I indeed get an OOM error:

Jun 15 11:24:20 host01 kernel: [579664.337316] CPU: 13 PID: 6109 Comm: nginx Tainted: P           O      5.4.114-1-pve #1
Jun 15 11:24:20 host01 kernel: [579664.337316] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450 Pro4 R2.0, BIOS H4.84 04/08/2021
Jun 15 11:24:20 host01 kernel: [579664.337317] Call Trace:
Jun 15 11:24:20 host01 kernel: [579664.337322]  dump_stack+0x6d/0x8b
Jun 15 11:24:20 host01 kernel: [579664.337324]  dump_header+0x4f/0x1e1
Jun 15 11:24:20 host01 kernel: [579664.337325]  oom_kill_process.cold.33+0xb/0x10
Jun 15 11:24:20 host01 kernel: [579664.337327]  out_of_memory+0x1ad/0x490
Jun 15 11:24:20 host01 kernel: [579664.337329]  mem_cgroup_out_of_memory+0xc4/0xd0
Jun 15 11:24:20 host01 kernel: [579664.337330]  try_charge+0x76b/0x7e0
Jun 15 11:24:20 host01 kernel: [579664.337332]  mem_cgroup_try_charge+0x71/0x190
Jun 15 11:24:20 host01 kernel: [579664.337333]  __add_to_page_cache_locked+0x265/0x340
Jun 15 11:24:20 host01 kernel: [579664.337335]  ? scan_shadow_nodes+0x30/0x30
Jun 15 11:24:20 host01 kernel: [579664.337335]  add_to_page_cache_lru+0x4f/0xd0
Jun 15 11:24:20 host01 kernel: [579664.337336]  pagecache_get_page+0xff/0x2e0
Jun 15 11:24:20 host01 kernel: [579664.337369]  ? zpl_readpage+0x9f/0xe0 [zfs]
Jun 15 11:24:20 host01 kernel: [579664.337370]  filemap_fault+0x887/0xa70
Jun 15 11:24:20 host01 kernel: [579664.337371]  ? page_add_file_rmap+0x131/0x190
Jun 15 11:24:20 host01 kernel: [579664.337372]  ? alloc_set_pte+0x104/0x5c0
Jun 15 11:24:20 host01 kernel: [579664.337373]  ? xas_load+0xc/0x80
Jun 15 11:24:20 host01 kernel: [579664.337374]  ? xas_find+0x17e/0x1b0
Jun 15 11:24:20 host01 kernel: [579664.337374]  ? filemap_map_pages+0x28d/0x3b0
Jun 15 11:24:20 host01 kernel: [579664.337375]  __do_fault+0x3c/0x130
Jun 15 11:24:20 host01 kernel: [579664.337376]  __handle_mm_fault+0xe73/0x1290
Jun 15 11:24:20 host01 kernel: [579664.337378]  ? __release_sock+0x99/0xd0
Jun 15 11:24:20 host01 kernel: [579664.337379]  handle_mm_fault+0xc9/0x1f0
Jun 15 11:24:20 host01 kernel: [579664.337380]  __do_page_fault+0x233/0x4c0
Jun 15 11:24:20 host01 kernel: [579664.337381]  do_page_fault+0x2c/0xe0
Jun 15 11:24:20 host01 kernel: [579664.337383]  page_fault+0x34/0x40
Jun 15 11:24:20 host01 kernel: [579664.337384] RIP: 0033:0x55d80fd27c20
Jun 15 11:24:20 host01 kernel: [579664.337387] Code: Bad RIP value.
Jun 15 11:24:20 host01 kernel: [579664.337387] RSP: 002b:00007fffde46d5b8 EFLAGS: 00010246
Jun 15 11:24:20 host01 kernel: [579664.337388] RAX: 00000000228ddad0 RBX: 00000000228ddad0 RCX: 000055d810634ac0
Jun 15 11:24:20 host01 kernel: [579664.337389] RDX: 000055d8105cdac8 RSI: 000055d810634ae8 RDI: 000055d80fe50580
Jun 15 11:24:20 host01 kernel: [579664.337389] RBP: 000055d810634ac0 R08: 0000000000000004 R09: 0000000000000000
Jun 15 11:24:20 host01 kernel: [579664.337389] R10: 00007fffde46d52c R11: 0000000000000246 R12: 000055d80fe50580
Jun 15 11:24:20 host01 kernel: [579664.337390] R13: 000055d810634ae8 R14: fffffffffffffffe R15: 0000000000000000
Jun 15 11:24:20 host01 kernel: [579664.337391] memory: usage 4194300kB, limit 4194304kB, failcnt 341879857
Jun 15 11:24:20 host01 kernel: [579664.337392] memory+swap: usage 4194300kB, limit 5242880kB, failcnt 0
Jun 15 11:24:20 host01 kernel: [579664.337392] kmem: usage 53752kB, limit 9007199254740988kB, failcnt 0
Jun 15 11:24:20 host01 kernel: [579664.337392] Memory cgroup stats for /lxc/300:
Jun 15 11:24:20 host01 kernel: [579664.337399] anon 951582720
Jun 15 11:24:20 host01 kernel: [579664.337399] file 3290206208
Jun 15 11:24:20 host01 kernel: [579664.337399] kernel_stack 1806336
Jun 15 11:24:20 host01 kernel: [579664.337399] slab 37367808
Jun 15 11:24:20 host01 kernel: [579664.337399] sock 0
Jun 15 11:24:20 host01 kernel: [579664.337399] shmem 3287863296
Jun 15 11:24:20 host01 kernel: [579664.337399] file_mapped 20680704
Jun 15 11:24:20 host01 kernel: [579664.337399] file_dirty 0
Jun 15 11:24:20 host01 kernel: [579664.337399] file_writeback 0
Jun 15 11:24:20 host01 kernel: [579664.337399] anon_thp 0
Jun 15 11:24:20 host01 kernel: [579664.337399] inactive_anon 1248546816
Jun 15 11:24:20 host01 kernel: [579664.337399] active_anon 2988969984
Jun 15 11:24:20 host01 kernel: [579664.337399] inactive_file 884736
Jun 15 11:24:20 host01 kernel: [579664.337399] active_file 0
Jun 15 11:24:20 host01 kernel: [579664.337399] unevictable 0
Jun 15 11:24:20 host01 kernel: [579664.337399] slab_reclaimable 17420288
Jun 15 11:24:20 host01 kernel: [579664.337399] slab_unreclaimable 19947520
Jun 15 11:24:20 host01 kernel: [579664.337399] pgfault 1252834525
Jun 15 11:24:20 host01 kernel: [579664.337399] pgmajfault 466640438
Jun 15 11:24:20 host01 kernel: [579664.337399] workingset_refault 460626177
Jun 15 11:24:20 host01 kernel: [579664.337399] workingset_activate 43318272
Jun 15 11:24:20 host01 kernel: [579664.337399] workingset_nodereclaim 30822
Jun 15 11:24:20 host01 kernel: [579664.337399] pgrefill 628059313
Jun 15 11:24:20 host01 kernel: [579664.337399] pgscan 1597972947
Jun 15 11:24:20 host01 kernel: [579664.337399] pgsteal 465181570
Jun 15 11:24:20 host01 kernel: [579664.337399] pgactivate 474427854
Jun 15 11:24:20 host01 kernel: [579664.337400] Tasks state (memory values in pages):
Jun 15 11:24:20 host01 kernel: [579664.337400] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Jun 15 11:24:20 host01 kernel: [579664.337401] [   5636] 100000  5636    42353      551    98304        0             0 systemd
Jun 15 11:24:20 host01 kernel: [579664.337403] [   6066] 100000  6066     1730      130    57344        0             0 login
Jun 15 11:24:20 host01 kernel: [579664.337404] [  31999] 100000 31999     1538      582    49152        0             0 bash
Jun 15 11:24:20 host01 kernel: [579664.337405] [   6069] 100000  6069      604       27    40960        0             0 agetty
Jun 15 11:24:20 host01 kernel: [579664.337406] [   6031] 100106  6031     2242      132    53248        0             0 dbus-daemon
Jun 15 11:24:20 host01 kernel: [579664.337407] [   6033] 101000  6033   144330     3329  1298432        0             0 node
Jun 15 11:24:20 host01 kernel: [579664.337408] [   7434] 101000  7434   181595    16524  2768896        0             0 node
Jun 15 11:24:20 host01 kernel: [579664.337409] [   6098] 100100  6098    27792     8810   212992        0             0 redis-server
Jun 15 11:24:20 host01 kernel: [579664.337410] [   6036] 100000  6036    39046      449    77824        0             0 rsyslogd
Jun 15 11:24:20 host01 kernel: [579664.337411] [   6038] 100000  6038     4826      230    77824        0             0 systemd-logind
Jun 15 11:24:20 host01 kernel: [579664.337412] [   6100] 100000  6100     2821      233    57344        0             0 nginx
Jun 15 11:24:20 host01 kernel: [579664.337413] [   6104] 100107  6104     2938      409    57344        0             0 nginx
Jun 15 11:24:20 host01 kernel: [579664.337414] [   6107] 100107  6107     2905      403    57344        0             0 nginx
Jun 15 11:24:20 host01 kernel: [579664.337414] [   6109] 100107  6109     3201      699    57344        0             0 nginx
Jun 15 11:24:20 host01 kernel: [579664.337415] [   6040] 101000  6040   106580    55963   753664        0             0 bundle
Jun 15 11:24:20 host01 kernel: [579664.337416] [  14740] 101000 14740   141461    66347  1011712        0             0 bundle
Jun 15 11:24:20 host01 kernel: [579664.337417] [  16405] 101000 16405   137877    65512   995328        0             0 bundle
Jun 15 11:24:20 host01 kernel: [579664.337418] [   6044] 100000  6044     1377       51    53248        0             0 cron
Jun 15 11:24:20 host01 kernel: [579664.337419] [   6101] 100000  6101     3962      208    65536        0             0 sshd
Jun 15 11:24:20 host01 kernel: [579664.337420] [   6058] 100000  6058      604       27    45056        0             0 agetty
Jun 15 11:24:20 host01 kernel: [579664.337421] [   6415] 100000  6415    15077     7389   180224        0             0 systemd-journal
Jun 15 11:24:20 host01 kernel: [579664.337422] [  15465] 101000 15465   173849    74259  1302528        0             0 bundle
Jun 15 11:24:20 host01 kernel: [579664.337423] [  21593] 101000 21593      596       17    40960        0             0 sh
Jun 15 11:24:20 host01 kernel: [579664.337424] [  21594] 101000 21594    21010     5332   159744        0             0 convert
Jun 15 11:24:20 host01 kernel: [579664.337425] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=ns,mems_allowed=0,oom_memcg=/lxc/300,task_memcg=/lxc/300/ns/system.slice/mastodon-sidekiq.service,task=bundle,pid=15465,uid=101000
Jun 15 11:24:20 host01 kernel: [579664.346878] oom_reaper: reaped process 15465 (bundle), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Still, it seems very strange to me because the ressources allocation looks fine. Maybe it has something to do with bad stats reporting from Linux cgroup…

1 Like

Tough one… One question is if sidekiq is guilty of something at all. OOM does not kill the process that caused the condition, I think it just uses some heuristics. Any chance to get some ongoing monitoring of process memory usage to see which one gets away?

No experience with LXC here so I can’t tell - but if cgroups were at fault there would have been a storm of Docker users seeing things like this…

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.