Server randomly crashes when mastodon-sidekiq service is running


#1

Hi.

I’m the admin of the lou.lt instance and I’m currently moving all my activities on a new server.

My old server was running Ubuntu Server 16.04 and Mastodon was set up on it using an old version of the Production Guide but everything was running mostly well apart from some issues with updates (haven’t been able to do updates for a little while).

I used the Migration Guide to move everything on the new machine and I had no issues at the time.

The new server

  • Ubuntu 18.04
  • Node JS v8.10.0

Also, I’m using vmware ESXi on my new server and everything is running inside VMs. I made something like this:

Almost everything seemed to run smoothly at first, I only had to deal with a small issue with the uws node module. At first, the mastodon-streaming service refused to start and I had a “Error: Compilation of µWebSockets has failed and there is no pre-compiled binary available for your system. Please install a supported C++11 compiler and reinstall the module 'uws'.” in the logs.

After installing gcc-8 and running node-gyp rebuild by hand in the uws module directory, the service seems to be able to run correctly.

But here start the real issues

When I start the three services mastodon-web, mastodon-sidekiq and mastodon-streaming, the whole VM randomly crash after a time going from a few minutes to 24 hours (but most of the time, it’s under 5 minutes).

And when I say the whole VM I mean it. The VM won’t answer to ping and is unresponsive in the ESXi admin panel. Memory and CPU usage show no issue and stay linear from the moment the VM has crashed.

I tried to investigate the issue but the only thing I could see is a bunch of “^@” characters in the syslog file at the moment of the crash.

https://discourse.joinmastodon.org/uploads/joinmastodon/original/1X/2d4b84522c5e9de78a90c1cfc2884ced1df5fd8f.png

I also have sometimes some ERR! c9b13871-2815-40ac-928c-ffe15f26cb8a Error: Missing access token in the mastodon-streaming logs. They may be an issue but seems unrelated to the vm crash issue to me.

I did some test and had the following results:

  • When the three mastodon services are down, the VM is stable.
  • When only the mastodon-web service is running, the VM is stable.
  • When mastodon-web and mastodon-sidekiq are running, the VM crashes.
  • When mastodon-web and mastodon-streaming are running, the VM seems to be stable (but only based on a still running 1 hour long test)

I’m thinking I have an issue with the mastodon-sidekiq service but nothing in the logs seem bad to me.

I don’t know what to check now or what to do. I did a lot of other test I did not describe here but I can provide any log or test result you need to help me.


#2

I think you have a serious issue with how your operating system works on ESX. You didn’t mention what the new OS is. Even if mastodon is eating out all memory or whatever it shouldn’t crash like this.


#3

Are you talking about the host os or the guest os ?

Answer to both:

Host OS is VMWare ESXi which is not an application to install on an OS but integrates its own OS components and is installed as host OS.

Guest os is Ubuntu 18.04

I want to add that I have many other apps (webapps, game servers, etc.) installed on other VMs and none of them have issues


#4

Thanks, I’ve heard of ESX before. I would check Ubuntu VGA console for any clues. I guess your kernel dies and prints something to the console.


#5

VGA console just go unresponsive, no message displayed.

After reboot, I got this in syslog file:

the “^@” garbage is the moment of the crash, everything before seems normal and everything after is the reboot


#6

This can have many reasons. I am sorry, you need to get more information from the Linux kernel. Maybe setting up a serial console for the kernel might give you something to work on.


#7

Thanks a lot for your help @saper :slight_smile: didn’t know I would have more kernel logs by using the console.

Indeed, the serial console gave something to work on. Here it is:

[  109.057554] kernel BUG at /build/linux-5s7Xkn/linux-4.15.0/drivers/net/vmxnet3/vmxnet3_drv.c:1413!
[  109.058615] invalid opcode: 0000 [#1] SMP PTI
[  109.059227] Modules linked in: vmw_balloon coretemp joydev intel_rapl_perf input_leds serio_raw shpchp mac_hid vmw_vsock_vmci_transport vsock vmw_vmci sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid crct10dif_pclmul crc32_pclmul mptspi ghash_clmulni_intel mptscsih pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd vmwgfx mptbase scsi_transport_spi ttm psmouse drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops vmxnet3 ahci drm libahci i2c_piix4 pata_acpi
[  109.066852] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.15.0-20-generic #21-Ubuntu
[  109.067752] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
[  109.069001] RIP: 0010:vmxnet3_rq_rx_complete+0x7dd/0xe50 [vmxnet3]
[  109.069728] RSP: 0018:ffff9de7ffc83e20 EFLAGS: 00010297
[  109.070397] RAX: 0000000000000001 RBX: ffff9de7f1b95100 RCX: ffff9de7f2526f00
[  109.071296] RDX: 000000000000000b RSI: 0000000000000040 RDI: 0000000000000003
[  109.072134] RBP: ffff9de7ffc83e88 R08: 0000000000000028 R09: 0000000000000000
[  109.072972] R10: 0000000000000000 R11: ffff9de7eb0108c0 R12: 0000000000000031
[  109.073809] R13: ffff9de7eb011600 R14: ffff9de7f2145310 R15: ffff9de7f1bb4498
[  109.074745] FS:  0000000000000000(0000) GS:ffff9de7ffc80000(0000) knlGS:0000000000000000
[  109.075695] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  109.076375] CR2: 00007f5a888b1d08 CR3: 000000012d80a001 CR4: 00000000003606e0
[  109.077252] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  109.078173] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  109.079053] Call Trace:
[  109.079352]  <IRQ>
[  109.079607]  vmxnet3_poll_rx_only+0x36/0xa0 [vmxnet3]
[  109.080206]  net_rx_action+0x140/0x3a0
[  109.080673]  __do_softirq+0xdf/0x2b2
[  109.081108]  irq_exit+0xb6/0xc0
[  109.081490]  do_IRQ+0x82/0xd0
[  109.081915]  common_interrupt+0x84/0x84
[  109.082776]  </IRQ>
[  109.083381] RIP: 0010:native_safe_halt+0x6/0x10
[  109.084265] RSP: 0018:ffffb03700cbfe80 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffd2
[  109.085536] RAX: ffffffff8f594f00 RBX: 0000000000000001 RCX: 0000000000000000
[  109.086810] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[  109.087994] RBP: ffffb03700cbfe80 R08: 0000000000000002 R09: 0000000000000001
[  109.089152] R10: ffffb03700cbfe20 R11: 0000000000000077 R12: 0000000000000001
[  109.090416] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  109.091561]  ? __cpuidle_text_start+0x8/0x8
[  109.092361]  default_idle+0x20/0x100
[  109.093115]  arch_cpu_idle+0x15/0x20
[  109.093923]  default_idle_call+0x23/0x30
[  109.094661]  do_idle+0x172/0x1f0
[  109.095307]  cpu_startup_entry+0x73/0x80
[  109.096028]  start_secondary+0x1a6/0x200
[  109.096737]  secondary_startup_64+0xa5/0xb0
[  109.097586] Code: a0 44 88 4d a8 4c 89 55 b0 e8 d0 da 32 cf 44 0f b6 4d a8 4c 8b 5d a0 4c 8b 55 b0 49 c7 85 48 01 00 00 00 00 00 00 e9 70 f9 ff ff <0f> 0b 0f 0b 0f 0b 49 8d 7d 20 48 89 ce e8 d1 22 33 cf 4c 8b 5d 
[  109.100515] RIP: vmxnet3_rq_rx_complete+0x7dd/0xe50 [vmxnet3] RSP: ffff9de7ffc83e20
[  109.101804] ---[ end trace 19418819978b2802 ]---
[  109.102728] Kernel panic - not syncing: Fatal exception in interrupt
[  109.103916] Kernel Offset: 0xdc00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  109.106587] ---[ end Kernel panic - not syncing: Fatal exception in interrupt 

I’ve done some research over vmxnet3 driver errors and found this. I tried the workaround described and it seems to work better (I prefer not to celebrate before I’m sure we won’t have a new kernel panic in a few hours).

What I did:

  1. Power off the virtual machine.
  2. Edit the vmx file of the VM by adding this at its end :
    vmxnet3.rev.30 = FALSE
  3. Power the virtual machine back on.

This have been running for half an hour now. Still running well. I will end this topic tomorrow if it is still good.


#8

I wonder if upgrading your ESX helps too. Good you got the instance back!


#9

I tried upgrading but it says it’s already up to date