High amount of Sidekiq retries, toots not appearing


#1

Hi there,

I’m the admin over at mmorpg.social.

I’m experiencing a problem with toots from other instances not appearing in my federated timeline. I’m getting a high number of retries and failures in Sidekiq (over 80,000 failures in 3 days). I’ve attached a log below:

When I look at the detailed sidekiq logs, I get a group of entries similar to the below:

sidekiq_1    | 2018-12-16T20:50:26.753Z 6 TID-gnzb91ika ActivityPub::ProcessingWorker JID-3090432b2f9b673dd6e4f0ca INFO: start
sidekiq_1    | 2018-12-16T20:50:27.799Z 6 TID-gnzb91ika ActivityPub::ProcessingWorker JID-3090432b2f9b673dd6e4f0ca INFO: fail: 1.046 sec
sidekiq_1    | 2018-12-16T20:50:27.800Z 6 TID-gnzb91ika WARN: {"context":"Job raised exception","job":{"class":"ActivityPub::ProcessingWorker","args":[9,"{\"type\": \"Announce\", \"to\": [\"https://relay.mastodon.host/actor/followers\"], \"object\": \"https://mastodon.host/users/anexcursion/statuses/101252513258197361\", \"actor\": \"https://relay.mastodon.host/actor\", \"id\": \"https://relay.mastodon.host/activities/283a62e7-75a4-4bca-8bdb-c435b0dfaf20\", \"@context\": \"https://www.w3.org/ns/activitystreams\"}",null],"retry":true,"queue":"default","backtrace":true,"jid":"3090432b2f9b673dd6e4f0ca","created_at":1544990747.2852695,"enqueued_at":1544993426.7524924,"error_message":"failed to connect: No address for mastodon.host on https://mastodon.host/users/anexcursion/statuses/101252513258197361","error_class":"HTTP::ConnectionError","failed_at":1544990748.3137655,"retry_count":6,"error_backtrace":["/mastodon/app/lib/request.rb:183:in `open'"],"retried_at":1544992064.727663},"jobstr":"{\"class\":\"ActivityPub::ProcessingWorker\",\"args\":[9,\"{\\\"type\\\": \\\"Announce\\\", \\\"to\\\": [\\\"https://relay.mastodon.host/actor/followers\\\"], \\\"object\\\": \\\"https://mastodon.host/users/anexcursion/statuses/101252513258197361\\\", \\\"actor\\\": \\\"https://relay.mastodon.host/actor\\\", \\\"id\\\": \\\"https://relay.mastodon.host/activities/283a62e7-75a4-4bca-8bdb-c435b0dfaf20\\\", \\\"@context\\\": \\\"https://www.w3.org/ns/activitystreams\\\"}\",null],\"retry\":true,\"queue\":\"default\",\"backtrace\":true,\"jid\":\"3090432b2f9b673dd6e4f0ca\",\"created_at\":1544990747.2852695,\"enqueued_at\":1544993426.7524924,\"error_message\":\"failed to connect: No address for mastodon.host on https://mastodon.host/users/anexcursion/statuses/101252513258197361\",\"error_class\":\"HTTP::ConnectionError\",\"failed_at\":1544990748.3137655,\"retry_count\":6,\"error_backtrace\":[\"/mastodon/app/lib/request.rb:183:in `open'\"],\"retried_at\":1544992064.727663}"}
sidekiq_1    | 2018-12-16T20:50:27.800Z 6 TID-gnzb91ika WARN: HTTP::ConnectionError: failed to connect: No address for mastodon.host on https://mastodon.host/users/anexcursion/statuses/101252513258197361
sidekiq_1    | 2018-12-16T20:50:27.800Z 6 TID-gnzb91ika WARN: /mastodon/app/lib/request.rb:183:in `open'

Any clue what might be the issue here? I’ve done some curl checks from the server and they all resolve on both ip4 and ip6, so I’m a little lost here.

Thanks in advance!


#2


Tell me about it…

Waste of server CPU time and means actual working instances are not getting content.


#3

I think I’m getting close to the issue.

Turns out, trying to run a federated mastadon instance on a single CPU VPS is a no-no. Even though the CPU was barely hitting above 30% utilisation, tasks would time-out waiting for a DNS resolution. As soon as I resized to a two-CPU VPS, the problem vanished.

This makes me think that there might be a timeout issue on some of the task steps. (like DNS lookups) which can result in a queue building up as jobs get re-queued. I’ll have to do some digging to find it in the code, but it feels like some tuning might help.


#4

Hey Gazimoff, I’m having the exact same issue. Did you ever figure out what was going on?


#5

Unfortunately I didn’t get to the bottom of it, but moving to a multi-core virtual machine fixed it for my load of 150 users. I’ve put it on the back burner for now (I still have a few retries, but they’re limited to certain instances as @humblr pointed out.)


#6

150 User online at a time or Just 150 user registered?