503 error on upload with S3 by Scaleway

Hi all,

A few days ago, all image upload and storage started failing. When trying to upload an image from the web interface or from an app, I get a 503 error.

I use S3-compatible storage with Scaleway.

Here’s the log for one of these errors:

May 27 17:05:57 boitam.eu bundle[1488]: [28ca1949-cba8-460c-a4d8-96afc3640181] [paperclip] Trying to link /tmp/RackMultipart20210527-1536-vpljmu.png to /tmp/eceed3adac545fe86e611973fe25639a20210527-1536-1ongt5r.png
May 27 17:05:57 boitam.eu bundle[1488]: [28ca1949-cba8-460c-a4d8-96afc3640181] [paperclip] Trying to link /tmp/eceed3adac545fe86e611973fe25639a20210527-1536-1ongt5r.png to /tmp/5dae2fea364cdb5014e276d107144e2b20210527-1536-11iazlz.png
May 27 17:05:57 boitam.eu bundle[1488]: [28ca1949-cba8-460c-a4d8-96afc3640181] Command :: file -b --mime '/tmp/5dae2fea364cdb5014e276d107144e2b20210527-1536-11iazlz.png'
May 27 17:05:57 boitam.eu bundle[1488]: [28ca1949-cba8-460c-a4d8-96afc3640181] Command :: identify -format %m '/tmp/eceed3adac545fe86e611973fe25639a20210527-1536-1ongt5r.png[0]'
May 27 17:05:57 boitam.eu bundle[1488]: [28ca1949-cba8-460c-a4d8-96afc3640181] Command :: convert '/tmp/eceed3adac545fe86e611973fe25639a20210527-1536-1ongt5r.png[0]' -auto-orient -resize "552x290>" -quality 90 -strip +set modify-date +set create-date '/tmp/8e751a91c70e4a88f8e71a56a9f26e6820210527-1536-1gi1mmk'
May 27 17:05:57 boitam.eu bundle[1488]: [28ca1949-cba8-460c-a4d8-96afc3640181] Command :: convert '/tmp/8e751a91c70e4a88f8e71a56a9f26e6820210527-1536-1gi1mmk' RGB:-
May 27 17:05:57 boitam.eu bundle[1488]: [28ca1949-cba8-460c-a4d8-96afc3640181] [paperclip] Trying to link /tmp/8e751a91c70e4a88f8e71a56a9f26e6820210527-1536-1gi1mmk to /tmp/acfda3d41a46bab177fdc2fbcc2da30f20210527-1536-1ddo8jn
May 27 17:05:57 boitam.eu bundle[1488]: [28ca1949-cba8-460c-a4d8-96afc3640181] [paperclip] Trying to link /tmp/eceed3adac545fe86e611973fe25639a20210527-1536-1ongt5r.png to /tmp/5dae2fea364cdb5014e276d107144e2b20210527-1536-qhpg8i.png
May 27 17:05:57 boitam.eu bundle[1488]: [28ca1949-cba8-460c-a4d8-96afc3640181] Command :: file -b --mime '/tmp/5dae2fea364cdb5014e276d107144e2b20210527-1536-qhpg8i.png'
May 27 17:05:57 boitam.eu bundle[1488]: [28ca1949-cba8-460c-a4d8-96afc3640181] [paperclip] saving media_attachments/files/106/308/249/696/852/970/original/8ed3821cd616716f.png
May 27 17:06:02 boitam.eu bundle[1488]: [28ca1949-cba8-460c-a4d8-96afc3640181] method=POST path=/api/v2/media format=html controller=Api::V2::MediaController action=create status=503 duration=5751.91 view=0.81 db=10.03

Here’s my S3 conf

#S3 (optional)
S3_ENABLED=true
S3_BUCKET=boitam-eu-media
AWS_ACCESS_KEY_ID=xx
AWS_SECRET_ACCESS_KEY=xx
S3_PROTOCOL=https
S3_ALIAS_HOST=boitam-eu-media.s3.fr-par.scw.cloud

S3_HOSTNAME=s3.fr-par.scw.cloud
S3_ENDPOINT=https://s3.fr-par.scw.cloud
S3_REGION=fr-par

It was working really well for 6 months, until a few days ago. How can I troubleshoot this, and go to the root of the 503 error?

Thanks!

Assuming you are running on 3.4.0, this is almost surely caused by an error with connecting to S3.
You should be able to get some more information by commenting out the line reading rescue_from Mastodon::RaceConditionError, Seahorse::Client::NetworkingError, Stoplight::Error::RedLight, ActiveRecord::SerializationFailure, with: :service_unavailable in app/controllers/application_controller.rb, restarting mastodon-web and retrying an image upload.

Thank you for your answer!

Sadly it didn’t work. Do you know what changed with 3.4.0 on that front?

Please try to follow advice from @Claire to provide new, fresh logs - this is just to further investigate the problem and not a solution.

I’m getting the same error.

After commenting out Seahorse::Client::NetworkingError, it seems to work now.

Didn’t find anything interesting in log other than a few api/v1/media/ coming back with 206. I upgraded to 3.4.0 a few weeks ago and it was working fine until today. Didn’t do any config/code change or deployment.

This makes me think is it Scaleway changing their API response to be async? I don’t know the detail of Mastodon’s implementation, here’s my theory: If Scaleway was previously returning 200 after content upload complete, and now return 206 and let Mastodon do constant pulling, then if mastodon is treating 206 as network error and stop retrying, it could cause such behavior.

Thanks! I did follow the advice. And I said that sadly, it didn’t work. The logs didn’t have more info, here’s an example from right now:

Jun 02 09:36:51 boitam.eu bundle[18706]: [727bfbb6-9357-4dcc-9c51-33d81d94fcff] [paperclip] Trying to link /tmp/RackMultipart20210602-15426-cduhbd.jpg to /tmp/ecc80200c61eac137779874e3a5ad41720210602-15426-1ehqhg2.jpg
Jun 02 09:36:51 boitam.eu bundle[18706]: [727bfbb6-9357-4dcc-9c51-33d81d94fcff] [paperclip] Trying to link /tmp/ecc80200c61eac137779874e3a5ad41720210602-15426-1ehqhg2.jpg to /tmp/9a40e0c6fbb0477eb02f0017afeeb8d520210602-15426-erh90h.jpg
Jun 02 09:36:51 boitam.eu bundle[18706]: [727bfbb6-9357-4dcc-9c51-33d81d94fcff] Command :: file -b --mime '/tmp/9a40e0c6fbb0477eb02f0017afeeb8d520210602-15426-erh90h.jpg'
Jun 02 09:36:51 boitam.eu bundle[18706]: [727bfbb6-9357-4dcc-9c51-33d81d94fcff] Command :: identify -format %m '/tmp/ecc80200c61eac137779874e3a5ad41720210602-15426-1ehqhg2.jpg[0]'
Jun 02 09:36:51 boitam.eu bundle[18706]: [727bfbb6-9357-4dcc-9c51-33d81d94fcff] Command :: convert '/tmp/ecc80200c61eac137779874e3a5ad41720210602-15426-1ehqhg2.jpg[0]' -auto-orient -resize "323x495>" -quality 90 -strip +set modify-date +set create-date '/tmp/4700119c86923337a78357263e3b3ba220210602-15426-19a5q35'
Jun 02 09:36:51 boitam.eu bundle[18706]: [727bfbb6-9357-4dcc-9c51-33d81d94fcff] Command :: convert '/tmp/4700119c86923337a78357263e3b3ba220210602-15426-19a5q35' RGB:-
Jun 02 09:36:52 boitam.eu bundle[18706]: [727bfbb6-9357-4dcc-9c51-33d81d94fcff] [paperclip] Trying to link /tmp/4700119c86923337a78357263e3b3ba220210602-15426-19a5q35 to /tmp/c622235fbd8500ca849c40ae8777a32b20210602-15426-i9dnm4
Jun 02 09:36:52 boitam.eu bundle[18706]: [727bfbb6-9357-4dcc-9c51-33d81d94fcff] [paperclip] Trying to link /tmp/ecc80200c61eac137779874e3a5ad41720210602-15426-1ehqhg2.jpg to /tmp/9a40e0c6fbb0477eb02f0017afeeb8d520210602-15426-1b8tc5m.jpg
Jun 02 09:36:52 boitam.eu bundle[18706]: [727bfbb6-9357-4dcc-9c51-33d81d94fcff] Command :: file -b --mime '/tmp/9a40e0c6fbb0477eb02f0017afeeb8d520210602-15426-1b8tc5m.jpg'
Jun 02 09:36:52 boitam.eu bundle[18706]: [727bfbb6-9357-4dcc-9c51-33d81d94fcff] [paperclip] saving media_attachments/files/106/340/457/661/088/746/original/b3e1d03de5b3037e.jpg
Jun 02 09:36:57 boitam.eu bundle[18706]: [727bfbb6-9357-4dcc-9c51-33d81d94fcff] method=POST path=/api/v2/media format=html controller=Api::V2::MediaController action=create status=503 duration=6028.46 view=1.72 db=82.25

This makes me think is it Scaleway changing their API response to be async? I don’t know the detail of Mastodon’s implementation, here’s my theory: If Scaleway was previously returning 200 after content upload complete, and now return 206 and let Mastodon do constant pulling, then if mastodon is treating 206 as network error and stop retrying, it could cause such behavior.

Thanks! It looks like it’s possible… I’ve just opened a support ticket at Scaleway, to know if they have changed anything in the API lately (especially the response codes)

Scaleway Support may need logs that are more complete than what I get (I don’t now why commenting out Seahorse::Client::NetworkingError doesn’t work for me, there might be something in my config preventing more logging), would you mind sharing some of your logs with the 206 code?

There are other places where an exception could be caught and a 503 raised. For instance, the following in app/controllers/api/base_controller.rb:

  rescue_from Mastodon::RaceConditionError, Seahorse::Client::NetworkingError, Stoplight::Error::RedLight do
    render json: { error: 'There was a temporary problem serving your request, please try again' }, status: 503
  end

Please retry with commenting that one out.

Commenting the rescue_from above should not make that work any better, just output more info in some cases. I guess this was likely a transient issue with Scaleway.

1 Like

Checking my log again I noticed after commenting out (or maybe scaleway really recovered, I haven’t change the code back) NetworkingError, the only difference with your log is

Jun 02 18:23:52 mastodon-ubuntu-s-1vcpu-2gb-sfo2-01 bundle[12867]: [8ff1ddc8-ab3d-4f32-9d29-f1b18196ae0d] Command :: file -b --mime '/tmp/8162e0f621a69c056274fc23fda35fb420210602-12976-ebdm2i.png'
Jun 02 18:23:52 mastodon-ubuntu-s-1vcpu-2gb-sfo2-01 bundle[12867]: [8ff1ddc8-ab3d-4f32-9d29-f1b18196ae0d] [paperclip] saving media_attachments/files/106/342/529/903/382/841/original/52637aa8d5fc6b22.png
Jun 02 18:23:54 mastodon-ubuntu-s-1vcpu-2gb-sfo2-01 bundle[12867]: [8ff1ddc8-ab3d-4f32-9d29-f1b18196ae0d] [paperclip] saving media_attachments/files/106/342/529/903/382/841/small/52637aa8d5fc6b22.png
Jun 02 18:23:55 mastodon-ubuntu-s-1vcpu-2gb-sfo2-01 bundle[12867]: [8ff1ddc8-ab3d-4f32-9d29-f1b18196ae0d] method=POST path=/api/v2/media format=html controller=Api::V2::MediaController action=create status=202 duration=4024.51 view=2.72 db=10.52

Notice the second “saving … small”. It was not there when the issue presents. Also media controller returned 202 instead of 503. Seems like whatever the issue is it’s preventing thumbnail being generated correctly. And I do check the original media file even when it says upload failed, has always been uploaded to Scaleway successfully, while under same parent bucket the thumbnail in small didn’t get uploaded.

This makes me think instead of Scaleway changing their API response, maybe it’s their rate limiting policy changed? Like the original file got uploaded successfully, then the thumbnail upload request hit the rate limit and fail, then prevent mastodon from completing the request. Maybe mention this in support ticket would help them locate the cause.

BTW the one I commented out is in base_controller,not sure if you commented out somewhere else’s?

1 Like

@Claire I am looking at other reports here regarding S3-like interfaces. Do you think it would be worthwhile not to hide real exceptions in those cases? At least to log the full exception to the log?

Thank you @Claire!

Commenting out the lines in app/controllers/api/base_controller.rb caused an error 500 when trying to upload a file. Now it seems that there’s a timeout causing the error. I’ll check out my settings.

Jun 07 16:44:37 boitam.eu bundle[23198]: [e14eca1d-1c35-43f9-80aa-9ef7635f196a] [paperclip] Trying to link /tmp/RackMultipart20210607-23280-g2m4zf.jpg to /tmp/5b64d9070fe787819d37409b5444814220210607-23280-tk89ox.jpg
Jun 07 16:44:37 boitam.eu bundle[23198]: [e14eca1d-1c35-43f9-80aa-9ef7635f196a] [paperclip] Trying to link /tmp/5b64d9070fe787819d37409b5444814220210607-23280-tk89ox.jpg to /tmp/fba55f4a3133ac26c24368e778db610d20210607-23280-1im3dhr.jpg
Jun 07 16:44:37 boitam.eu bundle[23198]: [e14eca1d-1c35-43f9-80aa-9ef7635f196a] Command :: file -b --mime '/tmp/fba55f4a3133ac26c24368e778db610d20210607-23280-1im3dhr.jpg'
Jun 07 16:44:37 boitam.eu bundle[23198]: [e14eca1d-1c35-43f9-80aa-9ef7635f196a] Command :: identify -format %m '/tmp/5b64d9070fe787819d37409b5444814220210607-23280-tk89ox.jpg[0]'
Jun 07 16:44:37 boitam.eu bundle[23198]: [e14eca1d-1c35-43f9-80aa-9ef7635f196a] Command :: convert '/tmp/5b64d9070fe787819d37409b5444814220210607-23280-tk89ox.jpg[0]' -auto-orient -resize "321x498>" -quality 90 -strip +set modify-date +set create-date '/tmp/53ecd5d6fe4168bcb1b2c3f39d68163420210607-23280-kolzxt'
Jun 07 16:44:37 boitam.eu bundle[23198]: [e14eca1d-1c35-43f9-80aa-9ef7635f196a] Command :: convert '/tmp/53ecd5d6fe4168bcb1b2c3f39d68163420210607-23280-kolzxt' RGB:-
Jun 07 16:44:37 boitam.eu bundle[23198]: [e14eca1d-1c35-43f9-80aa-9ef7635f196a] [paperclip] Trying to link /tmp/53ecd5d6fe4168bcb1b2c3f39d68163420210607-23280-kolzxt to /tmp/da0b79b4b79d4d565a0f94b3ccaec53f20210607-23280-a81gym
Jun 07 16:44:37 boitam.eu bundle[23198]: [e14eca1d-1c35-43f9-80aa-9ef7635f196a] [paperclip] Trying to link /tmp/5b64d9070fe787819d37409b5444814220210607-23280-tk89ox.jpg to /tmp/fba55f4a3133ac26c24368e778db610d20210607-23280-xifnin.jpg
Jun 07 16:44:37 boitam.eu bundle[23198]: [e14eca1d-1c35-43f9-80aa-9ef7635f196a] Command :: file -b --mime '/tmp/fba55f4a3133ac26c24368e778db610d20210607-23280-xifnin.jpg'
Jun 07 16:44:37 boitam.eu bundle[23198]: [e14eca1d-1c35-43f9-80aa-9ef7635f196a] [paperclip] saving media_attachments/files/106/370/451/216/320/086/original/f0e22d3ac0cb9a5e.jpg
Jun 07 16:44:42 boitam.eu bundle[23198]: [e14eca1d-1c35-43f9-80aa-9ef7635f196a] method=POST path=/api/v2/media format=html controller=Api::V2::MediaController action=create status=500 error='Seahorse::Client::NetworkingError: Net::ReadTimeout with #<TCPSocket:(closed)>' duration=5800.27 view=0.00 db=26.89
Jun 07 16:44:42 boitam.eu bundle[23198]: [e14eca1d-1c35-43f9-80aa-9ef7635f196a]
Jun 07 16:44:42 boitam.eu bundle[23198]: [e14eca1d-1c35-43f9-80aa-9ef7635f196a] Seahorse::Client::NetworkingError (Net::ReadTimeout with #<TCPSocket:(closed)>):
Jun 07 16:44:42 boitam.eu bundle[23198]: [e14eca1d-1c35-43f9-80aa-9ef7635f196a]
Jun 07 16:44:42 boitam.eu bundle[23198]: [e14eca1d-1c35-43f9-80aa-9ef7635f196a] lib/paperclip/attachment_extensions.rb:55:in `block in save'
Jun 07 16:44:42 boitam.eu bundle[23198]: [e14eca1d-1c35-43f9-80aa-9ef7635f196a] lib/paperclip/attachment_extensions.rb:61:in `save'
Jun 07 16:44:42 boitam.eu bundle[23198]: [e14eca1d-1c35-43f9-80aa-9ef7635f196a] app/controllers/api/v2/media_controller.rb:5:in `create'

Additional info: my settings are good, Mastodon can store some objects on the bucket. 15 days ago there were 469240 objects, now there’s 479030, nearly 10k more. I don’t know how I could check which images are stored (originals, thumbnails, others?)

New info! With the new error log showing the timeout, Scaleway support was able to diagnose the problem: it was apparently on their end, and they re-synchronized the metadata for my bucket. After this re-sync, the bucket now works well and there’s no upload error anymore. I asked them how I could diagnoze such a problem, and if I could troubleshoot it myself, but it’s all on their side, so the best solution is to open a support ticket for them to do their thing.

Many thanks to everyone!

1 Like

Good point. The current behavior has been originally introduced in Add stoplight for object storage failures, return HTTP 503 by Gargron · Pull Request #13043 · tootsuite/mastodon · GitHub

I think we should indeed add some logging there, and I’ll look into that.

1 Like