I'm inheriting this old Django project hosted on an EC2 instance. It used to run on Heroku and used a Proximo proxy in front of gunicorn. Now it just runs a systemd script with the following:
ExecStart=/bin/sh -c "bin/proximo gunicorn myApp.wsgi --config config/gunicorn.cfg"
Now, for the most part, this system seems to be stable, but every now and then, we'll get 503 errors on the site, even when the instance is duplicated and behind a load balancer.
The only errors that show up that are related are in the Apache error log around the time that the site goes down:
[Wed Dec 27 18:01:20.487596 2023] [proxy:error] [pid 5912:tid 139805448980224] (2)No such file or directory: AH02454: HTTP: attempt to connect to Unix domain socket /var/run/rpc/xmlrpc.sock (localhost) failed
[Wed Dec 27 18:01:20.487636 2023] [proxy_http:error] [pid 5912:tid 139805448980224] [client 172.31.25.252:41564] AH01114: HTTP: failed to make connection to backend: httpd-UDS
So, naturally, I thought this was something with the proxy and simply removed bin/proximo
before the gunicorn
in the script. However, the site has goes down multiple times since then and we can see that there's this constant unhealthy target in the load balancer. I guess it's trying to restart over and over? It eventually resolves itself and the site comes back online, but this is really a head scratcher in terms of how to diagnose this. I've tried logging all requests to Cloudwatch to see if anything stands out but it does not. It looks like business as usual and there are no 500 errors in the Django layer of the application. It all seems to be around the load balancer level.
One idea I did have was to install XMLRPC on the server. This seems to be a PHP library and I'm not sure what the consequences of installing it would be, but it's one idea I have, although it doesn't really explain why the error is occurring.
One side note which I'm not sure is important is that there was this old Procfile that was being used for the Heroku instance which still had the proximo
command to run the server. I recently removed it just to rule it out but I don't understand how a Procfile would cause something like this on our EC2 instance as I don't think anything is pointing to it.
Anyways, if anyone out there has any ideas and can queue me in on something, please share! I have no concrete answers for the client other than we're monitoring requests and the server. This IS an old Django instance (1.8 i believe) and so things are just holding on and I can't really migrate things to a newer Django + AWS environment without basically rewriting huge parts of the application (which is currently in progress).
Thanks for reading.