My stack is as follows:
- EventBridge fires a Glue job at a regular interval.
- Said Glue job runs Python scripts, which run as Step Functions.
- The output of these scripts is saved to S3.
How can I monitor this? I ideally would like AWS (presumably, CloudWatch talking to SNS) to email me if the Glue job fails, but my definition of "fails" seems so broad that I feel like I'm solving the wrong problem. The below is a list of possible failures, but I'm hoping for a single solution that is so general that it hits all possibilities. If I could both always be alerted in case of a Glue error and make most possible failure cases trigger a Glue error, then I would probably be happy.
- The Glue job simply doesn't fire.
- The Glue job fires, but errors.
- The Glue job fires, but the Step Functions do not.
- The Step Functions error.