Skip to content
Home DevOps AutoSys Alarms: The 1 Setting That Silences Critical Failures

AutoSys Alarms: The 1 Setting That Silences Critical Failures

Where developers are forged. · Structured learning · Free forever.
📍 Part of: AutoSys → Topic 24 of 30
AutoSys alarms and notifications explained with real outage stories.
⚙️ Intermediate — basic DevOps knowledge assumed
In this tutorial, you'll learn
AutoSys alarms and notifications explained with real outage stories.
  • Set alarm_if_fail: 1 on all critical jobs — the default is 0 (no alarm). Run annual audits to ensure compliance.
  • Use notification_emailaddress for direct email alerts; include log file paths and exit code (%x) in notification_msg.
  • max_run_alarm and min_run_alarm provide bounds-based alerting for jobs running too long or suspiciously fast.
Alarm Trigger Chain Alarm Trigger Chain. From job failure to team notification · Job fails / exceeds runtime · alarm_if_fail or max_run_alarm · Alarm raised in Event Server · alarm event recorded with timestamp · Notification email sent THECODEFORGE.IOAlarm Trigger ChainFrom job failure to team notification Job fails / exceeds runtimealarm_if_fail or max_run_alarm Alarm raised in Event Serveralarm event recorded with timestamp Notification email sentnotification_emailaddress recipients WCC alarm list updatedvisible in Monitor dashboard Team acknowledges alarmsendevent ALARM_ACKTHECODEFORGE.IO
thecodeforge.io
Alarm Trigger Chain
Autosys Alarms Notifications
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • AutoSys alarms raise alerts in Event Server on job failures, long-running jobs, or machine issues — alarm_if_fail: 1 enables failure alarms
  • Key components: alarm_if_fail (job failure), max_run_alarm (long runtime), min_run_alarm (suspiciously fast), alarm_if_terminated (killed jobs), alarm_on_missing (machine offline)
  • Performance impact: Alarms stored in Event Server DB; excessive alarms cause database bloat and UI slowdowns — aggregate, don't alert per instance
  • Production trap: alarm_if_fail: 0 is default — a silent failure overnight means no one knows until customers complain
  • Biggest mistake: Sending failure emails to unmonitored shared mailbox — alarms without response process are no alarms at all
🚨 START HERE

AutoSys Alarm Debug Cheat Sheet

Fast diagnostics for alarm issues in production AutoSys environments.
🟡

Job failed, no alarm — suspected missing alarm_if_fail

Immediate ActionCheck job definition for alarm_if_fail attribute
Commands
autorep -J job_name -q | grep -i alarm_if_fail
echo 'Default is 0 (no alarm). Must set to 1.'
Fix NowUpdate job: `update_job: job_name alarm_if_fail: 1` in JIL, then `jil < update.jil` or use `sendevent -E UPDATE_JOB`.
🟡

Email notification not sent on failure

Immediate ActionCheck notification_emailaddress and SMTP configuration
Commands
autorep -J job_name -q | grep -i notification_email
autorep -M | grep -i mail
Fix NowAdd or fix notification_emailaddress. Test SMTP: `sendevent -E NOTIFY -M "test" -u user@company.com`. Check event processor logs for mail errors.
🟡

max_run_alarm firing too often — false positives

Immediate ActionCheck job runtime history and adjust threshold
Commands
autorep -J job_name -r -t | tail -20 | grep -E 'Start|End|Run Time'
echo 'Set max_run_alarm to p99 runtime + 20% buffer'
Fix NowUpdate job: `update_job: job_name max_run_alarm: new_threshold`. For variable runtime, consider using term_run_time to kill hung jobs instead of alarm.
🟡

Alarm acknowledged but same alarm reappears next day

Immediate ActionRoot cause not fixed. Alarm will fire again until permanent fix
Commands
autorep -a -J job_name | grep -i alarm
echo 'Fix root cause, not just acknowledge'
Fix NowInvestigate why job fails every day; implement permanent fix. Use alarm_action to create ticket and prevent new alarm until ticket closed.
🟡

Machine offline — alarm_on_missing not configured

Immediate ActionCheck machine definition for alarm_on_missing attribute
Commands
autorep -M prod-server-01 | grep -i alarm
ping prod-server-01 && echo 'Machine up' || echo 'Machine down'
Fix NowUpdate machine: `update_machine: prod-server-01 alarm_on_missing: 1` in JIL for machine, then `jil < update_machine.jil`.
Production Incident

The Silent Friday Night Payroll Failure

A critical payroll job failed at 2am on Friday night. alarm_if_fail was not set (default 0). The team discovered the failure on Monday morning when employees asked why they hadn't been paid. The fix took 20 minutes. The outage cost 3 days of reputation damage.
SymptomThe payroll job ran every Friday at 2am. On this Friday, the script failed due to a database connection timeout. AutoSys logged the failure, marked the job status as FAILURE, and stopped. No alarm fired. No email was sent. The team didn't know until Monday morning when the finance department asked why payroll hadn't run. The job could have been re-run in 20 minutes, but the team lost the entire weekend window.
AssumptionThe team assumed that AutoSys automatically alerted on job failures. They didn't know that alarm_if_fail defaults to 0. They also assumed that because they had a dashboard, someone would notice. But no one checked the dashboard over the weekend. They also had no runbook for monitoring auto-recovery or fallback alerts.
Root causeThe job definition JIL file lacked alarm_if_fail: 1. It was omitted entirely, and the default is 0 (no alarm). The team had configured email notifications for successful completion but not for failures. The operations team monitored the dashboard only during business hours. The failure alert was never triggered, and the on-call engineer had no way of knowing about the failure. The job was critical but treated as non-critical in the alarm configuration because no one had reviewed the JIL defaults.
Fix1. Updated the job definition: alarm_if_fail: 1. 2. Added notification_emailaddress: payroll-ops@company.com and notification_msg_on_failure: "Job %s failed on %m at %t with exit code %x. Check log /logs/autosys/payroll_run.err". 3. Added a separate max_run_alarm: 60 to detect hung jobs. 4. Configured the on-call rotation to include weekend coverage with pager duty integration. 5. Added a weekly audit script that lists all jobs with alarm_if_fail: 0 and flags them for review.
Key Lesson
alarm_if_fail: 0 is the default. You must explicitly set it to 1 on every critical job. Do not assume AutoSys alerts on failure.A failure without an alarm is a silent outage. Review all JIL files annually to ensure critical jobs have alarms enabled.Dashboards are not alarms. If no one is looking at the dashboard when the failure occurs, it's not monitoring.Document the on-call escalation process. The failure alert must reach a human, not just a log file.
Production Debug Guide

Symptom → Action mapping for common alarm failures in AutoSys environments.

Job failed but no alarm/email received — outage went undetectedCheck if alarm_if_fail: 1 is set. Default is 0 (no alarm). Modify JIL: update_job: job_name alarm_if_fail: 1. Also check notification_emailaddress and notification_msg.
Too many alarms — team stopped paying attention (alarm fatigue)Reduce alarm volume. Only alarm on critical jobs. Use max_run_alarm and min_run_alarm for bounds alerts, but set thresholds high enough to avoid frequent firing. Consider aggregating non-critical failures into daily summary report instead of real-time alerts.
Alarms acknowledged but root cause not fixed — repeating alarmsImplement alarm resolution tracking. After acknowledging, assign ticket number and require RCA. AutoSys can invoke script on alarm (alarm_action) to create ticket. Block new alarms for same job until ticket is resolved, or use different severity levels.
max_run_alarm triggers every night for long-running job — false positivemax_run_alarm threshold is too low. Increase it based on historical runtime p99 + 20%. For seasonal jobs, use conditional start conditions or multiple JILs with different thresholds.
Notification email not received — check failed but no alertSMTP configuration in AutoSys may be misconfigured. Check autorep -M for mailer status. Also check if notification_emailaddress contains spaces or invalid characters. Test with sendevent -E ALARM_TEST.

AutoSys has a built-in alarm system that lets you define exactly what events should trigger alerts and who should be notified. Without alarms, your batch jobs could silently fail overnight and nobody would know until users start reporting missing reports in the morning. With well-configured alarms, your team knows within minutes.

But alarms are dangerous. Set them too broadly and your team ignores them (alarm fatigue). Send them to the wrong mailbox and nobody reads them. Default alarm_if_fail: 0 means your critically important job fails every night at 2am and nobody ever hears about it.

By the end you'll know how to set up job failure alarms, email notifications, machine monitoring, and runtime bounds alerts. You'll also know the specific mistakes that cause alarms to be ignored or to never fire at all.

alarm_if_fail — the basic failure alert

The simplest alarm is alarm_if_fail. Set it to 1 on any job, and AutoSys raises an alarm in the Event Server when that job fails. You can then configure alarm actions to send email, page, or invoke a script.

io/thecodeforge/autosys/alarm_basic.jil · BASH
123456789101112
insert_job: critical_eod_job
job_type: CMD
command: /scripts/critical_eod.sh
machine: prod-server-01
owner: batchuser
date_conditions: 1
days_of_week: all
start_times: "22:00"
alarm_if_fail: 1          /* raise alarm if job fails */
max_run_alarm: 60         /* also alarm if still running after 60 minutes */
min_run_alarm: 5          /* alarm if completes in less than 5 minutes (suspicious) */
alarm_if_terminated: 1    /* alarm if job is killed/terminated */
📊 Production Insight
alarm_if_fail: 0 is the default. Many legacy jobs were defined without it and have been failing silently for years.
Run an audit: autorep -J % -q | grep -B5 alarm_if_fail | grep -v alarm_if_fail: 1 to find jobs without failure alarms.
Rule: Every production job should have alarm_if_fail: 1 unless explicitly documented as non-critical.
🎯 Key Takeaway
Set alarm_if_fail: 1 on all critical jobs — the default is 0 (no alarm). Run annual audits to ensure compliance.
Use max_run_alarm and min_run_alarm for runtime bounds alerts; min_run_alarm catches jobs that end too fast (possible logic error).
Rule: A failure without an alarm is a silent outage. Configure alarms before the first production run.
Alarm Configuration Decision Tree
IfJob is critical (financial, customer-facing, compliance)
UseSet alarm_if_fail: 1, max_run_alarm: p99 runtime + 20%, notification_emailaddress to on-call group
IfJob is non-critical but should be monitored
UseSet alarm_if_fail: 1 but send email to team mailbox (not pager). Review failures daily.
IfJob runtime varies significantly (data-dependent)
UseSet max_run_alarm high (e.g., 4x median runtime) to avoid false positives. Use term_run_time if you want to kill hung jobs instead.
IfJob is part of a box that handles retries
UseAlarm only if box fails, not individual job. Set alarm_if_fail: 1 on box, 0 on child jobs to reduce noise.
IfJob is expected to fail occasionally (e.g., file not found, retried later)
UseDo NOT set alarm_if_fail on job. Set alarm at parent workflow level only after retries exhausted.

Notification attributes — email on failure

For direct email notification without configuring a separate alarm action, use the notification attributes. These send an email when the job fails.

io/thecodeforge/autosys/notifications.jil · BASH
1234567891011121314
insert_job: payroll_run
job_type: CMD
command: /scripts/payroll.sh
machine: finance-server
owner: finuser
date_conditions: 1
days_of_week: fri
start_times: "18:00"
alarm_if_fail: 1
/* Email notification on failure */
notification_emailaddress: batch-ops@company.com,payroll-lead@company.com
notification_emailaddress_on_success: payroll-mgr@company.com
notification_msg: "ALERT: AutoSys job %s FAILED on %m at %t. Exit code %x. Check log: /logs/autosys/payroll_run.err"
notification_msg_on_success: "INFO: Payroll run %s completed successfully at %t"
🔥Notification message variables
AutoSys supports variables in notification_msg: %s = job name, %m = machine name, %t = timestamp, %x = exit code. Use these to make your alert emails informative enough that the on-call engineer knows what failed and where to look.
📊 Production Insight
notification_msg_on_failure should always include the log file path and exit code. Without these, the on-call engineer has to log into AutoSys, find the job, find the machine, then find the log.
The %x (exit code) variable is often omitted but it's the single most useful piece of information for triage.
Rule: Include %s, %m, %x, and the full path to the job's log file in every failure notification. The engineer should not have to look anything up.
🎯 Key Takeaway
Use notification_emailaddress for direct email alerts; include log file paths and exit code (%x) in notification_msg.
Sending to a shared mailbox nobody monitors defeats the purpose — alarms need a response process.
Rule: notification_emailaddress should point to a pager or SMS gateway for critical jobs, not just an internal mailbox.

Machine and system alarms

Beyond job-level alarms, AutoSys can alarm on machine events — when an agent goes MISSING or when the Event Processor has issues.

io/thecodeforge/autosys/machine_alarms.jil · BASH
12345678
/* Configure machine-level alarms */
update_machine: prod-server-01
max_load: 100
alarm_on_missing: 1    /* alarm when agent goes offline */

/* View active alarms */
# autorep -a              /* show all active alarms */
# sendevent -E ALARM_ACK  /* acknowledge an alarm */
📊 Production Insight
A machine that goes MISSING is worse than a job failure — it affects all jobs on that machine. Yet many sites don't monitor it.
alarm_on_missing: 1 should be set on every production machine. When a machine goes offline, all running jobs on it fail immediately.
Rule: Configure machine alarms before deploying new agents. Add to standard machine template: update_machine: new_host alarm_on_missing: 1.
🎯 Key Takeaway
Set alarm_on_missing: 1 on all production machines — a missing agent takes down all jobs on that host.
Use autorep -a to view active alarms; ALARM_ACK acknowledges without fixing root cause.
Rule: Alarms without an acknowledgement process are just noise. Assign ownership and track resolution.
🗂 AutoSys Alarm Types
Choose alarm based on what event requires attention — failure, runtime bounds, or machine health.
Alarm TypeAttributeTriggers WhenDefault ValueBest For
Job failure alarmalarm_if_fail: 1Job exits with non-zero code0 (disabled)All production jobs
Long run alarmmax_run_alarm: NJob still running after N minutes0 (disabled)Jobs that can hang (file waits, network calls)
Short run alarmmin_run_alarm: NJob completes in less than N minutes0 (disabled)Jobs with expected minimum runtime (data loads)
Termination alarmalarm_if_terminated: 1Job is killed (KILLJOB or term_run_time)0 (disabled)Jobs that should never be killed manually
Machine offline alarmalarm_on_missing: 1Agent machine stops responding0 (disabled)All machines hosting critical jobs

🎯 Key Takeaways

  • Set alarm_if_fail: 1 on all critical jobs — the default is 0 (no alarm). Run annual audits to ensure compliance.
  • Use notification_emailaddress for direct email alerts; include log file paths and exit code (%x) in notification_msg.
  • max_run_alarm and min_run_alarm provide bounds-based alerting for jobs running too long or suspiciously fast.
  • Set alarm_on_missing: 1 on all production machines — a missing agent takes down all jobs on that host.
  • Alarms need a response process — sending to a shared mailbox nobody monitors defeats the purpose.

⚠ Common Mistakes to Avoid

    Not setting alarm_if_fail: 1 on critical jobs — expecting default to be 1
    Symptom

    Job fails silently over weekend. No alarm, no email. Team discovers failure when users complain Monday morning. Outage goes undetected for 48+ hours.

    Fix

    Update all production jobs: update_job: job_name alarm_if_fail: 1. Run quarterly audit: autorep -J % -q | grep -B5 'alarm_if_fail:' | grep -v 'alarm_if_fail: 1' to catch missing alarms.

    Not including %x (exit code) and log path in notification_msg
    Symptom

    On-call engineer receives alert 'Job payroll_run failed' but has no idea why. They must log into AutoSys, find the job, find the machine, then grep for the log file. Triage takes 20 minutes instead of 2.

    Fix

    Add variables: notification_msg: "Job %s failed on %m at %t with exit code %x. Log: /logs/autosys/%s.err". Include full absolute path to the log file.

    Sending alarms to unmonitored shared mailbox
    Symptom

    Alarms sent to batch-ops@company.com. The mailbox has 10,000 unread messages. No one notices new alarms. Failures go undetected.

    Fix

    Send critical alarms to pager/SMS gateway or ticketing system. For non-critical, send to team channel with expectation of daily review. Never send to a mailbox that is not actively monitored.

    Setting max_run_alarm too low — false positives every night
    Symptom

    max_run_alarm: 30 minutes. Job normally takes 25 minutes but occasionally takes 35 minutes due to data volume. Alarm fires every night. Team ignores alarm. Real hung job goes unnoticed.

    Fix

    Set max_run_alarm to p99 runtime + 20% based on historical data. Use autorep -J job_name -r -t to see runtime history. For seasonal jobs, use multiple JILs with different thresholds or conditional start times.

    Acknowledging alarm without fixing root cause
    Symptom

    Same alarm fires every day about same job. Team acknowledges it daily but never investigates. Becoming noise, eventually real alarm gets missed.

    Fix

    Implement alarm resolution tracking. Require ticket number and root cause analysis for each acknowledged alarm. Block new alarms for same job until ticket is closed. Use alarm_action to create ticket automatically.

Interview Questions on This Topic

  • QHow do you configure AutoSys to send an email when a job fails?Mid-levelReveal
    Two methods: (1) Set alarm_if_fail: 1 and configure notification_emailaddress and notification_msg. AutoSys sends email to the specified addresses when the job fails. The notification_msg can include variables %s (job name), %m (machine), %t (timestamp), %x (exit code). (2) Use alarm_action to call a custom script that sends email, pages, or creates a ticket. The alarm_action script receives the alarm details as arguments. The notification approach is simpler; alarm_action is more flexible for integration with ticketing systems or pager duty.
  • QWhat does max_run_alarm do?JuniorReveal
    max_run_alarm specifies a runtime threshold in minutes. If the job is still running after that many minutes, AutoSys raises an alarm. It does NOT kill the job (that's term_run_time). It just alerts the team that the job is taking longer than expected. This is useful for detecting hung jobs or jobs that have gotten stuck on a file wait or network call. The threshold should be set based on historical runtime p99 plus a buffer (e.g., 20%). Setting it too low causes false positives and alarm fatigue. Setting it too high delays detection of real hung jobs.
  • QWhat variables can you use in AutoSys notification_msg?JuniorReveal
    AutoSys supports: %s = job name, %m = machine name, %t = timestamp (when the event occurred), %x = exit code of the job. Example: notification_msg: "Job %s failed on %m at %t with exit code %x. Log: /logs/autosys/%s.err". The %s variable is especially useful for constructing log file paths. The %x variable is critical for triage — it tells the on-call engineer why the script failed (e.g., exit code 2 = file not found, exit code 3 = database connection error).
  • QWhat does alarm_if_fail: 0 mean (the default)?JuniorReveal
    alarm_if_fail: 0 means that when the job fails, AutoSys does NOT raise an alarm in the Event Server. The job status becomes FAILURE, but no alert is triggered. This is the default setting. Many teams forget to set it to 1, leading to silent failures. Any job that is critical to production must have alarm_if_fail: 1 explicitly configured. Security and compliance audits often require proof that all critical jobs have alarms enabled.
  • QHow do you acknowledge an AutoSys alarm?JuniorReveal
    Two methods: (1) Through the Workload Control Center (WCC) interface — navigate to the alarm, select 'Acknowledge'. (2) Using command line: sendevent -E ALARM_ACK -J job_name where job_name is the job that caused the alarm. Acknowledging an alarm removes it from the active alarm list but does not fix the underlying issue. The alarm will reappear if the job fails again on the next run unless the root cause is fixed. Many teams also use sendevent -E ALARM_ACK -A to acknowledge all alarms (not recommended — leads to alarm fatigue).

Frequently Asked Questions

How do I get notified when an AutoSys job fails?

Set `alarm_if_fail: 1 and add notification_emailaddress: your-team@company.com to the job definition. Include a notification_msg` with the log file path and %x exit code so on-call engineers know where to look.

What is max_run_alarm in AutoSys?

max_run_alarm specifies a runtime threshold in minutes. If the job is still running after that many minutes, AutoSys raises an alarm. It doesn't kill the job (that's term_run_time), it just alerts the team that the job is taking longer than expected.

What are the notification message variables in AutoSys?

AutoSys supports: %s (job name), %m (machine name), %t (timestamp), %x (exit code). Use these in notification_msg and notification_msg_on_success to make alert emails immediately informative.

What is the default value of alarm_if_fail?

The default is 0, which means no alarm is raised on failure. You must explicitly set alarm_if_fail: 1 on jobs where you want failure alerts. Many teams make this a required attribute in their job definition standards.

How do I acknowledge an AutoSys alarm?

Use sendevent -E ALARM_ACK or acknowledge through the WCC interface. Unacknowledged alarms accumulate in the alarm list. Establishing an alarm acknowledgement process is important for keeping the alarm list meaningful.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousAutoSys Monitoring with WCCNext →AutoSys Fault Tolerance and Recovery
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged