Beginner 3 min · March 19, 2026

AutoSys Job Status Codes Explained

AutoSys Status Codes — PEND_MACH Disk Space Trap

Q: What does ACTIVATED mean in AutoSys?

ACTIVATED (AC) means the job's parent BOX is in RUNNING state, but the job itself hasn't started yet because its own conditions (start_times, condition attribute) haven't been met. It's a normal state — the job is queued inside a running box. Do NOT force-start a job in ACTIVATED unless the parent box has been running far longer than expected.

Q: What is the difference between ON_HOLD and ON_ICE?

ON_HOLD: When you release a job from ON_HOLD (OFF_HOLD), it will run if its starting conditions are currently satisfied. ON_ICE: When you take a job off ICE (OFF_ICE), it will NOT run immediately — it waits until its conditions reoccur in the next scheduling cycle. ON_ICE is a stronger suspension. **Example**: A midnight job put ON_ICE at 2 PM and released at 3 PM runs at midnight, not 3 PM. The same job put ON_HOLD at 2 PM and released at 3 PM would run immediately at 3 PM (if conditions are met).

Q: How do I restart a failed AutoSys job?

First, investigate why it failed: - `autorep -J jobname -q | grep std_err_file` to find error log - `cat /path/to/error/file` to see the error - `autorep -J jobname -L 5` to see recent runs Fix the underlying issue (code bug, missing file, permission). Then: - `sendevent -E FORCE_STARTJOB -J jobname` Do NOT restart without investigating — you'll likely just see it fail again immediately.

Q: What does TERMINATED status mean in AutoSys?

TERMINATED (TE) means the job was killed. Common causes: - `term_run_time` exceeded (AutoSys killed it after max runtime) - Manual `KILLJOB` event sent by operator - Agent machine went down while job was running Check `autorep -J jobname -d | grep -E 'term_run_time|KILLJOB'` to see the cause. If term_run_time was too low, increase it. If manual kill, understand why the job needed killing. After fixing, use `RESTART` or `FORCE_STARTJOB` to rerun.

Q: What does QUE_WAIT mean and how is it different from PEND_MACH?

QUE_WAIT (QU) means the machine is reachable but currently too busy to accept new jobs — the system load exceeds the machine's `max_load` setting. AutoSys queues the job until load drops. **Difference from PEND_MACH**: - PEND_MACH: Agent unreachable (machine down, agent crashed, network issue). Requires intervention. - QUE_WAIT: Agent reachable, machine overloaded. Resolves automatically. **Diagnosis**: `autorep -M machine-name` shows max_load and current load. If load > max_load, QUE_WAIT is expected. **Fix**: Wait for load to drop, or increase max_load on the machine definition.

47 jobs stuck in PE status? A full /var partition silently kills the Remote Agent.

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.

✓ Production

production tested

July 27, 2026

last updated

1,750

articles · all by Naren

Before you start⏱ 20 min

✓Basic programming fundamentals
✓A computer with internet access
✓Willingness to follow along with examples

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

SUCCESS (SU): Job completed with exit code 0. Normal. No action.
FAILURE (FA): Job exited non-zero. Check std_err_file before restarting.
RUNNING (RU): Job executing. Monitor runtime vs expected duration.
PEND_MACH (PE): Agent unreachable. First check: disk space on agent (df -h).
ON_HOLD (OH): Manually held. OFF_HOLD starts immediately if conditions met.
ON_ICE (OI): Suspended. OFF_ICE waits for next scheduling cycle.
ACTIVATED (AC): Job inside RUNNING box, waiting its turn. Normal.
TERMINATED (TE): Job killed (term_run_time or KILLJOB). Investigate why.
Production rule: 20+ jobs in PEND_MACH on same machine = disk full. Clear space, restart agent.

✦ Definition~90s read

What is AutoSys Job Status Codes?

AutoSys status codes are the job scheduler's state machine — a finite set of signals that tell you where a job is in its lifecycle, from activation to completion. They exist because distributed batch processing needs a deterministic way to track jobs across thousands of machines without human babysitting.

★

AutoSys job statuses are like package tracking codes — each status tells you exactly where your job is in its lifecycle.

The codes range from simple success/failure (SUCCESS, FAILURE) to intermediate states (RUNNING, ACTIVATED) and critical stalls (PEND_MACH, PEND_RES). But here's the trap: these codes are not exit codes. An exit code comes from your script's process; an AutoSys status code is the scheduler's opinion about the job's execution state.

They can diverge wildly — a script that exits 0 can still show FAILURE if AutoSys detects a timeout or machine issue after the process ended.

Where this bites you hardest is with PEND_MACH. Most engineers treat PEND_MACH as 'waiting for a machine' and assume it'll resolve. But PEND_MACH with a disk space issue means the job will never start — it's a permanent stall masked as a transient status.

The scheduler won't fail the job; it just sits there, consuming a slot in your queue, while your downstream dependencies silently starve. The real problem is that AutoSys status codes hide the root cause. PEND_MACH doesn't tell you why the machine isn't available — disk full, agent down, network partition, or resource exhaustion all look identical.

You have to dig into the agent logs or the job's output to distinguish a temporary blip from a deadlock.

In the ecosystem, AutoSys competes with tools like Control-M, Tivoli Workload Scheduler, and open-source alternatives like Apache Airflow. The key difference: modern schedulers expose richer state information (Airflow's task instances show 'up_for_retry' vs 'failed' with distinct reasons).

AutoSys's status codes are a legacy design — they're terse, ambiguous, and optimized for 1980s terminal screens. When you hit PEND_MACH for disk space, you're fighting a 40-year-old abstraction that conflates 'not yet ready' with 'will never be ready.' The fix isn't to watch status codes; it's to instrument your jobs with pre-flight checks and alert on PEND_MACH duration exceeding a threshold.

Treat any PEND_MACH longer than 5 minutes as a failure until proven otherwise.

Plain-English First

AutoSys job statuses are like package tracking codes — each status tells you exactly where your job is in its lifecycle. RUNNING means it's on the truck. SUCCESS means it was delivered. FAILURE means something went wrong. PEND_MACH means the truck broke down.

You'll stare at autorep output every day. SU, FA, RU, OH, OI, PE, TE, AC, IN, ST, QU, RE.

That's 12 status codes. Some are normal. Some need immediate action. Some are traps that operations teams misinterpret for hours.

This article covers each status — what it means, what causes it, and exactly what to do. The difference between ON_HOLD and ON_ICE appears in every AutoSys interview. Master it here.

What AutoSys Status Codes Actually Tell You — and What They Hide

AutoSys status codes are integer values (0–255) emitted by jobs at termination, encoding the exit condition as determined by the AutoSys agent. The core mechanic: the agent captures the process exit code and maps it to a status string (SUCCESS, FAILURE, TERMINATED, etc.), but the raw code itself is often lost or overwritten by the agent's own logic. In practice, the most dangerous status is PEND_MACH (pending machine) — it means the job never started because no agent could be assigned. The agent selection algorithm considers machine load, time zone, and operator-defined attributes, but a full disk on the primary agent host silently prevents assignment, leaving the job in PEND_MACH with exit code 0. This is not a job failure; it's an infrastructure failure that looks like a scheduling delay. Senior engineers treat PEND_MACH as a canary for disk space or agent health, not a job issue. Use status codes to distinguish between job logic failures (non-zero exit) and execution failures (PEND_MACH, TERMINATED). In real systems, monitoring PEND_MACH trends catches disk-full scenarios before they cascade.

⚠ PEND_MACH ≠ Job Failure

A job stuck in PEND_MACH with exit code 0 means the agent never ran it — check disk space on the primary agent host, not the job logic.

📊 Production Insight

A payment settlement job entered PEND_MACH during month-end close because the primary agent's /var partition hit 98% usage.

The job never started, no alert fired (exit code 0), and settlements were delayed 6 hours until manual intervention.

Rule: Monitor PEND_MACH duration — anything over 5 minutes on a production job triggers disk-space check on the agent host.

🎯 Key Takeaway

PEND_MACH with exit code 0 is an infrastructure failure, not a job failure.

Raw exit codes are often overwritten — always log the agent's status string, not just the integer.

Trend PEND_MACH counts per agent host to detect disk pressure before jobs fail.

thecodeforge.io

Autosys Job Status Codes

Normal lifecycle statuses

These are the statuses a job moves through during a healthy execution. You should see them regularly and not be alarmed.

INACTIVE (IN): Job hasn't been triggered yet. Waiting for its schedule or conditions. This is the starting state for most scheduled jobs.

ACTIVATED (AC): The job's parent BOX is in RUNNING state, but the job itself hasn't started yet because its own conditions aren't met. This is normal — it means the box is active and the job is queued.

STARTING (ST): Brief transition state. The Event Processor sent the start command to the agent, and the agent is spinning up the process. Usually lasts under 3 seconds.

RUNNING (RU): Job is executing on the agent machine. Monitor runtime against expected duration. A job that runs 4x longer than usual may be hung.

SUCCESS (SU): Job completed with exit code 0. Normal completion. No action needed.

RESTART (RE): Job is being automatically retried after a failure (n_retrys configured). This is normal for jobs with retry logic. Only be concerned if the final attempt also fails.

check_status.shBASH

# Check status of a single job
autorep -J daily_report

# Check status with full details (start/end times, exit code)
autorep -J daily_report -d

# List all jobs and their current status
autorep -J %

# Filter by status (SU = SUCCESS, FA = FAILURE, RU = RUNNING)
autorep -J % -s SU
autorep -J % -s FA
autorep -J % -s RU

# Check if a job is hung (running longer than expected)
autorep -J long_job -d | grep "Run Time"

Output

Job Name ST Exit Start Time End Time

------------------------------------------------------------

daily_report SU 0 03/19/2026 02:00:01 03/19/2026 02:08:43

weekly_reconcile FA 1 03/19/2026 03:00:00 03/19/2026 03:01:22

monthly_summary OH -- -- --

Mental Model

The Package Tracking Model

PEND_MACH is 'truck broke down'. ON_HOLD is 'hold at depot'. The analogy makes the statuses intuitive.

INACTIVE → ACTIVATED → STARTING → RUNNING → SUCCESS (normal flow)
FAILURE, TERMINATED, PEND_MACH = delivery exceptions
ON_HOLD/ON_ICE = manual holds (customer requested hold)
The Event Processor is the dispatch center. Agents are the delivery trucks.

📊 Production Insight

A team saw a job in ACTIVATED status for 3 hours and assumed it was stuck. They force-started it. The job ran and failed because its dependency hadn't completed yet. The original job would have started 10 minutes later when the dependency finished.

ACTIVATED is not stuck. It means 'waiting for conditions inside a running box'. The job was fine. The team's impatience caused a failure.

Rule: ACTIVATED only becomes a problem if the parent box has been RUNNING for longer than the expected total duration of all jobs in the box. Otherwise, it's normal.

🎯 Key Takeaway

IN → AC → ST → RU → SU is normal job lifecycle.

ACTIVATED means 'waiting inside running box', not stuck.

RUNNING for 2x expected = possibly hung.

SUCCESS doesn't guarantee correct output — validate separately.

Is this status normal or a problem?

IfJob in ACTIVATED for < 1 hour, box still RUNNING

→

UseNormal. Job is waiting for its conditions inside the box.

IfJob in RUNNING for > 2x expected duration

→

UseProbably hung. Investigate. May need KILLJOB.

IfJob in PEND_MACH, other jobs on same machine also PEND_MACH

→

UseProblem. Agent machine likely down or disk full. Check immediately.

IfJob in FAILURE, first time in weeks

→

UseProblem. Investigate root cause before restarting.

All status codes with abbreviations

The autorep command uses two-letter abbreviations for status codes. Here they all are, including the less common ones:

Normal/Interim: AC (ACTIVATED), IN (INACTIVE), ST (STARTING), RU (RUNNING), SU (SUCCESS), RE (RESTART)

Problem/Unusual: FA (FAILURE), PE (PEND_MACH), TE (TERMINATED), QU (QUE_WAIT)

Manual holds: OH (ON_HOLD), OI (ON_ICE)

Less common: DE (DELAYED — job waiting for start time), OP (OPERATOR — operator held), SP (STOPPED), UP (UP STREAM — waiting on upstream), WD (WAIT_DEP — waiting on file watcher)

The difference between ON_HOLD and ON_ICE is the most misunderstood. ON_HOLD + OFF_HOLD = start immediately if conditions met. ON_ICE + OFF_ICE = wait for next scheduling cycle.

status_abbreviations.txtBASH

AC  ACTIVATED    -- Box is RUNNING but this job hasn't started yet
DE  DELAYED      -- Job waiting to meet its start time criteria
FA  FAILURE      -- Job completed with non-zero exit code
IN  INACTIVE     -- Job hasn't been triggered; waiting for schedule
OH  ON_HOLD      -- Manually held; won't run until released
OI  ON_ICE       -- Suspended; won't run even when conditions reappear
OP  OPERATOR     -- Operator held (sendevent -E HOLDJOB)
PE  PEND_MACH    -- Waiting for target machine (agent offline)
QU  QUE_WAIT     -- Waiting for available machine load/resources
RE  RESTART      -- Job is being restarted after failure
RU  RUNNING      -- Job is currently executing
ST  STARTING     -- Job sent to agent; waiting for execution to begin
SU  SUCCESS      -- Job completed with exit code 0
TE  TERMINATED   -- Job was killed (term_run_time, KILLJOB event, etc.)
UP  UP_STREAM    -- Waiting on upstream dependencies
WD  WAIT_DEP     -- Waiting on file watcher condition

⚠ ON_HOLD vs ON_ICE — the most misunderstood difference

ON_HOLD: OFF_HOLD starts job immediately if conditions are currently satisfied. ON_ICE: OFF_ICE does NOT start immediately — it waits for conditions to reoccur in the next scheduling cycle. This difference causes major operational confusion.

📊 Production Insight

A financial firm put a midnight reporting job ON_ICE at 2 PM to prevent it from running during database maintenance. After maintenance finished at 4 PM, they did OFF_ICE expecting it to run immediately. It didn't. The job ran at midnight as originally scheduled. The report was 8 hours late.

The team confused ON_ICE with ON_HOLD. They should have used ON_HOLD for a temporary pause during business hours.

Rule: Use ON_HOLD when you want the job to run as soon as you're ready. Use ON_ICE when you want the job to return to its normal schedule without running out-of-cycle.

🎯 Key Takeaway

ON_HOLD + OFF_HOLD = immediate start (if conditions met).

ON_ICE + OFF_ICE = next scheduled cycle only.

Use ON_HOLD for temporary pauses (maintenance, migrations).

Use ON_ICE for permanent schedule changes or to prevent out-of-cycle runs.

thecodeforge.io

Autosys Job Status Codes

The problem statuses and what to do

These statuses almost always require your attention. Each has a specific diagnosis path.

problem_statuses.shBASH

# ── FAILURE: job failed ──────────────────────────────────────────
# Check the error log first
cat /logs/autosys/daily_report.err
# Then check the event log
autorep -J daily_report -d
# Review recent runs
autorep -J daily_report -run 7
# Retry after fixing the underlying issue
sendevent -E FORCE_STARTJOB -J daily_report

# ── PEND_MACH: agent machine offline ─────────────────────────────
# First: check disk space on agent (MOST COMMON)
ssh prod-server-01 'df -h'
# Then check agent status
autoping -m prod-server-01
autorep -M prod-server-01
# Restart agent after fixing issue
/opt/CA/agent/bin/agent_start

# ── TERMINATED: job was killed ────────────────────────────────────
# Check if killed by term_run_time or manually
autorep -J slow_job -d | grep -E 'term_run_time|KILLJOB'
# If due to term_run_time, increase it:
update_job: slow_job
term_run_time: 180

# ── QUE_WAIT: machine overloaded ──────────────────────────────────
# Check machine load limits
autorep -M machine-name
autorep -J jobname -q | grep load

Output

FAILURE — Error in /logs/autosys/daily_report.err:

ERROR: Database connection timeout after 30s

PEND_MACH — prod-server-01: MISSING (agent service down)

Disk usage on prod-server-01: /apps 100% full — agent stopped

TERMINATED — term_run_time: 60 (exceeded). Job ran for 62 minutes.

🔥QUE_WAIT vs PEND_MACH — subtle but important difference

PEND_MACH = agent unreachable (machine down or agent stopped). QUE_WAIT = agent reachable but machine too busy (load exceeds max_load). QUE_WAIT resolves automatically when load drops. PEND_MACH requires intervention.

📊 Production Insight

A team saw a job in QUE_WAIT and assumed the agent was down. They restarted the agent, which cleared the queue, but the underlying high load was caused by a different job consuming all CPU. The high-load job continued to run, and QUE_WAIT returned immediately.

QUE_WAIT doesn't mean 'agent broken'. It means 'machine is too busy right now'. AutoSys is protecting the machine from overload.

Diagnosis: autorep -M machine-name shows the max_load setting and current load. If load > max_load, jobs wait.

Fix: Increase max_load on the machine definition, or reduce load from other jobs. Don't restart the agent — that doesn't solve the load issue.

Rule: QUE_WAIT = load problem. PEND_MACH = agent/network problem. Don't confuse them.

🎯 Key Takeaway

FAILURE: Check error log first. Never restart without investigating.

PEND_MACH: First check disk space (df -h). 90% of cases.

TERMINATED: term_run_time or KILLJOB. Increase runtime or fix hang.

QUE_WAIT: Machine overload. Increase max_load or reduce load.

The Status Transition Matrix: Why Your Job Didn't Fail — It Just Never Started

Most engineers treat AutoSys status codes as final verdicts. They're not. They're snapshots of a sequence that can stall at any point. A job stuck in ACTIVATED for 15 minutes isn't failing — it's waiting on a condition that never evaluated true. A job bouncing between QUEUED and STARTING is a dispatcher bottleneck, not a script error.

The real skill is reading the transition path, not the final status. When you see INACTIVE after a failure, that's the system cleaning house — your job already exhausted its retry limit. When you see TERMINATED with exit code 0, someone sent a kill signal mid-execution. The exit code is meaningless then.

Build a mental model of the state machine. Every status has a permitted predecessor and successor. If you see a transition that shouldn't happen — like SUCCESS flipping to INACTIVE without an intervening failure — you're looking at a force-restart or a database corruption. Both require human intervention, not automation.

StatusTransitions.ymlYAML

// io.thecodeforge — devops tutorial

// Valid transition paths for an AutoSys job
name: payment_reconciliation_job
machine: payment-prod-01
command: /opt/scripts/run_reconciliation.sh
max_run_alarm: 900

# Valid status sequences:
# STARTING -> RUNNING -> SUCCESS
# STARTING -> RUNNING -> FAILURE -> INACTIVE (retries exhausted)
# ACTIVATED -> QUEUED -> STARTING -> RUNNING -> TERMINATED

# The bad one — never seen in logs but happens in prod:
# SUCCESS -> INACTIVE without FAILURE
# This = a manual force-restart or Event Server glitch

dependencies:
  - job: data_ingestion_job
    condition: success

# if data_ingestion_job never starts, payment_reconciliation stays ACTIVATED
# It didn't fail. It never got the signal.

Output

Actual transition seen in production logs:

2024-03-15 14:30:02 ACTIVATED -> QUEUED

2024-03-15 14:30:05 QUEUED -> STARTING

2024-03-15 14:30:08 STARTING -> RUNNING

2024-03-15 14:45:12 RUNNING -> FAILURE

2024-03-15 14:45:13 FAILURE -> INACTIVE

Engineer's assumption: script bug

Reality: remote agent process died mid-execution

⚠ Production Trap:

If you see ACTIVATED persist for >5 minutes, check the Event Server's condition evaluator — not the job. The job didn't fail to start. The scheduler failed to evaluate its dependencies.

🎯 Key Takeaway

A job's current status tells you where it is; the transition history tells you why.

Exit Codes vs. AutoSys Status Codes: The Lie You've Been Told

Here's what nobody tells you: AutoSys status codes and process exit codes are different layers of reality. A job can exit with code 0 (success to the OS) but AutoSys flags it as FAILURE because of exit code mapping. Conversely, a script that dumps core with exit code 139 can be registered as SUCCESS if the job profile says 'ignore_exit_code: 1'.

The exit_code_mapping parameter is a minefield. Default behavior: any non-zero exit is failure. But production systems routinely map exit code 2 to SUCCESS because some old Perl script uses 2 for "no data found" — which is not an error. That's your problem. You're debugging an AutoSys FAILURE that is actually a business logic success.

Stop looking at status codes in isolation. Cross-reference them with the actual exit code from the job's stdout log. If you see FAILURE with exit code 0, you have a timeout issue — the job ran past max_run_alarm. If you see SUCCESS with exit code 137, someone sent SIGKILL to the process and AutoSys swallowed it.

Map your exit codes explicitly. Put it in version control. Treat the defaults as suspicious.

ExitCodeMapping.ymlYAML

// io.thecodeforge — devops tutorial

// Profile for a job that 'succeeds' when no data exists
job_name: crm_export_loader
machine: crm-worker-03
command: /opt/scripts/load_crm_data.sh

# Default AutoSys behavior: non-zero exit = FAILURE
# This mapping changes the rules:
exit_code_mapping:
  - exit_code: 0
    description: Data loaded successfully
    status: SUCCESS
  - exit_code: 2
    description: No new data available
    status: SUCCESS
  - exit_code: 1
    description: Database connection lost
    status: FAILURE
  - exit_code: 137
    description: Process killed by OOM
    status: TERMINATED

max_run_alarm: 300
# If this runs 301 seconds, AutoSys terminates it
# Exit code from kill signal = 143, not from the script
# Without explicit mapping, you see FAILURE with exit 143

Output

Engineer sees in AutoSys console:

Status: FAILURE

Exit Code: 0

Reality check from job log:

"Job exceeded max_run_alarm of 300 seconds. AutoSys killed process with SIGTERM."

The exit code 0 was from an earlier log entry. The actual termination exit was 143, but AutoSys overwrote it with the last value captured before kill.

🔥Senior Shortcut:

Always set exit_code_mapping for exit code 143 (SIGTERM) and 137 (SIGKILL). If you see those without explicit mapping, it means AutoSys is lying to you about whether the job ran clean.

🎯 Key Takeaway

Exit codes tell you what the OS saw; status codes tell you what AutoSys decided. Never trust one without the other.

● Production incidentPOST-MORTEMseverity: high

The 2 AM PEND_MACH That Wasn't a Network Problem

Symptom

autorep shows 47 jobs all with PE status on the same agent machine. autoping -m agent-host returns MISSING. No recent changes to network or firewall.

Assumption

The engineer assumed network or firewall because 'agent unreachable' sounds like connectivity. They didn't check the most common cause first.

Root cause

The Remote Agent service writes logs to /var/log/autosys. When /var fills 100%, the agent process stops. The agent doesn't crash with an error — it just exits. AutoSys's Event Processor keeps trying to contact it, sees no response, and leaves jobs in PEND_MACH. The agent machine had a 20GB /var partition. Log rotation failed 3 days ago. The partition filled completely.

Fix

1. SSH to agent machine: ssh agent-host 2. df -h → /var 100% used 3. Find large files: du -sh /var/log/* | sort -h 4. Rotate or delete old logs: sudo logrotate -f /etc/logrotate.conf 5. Restart agent: /opt/CA/WorkloadAutomationAE/agent/bin/agent_start 6. Verify: autoping -m agent-host → ACTIVE 7. Jobs will auto-resume from PEND_MACH to RUNNING Prevention: Set up disk space monitoring at 80% for agent machines. Auto-rotate logs weekly.

Key lesson

PEND_MACH on multiple jobs → same machine → disk space is #1 cause.
Diagnosis order: disk space → agent process → network → everything else.
df -h is the first command, not the last. It saves hours.
Agent machines need disk monitoring. Full disk stops the agent silently.

Production debug guideFA, PE, TE, OH/OI — the statuses that need your attention4 entries

Symptom · 01

FAILURE (FA) — job exited with non-zero code

→

Fix

1. Check std_err_file in job definition. 2. autorep -J JOB -L 10 (last runs). 3. Fix root cause (code, missing file, permission). 4. FORCE_STARTJOB after fix.

Symptom · 02

PEND_MACH (PE) — agent unreachable

→

Fix

1. ssh agent-host, df -h (full disk = #1). 2. ps -ef | grep autosys (agent running?). 3. autoping -m AGENT (from server). 4. Clear disk space, restart agent.

Symptom · 03

TERMINATED (TE) — job was killed

→

Fix

1. autorep -J JOB -d | grep term_run_time (expired?). 2. Check if KILLJOB was sent manually. 3. Increase term_run_time or fix hang reason. 4. RESTART job.

Symptom · 04

ON_HOLD (OH) or ON_ICE (OI) — job intentionally held

→

Fix

1. Verify hold is intentional (sendevent history). 2. OFF_HOLD: sendevent -E OFF_HOLD -J JOB (runs immediately if conditions met). 3. OFF_ICE: sendevent -E OFF_ICE -J JOB (waits for next schedule).

★ Status Codes — 60-Second DiagnosisRun these commands when you see problem statuses

FAILURE — find why−

Immediate action

Check error file and last runs

Commands

autorep -J JOBNAME -q | grep std_err_file

autorep -J JOBNAME -L 10 | grep -A 5 FAILURE

Fix now

cat /path/to/error/file

PEND_MACH — check disk first+

Check if job is ON_HOLD or ON_ICE+

Job stuck at ACTIVATED too long+

AutoSys Status Codes — Complete Reference

Status	Abbrev.	Normal or problem?	Typical cause	Action needed?
INACTIVE	IN	Normal	Waiting for schedule/conditions	No
ACTIVATED	AC	Normal	Box is RUNNING, job waiting its turn	No — but monitor if box runs too long
STARTING	ST	Normal	Agent spinning up process	No — brief transition
RUNNING	RU	Normal	Job executing	No — but monitor duration
SUCCESS	SU	Normal	Exit code 0	No
FAILURE	FA	Problem	Non-zero exit code	Yes — investigate and fix
ON_HOLD	OH	Normal (if intentional)	Manual hold (sendevent)	Check if intentional
ON_ICE	OI	Normal (if intentional)	Manual freeze (sendevent)	Check if intentional
PEND_MACH	PE	Problem	Agent unreachable (full disk #1)	Yes — check agent machine
QUE_WAIT	QU	Possible problem	Machine load > max_load	Monitor — resolves automatically
TERMINATED	TE	Problem (usually)	Killed by term_run_time or manual	Investigate cause
RESTART	RE	Normal	Auto-retry after failure	Monitor for continued failure

⚙ Quick Reference

5 commands from this guide

File	Command / Code	Purpose
check_status.sh	autorep -J daily_report	Normal lifecycle statuses
status_abbreviations.txt	AC ACTIVATED -- Box is RUNNING but this job hasn't started yet	All status codes with abbreviations
problem_statuses.sh	cat /logs/autosys/daily_report.err	The problem statuses and what to do
StatusTransitions.yml	name: payment_reconciliation_job	The Status Transition Matrix: Why Your Job Didn't Fail
ExitCodeMapping.yml	job_name: crm_export_loader	Exit Codes vs. AutoSys Status Codes

Key takeaways

PEND_MACH = agent unreachable. First check

disk space (df -h). 90% of cases.

ON_HOLD + OFF_HOLD = immediate start. ON_ICE + OFF_ICE = next cycle only. Master this difference.

ACTIVATED is normal

job waiting inside running box. Don't force-start it.

FAILURE requires investigation before restart. Check error log first, always.

QUE_WAIT = machine overload. PEND_MACH = agent unreachable. Don't confuse them.

Symptom

Job in ACTIVATED for 2 hours. Team force-starts it. The job's dependency wasn't ready. The job fails. The original job would have started correctly 10 minutes later.

Fix

ACTIVATED means 'waiting inside a running box'. It's not stuck unless the parent box has been running longer than the total expected duration of all jobs in the box.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

What does PEND_MACH status mean in AutoSys?

Q02SENIOR

What is the difference between ON_HOLD and ON_ICE in AutoSys?

Q03JUNIOR

What does ACTIVATED status mean for a job inside a BOX?

Q04SENIOR

How do you force-start a job that is in FAILURE status?

Q05SENIOR

What is the most common cause of jobs going to PEND_MACH in bulk?

Q06SENIOR

What does QUE_WAIT status mean and how is it different from PEND_MACH?

Q01 of 06SENIOR

What does PEND_MACH status mean in AutoSys?

ANSWER

PEND_MACH (PE) means the job is waiting for its target machine to become available. The Remote Agent on that machine is not responding to the Event Processor. Most common cause (90% of cases): The agent machine's filesystem is 100% full, causing the agent service to stop. Other causes: - Agent service not running - Machine offline - Network issue (firewall blocking port 7520) - Agent machine overloaded (different from QUE_WAIT) Diagnosis order: 1. SSH to agent machine: df -h (check disk space) 2. Check agent process: ps -ef | grep autosys 3. Test connectivity: autoping -m agent-host Fix: Clear disk space, restart agent, or fix network issue.

FAQ · 6 QUESTIONS

Frequently Asked Questions

What does PEND_MACH mean in AutoSys?

What does ACTIVATED mean in AutoSys?

What is the difference between ON_HOLD and ON_ICE?

How do I restart a failed AutoSys job?

What does TERMINATED status mean in AutoSys?

What does QUE_WAIT mean and how is it different from PEND_MACH?

COMPLETE GUIDE

The Complete AutoSys Workload Automation Guide for Engineers →

JIL syntax, sendevent, autorep, box jobs, file watchers, scheduling, HA, security, cloud workload automation, and 22 interview questions — the definitive AutoSys reference for production engineers.

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.

✓ Verified

production tested

July 27, 2026

last updated

1,750

articles · all by Naren

🔥

That's AutoSys. Mark it forged?

3 min read · try the examples if you haven't