Senior 3 min · March 19, 2026

AutoSys Interview Questions — 50 Bank & Insurance Q&As

ON_HOLD runs immediately on release; ON_ICE waits for next cycle.

N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.

Follow
Production
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • ON_HOLD: releasing starts job immediately if conditions met. ON_ICE: releasing waits for next scheduling cycle. Most common wrong answer.
  • PEND_MACH = agent unreachable. First check: disk space on agent (df -h). 90% of cases.
  • date_conditions defaults to 0 (time scheduling disabled). Most people assume it's 1.
  • FORCE_STARTJOB bypasses ALL conditions (time AND dependencies). STARTJOB respects them.
  • box_terminator: 1 stops entire box when job fails. Use on validation jobs only.
  • Global variables: SET_GLOBAL writes, autostatus -G reads, variable() in JIL conditions.
✦ Definition~90s read
What is AutoSys Interview Questions and Answers?

AutoSys is a distributed job scheduling and workload automation platform from CA Technologies (now Broadcom), used primarily in financial services and insurance to orchestrate batch processing, ETL pipelines, and regulatory reporting. It solves the problem of managing thousands of interdependent jobs across heterogeneous systems (UNIX, Windows, mainframes) with deterministic sequencing, time-based triggers, and event-driven execution.

This article collects the AutoSys questions that actually come up in interviews at banks, insurance companies, and enterprise IT shops — and gives you the complete, correct answers, not the vague half-answers you'll find elsewhere.

In banking and insurance, AutoSys is the backbone for overnight batch cycles that process trades, calculate risk, generate statements, and feed downstream systems — failures here mean regulatory breaches or multi-million dollar losses. The platform's core components are the Event Server (a persistent database that tracks job state), the Agent (runs on each machine to execute jobs), and the Client interfaces (GUI, CLI, JIL).

JIL (Job Information Language) is the declarative scripting language used to define jobs, dependencies, conditions, and calendars — think of it as cron on steroids with cross-system awareness. Alternatives include Control-M, Tivoli Workload Scheduler, and open-source tools like Airflow or Rundeck, but AutoSys dominates legacy banking environments due to its mainframe heritage, robust failover, and audit trail capabilities.

You should not use AutoSys for real-time streaming, microservice orchestration, or lightweight cron replacements — it's designed for heavy, stateful, multi-step batch workflows where job order and error recovery are non-negotiable. The interview questions in this article target the specific failure modes and edge cases that trip up candidates: what happens when a job stays in RUNNING state for 8 hours, how to handle event server failover without losing job history, and why 'status: SUCCESS' doesn't always mean the job actually ran correctly.

Plain-English First

This article collects the AutoSys questions that actually come up in interviews at banks, insurance companies, and enterprise IT shops — and gives you the complete, correct answers, not the vague half-answers you'll find elsewhere.

AutoSys interviews are specific. Interviewers know the tool. Vague answers about 'scheduling jobs' fail.

This guide assumes you've worked through the other articles in this track. It's your review. The questions are organised from foundational to advanced. The answers are complete, not truncated.

The most common wrong answer? ON_HOLD vs ON_ICE. That question appears in almost every interview. Get it right.

What AutoSys Interview Questions Actually Test

AutoSys is an enterprise job scheduling and workload automation tool used to define, manage, and monitor batch jobs across distributed systems. Its core mechanic is the Job Information Language (JIL), a declarative syntax that specifies job attributes like command, machine, start condition, and dependencies. Interview questions probe your ability to translate business scheduling requirements into JIL definitions and troubleshoot job failures in complex dependency chains.

In practice, AutoSys jobs are event-driven: a job triggers based on time, file arrival, or the exit code of another job. Key properties include box jobs (containers for grouping), global variables, and condition expressions (e.g., 'success(jobA) AND exitcode(jobB) = 0'). Understanding how AutoSys handles job states (SUCCESS, FAILURE, TERMINATED, RUNNING) and the role of the Event Server and Remote Agent is critical for answering scenario-based questions.

Use AutoSys when you need reliable, auditable batch processing with cross-platform orchestration — common in banking for end-of-day reconciliations, report generation, or data warehouse loads. It matters because a misconfigured dependency or missing file trigger can halt a critical business process, causing SLA breaches. Interviewers look for candidates who can design resilient schedules with proper error handling, restart logic, and monitoring.

JIL Is Not a Scripting Language
AutoSys JIL defines job metadata and dependencies, not logic. Do not confuse JIL conditions with shell scripting — they only evaluate exit codes and job states.
Production Insight
A bank's nightly trade settlement job chain failed because a predecessor job succeeded with exit code 1 (a warning), but the downstream job's condition required exitcode(prev) = 0.
Symptom: downstream job never triggered despite upstream completing; operations team found no error in logs, only a 'condition not met' status.
Rule: always define success criteria explicitly — either use 'success(prev)' to ignore exit codes, or set 'term_run_time' and 'max_run_alarm' to catch unexpected exit codes early.
Key Takeaway
AutoSys interview questions are 80% JIL syntax and condition logic, 20% troubleshooting failed jobs.
Master box jobs and global variables to reduce duplication and simplify dependency management.
Always design for restartability — use 'max_retry', 'term_run_time', and 'watch_file' to handle transient failures without manual intervention.
AutoSys Interview Topic Map AutoSys Interview Topic Map. Grouped by category — know these cold · Architecture · Event Server vs Processor · Component roles · HA / shadow server · PEND_MACH causesTHECODEFORGE.IOAutoSys Interview Topic MapGrouped by category — know these cold ArchitectureEvent Server vs ProcessorComponent rolesHA / shadow serverPEND_MACH causes JIL Commandsinsert vs updatedelete vs delete_boxautorep -q backupoverride_job Job TypesCMD / BOX / FW diffbox_name attributebox_terminator useFW min_file_size Status CodesAll abbreviationsON_HOLD vs ON_ICEACTIVATED meaningTERMINATED causes Schedulingdate_conditions gaterun_window purposerun_calendar setuptimezone handling TroubleshootingFailure workflowPEND_MACH → disk checkRestart procedureCHANGE_STATUS useTHECODEFORGE.IO
thecodeforge.io
AutoSys Interview Topic Map
Autosys Interview Questions

Architecture and concepts

These questions test whether you understand what AutoSys actually is and how it works internally. They're usually early in the interview to establish baseline knowledge.

architecture_qa.txtBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Q: What is AutoSys and what problem does it solve?
A: AutoSys is Broadcom's enterprise workload automation platform for scheduling,
   monitoring, and orchestrating batch jobs across multiple servers. It solves the
   scalability problems of cron: dependency management, centralised visibility, alerting,
   audit trails, and multi-server coordination.

Q: What are the main components of AutoSys architecture?
A: Event Server (database storing all definitions and events), Event Processor
   (scheduling daemon that evaluates conditions and triggers agents), Remote Agents
   (lightweight processes on each target machine), and Clients (CLI tools + WCC web UI).

Q: What happens when the Event Processor goes down?
A: Job triggering stops. Jobs that are currently RUNNING continue to completion (the
   agent handles execution independently), but no new jobs will be triggered until
   the Event Processor is restarted.
Interview tip — Event Processor vs Event Server
Interviewers often ask 'what's the difference?' The Event Server is the database (storage). The Event Processor is the daemon (evaluation). One stores state, the other triggers jobs.
Production Insight
A candidate answered 'The Event Processor writes to the Event Server.' That's backwards. The Event Processor reads from the Event Server. The Event Server is written to by agents and sendevent commands. The processor is stateless.
The interviewer asked a follow-up: 'If the Event Server goes down, do running jobs continue?' The candidate didn't know. Answer: Yes — the agent runs jobs independently. But job completion status cannot be written back.
Rule: Know which component does what. If you confuse direction, you fail the architecture section.
Key Takeaway
Event Server = database (storage). Event Processor = daemon (evaluation).
Event Processor down = no new jobs. Running jobs continue.
Agent down = jobs on that PEND_MACH. Other agents fine.
Know the failure modes: silent, not sudden.
Component failure — what happens to jobs?
IfEvent Processor crashes
UseRunning jobs continue. No new jobs start. Status updates queue.
IfEvent Server unreachable
UseRunning jobs continue. Completion status can't be saved. Agent may retry.
IfRemote Agent on machine down
UseJobs on that machine stay PENDING. Other machines unaffected.
IfNetwork between server and agent down
UseJobs on that machine go PEND_MACH. Agent can't start jobs or report status.

JIL and job operations

These test practical JIL knowledge — what interviewers really want to know is whether you've actually used the tool, not just read about it.

jil_operations_qa.txtBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Q: What is the difference between insert_job and update_job?
A: insert_job creates a new job definition — fails if job already exists.
   update_job modifies an existing job (partial update, only changed attributes).
   Fails if job doesn't exist.

Q: What is the difference between delete_job and delete_box?
A: delete_job on a box removes only the box, leaving inner jobs as standalone.
   delete_box removes the box AND all its inner jobs.

Q: How do you back up AutoSys job definitions?
A: autorep -J % -q > backup_$(date +%Y%m%d).jil
   This dumps all job definitions in JIL format to a file.

Q: How do you view the JIL definition of an existing job?
A: autorep -J jobname -q

Q: What does FORCE_STARTJOB do differently from STARTJOB?
A: FORCE_STARTJOB starts the job immediately bypassing all conditions
   (date_conditions, start_times, condition attribute). STARTJOB only triggers
   if conditions are currently met.
Most missed JIL question: delete_job vs delete_box
On a box: delete_job removes the box container. Inner jobs become standalone. delete_box removes box AND inner jobs. This is a common trick question — if you say 'delete_job removes the box and its jobs', you're wrong.
Production Insight
An operations engineer used delete_job on a production box thinking it would remove all inner jobs. It didn't. The box vanished. All inner jobs became orphaned standalone jobs. They continued running on their own schedules, independent of dependencies.
A trading settlement job ran 4 hours early because its parent box was gone. The box had enforced a start time. Without the box, the job ran at its own start time — which was 2 PM, not 6 PM.
Recovery: regenerate box definition from backup (autorep -J boxname -q had been saved). Reinsert box. Reassociate inner jobs with box_name attributes.
Rule: Always have current JIL backups. autorep -J % -q weekly. Delete box? Use delete_box or expect orphaned jobs.
Key Takeaway
insert_job vs update_job: create vs modify. delete_job vs delete_box: box-only vs box+children.
Backups: autorep -J % -q > backup.jil — do this weekly.
FORCE_STARTJOB bypasses ALL conditions. STARTJOB respects them.
JIL is case-sensitive on Linux. JOB vs Job are different.

Status codes and troubleshooting

These test operational knowledge — have you actually been on-call for an AutoSys environment? Interviewers love status code questions because they separate theory from practice.

status_trouble_qa.txtBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Q: What does PEND_MACH mean and what usually causes it?
A: PEND_MACH (PE) means the Remote Agent on the target machine is unavailable.
   Most common cause: the agent machine's filesystem is 100% full, stopping the
   agent service. Check disk space first: ssh machine01 'df -h'

Q: What is the difference between ON_HOLD and ON_ICE?
A: ON_HOLD: releasing (OFF_HOLD) starts the job immediately if conditions are currently met.
   ON_ICE: releasing (OFF_ICE) makes the job wait for conditions to reoccur in the
   next scheduling cycle — it does not start immediately.

Q: A job was failing every night for a week. What's your troubleshooting approach?
A: 1. Check std_err_file for the error pattern
   2. Check if it's always the same exit code (consistent root cause)
   3. Check autorep -J jobname -run 7 to compare recent runs
   4. Check if it correlates with system events (deployments, maintenance)
   5. Engage the application team who owns the script

Q: How do you unblock downstream jobs after manually fixing a failed job?
A: sendevent -E CHANGE_STATUS -J fixed_job -s SUCCESS
   This marks the job SUCCESS so all downstream success() conditions are met.
ON_HOLD vs ON_ICE — the most common wrong answer
Most candidates say 'they're the same'. They're not. OFF_HOLD starts immediately. OFF_ICE waits for next schedule. If you get this wrong, you fail the status section. Know it cold.
Production Insight
A candidate correctly defined ON_HOLD vs ON_ICE. Then the interviewer asked: 'You have a job that runs at midnight. At 2 PM, you put it ON_ICE. At 3 PM, you release it. When does it run?'
The candidate thought: immediately. Wrong. ON_ICE release waits for the next scheduling cycle — midnight. The job ran at midnight, not 3 PM.
The candidate would have failed the real scenario. Operational experience matters more than definitions.
Rule: ON_HOLD = manual overrides during the day. ON_ICE = permanent schedule changes or avoiding out-of-cycle runs.
Key Takeaway
PEND_MACH = agent unreachable. First check: disk space.
ON_HOLD = immediate resume. ON_ICE = next scheduled cycle. Learn it.
CHANGE_STATUS -s SUCCESS unblocks downstream after manual fix.
troubleshooting = logs + trends + correlation + escalation.

Advanced and scenario questions

These test whether you can reason about AutoSys in complex real-world situations. Senior-level interviews focus heavily on this section.

advanced_qa.txtBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Q: Design an AutoSys workflow for end-of-day batch processing.
A: Use a 3-level hierarchy: master box (overall schedule) → section boxes
   (logical groupings: extract, transform, load, report) → CMD jobs inside each
   section box. Include a pre-check job as box_terminator, n_retrys on I/O jobs,
   alarm_if_fail on all critical jobs, and a post-check job to validate output.

Q: What is box_terminator and when would you use it?
A: box_terminator: 1 on a job means if that job fails, the entire parent box
   immediately moves to FAILURE and all remaining inner jobs are skipped.
   Use it on validation/pre-check jobs whose failure makes all downstream
   processing pointless.

Q: How do you handle a scenario where an upstream file sometimes arrives late?
A: Use a File Watcher job (job_type: FW) with a run_window covering the expected
   arrival period and an appropriate min_file_size. The downstream jobs condition
   on success(file_watcher_job). This way processing starts as soon as the file
   arrives rather than at a fixed time that may be too early.

Q: How do you pass data between AutoSys jobs?
A: Using global variables: the upstream script runs sendevent -E SET_GLOBAL
   -G "VAR_NAME=value". Downstream jobs read it via autostatus -G VAR_NAME or
   reference it in JIL conditions with variable(VAR_NAME).
Senior interview tip — mention trade-offs and alternatives
When asked 'how would you design X', don't just give one answer. Say 'Option A is a box with a File Watcher. Option B is a scheduled job with polling. Option A is better because...' Show you can compare approaches.
Production Insight
A senior candidate was asked 'How would you handle a file that arrives in multiple chunks?'
Junior answer: 'Use a File Watcher.'
Senior answer: 'Use a manifest file. Upstream writes one .ready file after all chunks are complete. File Watcher watches .ready. This prevents triggering on partial data. Alternatively, use min_file_size set to the expected final size, but manifest is more reliable because chunk order is unpredictable.'
The senior answer showed consideration of edge cases, alternatives, and trade-offs. That's what gets the offer.
Rule: At senior level, every answer should include 'it depends' and then explain the trade-offs.
Key Takeaway
EOD workflow: hierarchical boxes + pre-check terminator + post-check validation.
box_terminator on validation jobs only. Optional jobs should never be terminators.
File Watcher for unpredictable arrival times. Must have min_file_size and run_window.
Global variables pass data. Use workflow prefixes to avoid collisions.

AutoSys Event Server Failover — What the Docs Won't Tell You

You've designed a fault-tolerant architecture. Two Event Processors, one Event Server, automatic failover. Then the primary ES dies, and your jobs vanish into a black hole for twenty minutes.

The docs say failover takes 90 seconds. Reality says 5-15 minutes depending on deadlock detection timeouts, unreachable host timeouts, and whatever else your sysadmin configured in the alarm scripts. The Event Server failover timeout is NOT a hard limit. It's a TCP timeout stack. Every hop adds seconds.

Senior engineers set ES_FAILOVER_TIMEOUT to 300 seconds minimum. Anything lower causes false positives when the primary ES is just slow, not dead. And you MUST test failover manually during non-peak hours. Simulate a kill -9 on the ES process. Watch the client machines log. If they don't reconnect within 300 seconds, your network team has a problem.

Production lesson: Event Server redundancy gives you zero benefit if the failover doesn't complete before your SLA breach window. Test your recovery time objective (RTO) quarterly. Your auditor will ask for the proof. Have it ready.

eventServerFailoverTest.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// io.thecodeforge — devops tutorial
// Simulate ES failover and measure downtime
// Run on a non-production calendar day

event_server_primary:
  host: autosys-es-01.prod.corp
  process_name: eventor
  kill_command: "kill -9 $(pgrep -f eventor)"
  expected_failover_time_seconds: 300
  alert_slack_channel: "#ops-autosys-watchdog"

validation:
  check_every_seconds: 10
  max_retries: 30
  success_condition:
    - "Client machines reconnect to secondary ES"
    - "No scheduled job starts within 5 minutes of failover"
    - "System alarms trigger within 60 seconds"
  failure_action: "Page the on-call architect immediately"

post_test:
  verify_log: "/opt/autosys/logs/es_failover_2025-03-17.log"
  recovery_es_process_restart: true
Output
es_failover_2025-03-17.log
15:30:00 INFO Primary ES kill issued
15:30:02 WARN Client abc-app-01: ES unreachable
15:30:03 WARN Client xyz-db-02: ES unreachable
15:32:15 INFO Secondary ES elected as new primary
15:32:18 INFO Client abc-app-01 reconnected
15:32:19 INFO Client xyz-db-02 reconnected
Total failover time: 135 seconds
Status: PASS (threshold 300s)
Production Trap:
Most autosys teams never test failover. They assume it works. Then the SAN goes down on a Tuesday at 3 PM and your VP wants a root cause in 10 minutes. Run a failure drill quarterly. Document every delay.
Key Takeaway
Es failover timeouts are additive across network layers. Set ES_FAILOVER_TIMEOUT to 300s minimum. Test actual RTO quarterly, not just config.

Global Virtual Machines — Why Your 'Simple' Box Stops Running

Junior engineers love Global Virtual Machines (GVM). One config, applies everywhere. Then someone adds a new local machine to the same farm, and suddenly jobs stop starting on it for no reason.

GVM works by matching machine attributes — hostname prefix, OS version, location code, custom tags. If your GVM regex is too permissive, it matches machines you don't intend. If it's too strict, it excludes machines you do intend. The worst part: there's no built-in query to show you what machines a GVM resolves to. You have to write a Python script to parse the AUDIT_DB.

Here's what 90% of production incidents boil down to: someone updates the hostname naming convention (e.g., from 'app-[env]-[num]' to 'app-[num]-[env]') but forgets to update the GVM regex. Jobs start failing silently. No alarm. Just machines sitting idle while boxes stay pending.

Fix: Never use hostname prefix in GVM. Use custom machine attributes — location, environment, role — that are explicitly set and reviewed in change management. Then your GVM becomes a simple AND/OR filter on stable tags. Audit your GVMs quarterly against current machine inventory. Automate it or schedule a manual check in Jira.

gvmAuditScript.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// io.thecodeforge — devops tutorial
// Audit GVM definitions against live machine inventory
// Run weekly cron job. Output to Slack.

gvm_name: "APP_LINUX_PROD_BOX"
regex_pattern: "app-[A-Za-z]{2}-[0-9]{3}*"
expected_machines:
  - app-prd-001
  - app-prd-002
  - app-prd-003
  - app-prd-004
live_machines_with_prefix:
  - app-dev-001  # Wrong env — matches accidentally
  - app-prd-005  # Missing from GVM? Check if new

check_config:
  audit_db_job: "AUTOSYS_AUDIT_GVM_DEFINITIONS"
  alert_on_mismatch: true
  slack_channel: "#autosys-gvm-audit"
Output
GVM Audit: APP_LINUX_PROD_BOX
MATCHED (unexpected): app-dev-001, app-stg-004
MISSING (expected): app-prd-005 (newly provisioned)
Action: Update GVM regex to exclude non-prod. Add app-prd-005.
Senior Shortcut:
Key Takeaway
GVMs are fragile with hostname regex. Use custom machine attributes instead. Audit actual resolution weekly to catch drifts before jobs fail.
● Production incidentPOST-MORTEMseverity: high

The Interview Answer That Didn't Match Production

Symptom
The candidate answered: 'ON_ICE, because I want the job to wait until the next cycle after the migration.' That's technically correct. But the interviewer wanted to hear 'ON_HOLD, because after the migration finishes, we want the job to run immediately, not wait until midnight.'
Assumption
The candidate memorised definitions but never applied them to real operations. They didn't understand the operational consequence of the difference.
Root cause
ON_HOLD: release triggers immediate start if conditions are currently true. ON_ICE: release requires time conditions to reoccur in the next scheduling cycle. During a database migration at 2 PM, a job that normally runs at midnight is held. After migration completes at 4 PM: - If ON_HOLD: release runs the job at 4 PM (good — you want validation now) - If ON_ICE: release does nothing until midnight (bad — you wait 8 hours to validate)
Fix
The candidate learned the rule: ON_HOLD for temporary pauses during business hours where you want immediate resume. ON_ICE for permanent schedule changes or when you don't want out-of-cycle runs. Interview tip: Always follow definition with 'In production, I would use ON_HOLD when... and ON_ICE when...'
Key lesson
  • Memorised definitions are not enough. Apply them to real scenarios.
  • ON_HOLD = immediate resume. ON_ICE = next scheduled cycle.
  • Database migrations: ON_HOLD. Schedule changes: ON_ICE.
  • Interviewers probe with 'when would you use this?' — always have an example.
Production debug guideThe 'walk me through how you'd fix this' questions4 entries
Symptom · 01
Job in PEND_MACH status at 2 AM
Fix
Step 1: SSH to agent machine. df -h (full disk is #1 cause). Step 2: ps -ef | grep autosys (agent running?). Step 3: Check network: telnet server 7520. Answer: Most likely full disk stopping agent.
Symptom · 02
Job shows SUCCESS but data didn't update
Fix
Look for sqlplus without error checking. Check std_out_file for ORA- errors. Answer: sqlplus returns 0 on SQL errors. Always wrap in script that greps for ORA-.
Symptom · 03
File Watcher triggered on empty file
Fix
Check min_file_size. Default is 0. Increase to 1024+. Answer: Upstream wrote lock file first.
Symptom · 04
SAP job stuck PENDING, no error
Fix
XBP user password expired or account locked. Check with Basis team. Answer: AutoSys can't see SAP auth failures.
★ Interview Command Recall — Must-Know SyntaxYou will be asked these exact commands. Know them cold.
Back up all job definitions
Immediate action
Use autorep with -q flag
Commands
autorep -J % -q > backup_$(date +%Y%m%d).jil
Fix now
This is a complete backup in JIL format
Check why job isn't starting+
Immediate action
View JIL definition and status
Commands
autorep -J JOBNAME -q
autorep -J JOBNAME -d
Fix now
-q shows conditions, -d shows status detail
Force-start a job+
Immediate action
Use sendevent FORCE_STARTJOB
Commands
sendevent -E FORCE_STARTJOB -J JOBNAME
sendevent -E CHANGE_STATUS -J JOBNAME -s SUCCESS (to unblock downstream)
Fix now
FORCE_STARTJOB bypasses ALL conditions
Set a global variable+
Immediate action
Use sendevent SET_GLOBAL
Commands
sendevent -E SET_GLOBAL -G "COUNT=100"
autostatus -G COUNT
Fix now
No spaces around =
AutoSys Interview Topics — What to Expect by Level
Topic areaJunior expected depthMid-level expected depthSenior expected depth
ArchitectureName the componentsExplain what each does, failure modesDesign HA, predict failure cascades
JIL commandsBasic insert/update/delete syntaxautorep flags, backup strategiesComplex JIL with conditions, variables
Status codesRecognise SU/FA/RU/INPEND_MACH causes, ON_HOLD vs ON_ICERecovery procedures for each status
Schedulingdate_conditions, start_timesrun_window, run_calendarComplex calendars, timezone handling
Fault tolerancen_retrys definitionbox_terminator, term_run_timeHA design, recovery strategy
TroubleshootingCheck logs commandSystematic diagnosis workflowRoot cause analysis, prevention

Key takeaways

1
ON_HOLD vs ON_ICE
OFF_HOLD starts immediately. OFF_ICE waits for next schedule. This is tested in almost every interview.
2
autorep flags
default (status), -d (detail), -q (JIL dump), -s (filter), -run (last N runs). Know them cold.
3
PEND_MACH = agent unreachable. First check
disk space (df -h). 90% of cases.
4
date_conditions defaults to 0 (disabled). Most people assume it's 1. That's the trap.
5
FORCE_STARTJOB bypasses ALL conditions (time AND dependencies). STARTJOB respects them.
6
box_terminator
1 on validation only. Never on optional jobs.
7
Senior answers include trade-offs
'it depends' + comparison of approaches.
8
Have a real example ready for every concept
'I used ON_HOLD when...'

Common mistakes to avoid

5 patterns
×

Memorising answers without understanding the reasoning

Symptom
Candidate defines ON_HOLD vs ON_ICE perfectly. When asked 'which would you use during a database migration?', they guess wrong. Interviewer probes deeper and realises lack of operational experience.
Fix
For every definition, think of a production scenario where you would use it. Practice explaining both ON_HOLD and ON_ICE with real examples.
×

Not knowing the autorep flags

Symptom
Candidate says 'I would check the job status' but can't name autorep flags. Interviewer asks 'what's the difference between autorep -d and -q?' Candidate doesn't know.
Fix
Memorise: autorep alone = status table. -d = detail (start/end times). -q = JIL dump (definition). -s = filter by status. -run = last N runs.
×

Confusing ON_HOLD and ON_ICE

Symptom
Candidate says 'they're the same.' Most common wrong answer. Immediate negative signal.
Fix
Repeat: OFF_HOLD starts immediately if conditions met. OFF_ICE waits for next scheduling cycle. If you can't articulate the difference, you haven't operated AutoSys.
×

Being vague about troubleshooting — 'I would check the logs'

Symptom
Candidate says 'I would check the logs' without specifying which logs or what to look for. Interviewer hears 'I've never actually done this'.
Fix
Specific answers: 'First I check $AUTOUSER/out/event_demon.$AUTOSERV for condition evaluation. Then I check the job's std_err_file. Then I ssh to the agent and check the application log.'
×

Not having an end-of-day workflow design ready

Symptom
Interviewer asks 'design an EOD batch workflow'. Candidate stalls or gives a flat list of jobs without hierarchy, error handling, or validation.
Fix
Have a pattern ready: master box → section boxes (extract/transform/load/report) → CMD jobs. Include pre-check validation as box_terminator. Include post-check verification. Mention n_retrys on network I/O jobs. This shows you've built real workflows.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
What is AutoSys and what makes it better than cron for enterprise batch ...
Q02SENIOR
Explain the AutoSys architecture and the role of each component.
Q03SENIOR
What is the difference between ON_HOLD and ON_ICE? What happens when you...
Q04SENIOR
A job is in PEND_MACH status. Walk me through how you diagnose and fix i...
Q05JUNIOR
What does date_conditions do and what is its default value?
Q06SENIOR
What is box_terminator and when would you use it?
Q07SENIOR
How do you design an AutoSys workflow for a complex end-of-day batch run...
Q08SENIOR
What is the difference between FORCE_STARTJOB and STARTJOB?
Q09SENIOR
How would you pass a record count from one AutoSys job to the next?
Q10SENIOR
Walk me through how you recover from a BOX job that went to FAILURE at 3...
Q01 of 10JUNIOR

What is AutoSys and what makes it better than cron for enterprise batch processing?

ANSWER
AutoSys is an enterprise workload automation platform. Better than cron because: cross-server dependencies (cron can't make job B wait for job A on another server), centralised monitoring (cron logs are per-server), alerting (cron only emails errors), audit trails (who changed what and when), retry logic (n_retrys), file-watching (event-driven), and global variables (cross-job data passing). Cron is fine for single-server, independent jobs. AutoSys is for multi-server, dependent workflows with SLAs and compliance requirements.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What AutoSys questions come up most in interviews?
02
What is the most commonly asked AutoSys interview question?
03
Do I need hands-on AutoSys experience to pass the interview?
04
What is the PEND_MACH answer in AutoSys interviews?
05
How do I explain AutoSys to a non-technical interviewer?
COMPLETE GUIDE
The Complete AutoSys Workload Automation Guide for Engineers →

JIL syntax, sendevent, autorep, box jobs, file watchers, scheduling, HA, security, cloud workload automation, and 22 interview questions — the definitive AutoSys reference for production engineers.

N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.

Follow
Verified
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
🔥

That's AutoSys. Mark it forged?

3 min read · try the examples if you haven't

Previous
AutoSys Cloud Workload Automation
30 / 30 · AutoSys