AutoSys Architecture — Full /var Crashes Agents Silently
A full /var disk crashed the Remote Agent silently, leaving 500 jobs in PEND_MACH.
- AutoSys components: Event Server (database of all jobs/events), Event Processor (scheduler daemon), Remote Agent (job executor), client tools (jil, autorep, sendevent)
- Event Server stores all job definitions, event history, machine definitions; source of truth for entire AutoSys environment
- Event Processor runs continuously, evaluates conditions, triggers jobs on Remote Agents — only ONE per AutoSys instance
- Performance: Event Processor polls Event Server (~30s interval) — buffer for real-time alerts
- Production trap: Remote Agent machine runs out of disk space — agent crashes, all jobs on that machine go PEND_MACH (stuck), no auto-recovery
- Biggest mistake: Running multiple Event Processors — corrupts job state, leads to duplicate job execution
Think of AutoSys like a restaurant. The Event Server is the order book — it stores every job definition and event. The Event Processor is the head chef — it reads the orders and decides what to cook next. The Remote Agents are the kitchen staff on different floors — they actually execute the work. The GUI is the front-of-house — you see what's happening and can make changes.
Before you write a single line of JIL or schedule your first job, it helps to understand how AutoSys actually works under the hood. The architecture is straightforward but knowing what each component does — and why — will save you a lot of head-scratching when things go wrong in production.
AutoSys has four major components that work together: the Event Server, the Event Processor, Remote Agents, and client tools. Each has a clear job, and understanding the flow between them makes debugging much easier.
By the end you'll know exactly how job definitions flow from JIL to Event Server to Event Processor to Remote Agent and back. You'll understand what PEND_MACH means and why it's the most common production issue. And you'll know the component that, when it fails, stops all job scheduling.
The Event Server — the source of truth
The Event Server is a relational database (typically Sybase or Oracle) that stores everything AutoSys needs to operate. This includes all job definitions (what to run, when, where, under which conditions), all events that have occurred (job started, job succeeded, job failed), global variable values, machine definitions, calendar definitions, and monitor and report definitions.
When a job finishes and reports its status, that status goes into the Event Server. When the Event Processor needs to know whether a dependent job's condition is met, it queries the Event Server. It's the single source of truth for the entire AutoSys environment.
db_purge_events) to keep query performance acceptable.The Event Processor — the brain
The Event Processor (also called the scheduler or the event daemon) is the most important component. It runs continuously, polling the Event Server for events. When it detects that a job's starting conditions are met — the right time has arrived, dependent jobs have succeeded, the machine is available — it triggers the job to run on the appropriate agent.
The Event Processor also handles time-based scheduling, evaluates job condition logic, and manages the overall state machine for each job. On Unix/Linux it's started with the eventor command. There is only ever one Event Processor running per AutoSys instance.
if [ $(ps -ef | grep eventor | wc -l) -ne 1 ]; then alert; fi.Remote Agents — where the work actually happens
Remote Agents run on every machine where AutoSys needs to execute jobs. When the Event Processor decides a job should run, it sends a message to the Remote Agent on the target machine. The agent starts the process, monitors it, captures the exit code, and reports the result back to the Event Server.
Agents can be extended with plugins for specific integrations — SAP, Oracle E-Business, PeopleSoft, and others. If an agent goes down, jobs that are supposed to run on that machine go into PEND_MACH status and wait until the agent comes back up.
Client tools — how you interact with AutoSys
Client tools are the interfaces you use to define, manage, and monitor jobs. The main ones are: jil — the command-line JIL processor for creating and modifying job definitions; autorep — reports job status and definitions; sendevent — manually triggers events like starting a job or putting it on hold; autostatus — checks the current status of a specific job; and the WCC Web UI — a browser-based dashboard for monitoring job flows visually.
Most experienced AutoSys administrators work primarily from the command line using jil, autorep, and sendevent. The GUI is useful for monitoring and for people less comfortable with CLI.
autorep -J % -d for monitoring, sendevent -E FORCE_STARTJOB for recovery. Avoid manual GUI actions in automated recovery procedures.The Full Disk That Froze 500 Jobs
/var/log/autosys/agent.log by default. A misconfigured application job generated 50GB of debug output overnight, filling /var. When the disk reached 100%, the agent service tried to write to the log, failed, and crashed. The Event Processor sent a heartbeat check to the agent, got no response, and marked all jobs on that machine as PEND_MACH (pending machine). No new jobs could start on that machine, and existing jobs continued running? Actually, running jobs continue, but crashed agent can't report completion. The crashed agent also could not start new jobs. The job stuck in RUNNING state until a human intervened.logrotate with compression and 7-day retention.
3. Set disk_check_interval in agent config to check free space before writing.
4. For the offending job, limited log output to 100MB and added log rotation in the script.
5. Added a cron job that restarts the AutoSys agent if it's down: if ! ps -ef | grep -q 'autosys_agent'; then /etc/init.d/autosys_agent start; fi.
6. Documented PEND_MACH resolution steps: check disk space, restart agent, then sendevent -E FORCE_STARTJOB for stuck jobs.- Remote Agent crashes are silent. The agent stops without raising an alarm. Jobs go PEND_MACH without notification.
- Always monitor disk space on agent machines. An 80% full alert is your early warning. At 95%, trigger immediate action.
- PEND_MACH does not auto-resolve. Even if the agent restarts, jobs remain PEND_MACH until manually forced.
- The Event Processor cannot distinguish between a crashed agent and a slow agent. It just waits for heartbeat timeout.
ps -ef | grep autosys_agent. Check disk space on agent machine: df -h /var. Check agent log: /var/log/autosys/agent.log. Restart agent: /etc/init.d/autosys_agent restart. Then force restart stuck jobs: sendevent -E FORCE_STARTJOB -J job_name.autoping. If down, restart: eventor. Also check Event Server database connectivity: isql -S autosys (Sybase). Check if Event Processor process exists: ps -ef | grep eventor.$AUTOSYS/log/event_processor.log. Look for 'lost event' or 'queue overflow'. Increase Event Server buffer size or purge old events.ps -ef | grep eventor | wc -l should be 1. If >1, kill duplicate processes. Prevent by using FLOCK on eventor lock file.max_threads in configuration./etc/init.d/autosys_agent restart. Then force start jobs: for job in $(autorep -J % -m MACHINE_NAME -d | grep PEND_MACH | awk '{print $1}'); do sendevent -E FORCE_STARTJOB -J $job; doneKey takeaways
Common mistakes to avoid
5 patternsAssuming the Event Processor runs the job itself — it doesn't
Running multiple Event Processor instances on the same AutoSys instance
ps -ef | grep eventor | grep -v grep | wc -l should be 1. Use lock file in startup script: FLOCK -n /var/lock/eventor.lock -c eventor.Not monitoring the Event Server database size
db_purge_events -date '01/01/2026'. Run analyze table on Event Server tables. Set retention policy: keep events for 90 days, archive older events to flat file.Forgetting that the agent user account needs the right permissions
Not handling PEND_MACH recovery after agent restart
for job in $(autorep -J % -d | grep PEND_MACH | awk '{print $1}'); do sendevent -E FORCE_STARTJOB -J $job; done. Or use sendevent -E FORCE_STARTMACH -M machine_name to restart all jobs on that machine.Interview Questions on This Topic
What are the four main components of AutoSys architecture?
Frequently Asked Questions
JIL syntax, sendevent, autorep, box jobs, file watchers, scheduling, HA, security, cloud workload automation, and 22 interview questions — the definitive AutoSys reference for production engineers.
That's AutoSys. Mark it forged?
3 min read · try the examples if you haven't