Python os and pathlib Modules — File Paths, Directory Operations and Real-World Patterns
Every real Python application eventually touches the file system. Whether you're building a data pipeline that reads CSVs from a folder, a web scraper that saves results to disk, a CLI tool that organises photos, or a test suite that creates temporary directories — you need to navigate paths, check if files exist, create folders, and do it all in a way that doesn't break when a colleague runs your code on Windows instead of Mac. This is not optional knowledge; it's the plumbing behind almost every non-trivial Python project.
os Module — The Veteran Swiss Army Knife for File System Operations
The os module has been part of Python since version 1. Its job is to give you a portable interface to whatever operating system your code runs on. 'Portable' is the key word — os.path.join('reports', 'july', 'sales.csv') produces reports/july/sales.csv on Linux and Mac, but reports\july\sales.csv on Windows. Without this, hard-coded slashes are a silent bug waiting to ambush you the moment someone else runs your script.
The module splits into two concerns. First, os itself handles process-level stuff: environment variables, the current working directory, creating and removing directories, listing folder contents. Second, os.path handles the string manipulation of path addresses — joining segments, checking existence, splitting filenames from their extensions.
You'll reach for os most often in scripts that need to inspect or modify the environment they're running in — reading a config path from an environment variable, making sure a required output directory exists before writing to it, or recursively walking a directory tree. It's procedural, explicit, and it works everywhere.
import os # --- 1. Current working directory --- current_dir = os.getcwd() print(f"Script is running from: {current_dir}") # --- 2. Build a cross-platform path safely --- # NEVER hard-code slashes. os.path.join handles the separator for you. reports_path = os.path.join(current_dir, 'data', 'reports', 'july_sales.csv') print(f"Target file path: {reports_path}") # --- 3. Check existence before acting --- # Trying to open a non-existent file raises FileNotFoundError. # Always guard against this in production code. if os.path.exists(reports_path): print("File exists — safe to open.") else: print("File not found — creating parent directories now.") # exist_ok=True means no error if the folder already exists os.makedirs(os.path.dirname(reports_path), exist_ok=True) # --- 4. Split a path into useful parts --- file_directory = os.path.dirname(reports_path) # everything except the filename file_name = os.path.basename(reports_path) # just 'july_sales.csv' name_only, ext = os.path.splitext(file_name) # ('july_sales', '.csv') print(f"Directory : {file_directory}") print(f"Filename : {file_name}") print(f"Name only : {name_only}") print(f"Extension : {ext}") # --- 5. List directory contents (non-recursive) --- script_dir = os.path.dirname(os.path.abspath(__file__)) print(f"\nFiles and folders in script directory:") for entry in os.listdir(script_dir): full_entry_path = os.path.join(script_dir, entry) kind = 'DIR ' if os.path.isdir(full_entry_path) else 'FILE' print(f" [{kind}] {entry}") # --- 6. Read an environment variable with a safe fallback --- log_level = os.environ.get('LOG_LEVEL', 'INFO') # returns 'INFO' if not set print(f"\nLog level from environment: {log_level}")
Target file path: /home/user/projects/myapp/data/reports/july_sales.csv
File not found — creating parent directories now.
Directory : /home/user/projects/myapp/data/reports
Filename : july_sales.csv
Name only : july_sales
Extension : .csv
Files and folders in script directory:
[FILE] os_file_operations.py
[DIR ] data
[DIR ] tests
Log level from environment: INFO
pathlib Module — Object-Oriented Paths That Actually Make Sense
Introduced in Python 3.4, pathlib was born from a simple frustration: path manipulation using os.path is a collection of disconnected functions that you have to import and chain together in awkward ways. pathlib flips this around — a path becomes an object, and every operation is a method or property on that object. The result is code that reads like English.
The core class you'll use is Path. On Windows it automatically becomes a WindowsPath; on Unix it becomes a PosixPath. You don't pick — Python does. This means your code is genuinely cross-platform without you doing anything extra.
The slash operator (/) is overloaded on Path objects to join path segments. That means Path('data') / 'reports' / 'july_sales.csv' is valid Python and produces the correct path for the current OS. This single feature makes pathlib code dramatically more readable than the equivalent os.path.join chains. For any new code you write today, pathlib is the right default choice. Use os when you need environment variables, process utilities, or you're maintaining legacy code.
from pathlib import Path # --- 1. Create a Path object — the anchor for everything else --- # Path(__file__) gives us this script's path as an object, not a raw string script_path = Path(__file__).resolve() # resolve() makes it absolute, no '..' segments project_root = script_path.parent # go up one level to the containing folder print(f"Script : {script_path}") print(f"Project : {project_root}") # --- 2. Build paths with / operator — readable and safe --- data_dir = project_root / 'data' output_file = data_dir / 'processed' / 'summary.csv' print(f"Output file will go to: {output_file}") # --- 3. Create directories — mkdir with parents + exist_ok --- # parents=True creates every missing folder in the chain # exist_ok=True suppresses the error if the folder already exists output_file.parent.mkdir(parents=True, exist_ok=True) print(f"Ensured directory exists: {output_file.parent}") # --- 4. Inspect path properties — no functions, just attributes --- print(f"\nPath anatomy for: {output_file}") print(f" .name : {output_file.name}") # 'summary.csv' print(f" .stem : {output_file.stem}") # 'summary' print(f" .suffix : {output_file.suffix}") # '.csv' print(f" .parent : {output_file.parent}") # parent directory print(f" .parts : {output_file.parts}") # tuple of every segment # --- 5. Write and read files directly from the Path object --- output_file.write_text("date,revenue\n2024-07-01,15200\n2024-07-02,18400\n") content = output_file.read_text() print(f"\nFile content written and read back:\n{content}") # --- 6. Glob — find files matching a pattern recursively --- print("All .csv files anywhere under data/:") for csv_file in data_dir.rglob('*.csv'): # rglob = recursive glob # relative_to() makes the output cleaner — no giant absolute paths print(f" {csv_file.relative_to(project_root)}") # --- 7. Check existence and file type --- print(f"\nDoes output file exist? {output_file.exists()}") print(f"Is it a file? {output_file.is_file()}") print(f"Is it a directory? {output_file.is_dir()}") # --- 8. Rename / move a file --- archive_file = output_file.with_name('summary_archived.csv') # same dir, new name output_file.rename(archive_file) print(f"\nFile renamed to: {archive_file.name}")
Project : /home/user/projects/myapp
Output file will go to: /home/user/projects/myapp/data/processed/summary.csv
Ensured directory exists: /home/user/projects/myapp/data/processed
Path anatomy for: /home/user/projects/myapp/data/processed/summary.csv
.name : summary.csv
.stem : summary
.suffix : .csv
.parent : /home/user/projects/myapp/data/processed
.parts : ('/', 'home', 'user', 'projects', 'myapp', 'data', 'processed', 'summary.csv')
File content written and read back:
date,revenue
2024-07-01,15200
2024-07-02,18400
All .csv files anywhere under data/:
data/processed/summary.csv
Does output file exist? True
Is it a file? True
Is it a directory? False
File renamed to: summary_archived.csv
Real-World Pattern — Building a Cross-Platform File Organiser
Theory only sticks when you see it solve an actual problem. Here's a pattern you'll encounter constantly: you have an input folder with mixed files, and you need to sort them into subfolders by type, create a manifest of what was moved, and handle edge cases cleanly. This is the kind of task that separates someone who's read the docs from someone who's used the tools in production.
This example uses pathlib as the primary tool (for its readability) and dips into os for the environment variable — which is the natural split. Notice how Path objects flow through the entire function without a single string concatenation. Notice how exist_ok=True means you can run the script multiple times without it crashing on the second run. And notice how iterdir() gives you proper Path objects back, so you can call .suffix and .rename directly without any conversion.
This pattern is also the foundation for more complex tasks: swap iterdir() for rglob('*') to go recursive, add a dry_run flag that prints moves without executing them, or plug in a logging call instead of print. Real codebases are just these small patterns stacked on top of each other.
import os from pathlib import Path from datetime import datetime # --- Configuration via environment variable with sensible default --- # In production you'd set INBOX_DIR in your .env or CI environment inbox_dir = Path(os.environ.get('INBOX_DIR', './inbox')) output_dir = Path(os.environ.get('OUTPUT_DIR', './organised')) manifest = [] # we'll log every move here # Map file extensions to human-friendly category folder names EXTENSION_MAP = { '.jpg': 'images', '.jpeg': 'images', '.png': 'images', '.gif': 'images', '.mp4': 'videos', '.mov': 'videos', '.pdf': 'documents', '.docx': 'documents', '.txt': 'documents', '.csv': 'data', '.json': 'data', '.xlsx': 'data', } def organise_inbox(source_dir: Path, dest_dir: Path) -> list[dict]: """ Move every file in source_dir into a category subfolder under dest_dir. Returns a list of move records for logging or auditing. """ move_log = [] if not source_dir.exists(): print(f"Inbox directory not found: {source_dir}") return move_log # iterdir() yields every item in the directory as a Path object for item in source_dir.iterdir(): if item.is_dir(): # Skip subdirectories — only handle flat files in this version continue # .suffix returns the extension including the dot, e.g. '.pdf' # .lower() handles cases like '.JPG' vs '.jpg' extension = item.suffix.lower() category = EXTENSION_MAP.get(extension, 'misc') # unknown types go to 'misc' category_dir = dest_dir / category # Create the category folder if it doesn't exist yet category_dir.mkdir(parents=True, exist_ok=True) destination = category_dir / item.name # Handle name collisions — append a timestamp rather than silently overwriting if destination.exists(): timestamp = datetime.now().strftime('%Y%m%d_%H%M%S') new_name = f"{item.stem}_{timestamp}{item.suffix}" destination = category_dir / new_name print(f" Collision detected — renaming to: {new_name}") # .rename() moves the file; on some OS/filesystem combos use .replace() instead item.rename(destination) move_log.append({ 'original' : str(item), 'moved_to' : str(destination), 'category' : category, }) print(f" Moved [{category:10s}] {item.name} -> {destination.relative_to(dest_dir)}") return move_log def write_manifest(move_log: list[dict], dest_dir: Path) -> None: """Write a human-readable manifest of all file moves.""" manifest_path = dest_dir / 'manifest.txt' lines = [f"File Organisation Manifest — {datetime.now().isoformat()}\n"] lines += [f"{r['category']:10s} | {r['original']} -> {r['moved_to']}" for r in move_log] manifest_path.write_text('\n'.join(lines)) print(f"\nManifest written to: {manifest_path}") # --- Entry point --- if __name__ == '__main__': # For demo purposes, create some fake files in the inbox inbox_dir.mkdir(parents=True, exist_ok=True) for fake_file in ['photo.jpg', 'report.pdf', 'data_export.csv', 'notes.txt', 'video.mp4']: (inbox_dir / fake_file).write_text(f"Demo content for {fake_file}") print(f"Organising files from: {inbox_dir}") print(f"Destination root : {output_dir}\n") results = organise_inbox(inbox_dir, output_dir) write_manifest(results, output_dir) print(f"\nDone. {len(results)} file(s) organised.")
Destination root : organised
Moved [images ] photo.jpg -> images/photo.jpg
Moved [documents ] report.pdf -> documents/report.pdf
Moved [data ] data_export.csv -> data/data_export.csv
Moved [documents ] notes.txt -> documents/notes.txt
Moved [videos ] video.mp4 -> videos/video.mp4
Manifest written to: organised/manifest.txt
Done. 5 file(s) organised.
os.walk vs pathlib.rglob — Choosing the Right Recursive Tool
When you need to traverse a directory tree recursively, you have two solid options: the classic os.walk() and pathlib's rglob(). They solve the same problem differently, and knowing when to pick each one marks you as someone who actually thinks about the tools they use.
os.walk() is a generator that yields a three-tuple for every directory it visits: (dirpath, list_of_subdirs, list_of_files). This gives you fine-grained control — you can modify the subdirectory list in-place to prune branches you don't want to descend into. That's powerful when you need to skip hidden directories, node_modules, or .git folders without visiting them at all.
pathlib.rglob('*.csv') is simpler: it returns a flat generator of Path objects matching the pattern, anywhere in the tree. You don't get the tree structure, just the matches. For the common case of 'find me all files of type X', rglob is less code and more readable. Use os.walk when you need to control traversal behaviour; use rglob when you just need the results.
import os from pathlib import Path project_root = Path('./sample_project') # --- Set up a sample directory tree for demonstration --- for folder in ['src', 'src/utils', 'tests', 'docs', '.git', 'node_modules']: (project_root / folder).mkdir(parents=True, exist_ok=True) for filepath in ['src/main.py', 'src/utils/helpers.py', 'tests/test_main.py', 'docs/readme.md', '.git/config', 'node_modules/package.json']: (project_root / filepath).write_text(f"# {filepath}") print("=" * 55) print("METHOD 1: pathlib rglob — simple pattern matching") print("=" * 55) # rglob('*.py') matches any .py file in any subdirectory for python_file in sorted(project_root.rglob('*.py')): print(f" {python_file.relative_to(project_root)}") # OUTPUT INCLUDES node_modules and .git if they had .py files — no pruning print() print("=" * 55) print("METHOD 2: os.walk — pruning hidden/vendor directories") print("=" * 55) # Directories we never want to descend into SKIP_DIRS = {'.git', 'node_modules', '__pycache__', '.venv'} for dirpath, subdirs, filenames in os.walk(project_root): # Modify subdirs IN-PLACE — this tells os.walk not to visit pruned folders # This is the KEY advantage of os.walk over rglob subdirs[:] = [ directory for directory in subdirs if directory not in SKIP_DIRS ] for filename in filenames: if filename.endswith('.py'): full_path = os.path.join(dirpath, filename) relative_path = os.path.relpath(full_path, project_root) print(f" {relative_path}") print() print("Notice: .git and node_modules were skipped entirely by os.walk,") print("but rglob would have descended into them if they contained .py files.")
METHOD 1: pathlib rglob — simple pattern matching
=======================================================
src/main.py
src/utils/helpers.py
tests/test_main.py
=======================================================
METHOD 2: os.walk — pruning hidden/vendor directories
=======================================================
src/main.py
src/utils/helpers.py
tests/test_main.py
Notice: .git and node_modules were skipped entirely by os.walk,
but rglob would have descended into them if they contained .py files.
| Feature / Aspect | os / os.path | pathlib.Path |
|---|---|---|
| Python version introduced | Python 1 (ancient, stable) | Python 3.4+ |
| Style | Procedural — functions on strings | Object-oriented — methods on objects |
| Path joining | os.path.join('a', 'b', 'c') | Path('a') / 'b' / 'c' |
| Get filename | os.path.basename(path) | path.name |
| Get extension | os.path.splitext(f)[1] | path.suffix |
| Check existence | os.path.exists(path) | path.exists() |
| Create directories | os.makedirs(path, exist_ok=True) | path.mkdir(parents=True, exist_ok=True) |
| Read file contents | open(path).read() | path.read_text() |
| Recursive file search | os.walk() with manual filtering | path.rglob('*.ext') |
| Prune traversal branches | Yes — modify subdirs[:] in os.walk | No — must filter results after the fact |
| Environment variables | os.environ.get('KEY', 'default') | Not supported — use os for this |
| Return type of listing | Strings | Path objects (immediately useful) |
| Readability for beginners | Moderate — many functions to remember | High — reads like plain English |
| Best used for | Env vars, process info, legacy code | All new file path manipulation code |
🎯 Key Takeaways
- Never hard-code path separators — use pathlib's / operator or os.path.join() and let Python handle the OS difference between / and \\.
- Anchor file paths to __file__, not os.getcwd() — your script's location is fixed, but the working directory depends on where the user runs it from.
- pathlib is the right default for all new code — it returns objects you can keep working with, not raw strings that need re-parsing with more function calls.
- Use os.walk when you need to prune the traversal tree (skip vendor folders, hidden dirs) — modifying subdirs[:] in-place is a feature rglob simply doesn't have.
⚠ Common Mistakes to Avoid
- ✕Mistake 1: Hard-coding path separators like 'data/reports/file.csv' or 'data\\reports\\file.csv' — Your script crashes or produces wrong paths on a different OS — Use os.path.join() or the pathlib / operator so the separator is always correct for the platform running the code.
- ✕Mistake 2: Using os.getcwd() as the base for relative paths — If someone runs your script from a different working directory (e.g., python scripts/process.py from the project root), all your relative paths point to the wrong place — Always anchor relative paths to __file__ using Path(__file__).resolve().parent or os.path.dirname(os.path.abspath(__file__)) so paths are relative to the script, not the shell.
- ✕Mistake 3: Calling path.mkdir() without exist_ok=True in a script that might run more than once — Raises FileExistsError on the second run, crashing your pipeline — Always pass exist_ok=True (and parents=True if you're creating nested directories) unless you specifically need to error when the folder already exists.
Interview Questions on This Topic
- QWhat's the difference between os.path.abspath() and Path.resolve() — and is there a case where they'd return different results?
- QIf you need to recursively find all Python files in a project but skip node_modules and .git directories, would you use os.walk or pathlib.rglob, and why?
- QPath.rename() and Path.replace() both move files — what's the critical difference between them, and when has choosing the wrong one caused a real bug?
Frequently Asked Questions
Should I use os.path or pathlib for new Python projects?
Use pathlib for all new code. It's been the recommended approach since Python 3.6 and produces cleaner, more readable code. The only time you still reach for os directly is for environment variables (os.environ), process utilities, or when maintaining older codebases that already use os.path throughout.
How do I convert between a pathlib Path object and a plain string?
Wrap the Path in str(): str(Path('data/file.csv')) gives you a plain string. Going the other way, just pass the string to Path(): Path('/home/user/data/file.csv'). Most modern Python libraries like open(), pandas.read_csv(), and json.load() accept Path objects directly, so you rarely need to convert at all.
What's the difference between Path.glob() and Path.rglob()?
glob() only searches inside the immediate directory and one level of pattern — Path('data').glob('.csv') finds CSVs directly inside data/ but not inside data/subfolders/. rglob() is recursive — Path('data').rglob('.csv') finds every CSV file anywhere in the entire tree under data/. The 'r' stands for recursive. When in doubt about which to use, rglob is the safer choice if you're not sure how deep your files are nested.
Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.