Home Python Working with CSV Files in Python — Reading, Writing and Real-World Patterns

Working with CSV Files in Python — Reading, Writing and Real-World Patterns

In Plain English 🔥
Imagine a spreadsheet full of student grades — rows of names, scores, and subjects — saved as a plain text file where each value is separated by a comma. That's a CSV file: Comma-Separated Values. Python's csv module is the tool that lets your program open that file, read each row like a line in a notebook, and write new rows like filling in a form. No fancy Excel app needed — just Python and a text file.
⚡ Quick Answer
Imagine a spreadsheet full of student grades — rows of names, scores, and subjects — saved as a plain text file where each value is separated by a comma. That's a CSV file: Comma-Separated Values. Python's csv module is the tool that lets your program open that file, read each row like a line in a notebook, and write new rows like filling in a form. No fancy Excel app needed — just Python and a text file.

CSV files are everywhere. Your bank exports your transaction history as a CSV. Marketing teams dump campaign data into CSVs. Data scientists receive survey results as CSVs. If you write Python professionally, you will handle CSV files — probably by the end of your first week. Knowing how to do it correctly, not just barely, separates engineers who ship clean data pipelines from those who introduce subtle bugs that corrupt entire datasets.

The problem CSV files solve is deceptively simple: they give every program on earth a common language for tabular data. A spreadsheet created in Excel can be read by a Python script, processed, and written back out for a database to import — all without any special binary format. But that simplicity hides real complexity: what happens when a field contains a comma? What about quotes, newlines inside a cell, or different encodings from international data sources? Python's built-in csv module handles all of this — if you know how to tell it what to do.

By the end of this article you'll be able to read CSV files into clean Python data structures, write processed data back out correctly, handle the most common real-world edge cases like quoted fields and custom delimiters, and know exactly when to reach for pandas instead of the csv module. You'll also know the three mistakes that trip up even experienced developers.

Reading a CSV File the Right Way — and Why reader() Beats readlines()

A lot of developers first try to read a CSV by opening the file and calling readlines(), then splitting each line on commas. That works for five minutes — until a field contains a comma inside quotes, like a full address: '123 Main St, Apt 4'. Suddenly your split breaks the data into the wrong number of columns and your entire pipeline silently produces garbage.

Python's csv.reader() exists precisely to handle this. It understands the RFC 4180 standard for CSV formatting, which means it correctly parses quoted fields, escaped characters, and multi-line values. It wraps a file object and returns an iterator — so it reads one row at a time instead of loading the entire file into memory. That matters enormously when you're processing a 2GB sales export at midnight.

Always open CSV files with newline='' in the open() call. This is not optional. Without it, on Windows, the universal newline translation can corrupt rows by injecting extra blank lines. The Python docs explicitly require it, and skipping it is one of the most common silent bugs in beginner CSV code.

read_csv_basic.py · PYTHON
12345678910111213141516171819202122232425262728
import csv

# GOOD: Open with newline='' as required by Python's csv docs
# This prevents Windows from mangling line endings inside quoted fields
with open('employees.csv', newline='', encoding='utf-8') as csv_file:

    # csv.reader wraps the file object — it handles quoted commas automatically
    csv_reader = csv.reader(csv_file)

    # Skip the header row so we don't process column names as data
    header = next(csv_reader)
    print(f'Columns: {header}')  # ['name', 'department', 'salary']

    # Each row is a plain Python list — nice and familiar
    for row in csv_reader:
        employee_name = row[0]
        department    = row[1]
        salary        = float(row[2])  # csv always gives strings — cast explicitly

        if salary > 70000:
            print(f'{employee_name} ({department}) earns ${salary:,.2f}')

# --- employees.csv content used above ---
# name,department,salary
# Alice,Engineering,95000
# Bob,Marketing,62000
# Carol,Engineering,78000
# Dave,"Sales, EMEA",71000   <-- comma inside quotes handled perfectly
▶ Output
Columns: ['name', 'department', 'salary']
Alice (Engineering) earns $95,000.00
Carol (Engineering) earns $78,000.00
Dave (Sales, EMEA) earns $71,000.00
⚠️
Watch Out: csv always gives you stringsEvery value from csv.reader comes back as a string — even numbers. If you do math on salary without casting it first, Python won't error immediately; it'll just concatenate strings instead of adding numbers. Always cast to int() or float() explicitly at the point you read the value.

DictReader — When Column Names Matter More Than Position

Accessing row data by index (row[0], row[1]) is fragile. If someone adds a column to the CSV, every index after the insertion point is now wrong. This is the kind of bug that only appears in production, at 2am, when someone sends a 'slightly updated' file.

csv.DictReader solves this by using the header row as keys, giving you each row as an OrderedDict (a regular dict in Python 3.8+). Instead of row[2], you write row['salary']. Your code now describes intent, not position. Column order changes become irrelevant.

DictReader also lets you supply fieldnames manually if the CSV has no header row — a situation you'll hit often with legacy data exports. If the header is already present in the file, DictReader reads and discards it automatically. If you supply fieldnames and the file also has a header row, the first row gets treated as data, which is a common gotcha worth knowing about.

read_csv_dictreader.py · PYTHON
123456789101112131415161718192021222324252627
import csv
from collections import defaultdict

# DictReader: each row becomes a dict — robust against column reordering
with open('sales_q4.csv', newline='', encoding='utf-8') as csv_file:
    reader = csv.DictReader(csv_file)

    # Aggregate total sales per region — a real-world reporting task
    regional_totals = defaultdict(float)

    for row in reader:
        region      = row['region']           # access by name, not fragile index
        sale_amount = float(row['amount'])    # explicit cast from string
        regional_totals[region] += sale_amount

# Print a simple summary report
print('=== Q4 Sales by Region ===')
for region, total in sorted(regional_totals.items(), key=lambda item: item[1], reverse=True):
    print(f'{region:<15} ${total:>10,.2f}')

# --- sales_q4.csv content ---
# region,rep,amount
# North,Alice,12400.50
# South,Bob,8750.00
# North,Carol,9300.75
# West,Dave,15200.00
# South,Eve,6100.25
▶ Output
=== Q4 Sales by Region ===
West $ 15,200.00
North $ 21,701.25
South $ 14,850.25
⚠️
Pro Tip: Use DictReader by defaultMake DictReader your default choice for reading CSV files, not csv.reader. The tiny overhead is worth the resilience. Reserve csv.reader for performance-critical loops processing millions of rows where even dict lookup overhead matters — and profile first before optimising.

Writing CSV Files Correctly — Avoiding the Encoding and Quoting Traps

Writing CSV is where most bugs hide. The two most dangerous: wrong newline handling that produces double-spaced files on Windows, and missing quotechar settings that let commas inside values silently corrupt the output file for whoever opens it next.

csv.writer and csv.DictWriter handle both automatically — if you let them. The writer decides when to quote a field based on the quoting parameter, defaulting to QUOTE_MINIMAL, which quotes any field that contains the delimiter, a quotechar, or a line terminator. You can override this to QUOTE_ALL if you're sending data to a system that expects all fields quoted.

DictWriter is the mirror image of DictReader: you define the fieldnames once, write the header with writeheader(), then pass plain dicts for each row. This is the pattern used in real ETL pipelines — transform your data into clean dicts, then dump them all at the end. It keeps your transformation logic completely separate from your file-writing logic.

write_csv_dictwriter.py · PYTHON
1234567891011121314151617181920212223242526272829303132333435363738
import csv
import datetime

# Simulated processed data — imagine this came from a database query or API call
processed_orders = [
    {'order_id': 'ORD-001', 'customer': 'Alice',       'product': 'Laptop',        'total': 1299.99, 'status': 'shipped'},
    {'order_id': 'ORD-002', 'customer': 'Bob',         'product': 'Mouse, USB',    'total': 29.95,   'status': 'pending'},  # comma in product!
    {'order_id': 'ORD-003', 'customer': 'Carol',       'product': 'Keyboard',      'total': 89.00,   'status': 'shipped'},
    {'order_id': 'ORD-004', 'customer': 'Dave',        'product': 'Monitor 27"',   'total': 449.50,  'status': 'cancelled'},
]

output_filename = f'orders_export_{datetime.date.today()}.csv'
fieldnames = ['order_id', 'customer', 'product', 'total', 'status', 'exported_at']

# IMPORTANT: newline='' prevents double line breaks on Windows
# encoding='utf-8-sig' adds a BOM — makes Excel open the file correctly without garbled chars
with open(output_filename, mode='w', newline='', encoding='utf-8-sig') as csv_file:
    writer = csv.DictWriter(
        csv_file,
        fieldnames=fieldnames,
        quoting=csv.QUOTE_MINIMAL  # only quote fields that need it
    )

    writer.writeheader()  # writes: order_id,customer,product,total,status,exported_at

    for order in processed_orders:
        # Add a computed field before writing — mix your logic here cleanly
        order['exported_at'] = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        writer.writerow(order)

print(f'Exported {len(processed_orders)} orders to {output_filename}')

# --- Resulting file content (open in a text editor to verify) ---
# order_id,customer,product,total,status,exported_at
# ORD-001,Alice,Laptop,1299.99,shipped,2024-01-15 09:30:00
# ORD-002,Bob,"Mouse, USB",29.95,pending,2024-01-15 09:30:00   <-- auto-quoted!
# ORD-003,Carol,Keyboard,89.0,shipped,2024-01-15 09:30:00
# ORD-004,Dave,"Monitor 27""",449.5,cancelled,2024-01-15 09:30:00  <-- quote escaped too
▶ Output
Exported 4 orders to orders_export_2024-01-15.csv
🔥
Why utf-8-sig for Excel?Plain utf-8 files opened directly in Excel on Windows often show garbled characters for anything outside ASCII — accented names, currency symbols, Chinese characters. The '-sig' variant adds a Byte Order Mark (BOM) at the start of the file, which signals to Excel that the file is UTF-8 encoded. It's invisible to Python and every other modern tool but saves hours of 'why does my CSV look broken in Excel' debugging.

When to Drop the csv Module and Use pandas Instead

The csv module is perfect when you need lightweight, dependency-free file processing — a Lambda function, a CLI tool, a simple ETL script. But it has a hard ceiling. You're responsible for type casting every field, filtering rows with if statements, and aggregating with manual loops. For data analysis work, that's reinventing the wheel.

pandas.read_csv() gives you a DataFrame in one line. Column types are inferred automatically (though you should always verify them). Filtering, grouping, merging with other datasets, handling missing values — all built in. The tradeoff is a 20MB dependency and a slight startup cost. Worth it for analysis; overkill for a cron job that just reformats a file.

Know which tool to reach for. Use the csv module when you're writing production infrastructure that processes one file at a time and dependencies are a constraint. Use pandas when you're doing any kind of data exploration, transformation across multiple columns, or operations that would require more than 20 lines of csv module code.

csv_vs_pandas.py · PYTHON
123456789101112131415161718192021222324252627282930313233
import pandas as pd

# --- THE PANDAS WAY: read, filter, aggregate, export in ~10 lines ---

# dtype lets you be explicit about columns — never trust auto-inference for IDs or codes
employees_df = pd.read_csv(
    'employees.csv',
    dtype={'employee_id': str},  # prevent 00123 becoming 123
    parse_dates=['start_date'],  # auto-parse date columns
    encoding='utf-8'
)

# Filter to Engineering department earning above median salary
eng_df = employees_df[
    (employees_df['department'] == 'Engineering') &
    (employees_df['salary'] > employees_df['salary'].median())
]

# Group and summarise — try doing this cleanly with just the csv module
dept_summary = employees_df.groupby('department').agg(
    headcount=('employee_id', 'count'),
    avg_salary=('salary', 'mean'),
    total_payroll=('salary', 'sum')
).round(2).reset_index()

print(dept_summary.to_string(index=False))
print(f'\nEngineers above median: {len(eng_df)}')

# Write results back to CSV — index=False prevents pandas adding a row number column
dept_summary.to_csv('dept_summary_report.csv', index=False, encoding='utf-8-sig')
print('Summary report written.')

# --- Output assumes employees.csv has: Engineering x3, Marketing x2, Design x1 ---
▶ Output
department headcount avg_salary total_payroll
Design 1 68000.00 68000.00
Engineering 3 87333.33 262000.00
Marketing 2 64500.00 129000.00

Engineers above median: 2
Summary report written.
⚠️
Pro Tip: Always use index=False when writing CSVs with pandasBy default, DataFrame.to_csv() writes the row index (0, 1, 2...) as the first column. The person receiving that file will have an unnamed column of integers they never asked for. Always pass index=False unless you explicitly want the index saved — which is almost never.
Feature / Aspectcsv module (built-in)pandas.read_csv()
DependenciesNone — stdlib onlyRequires pandas (~20MB install)
Memory usageRow-by-row streaming — very lowLoads entire file into RAM
Type inferenceNone — everything is a stringAutomatic (verify with dtypes)
Filtering rowsManual if statementsBoolean indexing in one line
AggregationManual loops with dictsgroupby().agg() — built-in
Date parsingManual strptime() callsparse_dates=['col'] parameter
Best forETL scripts, CLIs, serverlessData analysis, reporting, EDA
Large files (>1GB)Excellent — streams row-by-rowNeeds chunking: chunksize param
Excel compatibilityutf-8-sig encoding trickHandled automatically
Error on bad rowsRaises csv.ErrorConfigurable: error_bad_lines param

🎯 Key Takeaways

  • Always open CSV files with newline='' and an explicit encoding — skipping either causes silent, hard-to-debug data corruption on Windows and with non-ASCII characters.
  • Use DictReader over csv.reader by default — accessing columns by name makes your code resilient to column reordering, which happens constantly with real-world data sources.
  • Every value from the csv module is a string — cast to int() or float() explicitly at read time, never assume type, and never do arithmetic without casting first.
  • Reach for pandas when you need aggregation, filtering, or type inference across multiple columns; stick with the csv module for lightweight, dependency-free production scripts where every import matters.

⚠ Common Mistakes to Avoid

  • Mistake 1: Opening the file without newline='' — On Windows, Python's universal newline mode translates \r\n to \n before the csv module sees it, but quoted fields that contain actual newlines get corrupted. The symptom is extra blank rows in your output or rows being split incorrectly. The fix is always: open('file.csv', newline='', encoding='utf-8') — no exceptions.
  • Mistake 2: Treating csv values as the correct type — Every value from csv.reader or DictReader is a string, even '42' and '3.14'. The silent symptom is string concatenation instead of arithmetic: '1200' + '300' gives '1200300', not 1500. The fix is to cast explicitly at the point of reading: salary = float(row['salary']) — and do it consistently, not lazily.
  • Mistake 3: Using pandas to_csv() without index=False — The symptom is the recipient receiving a CSV with a mysterious unnamed first column full of integers (0, 1, 2...). This corrupts imports into databases and confuses anyone who opens it in Excel. The fix: always write df.to_csv('output.csv', index=False) unless you have a deliberate reason to include the index.

Interview Questions on This Topic

  • QWhat's the difference between csv.reader and csv.DictReader, and when would you choose one over the other in a production ETL pipeline?
  • QWhy does Python's documentation explicitly require you to pass newline='' when opening a CSV file, and what actually goes wrong if you forget it on Windows?
  • QIf you receive a 4GB CSV file that won't fit in memory, how would you process it in Python — both with the csv module and with pandas?

Frequently Asked Questions

How do I read a CSV file in Python without pandas?

Use Python's built-in csv module. Open the file with open('file.csv', newline='', encoding='utf-8') and wrap it with csv.DictReader() to get each row as a dictionary keyed by the column headers. This requires no external dependencies and streams the file row-by-row, making it memory-efficient for large files.

Why does my CSV have blank lines between every row when I write it in Python?

This is a Windows-specific issue caused by not passing newline='' to the open() call. Python's default mode translates newlines, and the csv module adds its own — resulting in \r\r (double newlines). The fix is always open('output.csv', mode='w', newline='', encoding='utf-8').

How do I handle a CSV where fields contain commas — like addresses or product names?

You don't have to do anything special — Python's csv module handles this automatically. Fields containing commas are wrapped in double-quotes in the CSV file itself (e.g., '"123 Main St, Apt 4"'). csv.reader and DictReader parse these correctly by default, following the RFC 4180 standard. The bug happens when people use split(',') instead of the csv module.

🔥
TheCodeForge Editorial Team Verified Author

Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.

← PreviousWorking with JSON in PythonNext →os and pathlib Module in Python
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged