Skip to content
Home Database Cypher Query Language Basics

Cypher Query Language Basics

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Neo4j → Topic 2 of 3
A comprehensive guide to Cypher Query Language (CQL) — master the basics of pattern matching, graph traversal, and Neo4j data manipulation.
🧑‍💻 Beginner-friendly — no prior Database experience needed
In this tutorial, you'll learn
A comprehensive guide to Cypher Query Language (CQL) — master the basics of pattern matching, graph traversal, and Neo4j data manipulation.
  • Cypher is a pattern-based query language — you describe the SHAPE of your data using ASCII art, and the engine finds every instance of that shape in the graph. Think in nodes and paths, not tables and joins.
  • CREATE always creates new nodes, even if duplicates exist. MERGE checks for existence first and creates only if missing. Use MERGE for entities, CREATE for events.
  • Always use labels in your MATCH patterns — matching without a label forces a full database scan (O(n)), while matching with a label uses the index (O(log n)).
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer

Imagine you're trying to find 'friends of friends who live in London and like jazz' in a spreadsheet. You'd need multiple VLOOKUPs, joins, and temp tables — and it'd take 20 minutes to write. In a graph database, you draw the pattern: (me)-[:FRIEND]->(friend)-[:FRIEND]->(fof) WHERE fof.city = 'London' AND fof.likes = 'Jazz'. The database walks the graph like following breadcrumbs — it doesn't scan tables, it hops from node to node. That's Cypher: a language that lets you describe the shape of the data you want, and the engine finds every instance of that shape in the graph. The key difference from SQL: in SQL, relationships are inferred through foreign keys and reconstructed with joins at query time. In Cypher, relationships are first-class citizens — they're stored directly on disk as physical connections between nodes. When you traverse from one node to another, you're following a pointer, not scanning a join table. This is why a 6-level deep friend-of-friend query in SQL takes minutes but in Cypher takes milliseconds.

A recommendation engine I built for an e-commerce platform was generating 'customers who bought X also bought Y' queries using SQL. With 2 million products and 50 million purchase records, the query took 45 seconds — too slow for real-time recommendations on the product page. We rewrote the data layer in Neo4j with Cypher. Same logic, same data. The query dropped to 12 milliseconds. Not 12 seconds — 12 milliseconds. The difference wasn't better hardware or clever caching. It was that the graph stored relationships as direct pointers, so finding 'people who bought this product and what else they bought' was a two-hop traversal instead of a three-table join with a GROUP BY on 50 million rows.

Cypher is the query language for Neo4j — the most widely-used graph database. It's declarative (you describe what you want, not how to get it), pattern-based (you draw the shape of your data using ASCII art), and designed specifically for traversing relationships. If you've written SQL, the mental model shift is: stop thinking in tables and joins, start thinking in nodes and paths.

This guide covers every clause you'll use daily — CREATE, MATCH, WHERE, RETURN, SET, DELETE, MERGE, WITH, UNWIND, and CALL — with production examples from real systems. We'll also cover variable-length paths, aggregation, subqueries, indexing, and the six mistakes that cause the most pain in production.

Cypher Syntax Fundamentals: Nodes, Relationships, and Patterns

Before writing any query, you need to understand the three building blocks: nodes, relationships, and properties. Everything in Cypher is built on these.

Nodes represent entities — people, products, cities, orders. Written as (variable:Label {properties}). The parentheses look like circles (nodes in a graph). The label is a type tag (like a table name). Properties are key-value pairs stored on the node.

Relationships connect nodes — FRIEND, PURCHASED, LOCATED_IN. Written as -[variable:TYPE {properties}]->. The arrow indicates direction. Relationships always have exactly one type and exactly one direction. They can also store properties (like a 'since' date on a FRIEND relationship, or a 'quantity' on a PURCHASED relationship).

Patterns combine nodes and relationships into shapes: (a:Person)-[:FRIEND]->(b:Person). This is the core of Cypher — you write a pattern that describes the shape of data you want, and the engine finds every instance of that pattern in the graph.

Variables bind to nodes or relationships so you can reference them later in WHERE, RETURN, or WITH. If you don't need to reference something, you can leave it anonymous: (a)-[:FRIEND]->() finds all people that 'a' is friends with, without naming the destination.

Three rules that save you hours of debugging: 1. Node labels start with uppercase: :Person, :Product 2. Relationship types start with uppercase: :FRIEND, :PURCHASED 3. Property keys start with lowercase: name, email, created_at

io/thecodeforge/graph/SyntaxFundamentals.cypher · CYPHER
12345678910111213141516171819202122232425262728
// io.thecodeforge: Node syntax — (variable:Label {key: value})
// This creates a node with label 'Person' and two properties
CREATE (alice:Person {name: 'Alice', age: 32, city: 'London'})
RETURN alice;

// io.thecodeforge: Relationship syntax — -[variable:TYPE {key: value}]->
// This creates a directed relationship with a property
CREATE (bob:Person {name: 'Bob', age: 28, city: 'Paris'})
CREATE (alice)-[r:FRIEND {since: date('2019-06-15'), context: 'university'}]->(bob)
RETURN alice, r, bob;

// io.thecodeforge: Pattern — the fundamental unit of Cypher queries
// This pattern matches any Person who is FRIENDs with any other Person
MATCH (p1:Person)-[:FRIEND]->(p2:Person)
RETURN p1.name AS person, p2.name AS friend;

// io.thecodeforge: Anonymous nodes — when you don't need to reference them
// Finds all people Alice is friends with, without naming the relationship
MATCH (alice:Person {name: 'Alice'})-[:FRIEND]->(friend)
RETURN friend.name, friend.city;

// io.thecodeforge: Multiple labels — a node can have more than one
CREATE (charlie:Person:Developer {name: 'Charlie', language: 'Java'})
RETURN charlie;

// io.thecodeforge: Matching multiple labels — find all Developer-People
MATCH (dev:Person:Developer)
RETURN dev.name, dev.language;
▶ Output
╒══════════╤═══════════╤═══════════════╕
│"person" │"friend" │"friend.city" │
╞══════════╪═══════════╪═══════════════╡
│"Alice" │"Bob" │"Paris" │
└──────────┴───────────┴───────────────┘
💡Think in Shapes, Not Tables:
The single biggest mental shift from SQL to Cypher: stop thinking in 'what columns do I need?' and start thinking in 'what shape does my data look like?' If you can draw the pattern on a whiteboard, you can write it in Cypher. (a:Person)-[:BOUGHT]->(p:Product)<-[:BOUGHT]-(b:Person) reads exactly like what it finds: two people who bought the same product.

CREATE: Building Nodes and Relationships

CREATE is how you insert data into the graph. It always creates new nodes and relationships — even if duplicates already exist. This is the key difference from MERGE (covered later), which checks for existence first.

You can create single nodes, multiple nodes, nodes with relationships in one statement, and even entire subgraphs. The most common production pattern: create a node, create a relationship to an existing node, return what you created. This is how you'd model a new user signing up and connecting to an existing organization.

Properties can be any JSON-compatible type: strings, numbers, booleans, lists, dates, and temporal types. Neo4j stores them as key-value pairs on the node or relationship — there's no schema enforcement, which makes it easy to evolve your data model without migrations.

One gotcha: CREATE always creates. If you run the same CREATE statement twice, you get duplicate nodes. Use MERGE when you need idempotency (creating a node only if it doesn't already exist).

io/thecodeforge/graph/CreateNodes.cypher · CYPHER
123456789101112131415161718192021222324252627282930313233343536373839
// io.thecodeforge: Create a single node with properties
CREATE (order:Order {
  order_id: 'ORD-2024-001',
  total: 149.99,
  currency: 'GBP',
  status: 'pending',
  created_at: datetime()
})
RETURN order;

// io.thecodeforge: Create nodes and a relationship in one statement
// This is the most common production pattern — create entity + link to existing
CREATE (user:User {name: 'Dana', email: 'dana@thecodeforge.io'})
WITH user
MATCH (org:Organization {name: 'TheCodeForge'})
CREATE (user)-[rel:MEMBER_OF {role: 'engineer', joined: date('2024-01-15')}]->(org)
RETURN user.name, rel.role, org.name;

// io.thecodeforge: Create multiple nodes at once
CREATE
  (london:City {name: 'London', country: 'UK', population: 8982000}),
  (paris:City {name: 'Paris', country: 'France', population: 2161000}),
  (berlin:City {name: 'Berlin', country: 'Germany', population: 3645000});

// io.thecodeforge: Create an entire subgraph in one statement
// Product with variants, categories, and supplier — all linked
CREATE (p:Product {sku: 'WIDGET-PRO', name: 'Widget Pro', price: 29.99})
CREATE (cat:Category {name: 'Electronics'})
CREATE (sup:Supplier {name: 'AcmeCorp', country: 'China'})
CREATE (p)-[:IN_CATEGORY]->(cat)
CREATE (p)-[:SUPPLIED_BY {lead_time_days: 14}]->(sup)
RETURN p.name, cat.name, sup.name;

// io.thecodeforge: Create a relationship between EXISTING nodes
// This is how you link entities after initial creation
MATCH (alice:Person {name: 'Alice'})
MATCH (bob:Person {name: 'Bob'})
CREATE (alice)-[:FRIEND {since: date('2023-03-10')}]->(bob)
RETURN 'Friendship created' AS result;
▶ Output
╒════════════════╤══════════╤═══════════════════╕
│"user.name" │"rel.role"│"org.name" │
╞════════════════╪══════════╪═══════════════════╡
│"Dana" │"engineer"│"TheCodeForge" │
└────────────────┴──────────┴───────────────────┘
⚠ CREATE Always Creates — Even Duplicates:
Running the same CREATE statement twice creates duplicate nodes. In production, this corrupts your graph — you'll have two 'Alice' nodes with no way to tell which is real. Use MERGE when you need idempotency. Rule of thumb: use CREATE for event-based data (orders, log entries) where duplicates are impossible by design. Use MERGE for entity-based data (users, products) where the same entity might be referenced multiple times.

MATCH and WHERE: Reading and Filtering Data

MATCH is the read operation in Cypher — equivalent to SELECT in SQL. You write a pattern, and the engine finds every instance of that pattern in the graph. MATCH alone returns everything matching the pattern. WHERE filters the results.

The WHERE clause supports the full set of comparison and logical operators: =, <>, <, >, <=, >=, AND, OR, NOT, IN, STARTS WITH, ENDS WITH, CONTAINS, IS NULL, IS NOT NULL, and regular expressions (=~ 'regex').

Performance note: always start your MATCH from a specific node, not an anonymous scan. (user:User {email: 'x'})-[:ORDERED]->(order) is fast because it starts from the User index. (n)-[:ORDERED]->(order) without a starting point forces a full graph scan.

Multiple MATCH clauses act like joins — each MATCH finds a pattern, and Cypher joins them on shared variables. If you want an outer join effect, use OPTIONAL MATCH (returns null when the pattern isn't found).

io/thecodeforge/graph/MatchAndFilter.cypher · CYPHER
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
// io.thecodeforge: Basic MATCH — find all developers
MATCH (dev:Developer)
RETURN dev.name, dev.language, dev.years_experience;

// io.thecodeforge: MATCH with WHERE — filtering on properties
MATCH (dev:Developer)
WHERE dev.language = 'Java' AND dev.years_experience >= 5
RETURN dev.name AS senior_java_dev;

// io.thecodeforge: Inline property matching — shorthand for simple WHERE
// Equivalent to the query above but more concise
MATCH (dev:Developer {language: 'Java'})
WHERE dev.years_experience >= 5
RETURN dev.name AS senior_java_dev;

// io.thecodeforge: String matching — STARTS WITH, ENDS WITH, CONTAINS
MATCH (user:User)
WHERE user.email ENDS WITH '@thecodeforge.io'
  AND user.name STARTS WITH 'A'
RETURN user.name, user.email;

// io.thecodeforge: IN operator — matching against a list
MATCH (dev:Developer)
WHERE dev.language IN ['Java', 'Kotlin', 'Scala']
RETURN dev.name, dev.language
ORDER BY dev.language;

// io.thecodeforge: IS NULL / IS NOT NULL — checking for missing properties
MATCH (order:Order)
WHERE order.shipped_at IS NULL
  AND order.status = 'confirmed'
RETURN order.order_id, order.total
ORDER BY order.created_at DESC
LIMIT 10;

// io.thecodeforge: Regular expression matching
MATCH (product:Product)
WHERE product.sku =~ 'WIDGET-.*-PRO'
RETURN product.name, product.sku;

// io.thecodeforge: OPTIONAL MATCH — like a LEFT JOIN in SQL
// Returns users even if they have no orders
MATCH (user:User)
OPTIONAL MATCH (user)-[:PLACED]->(order:Order)
RETURN user.name,
       count(order) AS order_count,
       CASE WHEN order IS NULL THEN 'no orders' ELSE 'has orders' END AS status;

// io.thecodeforge: Multiple MATCH clauses — chained traversal
// Find developers who work on projects in a specific category
MATCH (dev:Developer)-[:CONTRIBUTES_TO]->(proj:Project)
MATCH (proj)-[:IN_CATEGORY]->(cat:Category {name: 'Backend'})
WHERE dev.years_experience >= 3
RETURN dev.name, proj.name, cat.name
ORDER BY dev.years_experience DESC;
▶ Output
╒═══════════════════╤══════════════╤═══════════╕
│"dev.name" │"proj.name" │"cat.name" │
╞═══════════════════╪══════════════╪═══════════╡
│"Alice" │"ForgeAPI" │"Backend" │
│"Charlie" │"DataPipeline"│"Backend" │
└───────────────────┴──────────────┴───────────┘
🔥Start Your MATCH From an Index:
Cypher's query planner looks for the most selective starting point. If you MATCH (dev:Developer {email: 'x'})-[:WORKS_ON]->(proj), it starts from the email index (O(log n)) and traverses outward. If you MATCH (dev)-[:WORKS_ON]->(proj:Project {name: 'x'}), it starts from the project index. But if you MATCH (dev)-[:WORKS_ON]->(proj) with no indexed starting point, it scans every developer in the graph. Always anchor your pattern on a node you can index.

RETURN, ORDER BY, LIMIT, and DISTINCT: Shaping Your Results

RETURN defines what the query gives back — equivalent to SELECT in SQL. You can return nodes, relationships, properties, computed values, and aggregations. ORDER BY sorts results (ascending by default, use DESC for descending). LIMIT caps the result set. DISTINCT removes duplicates.

You can alias results with AS for cleaner output. You can use CASE expressions for conditional logic. You can call functions like toString(), toInteger(), datetime(), and coalesce() inside RETURN.

Pagination pattern: use SKIP and LIMIT together for cursor-based pagination. Don't use SKIP for large offsets (skip 10000 rows is slow) — use a WHERE clause on the sort key instead (WHERE user.created_at < last_seen_date).

io/thecodeforge/graph/ReturnAndSort.cypher · CYPHER
123456789101112131415161718192021222324252627282930313233343536373839404142434445
// io.thecodeforge: RETURN with aliases and computed values
MATCH (dev:Developer)
RETURN dev.name AS developer,
       dev.language AS primary_language,
       dev.years_experience AS experience,
       CASE
         WHEN dev.years_experience >= 10 THEN 'Principal'
         WHEN dev.years_experience >= 5 THEN 'Senior'
         WHEN dev.years_experience >= 2 THEN 'Mid'
         ELSE 'Junior'
       END AS level
ORDER BY dev.years_experience DESC;

// io.thecodeforge: DISTINCT — remove duplicate results
MATCH (dev:Developer)-[:CONTRIBUTES_TO]->(proj:Project)
RETURN DISTINCT dev.name AS developer
ORDER BY dev.name;

// io.thecodeforge: Pagination with SKIP and LIMIT
// Page 1: first 10 results
MATCH (user:User)
RETURN user.name, user.email, user.created_at
ORDER BY user.created_at DESC
LIMIT 10;

// Page 2: skip first 10, return next 10
MATCH (user:User)
RETURN user.name, user.email, user.created_at
ORDER BY user.created_at DESC
SKIP 10 LIMIT 10;

// io.thecodeforge: RETURN relationships and their properties
MATCH (a:Person)-[r:FRIEND]->(b:Person)
RETURN a.name AS person,
       b.name AS friend,
       r.since AS friends_since,
       duration.between(r.since, date()).years AS years_of_friendship
ORDER BY r.since ASC;

// io.thecodeforge: Coalesce — provide defaults for missing properties
MATCH (product:Product)
RETURN product.name,
       coalesce(product.description, 'No description available') AS description,
       coalesce(product.rating, 0) AS rating
ORDER BY product.rating DESC;
▶ Output
╒══════════════╤══════════════════╤════════════╤═════════╕
│"developer" │"primary_language"│"experience"│"level" │
╞══════════════╪══════════════════╪════════════╪═════════╡
│"Alice" │"Java" │12 │Principal│
│"Charlie" │"Java" │8 │Senior │
│"Dana" │"Python" │4 │Mid │
└──────────────┴──────────────────┴────────────┴─────────┘
💡Paginate With WHERE, Not SKIP:
SKIP 10000 LIMIT 10 forces the engine to process and discard 10,000 rows before returning 10. For large datasets, use keyset pagination instead: WHERE user.created_at < $last_seen_date ORDER BY user.created_at DESC LIMIT 10. This lets the engine use the index to jump directly to the right position. Same result, 1000x faster on large graphs.

SET and REMOVE: Updating Nodes and Relationships

SET adds or updates properties on nodes and relationships. REMOVE deletes properties or removes labels from nodes. These are the UPDATE and ALTER equivalents in Cypher.

SET can assign a single property, multiple properties at once, or copy properties from a map. It can also add labels to a node (SET n:NewLabel). REMOVE does the inverse — removes properties or labels.

Production pattern: combine MATCH + SET for targeted updates. Always filter by an indexed property (like an ID or email) to avoid scanning the entire graph. Use coalesce() with SET to preserve existing values when partial updates arrive from an API.

io/thecodeforge/graph/SetAndRemove.cypher · CYPHER
1234567891011121314151617181920212223242526272829303132333435363738394041
// io.thecodeforge: SET — update a single property
MATCH (user:User {email: 'alice@thecodeforge.io'})
SET user.last_login = datetime()
RETURN user.name, user.last_login;

// io.thecodeforge: SET — update multiple properties at once
MATCH (order:Order {order_id: 'ORD-2024-001'})
SET order.status = 'shipped',
    order.shipped_at = datetime(),
    order.tracking_number = 'TRACK-99821'
RETURN order.order_id, order.status, order.tracking_number;

// io.thecodeforge: SET — update from a map (common in REST API handlers)
// The map comes from the request body in production
WITH {status: 'delivered', delivered_at: '2024-03-15T14:30:00Z'} AS updates
MATCH (order:Order {order_id: 'ORD-2024-001'})
SET order += updates  // += merges the map into existing properties
RETURN order.order_id, order.status, order.delivered_at;

// io.thecodeforge: SET — add a label to an existing node
MATCH (user:User {email: 'alice@thecodeforge.io'})
SET user:Admin
RETURN labels(user) AS labels;  // ['User', 'Admin']

// io.thecodeforge: REMOVE — delete a property
MATCH (user:User {email: 'alice@thecodeforge.io'})
REMOVE user.temporary_token
RETURN user.name, user.temporary_token;  // null — property is gone

// io.thecodeforge: REMOVE — remove a label from a node
MATCH (user:User {email: 'alice@thecodeforge.io'})
REMOVE user:Admin
RETURN labels(user) AS labels;  // ['User'] — Admin label removed

// io.thecodeforge: Production pattern — partial update with coalesce
// Preserve existing rating if the update doesn't include one
MATCH (product:Product {sku: 'WIDGET-PRO'})
SET product.name = coalesce($update_name, product.name),
    product.price = coalesce($update_price, product.price),
    product.updated_at = datetime()
RETURN product.name, product.price;
▶ Output
╒══════════════╤══════════╤═════════════════════════╕
│"order.order_id"│"order.status"│"order.tracking_number"│
╞══════════════╪══════════╪═════════════════════════╡
│"ORD-2024-001"│"shipped" │"TRACK-99821" │
└──────────────┴──────────┴─────────────────────────┘
⚠ SET += Merges, SET = Replaces:
SET user += {name: 'Alice V2'} merges the map into existing properties — other properties stay. SET user = {name: 'Alice V2'} replaces ALL properties with just the ones in the map — everything else is deleted. In a REST PATCH endpoint, use +=. In a REST PUT endpoint, use =. Getting this wrong in production silently deletes user data.

DELETE and DETACH DELETE: Removing Data

DELETE removes nodes and relationships. DETACH DELETE removes a node and ALL its relationships in one operation. This distinction matters — you cannot delete a node that still has relationships. If you try, Neo4j throws an error. DETACH DELETE bypasses this by removing relationships first, then the node.

In production, soft deletes (setting a 'deleted: true' property) are almost always better than hard deletes. Hard deletes break referential integrity in the graph — other nodes that referenced the deleted node now have dangling relationships. Soft deletes preserve the graph structure while hiding the node from queries (WHERE n.deleted IS NULL OR n.deleted = false).

When you must hard delete, always use DETACH DELETE for nodes. Using plain DELETE on a node with relationships will fail.

io/thecodeforge/graph/DeleteOperations.cypher · CYPHER
123456789101112131415161718192021222324252627282930313233343536373839
// io.thecodeforge: DELETE — remove a relationship only
// The nodes remain; only the connection is severed
MATCH (alice:Person {name: 'Alice'})-[r:FRIEND]->(bob:Person {name: 'Bob'})
DELETE r
RETURN 'Friendship removed' AS result;

// io.thecodeforge: DETACH DELETE — remove a node and ALL its relationships
// Use this when deleting an entity entirely
MATCH (user:User {email: 'spam@bot.net'})
DETACH DELETE user;

// io.thecodeforge: Soft delete — production pattern
// Mark as deleted instead of removing; preserves graph integrity
MATCH (order:Order {order_id: 'ORD-2024-999'})
SET order.deleted = true,
    order.deleted_at = datetime(),
    order.deleted_by = 'admin@thecodeforge.io'
RETURN order.order_id, order.deleted;

// io.thecodeforge: Filtering out soft-deleted nodes in queries
// Every MATCH in your app should include this filter
MATCH (order:Order)
WHERE order.deleted IS NULL
RETURN order.order_id, order.total
ORDER BY order.created_at DESC
LIMIT 20;

// io.thecodeforge: Delete all relationships of a specific type
// Useful for bulk cleanup during data migrations
MATCH ()-[r:TEMPORARY_LINK]->()
DELETE r
RETURN count(r) AS deleted_count;

// io.thecodeforge: Conditional delete — only delete if certain conditions met
MATCH (session:Session)
WHERE session.last_active < datetime() - duration('P30D')
  AND session.active = false
DETACH DELETE session
RETURN count(session) AS expired_sessions_deleted;
▶ Output
╒═══════════════════════════╕
│"expired_sessions_deleted" │
╞═══════════════════════════╡
│147 │
└───────────────────────────┘
⚠ Never DELETE a Node Without DETACH:
MATCH (n:User {id: 'x'}) DELETE n will throw 'Cannot delete node, because it still has relationships.' You must either delete the relationships first or use DETACH DELETE. In production, forgetting DETACH is the #1 cause of delete failures in batch jobs. If you're unsure, always use DETACH DELETE — removing orphaned relationships is almost always the right behavior.

MERGE: Idempotent Create-or-Find

MERGE is CREATE + MATCH in one operation. It checks if a pattern exists — if yes, it matches it; if no, it creates it. This makes MERGE the idempotent alternative to CREATE, essential for scripts that might run multiple times, ETL pipelines, and any scenario where duplicate data is dangerous.

MERGE has two companion clauses: ON CREATE SET (properties to set only when creating a new node) and ON MATCH SET (properties to set only when matching an existing node). Together, these give you upsert behavior: insert if new, update if existing.

Critical gotcha: MERGE matches the ENTIRE pattern, including relationships. If you MERGE (a)-[:FRIEND]->(b) and the nodes exist but the relationship doesn't, MERGE creates the relationship. But if only (a) exists and (b) doesn't, MERGE creates both (a) and (b) — it doesn't look up (b) by name first. This is the most common source of MERGE confusion.

io/thecodeforge/graph/MergeOperations.cypher · CYPHER
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
// io.thecodeforge: MERGE — find or create a node
// First run: creates the node. Second run: finds it.
MERGE (p:Project {name: 'ForgeAPI'})
ON CREATE SET p.created_at = datetime(),
              p.version = '1.0',
              p.status = 'active'
ON MATCH SET p.last_accessed = datetime()
RETURN p.name, p.version, p.status;

// io.thecodeforge: MERGE — idempotent relationship creation
// Prevents duplicate FRIEND relationships between the same two people
MATCH (alice:Person {name: 'Alice'})
MATCH (bob:Person {name: 'Bob'})
MERGE (alice)-[r:FRIEND]->(bob)
ON CREATE SET r.since = date(), r.source = 'manual'
RETURN alice.name, type(r) AS relationship, bob.name, r.since;

// io.thecodeforge: MERGE with unique constraint — production pattern
// Requires: CREATE CONSTRAINT unique_email FOR (u:User) REQUIRE u.email IS UNIQUE
MERGE (user:User {email: 'dana@thecodeforge.io'})
ON CREATE SET user.name = 'Dana',
              user.created_at = datetime(),
              user.signup_source = 'organic'
ON MATCH SET user.last_login = datetime()
RETURN user.name, user.email, user.created_at;

// io.thecodeforge: MERGE on a relationship path — create if not exists
// This is the upsert pattern for connection-based data
MATCH (product:Product {sku: 'WIDGET-PRO'})
MATCH (category:Category {name: 'Electronics'})
MERGE (product)-[r:IN_CATEGORY]->(category)
ON CREATE SET r.assigned_at = datetime(),
              r.assigned_by = 'system'
RETURN product.name, type(r), category.name;

// io.thecodeforge: MERGE for ETL — batch upsert from external data
// Load a list of users and upsert each one
UNWIND [
  {email: 'alice@thecodeforge.io', name: 'Alice', score: 95},
  {email: 'bob@thecodeforge.io', name: 'Bob', score: 87},
  {email: 'newuser@thecodeforge.io', name: 'Eve', score: 92}
] AS row
MERGE (u:User {email: row.email})
ON CREATE SET u.name = row.name,
              u.score = row.score,
              u.first_seen = datetime()
ON MATCH SET u.score = row.score,
             u.last_updated = datetime()
RETURN u.email, u.name, u.score;
▶ Output
╒══════════════════════════╤══════════╤═════════╤═════════════════════╕
│"u.email" │"u.name" │"u.score"│"u.first_seen" │
╞══════════════════════════╪══════════╪═════════╪═════════════════════╡
│"alice@thecodeforge.io" │"Alice" │95 │2024-01-10T08:00:00Z │
│"bob@thecodeforge.io" │"Bob" │87 │2024-01-10T08:00:01Z │
│"newuser@thecodeforge.io" │"Eve" │92 │2024-03-15T14:30:00Z │
└──────────────────────────┴──────────┴─────────┴─────────────────────┘
⚠ MERGE Creates What Doesn't Exist — Be Specific:
MERGE (a:Person {name: 'Alice'})-[:FRIEND]->(b:Person {name: 'Bob'}) checks if the ENTIRE pattern exists. If Alice exists but Bob doesn't, MERGE creates a NEW Bob node — it doesn't look up Bob by name in the broader graph. To avoid phantom nodes, MATCH the nodes first, then MERGE the relationship: MATCH (a:Person {name: 'Alice'}) MATCH (b:Person {name: 'Bob'}) MERGE (a)-[:FRIEND]->(b).

WITH: Chaining Subqueries and Aggregation Pipelines

WITH is the pipe operator of Cypher — it passes results from one part of the query to the next. Think of it as the equivalent of a subquery boundary or a GROUP BY + HAVING in SQL. Every clause after WITH operates on the results of the clause before it.

WITH has three critical uses: 1. Aggregation pipelines — aggregate data, then filter or sort on the aggregated result. In SQL, you'd use HAVING. In Cypher, you aggregate in one WITH, then filter in the next WHERE. 2. Query chaining — break a complex query into readable stages. Each WITH acts as a checkpoint. 3. Limiting scope — pass only specific variables forward, which the engine uses to optimize the query plan.

The rule: any variable not listed in the WITH clause is no longer available downstream. This is deliberate — it lets the query planner know exactly what data flows forward, enabling optimizations.

io/thecodeforge/graph/WithClause.cypher · CYPHER
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
// io.thecodeforge: WITH for aggregation — find prolific contributors
// Count projects per developer, then filter on the count
MATCH (dev:Developer)-[:CONTRIBUTES_TO]->(proj:Project)
WITH dev, count(proj) AS project_count
WHERE project_count >= 3
RETURN dev.name AS developer, project_count
ORDER BY project_count DESC;

// io.thecodeforge: WITH for query chaining — multi-stage filtering
// Stage 1: Find active projects
// Stage 2: Find developers on those projects
// Stage 3: Aggregate and filter
MATCH (proj:Project {status: 'active'})
WITH proj
MATCH (dev:Developer)-[:CONTRIBUTES_TO]->(proj)
WITH dev, collect(proj.name) AS projects, count(proj) AS num_projects
WHERE num_projects >= 2
RETURN dev.name, projects, num_projects
ORDER BY num_projects DESC;

// io.thecodeforge: WITH with LIMIT — process in batches
// Find the 100 most recent users, then do expensive operations on them
MATCH (user:User)
WITH user
ORDER BY user.created_at DESC
LIMIT 100
MATCH (user)-[:PLACED]->(order:Order)
WITH user, count(order) AS order_count, sum(order.total) AS total_spent
WHERE total_spent > 500
RETURN user.name, user.email, order_count, total_spent
ORDER BY total_spent DESC;

// io.thecodeforge: WITH as a scope boundary — only pass what you need
MATCH (dev:Developer)-[:CONTRIBUTES_TO]->(proj:Project)
WITH dev.name AS name, dev.language AS lang, count(proj) AS projects
// dev is no longer available here — only name, lang, projects
WHERE lang = 'Java'
RETURN name, lang, projects
ORDER BY projects DESC;

// io.thecodeforge: WITH + COLLECT — build lists per group
MATCH (cat:Category)<-[:IN_CATEGORY]-(product:Product)
WITH cat, collect(product.name) AS products, count(product) AS product_count
WHERE product_count >= 5
RETURN cat.name AS category, products, product_count
ORDER BY product_count DESC;
▶ Output
╒══════════════╤══════════════════════════════════════════════╤═══════════════╕
│"category" │"products" │"product_count"│
╞══════════════╪══════════════════════════════════════════════╪═══════════════╡
│"Electronics" │["Widget Pro","Gadget X","Sensor V2",...] │12 │
│"Clothing" │["Tee Basic","Hoodie Dev","Cap Logo",...] │8 │
└──────────────┴──────────────────────────────────────────────┴───────────────┘
🔥WITH Is Mandatory After Aggregation:
You cannot chain a WHERE clause directly after an aggregation like count(). The aggregation result must pass through a WITH (or RETURN) first. MATCH (dev)-[:CONTRIBUTES_TO]->(proj) WITH dev, count(proj) AS cnt WHERE cnt > 3 works. MATCH (dev)-[:CONTRIBUTES_TO]->(proj) WHERE count(proj) > 3 does NOT work — it's a syntax error. This is the Cypher equivalent of SQL's HAVING vs WHERE distinction.

Aggregation Functions: COUNT, COLLECT, SUM, AVG, and Friends

Cypher supports the standard set of aggregation functions: count(), collect(), sum(), avg(), min(), max(), percentileCont(), percentileDisc(), and stDev(). The two you'll use most are count() and collect().

count() has two forms: count(n) counts non-null values of n, count() counts all rows regardless of nulls. This distinction matters — count(user) skips rows where user is null (like from an OPTIONAL MATCH), while count() includes them.

collect() is the most powerful aggregation function in Cypher — it gathers all values into a list. This is how you build nested results: group by a category, collect the items in that category into a list, return the category with its items. It replaces SQL's GROUP_CONCAT and subqueries for building nested JSON responses.

All aggregation functions ignore null values except count(*).

io/thecodeforge/graph/AggregationFunctions.cypher · CYPHER
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
// io.thecodeforge: COUNT — two forms matter
// count(*) counts all rows including nulls
// count(order) counts only non-null orders
MATCH (user:User)
OPTIONAL MATCH (user)-[:PLACED]->(order:Order)
RETURN user.name,
       count(*) AS total_rows,           // includes users with no orders
       count(order) AS orders_placed      // excludes null orders
ORDER BY orders_placed DESC;

// io.thecodeforge: COLLECT — build nested lists
// Group products by category, return category with its products
MATCH (cat:Category)<-[:IN_CATEGORY]-(product:Product)
WITH cat.name AS category, collect(product) AS products
RETURN category, size(products) AS count, products
ORDER BY count DESC;

// io.thecodeforge: COLLECT with property extraction — cleaner output
// Collect just the names, not full node objects
MATCH (proj:Project)<-[:CONTRIBUTES_TO]-(dev:Developer)
WITH proj.name AS project, collect(dev.name) AS contributors
RETURN project, contributors
ORDER BY size(contributors) DESC;

// io.thecodeforge: SUM and AVG — numeric aggregations
MATCH (order:Order)-[:CONTAINS]->(item:OrderItem)
WITH order,
     sum(item.quantity * item.unit_price) AS order_total
RETURN avg(order_total) AS avg_order_value,
       min(order_total) AS min_order,
       max(order_total) AS max_order,
       sum(order_total) AS total_revenue,
       count(order) AS total_orders;

// io.thecodeforge: PERCENTILE — distribution analysis
MATCH (user:User)-[:PLACED]->(order:Order)
WITH user, count(order) AS order_count
RETURN percentileCont(order_count, 0.5) AS median_orders,
       percentileCont(order_count, 0.95) AS p95_orders,
       percentileCont(order_count, 0.99) AS p99_orders,
       avg(order_count) AS mean_orders;

// io.thecodeforge: COLLECT for top-N per group
// Top 3 products per category by rating
MATCH (cat:Category)<-[:IN_CATEGORY]-(product:Product)
WITH cat, product
ORDER BY product.rating DESC
WITH cat.name AS category, collect(product)[0..3] AS top_products
RETURN category, top_products;
▶ Output
╒════════════════╤═════════════╤══════════╤══════════╤══════════════════╤═════════════╕
│"avg_order_value"│"min_order" │"max_order"│"total_revenue"│"total_orders" │
╞════════════════╪═════════════╪══════════╪══════════════════╪═══════════════╡
│73.42 │12.99 │892.50 │1,847,293.00 │25,162 │
└────────────────┴─────────────┴──────────┴──────────────────┴───────────────┘
💡collect()[0..N] for Top-N Per Group:
The pattern collect(x) ORDER BY x.score DESC then collect(x)[0..3] is Cypher's equivalent of a window function with ROW_NUMBER(). It gives you the top N items per group without a subquery. This is the standard pattern for leaderboards, top products per category, and most-active users per team.

Variable-Length Paths: Traversing Multiple Hops

Variable-length paths let you traverse a variable number of relationships in one pattern. Instead of writing out each hop explicitly, you specify a range: [1..3] means 1 to 3 hops, [] means any number of hops, [*3] means exactly 3 hops.

This is the feature that makes graph databases shine for social networks (friends of friends), supply chains (supplier of supplier), and recommendation engines (users who bought what you bought bought what?).

Critical warning: unbounded paths ([*] with no upper limit) will explore every reachable node in the graph and can cause out-of-memory errors on large graphs. ALWAYS set a maximum depth. For most use cases, 1-5 hops is sufficient. Beyond 6 hops, the results are usually meaningless (everyone on Facebook is within 4.7 hops of everyone else).

Use shortestPath() and allShortestPaths() when you specifically want the shortest route between two nodes, not all possible paths.

io/thecodeforge/graph/VariableLengthPaths.cypher · CYPHER
1234567891011121314151617181920212223242526272829303132333435363738
// io.thecodeforge: Friends of friends — 2 hops exactly
MATCH (me:Person {name: 'Alice'})-[:FRIEND*2]->(fof)
WHERE fof <> me  // exclude Alice herself
RETURN DISTINCT fof.name AS friend_of_friend
ORDER BY fof.name;

// io.thecodeforge: Friends up to 3 levels deep — 1 to 3 hops
MATCH (me:Person {name: 'Alice'})-[:FRIEND*1..3]->(connection)
WHERE connection <> me
RETURN DISTINCT connection.name AS connection,
       min(length(shortestPath((me)-[:FRIEND*]->(connection)))) AS degrees
ORDER BY degrees, connection.name;

// io.thecodeforge: Supply chain — find suppliers of suppliers
MATCH (product:Product {name: 'Widget Pro'})-[:SUPPLIED_BY*1..4]->(supplier)
RETURN product.name AS product,
       supplier.name AS ultimate_supplier,
       supplier.country AS origin_country;

// io.thecodeforge: Shortest path between two people
MATCH path = shortestPath(
  (alice:Person {name: 'Alice'})-[:FRIEND*]-(bob:Person {name: 'Bob'})
)
RETURN [n IN nodes(path) | n.name] AS path_names,
       length(path) AS degrees_of_separation;

// io.thecodeforge: All shortest paths (if multiple equally-short routes exist)
MATCH path = allShortestPaths(
  (alice:Person {name: 'Alice'})-[:FRIEND*]-(bob:Person {name: 'Bob'})
)
RETURN [n IN nodes(path) | n.name] AS path_names,
       length(path) AS hops;

// io.thecodeforge: Bounded traversal with filtering at each hop
// Find paths where every intermediate person is over 25
MATCH (alice:Person {name: 'Alice'})-[:FRIEND*1..3]->(target:Person)
WHERE all(node IN nodes(path) WHERE node.age >= 25 OR node = alice OR node = target)
RETURN DISTINCT target.name, length(path) AS hops;
▶ Output
╒════════════════════════╤═══════════════════════╕
│"path_names" │"degrees_of_separation"│
╞════════════════════════╪═══════════════════════╡
│["Alice","Charlie","Bob"]│2 │
└────────────────────────┴───────────────────────┘
⚠ Never Use Unbounded Paths in Production:
MATCH (a)-[]->(b) with no upper limit explores every reachable node from 'a'. On a graph with 1 million nodes and dense relationships, this can consume all available memory and crash the server. Always use [1..5] or similar bounded ranges. If you need to find all reachable nodes, use the APOC procedure apoc.path.expandConfig with a maxDepth parameter and a timeout.

UNWIND: Working with Lists and Expanding Collections

UNWIND takes a list and expands it into individual rows — the opposite of collect(). If collect() gathers rows into a list, UNWIND spreads a list into rows. This is essential for processing batch data, transforming aggregated results back into individual records, and working with JSON arrays passed from application code.

The most common production use: UNWIND a parameter list to create or merge multiple nodes in one query. Instead of executing 100 separate CREATE statements (100 round trips), you UNWIND a list of 100 items and create all nodes in one query (1 round trip).

UNWIND + MERGE is the standard ETL pattern for loading data into Neo4j.

io/thecodeforge/graph/UnwindOperations.cypher · CYPHER
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
// io.thecodeforge: UNWIND — expand a list into rows
UNWIND ['Alice', 'Bob', 'Charlie'] AS name
MERGE (p:Person {name: name})
RETURN p.name;

// io.thecodeforge: UNWIND for batch creation — 1 query, 100 nodes
// In production, $users comes from a JSON request body
UNWIND $users AS user_data
MERGE (u:User {email: user_data.email})
ON CREATE SET u.name = user_data.name,
              u.created_at = datetime()
RETURN count(u) AS users_processed;

// io.thecodeforge: UNWIND to create relationships in batch
// Connect a list of skills to a developer in one query
MATCH (dev:Developer {name: 'Alice'})
UNWIND ['Java', 'Spring Boot', 'Neo4j', 'Kubernetes'] AS skill_name
MERGE (s:Skill {name: skill_name})
MERGE (dev)-[:HAS_SKILL {level: 'expert'}]->(s)
RETURN dev.name, collect(s.name) AS skills;

// io.thecodeforge: UNWIND after COLLECT — flatten and reprocess
// Find all categories, collect their products, then flatten for analysis
MATCH (cat:Category)<-[:IN_CATEGORY]-(product:Product)
WITH cat, collect(product) AS products
UNWIND products AS product
RETURN cat.name AS category,
       product.name AS product,
       product.price AS price
ORDER BY cat.name, product.price DESC;

// io.thecodeforge: UNWIND with index — track position in the list
UNWIND ['alpha', 'beta', 'gamma', 'delta'] AS item
WITH item, range(0, 3)[indexOf(['alpha','beta','gamma','delta'], item)] AS idx
MERGE (n:Item {name: item})
SET n.sort_order = idx
RETURN n.name, n.sort_order
ORDER BY n.sort_order;

// io.thecodeforge: UNWIND for data cleanup — split comma-separated values
MATCH (legacy:LegacyRecord)
WHERE legacy.tags IS NOT NULL
WITH legacy, split(legacy.tags, ',') AS tag_list
UNWIND tag_list AS tag
MERGE (t:Tag {name: trim(tag)})
MERGE (legacy)-[:TAGGED_WITH]->(t)
RETURN count(DISTINCT legacy) AS records_processed, count(DISTINCT t) AS tags_created;
▶ Output
╒═══════════════════════╤═════════════════╕
│"records_processed" │"tags_created" │
╞═══════════════════════╪═════════════════╡
│4,291 │187 │
└───────────────────────┴─────────────────┘
💡UNWIND $list Is the Batch API Pattern:
Every Neo4j driver (Java, Python, JavaScript) supports parameterized queries. Pass a list as $items, UNWIND it in Cypher, and process all items in a single database round trip. This is 100x faster than executing individual queries in a loop from your application code. In Spring Data Neo4j, use @Query with UNWIND $batch AS row MERGE ... for batch operations.

Subqueries with CALL: Encapsulating Complex Logic

CALL introduces subqueries in Cypher — self-contained query blocks that execute independently and return results to the outer query. There are two forms: CALL {} for inline subqueries (like SQL subqueries) and CALL procedure.name() for invoking stored procedures (like stored procedures in SQL).

CALL {} subqueries are essential for isolating aggregation, avoiding variable scope conflicts, and breaking complex queries into readable chunks. Each CALL {} block has its own scope — variables must be explicitly passed in and out.

CALL procedure is how you invoke APOC procedures, Graph Data Science algorithms, and custom procedures. The syntax is CALL db.indexes() YIELD indexName, state RETURN indexName.

io/thecodeforge/graph/SubqueriesAndCall.cypher · CYPHER
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
// io.thecodeforge: CALL {} — inline subquery for isolated aggregation
// Find developers, and for each, count their projects in a subquery
MATCH (dev:Developer)
CALL {
  WITH dev
  MATCH (dev)-[:CONTRIBUTES_TO]->(proj:Project)
  RETURN count(proj) AS project_count, collect(proj.name) AS project_names
}
RETURN dev.name, project_count, project_names
ORDER BY project_count DESC;

// io.thecodeforge: CALL {} — find the most recent order per user
MATCH (user:User)
CALL {
  WITH user
  MATCH (user)-[:PLACED]->(order:Order)
  RETURN order
  ORDER BY order.created_at DESC
  LIMIT 1
}
RETURN user.name, order.order_id, order.total, order.created_at
ORDER BY order.created_at DESC;

// io.thecodeforge: CALL db.indexes — list all indexes in the database
CALL db.indexes()
YIELD name, type, entityType, labelsOrTypes, properties, state
RETURN name, type, entityType, labelsOrTypes, properties, state
ORDER BY entityType, name;

// io.thecodeforge: CALL db.schema.visualization — see the graph schema
CALL db.schema.visualization()
YIELD nodes, relationships
RETURN nodes, relationships;

// io.thecodeforge: CALL apoc (if APOC is installed) — advanced procedures
// Export a subgraph to JSON for backup or migration
CALL apoc.export.json.query(
  'MATCH (u:User)-[:PLACED]->(o:Order) RETURN u, o LIMIT 100',
  null,
  {stream: true}
)
YIELD data
RETURN data;

// io.thecodeforge: CALL with UNION — combine results from multiple patterns
MATCH (dev:Developer {name: 'Alice'})
CALL {
  WITH dev
  MATCH (dev)-[:CONTRIBUTES_TO]->(proj:Project)
  RETURN proj.name AS item, 'project' AS type
  UNION
  WITH dev
  MATCH (dev)-[:HAS_SKILL]->(skill:Skill)
  RETURN skill.name AS item, 'skill' AS type
}
RETURN dev.name, collect({name: item, type: type}) AS contributions;
▶ Output
╒══════════╤══════════════╤═══════════════════════════════════════════╕
│"dev.name"│"project_count"│"project_names" │
╞══════════╪══════════════╪═══════════════════════════════════════════╡
│"Alice" │4 │["ForgeAPI","DataPipeline","AuthSvc","..."│
│"Charlie" │3 │["DataPipeline","ETLRunner","Monitoring"] │
│"Dana" │1 │["ForgeAPI"] │
└──────────┴──────────────┴───────────────────────────────────────────┘
🔥CALL {} Isolates Scope — Use It to Avoid Variable Conflicts:
If your query has multiple MATCH clauses that reuse the same variable names, the query planner can get confused about which variable refers to what. Wrapping each section in CALL {} gives it a clean scope. This is especially important in complex queries with 4+ MATCH clauses — the readability improvement alone is worth the extra syntax.

Indexes and Constraints: Schema Management for Performance

Neo4j is often described as 'schema-optional,' but that doesn't mean you should skip indexing. Without indexes, every MATCH on a property scans every node with that label — O(n). With a range index, the same MATCH uses a B-tree lookup — O(log n). With a uniqueness constraint, you get both an index AND a guarantee that no duplicates can be created.

Three types of indexes: 1. Range index — the default, supports equality and range queries (=, <, >, IN). CREATE INDEX FOR (n:Label) ON (n.property) 2. Text index — optimized for CONTAINS, ENDS WITH, and full-text search. CREATE TEXT INDEX FOR (n:Label) ON (n.property) 3. Point index — for geospatial queries. CREATE POINT INDEX FOR (n:Label) ON (n.location)

Constraints enforce data integrity: 1. Uniqueness — no two nodes with the same label can have the same property value. CREATE CONSTRAINT FOR (n:User) REQUIRE n.email IS UNIQUE 2. Node key — uniqueness + existence (the property must exist and be unique). CREATE CONSTRAINT FOR (n:User) REQUIRE (n.email, n.tenant_id) IS NODE KEY 3. Existence — the property must exist (not null). CREATE CONSTRAINT FOR (n:User) REQUIRE n.email IS NOT NULL

Production rule: create uniqueness constraints on every entity's primary identifier before loading data. This prevents duplicates at the database level and gives you a free index for lookups.

io/thecodeforge/graph/IndexesAndConstraints.cypher · CYPHER
1234567891011121314151617181920212223242526272829303132333435
// io.thecodeforge: Create a range index — for fast property lookups
CREATE INDEX user_email_index FOR (u:User) ON (u.email);
CREATE INDEX product_sku_index FOR (p:Product) ON (p.sku);
CREATE INDEX order_status_index FOR (o:Order) ON (o.status);

// io.thecodeforge: Create a composite index — for queries that filter on multiple props
CREATE INDEX user_tenant_email_index FOR (u:User) ON (u.tenant_id, u.email);

// io.thecodeforge: Create a text index — for CONTAINS and ENDS WITH queries
CREATE TEXT INDEX product_name_text_index FOR (p:Product) ON (p.name);

// io.thecodeforge: Create a uniqueness constraint — prevents duplicate entities
CREATE CONSTRAINT user_email_unique FOR (u:User) REQUIRE u.email IS UNIQUE;
CREATE CONSTRAINT product_sku_unique FOR (p:Product) REQUIRE p.sku IS UNIQUE;
CREATE CONSTRAINT order_id_unique FOR (o:Order) REQUIRE o.order_id IS UNIQUE;

// io.thecodeforge: Create a node key — composite uniqueness + existence
CREATE CONSTRAINT user_tenant_key FOR (u:User) REQUIRE (u.email, u.tenant_id) IS NODE KEY;

// io.thecodeforge: Create an existence constraint — property must exist
CREATE CONSTRAINT user_email_exists FOR (u:User) REQUIRE u.email IS NOT NULL;

// io.thecodeforge: List all indexes and their status
SHOW INDEXES YIELD name, type, entityType, labelsOrTypes, properties, state
ORDER BY entityType, name;

// io.thecodeforge: List all constraints
SHOW CONSTRAINTS YIELD name, type, entityType, labelsOrTypes, properties
ORDER BY entityType, name;

// io.thecodeforge: Drop an index (if you need to recreate it)
DROP INDEX user_email_index;

// io.thecodeforge: Drop a constraint
DROP CONSTRAINT user_email_unique;
▶ Output
╒══════════════════════════╤═════════╤════════════╤═══════════════╤══════════════╤═════════╕
│"name" │"type" │"entityType"│"labelsOrTypes"│"properties" │"state" │
╞══════════════════════════╪═════════╪════════════╪═══════════════╪══════════════╪═════════╡
│"user_email_index" │"RANGE" │"NODE" │["User"] │["email"] │"ONLINE" │
│"product_sku_index" │"RANGE" │"NODE" │["Product"] │["sku"] │"ONLINE" │
│"order_status_index" │"RANGE" │"NODE" │["Order"] │["status"] │"ONLINE" │
└──────────────────────────┴─────────┴────────────┴───────────────┴──────────────┴─────────┘
⚠ Constraints Create Indexes — Don't Duplicate:
A uniqueness constraint automatically creates a range index on the same property. If you create both CREATE INDEX FOR (u:User) ON (u.email) and CREATE CONSTRAINT FOR (u:User) REQUIRE u.email IS UNIQUE, you have two indexes doing the same work — wasting memory and slowing down writes. Rule: if you need uniqueness, create the constraint (it includes an index). If you only need fast lookups without uniqueness, create the index.

Real-World Patterns: Recommendation Engine, Fraud Detection, and Social Graph

The value of Cypher becomes clear when you see it applied to real problems. Here are three production patterns that demonstrate why graph databases outperform relational databases for connected data.

Recommendation Engine — 'Customers who bought X also bought Y.' In SQL, this requires a self-join on the purchases table with a GROUP BY, which gets exponentially slower as purchase history grows. In Cypher, it's a two-hop traversal: find everyone who bought the same product, then find what else they bought.

Fraud Detection — Find rings of accounts that share addresses, phone numbers, or devices. In SQL, this requires multiple joins across identity tables. In Cypher, you match a pattern where multiple accounts connect to the same shared resource.

Social Graph — Find the shortest path between two people, suggest connections based on mutual friends, identify communities. These are all natural graph traversals that Cypher handles natively.

io/thecodeforge/graph/RealWorldPatterns.cypher · CYPHER
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
// io.thecodeforge: RECOMMENDATION ENGINE
// 'Customers who bought this also bought' — collaborative filtering
MATCH (customer:Customer)-[:PURCHASED]->(product:Product {sku: 'WIDGET-PRO'})
<-[:PURCHASED]-(other:Customer)-[:PURCHASED]->(recommended:Product)
WHERE recommended <> product  // exclude the product they already bought
  AND NOT (customer)-[:PURCHASED]->(recommended)  // exclude what they already own
RETURN recommended.name AS product,
       recommended.sku AS sku,
       count(other) AS bought_by_others
ORDER BY bought_by_others DESC
LIMIT 5;

// io.thecodeforge: FRAUD DETECTION — shared identity signals
// Find accounts that share a phone number OR address — potential fraud ring
MATCH (a1:Account)-[:HAS_PHONE]->(phone:Phone)<-[:HAS_PHONE]-(a2:Account)
WHERE a1 <> a2
WITH a1, a2, collect(DISTINCT phone.number) AS shared_phones
MATCH (a1)-[:HAS_ADDRESS]->(addr:Address)<-[:HAS_ADDRESS]-(a2)
WITH a1, a2, shared_phones, collect(DISTINCT addr.full_address) AS shared_addresses
WHERE size(shared_phones) >= 1 AND size(shared_addresses) >= 1
RETURN a1.account_id AS account_1,
       a2.account_id AS account_2,
       shared_phones,
       shared_addresses,
       size(shared_phones) + size(shared_addresses) AS risk_score
ORDER BY risk_score DESC
LIMIT 20;

// io.thecodeforge: SOCIAL GRAPH — mutual friends suggestion
MATCH (me:Person {name: 'Alice'})-[:FRIEND]->(mutual:Person)-[:FRIEND]->(suggestion:Person)
WHERE NOT (me)-[:FRIEND]->(suggestion)  // not already friends
  AND suggestion <> me                  // not yourself
RETURN suggestion.name AS suggested_friend,
       count(mutual) AS mutual_friends,
       collect(mutual.name) AS through
ORDER BY mutual_friends DESC
LIMIT 10;

// io.thecodeforge: SOCIAL GRAPH — community detection (simplified)
// Find groups of 3+ people who are all friends with each other
MATCH (a:Person)-[:FRIEND]->(b:Person)-[:FRIEND]->(c:Person)-[:FRIEND]->(a)
WHERE id(a) < id(b) AND id(b) < id(c)  // avoid counting same triangle multiple times
RETURN a.name, b.name, c.name AS triangle
ORDER BY a.name;

// io.thecodeforge: KNOWLEDGE GRAPH — find all entities connected to a concept
MATCH (concept:Concept {name: 'Microservices'})-[:RELATED_TO*1..2]->(related)
WHERE related <> concept
RETURN DISTINCT labels(related)[0] AS type,
       related.name AS entity,
       related.description AS description
ORDER BY type, entity
LIMIT 20;
▶ Output
╒═══════════════════╤══════════╤═══════════════════════════════════════╕
│"product" │"sku" │"bought_by_others" │
╞═══════════════════╪══════════╪═══════════════════════════════════════╡
│"Gadget X" │"GADG-001"│847 │
│"Sensor V2" │"SENS-002"│623 │
│"Widget Mini" │"WIDM-003"│412 │
│"Power Pack" │"POWR-004"│298 │
│"Cable Pro" │"CABL-005"│156 │
└───────────────────┴──────────┴───────────────────────────────────────┘
💡The Recommendation Query Is Just Two Hops:
The entire 'customers who bought X also bought Y' feature is: (me)-[:PURCHASED]->(product)<-[:PURCHASED]-(other)-[:PURCHASED]->(rec). That's it. Four nodes, three relationships, one pattern. In SQL, this is a self-join with aggregation and exclusion subqueries — 15+ lines of SQL that gets slower as the purchase table grows. In Cypher, it's a single pattern that runs in milliseconds because the graph stores the relationships as direct pointers.

PROFILE and EXPLAIN: Query Performance Analysis

Every query you write for production should be profiled first. Neo4j provides two tools: EXPLAIN shows the execution plan without running the query (fast, safe for production). PROFILE runs the query and shows the execution plan with actual row counts and database hits (slower, shows real performance).

The execution plan is a tree of operators: NodeByLabelScan (scans all nodes with a label), NodeIndexSeek (uses an index), Expand (traverses a relationship), Filter (applies WHERE conditions), ProduceResult (returns data). Your goal: minimize NodeByLabelScan operations and maximize NodeIndexSeek operations.

The 'Eager' operator is a red flag — it means the engine had to load all intermediate results into memory before continuing. This happens with certain aggregation patterns and can cause memory issues on large graphs. If you see Eager, consider restructuring the query.

Database hits are the unit of work — fewer hits = faster query. A query with 100 database hits is fast. A query with 10 million database hits is slow, regardless of how many rows it returns.

io/thecodeforge/graph/QueryProfiling.cypher · CYPHER
1234567891011121314151617181920212223242526272829303132333435363738394041
// io.thecodeforge: EXPLAIN — show plan without executing (safe for production)
EXPLAIN
MATCH (dev:Developer {language: 'Java'})-[:CONTRIBUTED_TO]->(proj:Project)
WHERE proj.status = 'active'
RETURN dev.name, proj.name;

// io.thecodeforge: PROFILE — execute and show real metrics
PROFILE
MATCH (dev:Developer {language: 'Java'})-[:CONTRIBUTED_TO]->(proj:Project)
WHERE proj.status = 'active'
RETURN dev.name, proj.name;

// io.thecodeforge: BAD QUERY — no label, forces full scan
// This scans every node in the database
EXPLAIN
MATCH (n)
WHERE n.email = 'alice@thecodeforge.io'
RETURN n;

// io.thecodeforge: GOOD QUERY — uses label + index
// This uses the User label index and the email property index
EXPLAIN
MATCH (u:User {email: 'alice@thecodeforge.io'})
RETURN u;

// io.thecodeforge: Identifying the Eager operator
// If you see Eager in the plan, the engine is buffering all results
EXPLAIN
MATCH (order:Order)
WITH order
ORDER BY order.created_at DESC
MATCH (order)-[:CONTAINS]->(item:OrderItem)
RETURN order, collect(item) AS items;

// io.thecodeforge: Checking if an index is being used
// Run this after creating an index to verify the plan uses it
EXPLAIN
MATCH (u:User {email: 'test@thecodeforge.io'})
RETURN u.name;
// If the plan shows NodeIndexSeek, the index is working.
// If it shows NodeByLabelScan, the index isn't being used — check your query.
▶ Output
PROFILE output for the good query:

+------------------+----------------+---------+-----------+
| Operator | Estimated Rows | DB Hits | Memory (Bytes) |
+------------------+----------------+---------+-----------+
| ProduceResult | 1 | 0 | 0 |
| +NodeIndexSeek | 1 | 4 | 120 |
+------------------+----------------+---------+-----------+

Total database hits: 4 ← This is fast.


PROFILE output for the bad query:

+----------------------+----------------+-----------+---------------+
| Operator | Estimated Rows | DB Hits | Memory (Bytes)|
+----------------------+----------------+-----------+---------------+
| ProduceResult | 1 | 0 | 0 |
| +Filter | 1 | 2,847,293 | 480 |
| +NodeByLabelScan | 2,847,293 | 2,847,294 | 32,768 |
+----------------------+----------------+-----------+---------------+

Total database hits: 5,694,587 ← This is slow. Add an index.
🔥NodeIndexSeek Good, NodeByLabelScan Bad:
When reading a PROFILE or EXPLAIN output, look for these two operators. NodeIndexSeek means the engine used an index — fast. NodeByLabelScan means it scanned every node with that label — slow. If your query shows NodeByLabelScan where you expected NodeIndexSeek, check that the index exists (SHOW INDEXES) and that your query filters on the indexed property.
AspectRelational (SQL)Graph (Cypher)
Data ModelTables and RowsNodes and Relationships
Relationship LogicForeign Keys & JoinsDirect Traversal (Pointers)
Query SyntaxVerbose SELECT/JOINASCII-Art Pattern Matching
Performance at DepthDegrades exponentially with join depthConstant for local traversals (index-free adjacency)
SchemaRigid — ALTER TABLE for changesSchema-optional — add properties without migration
Many-to-ManyRequires junction tableNative — just draw multiple relationships
AggregationGROUP BY + HAVINGWITH + WHERE (after aggregation)
SubqueriesNested SELECT / CTECALL {} blocks
Batch OperationsINSERT INTO ... SELECTUNWIND $list MERGE ...
Full-Text SearchFull-Text Index / LIKE '%x%'TEXT INDEX + CONTAINS / apoc.text

🎯 Key Takeaways

  • Cypher is a pattern-based query language — you describe the SHAPE of your data using ASCII art, and the engine finds every instance of that shape in the graph. Think in nodes and paths, not tables and joins.
  • CREATE always creates new nodes, even if duplicates exist. MERGE checks for existence first and creates only if missing. Use MERGE for entities, CREATE for events.
  • Always use labels in your MATCH patterns — matching without a label forces a full database scan (O(n)), while matching with a label uses the index (O(log n)).
  • WITH is the pipe operator of Cypher — it chains query stages, enables aggregation filtering (the HAVING equivalent), and limits variable scope for query optimization.
  • Variable-length paths ([1..5]) are Cypher's superpower for social networks, supply chains, and recommendation engines — but NEVER use unbounded paths ([]) in production.
  • UNWIND expands lists into rows — use it with MERGE for batch operations that process thousands of records in a single database round trip.
  • Profile every production query with EXPLAIN or PROFILE. Look for NodeIndexSeek (good) vs NodeByLabelScan (bad). If you see Eager, restructure the query.
  • Create uniqueness constraints on every entity's primary identifier before loading data — this prevents duplicates at the database level and gives you a free index.
  • collect() gathers rows into a list, UNWIND spreads a list into rows. Together they're the foundation of nested result structures and batch processing in Cypher.
  • Soft deletes (setting a 'deleted' property) are almost always better than hard deletes (DETACH DELETE) in production — hard deletes break graph integrity.
  • Cypher's recommendation engine query is just two hops: (me)-[:PURCHASED]->(product)<-[:PURCHASED]-(other)-[:PURCHASED]->(rec). In SQL, this is a 15-line self-join. In Cypher, it's a single pattern.
  • Use keyset pagination (WHERE created_at < $cursor LIMIT 10) instead of SKIP-based pagination (SKIP 100000 LIMIT 10) — SKIP forces the engine to discard rows, keyset jumps directly to the right position.

⚠ Common Mistakes to Avoid

    Creating Cartesian products by matching two unrelated patterns — MATCH (a:User), (b:Product) without a relationship between them produces a × b rows, which crashes the database on large datasets. Always connect your patterns with relationships: MATCH (a:User)-[:PURCHASED]->(b:Product).

    b:Product).

    Using CREATE when you need MERGE — running the same CREATE script twice creates duplicate nodes. Use MERGE for entities (users, products) and CREATE only for events (orders, log entries) where duplicates are impossible by design.

    by design.

    Matching without labels — MATCH (n) WHERE n.email = 'x' scans every node in the database (O(n)). Always use labels: MATCH (u:User) WHERE u.email = 'x' uses the label index (O(log n)).

    (O(log n)).

    Unbounded variable-length paths — MATCH (a)-[*]->(b) explores every reachable node and can cause out-of-memory errors. Always set a maximum depth: MATCH (a)-[*1..5]->(b).

    1..5]->(b).

    Forgetting that DELETE fails on nodes with relationships — MATCH (n:User {id: 'x'}) DELETE n throws an error if the node has relationships. Use DETACH DELETE to remove the node and its relationships in one operation.

    operation.

    Using SET = instead of SET += for partial updates — SET user = {name: 'Alice'} replaces ALL properties with just {name: 'Alice'}, deleting everything else. Use SET user += {name: 'Alice'} to merge the update into existing properties.

    properties.

    Not creating indexes before loading data — loading 1 million nodes and then creating the index is slower than creating the index first. The index rebuilds from scratch after bulk load. Create constraints and indexes first, then load data.

    load data.

    Using SKIP for large offsets — SKIP 100000 LIMIT 10 forces the engine to process and discard 100,000 rows. Use keyset pagination with WHERE instead: WHERE user.created_at < $cursor ORDER BY user.created_at DESC LIMIT 10.

    C LIMIT 10.

Interview Questions on This Topic

  • QExplain 'Index-free Adjacency' and how it makes Cypher traversals constant-time regardless of database size. How does this differ from SQL's approach to relationships?
  • QWhat's the difference between CREATE and MERGE? When would you use each, and what happens if you run the same MERGE statement twice?
  • QHow does the WITH clause work in Cypher? Give an example where you aggregate data and then filter on the aggregated result — what's the SQL equivalent?
  • QWhat is a variable-length path query? Write a Cypher query to find all 'friends of friends' up to 3 levels deep, and explain why you should never use an unbounded path in production.
  • QHow do you profile a Cypher query? What does NodeIndexSeek vs NodeByLabelScan tell you, and what does the 'Eager' operator signify?
  • QExplain the difference between MATCH and OPTIONAL MATCH. In what scenario would you use OPTIONAL MATCH, and how does it relate to SQL's LEFT JOIN?
  • QHow does MERGE handle race conditions when two concurrent transactions try to create the same node? What role do uniqueness constraints play?
  • QWhat is the UNWIND clause and how would you use it to batch-insert 10,000 records in a single query? Why is this faster than executing 10,000 individual CREATE statements?
  • QDesign a Cypher query for a recommendation engine that finds 'customers who bought X also bought Y.' Explain why this is faster in a graph database than in SQL.
  • QYou have a graph with 50 million nodes and a query that's taking 30 seconds. Walk me through your debugging process — what do you check first, and what are the most common causes of slow Cypher queries?

Frequently Asked Questions

Does Cypher support ACID transactions like SQL?

Yes, Neo4j is fully ACID-compliant. Every Cypher query runs within a transaction — if any part of the query fails, the entire transaction is rolled back. For multi-statement transactions, you can use BEGIN and COMMIT explicitly through the driver API. Neo4j also uses optimistic locking: if two transactions modify the same node concurrently, the second one fails with a deadlock exception and must be retried. In your application code, wrap Cypher calls in a retry loop with exponential backoff.

What is the difference between CREATE and MERGE?

CREATE always creates a new node or relationship, even if an identical one already exists — this leads to duplicates. MERGE checks if the pattern exists first: if it does, MERGE matches it (does nothing); if it doesn't, MERGE creates it. MERGE is the idempotent version of CREATE. Use CREATE for event-based data (orders, log entries) where each record is inherently unique. Use MERGE for entity-based data (users, products) where the same entity might be referenced multiple times. MERGE also supports ON CREATE SET and ON MATCH SET for upsert behavior.

How do I handle many-to-many relationships in Cypher?

Many-to-many relationships are natural in graph databases — no junction table needed. Just create direct relationships between the nodes: (student:Student)-[:ENROLLED_IN]->(course:Course). If a student is enrolled in 5 courses, they have 5 direct relationships. If a course has 200 students, it has 200 incoming relationships. Querying is a simple pattern match: MATCH (s:Student)-[:ENROLLED_IN]->(c:Course) RETURN s.name, c.name. No join tables, no foreign keys, no performance degradation as the dataset grows.

What is the 'ASCII-art' syntax exactly?

Cypher uses text characters to represent graph elements. Parentheses (node) represent a node (they look like circles). Brackets -[rel:TYPE]-> represent a directed relationship (the arrow shows direction). Combining them: (a:Person)-[:FRIEND]->(b:Person) visually represents 'a Person node connected by a FRIEND relationship to another Person node.' This makes the query look like the data it queries — if you can draw it on a whiteboard, you can write it in Cypher.

Can I use Cypher to query non-graph data?

Cypher is optimized for graph traversals — using it for flat, non-connected data (like a simple key-value store) is inefficient. If your data has no relationships or the relationships are trivial (one-to-many without traversal needs), a relational database or document store is simpler and faster. Cypher's advantage kicks in when you need to traverse relationships: friends of friends, supply chain hops, recommendation paths, fraud ring detection. If your queries are mostly 'give me row X where column Y = Z,' you don't need a graph.

How do I import data from CSV into Neo4j?

Use LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row to read CSV files. The WITH HEADERS clause treats the first row as column names, accessible as row.column_name. Combine with MERGE for idempotent imports: LOAD CSV WITH HEADERS FROM 'file:///users.csv' AS row MERGE (u:User {email: row.email}) SET u.name = row.name. For large files (>100K rows), use PERIODIC COMMIT (Neo4j 4.x) or CALL { } IN TRANSACTIONS (Neo4j 5.x) to batch the writes and avoid running out of memory.

How do I paginate results efficiently in Cypher?

Don't use SKIP for large offsets — SKIP 100000 LIMIT 10 forces the engine to process and discard 100,000 rows. Use keyset pagination instead: pass the last seen value and filter on it. For example: MATCH (u:User) WHERE u.created_at < $last_seen_date RETURN u ORDER BY u.created_at DESC LIMIT 10. This lets the engine use the index to jump directly to the right position. Same pagination result, 1000x faster on large datasets. Store $last_seen_date from the last item of the previous page.

What is the WITH clause and when do I need it?

WITH is Cypher's pipe operator — it passes results from one query stage to the next. You need it in three situations: (1) after aggregation to filter on the aggregated result (like SQL's HAVING), (2) to chain multiple query stages for readability, and (3) to limit which variables pass forward (the engine uses this for optimization). Any variable not listed in the WITH clause becomes unavailable downstream. Example: MATCH (dev)-[:CONTRIBUTES_TO]->(proj) WITH dev, count(proj) AS cnt WHERE cnt > 3 — the count must pass through a WITH before you can filter on it.

How do I optimize a slow Cypher query?

Run PROFILE on the query and look for three things: (1) NodeByLabelScan — the engine is scanning all nodes with a label instead of using an index. Fix: create an index on the filtered property. (2) Eager operator — the engine is buffering all intermediate results in memory. Fix: restructure the query to reduce intermediate result size. (3) High database hits — the engine is doing too much work. Fix: start your MATCH from the most selective node (the one with the fewest matches), use labels everywhere, and avoid unbounded variable-length paths.

Can I use Cypher with Spring Data Neo4j?

Yes, Spring Data Neo4j (SDN) is the official integration. You define entity classes with @Node annotations, repositories that extend Neo4jRepository, and custom queries with @Query("cypher here"). SDN handles object-graph mapping — it converts Neo4j nodes and relationships into Java objects and back. For complex queries, use @Query with native Cypher. For batch operations, use UNWIND $batch AS row MERGE ... in a @Query method. SDN works with both Neo4j Community and Enterprise editions.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousIntroduction to Graph Databases and Neo4jNext →Neo4j Use Cases — When to Use a Graph Database
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged