Cypher Query Language Basics
- Cypher is a pattern-based query language — you describe the SHAPE of your data using ASCII art, and the engine finds every instance of that shape in the graph. Think in nodes and paths, not tables and joins.
- CREATE always creates new nodes, even if duplicates exist. MERGE checks for existence first and creates only if missing. Use MERGE for entities, CREATE for events.
- Always use labels in your MATCH patterns — matching without a label forces a full database scan (O(n)), while matching with a label uses the index (O(log n)).
Imagine you're trying to find 'friends of friends who live in London and like jazz' in a spreadsheet. You'd need multiple VLOOKUPs, joins, and temp tables — and it'd take 20 minutes to write. In a graph database, you draw the pattern: (me)-[:FRIEND]->(friend)-[:FRIEND]->(fof) WHERE fof.city = 'London' AND fof.likes = 'Jazz'. The database walks the graph like following breadcrumbs — it doesn't scan tables, it hops from node to node. That's Cypher: a language that lets you describe the shape of the data you want, and the engine finds every instance of that shape in the graph. The key difference from SQL: in SQL, relationships are inferred through foreign keys and reconstructed with joins at query time. In Cypher, relationships are first-class citizens — they're stored directly on disk as physical connections between nodes. When you traverse from one node to another, you're following a pointer, not scanning a join table. This is why a 6-level deep friend-of-friend query in SQL takes minutes but in Cypher takes milliseconds.
A recommendation engine I built for an e-commerce platform was generating 'customers who bought X also bought Y' queries using SQL. With 2 million products and 50 million purchase records, the query took 45 seconds — too slow for real-time recommendations on the product page. We rewrote the data layer in Neo4j with Cypher. Same logic, same data. The query dropped to 12 milliseconds. Not 12 seconds — 12 milliseconds. The difference wasn't better hardware or clever caching. It was that the graph stored relationships as direct pointers, so finding 'people who bought this product and what else they bought' was a two-hop traversal instead of a three-table join with a GROUP BY on 50 million rows.
Cypher is the query language for Neo4j — the most widely-used graph database. It's declarative (you describe what you want, not how to get it), pattern-based (you draw the shape of your data using ASCII art), and designed specifically for traversing relationships. If you've written SQL, the mental model shift is: stop thinking in tables and joins, start thinking in nodes and paths.
This guide covers every clause you'll use daily — CREATE, MATCH, WHERE, RETURN, SET, DELETE, MERGE, WITH, UNWIND, and CALL — with production examples from real systems. We'll also cover variable-length paths, aggregation, subqueries, indexing, and the six mistakes that cause the most pain in production.
Cypher Syntax Fundamentals: Nodes, Relationships, and Patterns
Before writing any query, you need to understand the three building blocks: nodes, relationships, and properties. Everything in Cypher is built on these.
Nodes represent entities — people, products, cities, orders. Written as (variable:Label {properties}). The parentheses look like circles (nodes in a graph). The label is a type tag (like a table name). Properties are key-value pairs stored on the node.
Relationships connect nodes — FRIEND, PURCHASED, LOCATED_IN. Written as -[variable:TYPE {properties}]->. The arrow indicates direction. Relationships always have exactly one type and exactly one direction. They can also store properties (like a 'since' date on a FRIEND relationship, or a 'quantity' on a PURCHASED relationship).
Patterns combine nodes and relationships into shapes: (a:Person)-[:FRIEND]->(b:Person). This is the core of Cypher — you write a pattern that describes the shape of data you want, and the engine finds every instance of that pattern in the graph.
Variables bind to nodes or relationships so you can reference them later in WHERE, RETURN, or WITH. If you don't need to reference something, you can leave it anonymous: (a)-[:FRIEND]->() finds all people that 'a' is friends with, without naming the destination.
Three rules that save you hours of debugging: 1. Node labels start with uppercase: :Person, :Product 2. Relationship types start with uppercase: :FRIEND, :PURCHASED 3. Property keys start with lowercase: name, email, created_at
// io.thecodeforge: Node syntax — (variable:Label {key: value}) // This creates a node with label 'Person' and two properties CREATE (alice:Person {name: 'Alice', age: 32, city: 'London'}) RETURN alice; // io.thecodeforge: Relationship syntax — -[variable:TYPE {key: value}]-> // This creates a directed relationship with a property CREATE (bob:Person {name: 'Bob', age: 28, city: 'Paris'}) CREATE (alice)-[r:FRIEND {since: date('2019-06-15'), context: 'university'}]->(bob) RETURN alice, r, bob; // io.thecodeforge: Pattern — the fundamental unit of Cypher queries // This pattern matches any Person who is FRIENDs with any other Person MATCH (p1:Person)-[:FRIEND]->(p2:Person) RETURN p1.name AS person, p2.name AS friend; // io.thecodeforge: Anonymous nodes — when you don't need to reference them // Finds all people Alice is friends with, without naming the relationship MATCH (alice:Person {name: 'Alice'})-[:FRIEND]->(friend) RETURN friend.name, friend.city; // io.thecodeforge: Multiple labels — a node can have more than one CREATE (charlie:Person:Developer {name: 'Charlie', language: 'Java'}) RETURN charlie; // io.thecodeforge: Matching multiple labels — find all Developer-People MATCH (dev:Person:Developer) RETURN dev.name, dev.language;
│"person" │"friend" │"friend.city" │
╞══════════╪═══════════╪═══════════════╡
│"Alice" │"Bob" │"Paris" │
└──────────┴───────────┴───────────────┘
CREATE: Building Nodes and Relationships
CREATE is how you insert data into the graph. It always creates new nodes and relationships — even if duplicates already exist. This is the key difference from MERGE (covered later), which checks for existence first.
You can create single nodes, multiple nodes, nodes with relationships in one statement, and even entire subgraphs. The most common production pattern: create a node, create a relationship to an existing node, return what you created. This is how you'd model a new user signing up and connecting to an existing organization.
Properties can be any JSON-compatible type: strings, numbers, booleans, lists, dates, and temporal types. Neo4j stores them as key-value pairs on the node or relationship — there's no schema enforcement, which makes it easy to evolve your data model without migrations.
One gotcha: CREATE always creates. If you run the same CREATE statement twice, you get duplicate nodes. Use MERGE when you need idempotency (creating a node only if it doesn't already exist).
// io.thecodeforge: Create a single node with properties CREATE (order:Order { order_id: 'ORD-2024-001', total: 149.99, currency: 'GBP', status: 'pending', created_at: datetime() }) RETURN order; // io.thecodeforge: Create nodes and a relationship in one statement // This is the most common production pattern — create entity + link to existing CREATE (user:User {name: 'Dana', email: 'dana@thecodeforge.io'}) WITH user MATCH (org:Organization {name: 'TheCodeForge'}) CREATE (user)-[rel:MEMBER_OF {role: 'engineer', joined: date('2024-01-15')}]->(org) RETURN user.name, rel.role, org.name; // io.thecodeforge: Create multiple nodes at once CREATE (london:City {name: 'London', country: 'UK', population: 8982000}), (paris:City {name: 'Paris', country: 'France', population: 2161000}), (berlin:City {name: 'Berlin', country: 'Germany', population: 3645000}); // io.thecodeforge: Create an entire subgraph in one statement // Product with variants, categories, and supplier — all linked CREATE (p:Product {sku: 'WIDGET-PRO', name: 'Widget Pro', price: 29.99}) CREATE (cat:Category {name: 'Electronics'}) CREATE (sup:Supplier {name: 'AcmeCorp', country: 'China'}) CREATE (p)-[:IN_CATEGORY]->(cat) CREATE (p)-[:SUPPLIED_BY {lead_time_days: 14}]->(sup) RETURN p.name, cat.name, sup.name; // io.thecodeforge: Create a relationship between EXISTING nodes // This is how you link entities after initial creation MATCH (alice:Person {name: 'Alice'}) MATCH (bob:Person {name: 'Bob'}) CREATE (alice)-[:FRIEND {since: date('2023-03-10')}]->(bob) RETURN 'Friendship created' AS result;
│"user.name" │"rel.role"│"org.name" │
╞════════════════╪══════════╪═══════════════════╡
│"Dana" │"engineer"│"TheCodeForge" │
└────────────────┴──────────┴───────────────────┘
MATCH and WHERE: Reading and Filtering Data
MATCH is the read operation in Cypher — equivalent to SELECT in SQL. You write a pattern, and the engine finds every instance of that pattern in the graph. MATCH alone returns everything matching the pattern. WHERE filters the results.
The WHERE clause supports the full set of comparison and logical operators: =, <>, <, >, <=, >=, AND, OR, NOT, IN, STARTS WITH, ENDS WITH, CONTAINS, IS NULL, IS NOT NULL, and regular expressions (=~ 'regex').
Performance note: always start your MATCH from a specific node, not an anonymous scan. (user:User {email: 'x'})-[:ORDERED]->(order) is fast because it starts from the User index. (n)-[:ORDERED]->(order) without a starting point forces a full graph scan.
Multiple MATCH clauses act like joins — each MATCH finds a pattern, and Cypher joins them on shared variables. If you want an outer join effect, use OPTIONAL MATCH (returns null when the pattern isn't found).
// io.thecodeforge: Basic MATCH — find all developers MATCH (dev:Developer) RETURN dev.name, dev.language, dev.years_experience; // io.thecodeforge: MATCH with WHERE — filtering on properties MATCH (dev:Developer) WHERE dev.language = 'Java' AND dev.years_experience >= 5 RETURN dev.name AS senior_java_dev; // io.thecodeforge: Inline property matching — shorthand for simple WHERE // Equivalent to the query above but more concise MATCH (dev:Developer {language: 'Java'}) WHERE dev.years_experience >= 5 RETURN dev.name AS senior_java_dev; // io.thecodeforge: String matching — STARTS WITH, ENDS WITH, CONTAINS MATCH (user:User) WHERE user.email ENDS WITH '@thecodeforge.io' AND user.name STARTS WITH 'A' RETURN user.name, user.email; // io.thecodeforge: IN operator — matching against a list MATCH (dev:Developer) WHERE dev.language IN ['Java', 'Kotlin', 'Scala'] RETURN dev.name, dev.language ORDER BY dev.language; // io.thecodeforge: IS NULL / IS NOT NULL — checking for missing properties MATCH (order:Order) WHERE order.shipped_at IS NULL AND order.status = 'confirmed' RETURN order.order_id, order.total ORDER BY order.created_at DESC LIMIT 10; // io.thecodeforge: Regular expression matching MATCH (product:Product) WHERE product.sku =~ 'WIDGET-.*-PRO' RETURN product.name, product.sku; // io.thecodeforge: OPTIONAL MATCH — like a LEFT JOIN in SQL // Returns users even if they have no orders MATCH (user:User) OPTIONAL MATCH (user)-[:PLACED]->(order:Order) RETURN user.name, count(order) AS order_count, CASE WHEN order IS NULL THEN 'no orders' ELSE 'has orders' END AS status; // io.thecodeforge: Multiple MATCH clauses — chained traversal // Find developers who work on projects in a specific category MATCH (dev:Developer)-[:CONTRIBUTES_TO]->(proj:Project) MATCH (proj)-[:IN_CATEGORY]->(cat:Category {name: 'Backend'}) WHERE dev.years_experience >= 3 RETURN dev.name, proj.name, cat.name ORDER BY dev.years_experience DESC;
│"dev.name" │"proj.name" │"cat.name" │
╞═══════════════════╪══════════════╪═══════════╡
│"Alice" │"ForgeAPI" │"Backend" │
│"Charlie" │"DataPipeline"│"Backend" │
└───────────────────┴──────────────┴───────────┘
RETURN, ORDER BY, LIMIT, and DISTINCT: Shaping Your Results
RETURN defines what the query gives back — equivalent to SELECT in SQL. You can return nodes, relationships, properties, computed values, and aggregations. ORDER BY sorts results (ascending by default, use DESC for descending). LIMIT caps the result set. DISTINCT removes duplicates.
You can alias results with AS for cleaner output. You can use CASE expressions for conditional logic. You can call functions like toString(), toInteger(), datetime(), and coalesce() inside RETURN.
Pagination pattern: use SKIP and LIMIT together for cursor-based pagination. Don't use SKIP for large offsets (skip 10000 rows is slow) — use a WHERE clause on the sort key instead (WHERE user.created_at < last_seen_date).
// io.thecodeforge: RETURN with aliases and computed values MATCH (dev:Developer) RETURN dev.name AS developer, dev.language AS primary_language, dev.years_experience AS experience, CASE WHEN dev.years_experience >= 10 THEN 'Principal' WHEN dev.years_experience >= 5 THEN 'Senior' WHEN dev.years_experience >= 2 THEN 'Mid' ELSE 'Junior' END AS level ORDER BY dev.years_experience DESC; // io.thecodeforge: DISTINCT — remove duplicate results MATCH (dev:Developer)-[:CONTRIBUTES_TO]->(proj:Project) RETURN DISTINCT dev.name AS developer ORDER BY dev.name; // io.thecodeforge: Pagination with SKIP and LIMIT // Page 1: first 10 results MATCH (user:User) RETURN user.name, user.email, user.created_at ORDER BY user.created_at DESC LIMIT 10; // Page 2: skip first 10, return next 10 MATCH (user:User) RETURN user.name, user.email, user.created_at ORDER BY user.created_at DESC SKIP 10 LIMIT 10; // io.thecodeforge: RETURN relationships and their properties MATCH (a:Person)-[r:FRIEND]->(b:Person) RETURN a.name AS person, b.name AS friend, r.since AS friends_since, duration.between(r.since, date()).years AS years_of_friendship ORDER BY r.since ASC; // io.thecodeforge: Coalesce — provide defaults for missing properties MATCH (product:Product) RETURN product.name, coalesce(product.description, 'No description available') AS description, coalesce(product.rating, 0) AS rating ORDER BY product.rating DESC;
│"developer" │"primary_language"│"experience"│"level" │
╞══════════════╪══════════════════╪════════════╪═════════╡
│"Alice" │"Java" │12 │Principal│
│"Charlie" │"Java" │8 │Senior │
│"Dana" │"Python" │4 │Mid │
└──────────────┴──────────────────┴────────────┴─────────┘
SET and REMOVE: Updating Nodes and Relationships
SET adds or updates properties on nodes and relationships. REMOVE deletes properties or removes labels from nodes. These are the UPDATE and ALTER equivalents in Cypher.
SET can assign a single property, multiple properties at once, or copy properties from a map. It can also add labels to a node (SET n:NewLabel). REMOVE does the inverse — removes properties or labels.
Production pattern: combine MATCH + SET for targeted updates. Always filter by an indexed property (like an ID or email) to avoid scanning the entire graph. Use coalesce() with SET to preserve existing values when partial updates arrive from an API.
// io.thecodeforge: SET — update a single property MATCH (user:User {email: 'alice@thecodeforge.io'}) SET user.last_login = datetime() RETURN user.name, user.last_login; // io.thecodeforge: SET — update multiple properties at once MATCH (order:Order {order_id: 'ORD-2024-001'}) SET order.status = 'shipped', order.shipped_at = datetime(), order.tracking_number = 'TRACK-99821' RETURN order.order_id, order.status, order.tracking_number; // io.thecodeforge: SET — update from a map (common in REST API handlers) // The map comes from the request body in production WITH {status: 'delivered', delivered_at: '2024-03-15T14:30:00Z'} AS updates MATCH (order:Order {order_id: 'ORD-2024-001'}) SET order += updates // += merges the map into existing properties RETURN order.order_id, order.status, order.delivered_at; // io.thecodeforge: SET — add a label to an existing node MATCH (user:User {email: 'alice@thecodeforge.io'}) SET user:Admin RETURN labels(user) AS labels; // ['User', 'Admin'] // io.thecodeforge: REMOVE — delete a property MATCH (user:User {email: 'alice@thecodeforge.io'}) REMOVE user.temporary_token RETURN user.name, user.temporary_token; // null — property is gone // io.thecodeforge: REMOVE — remove a label from a node MATCH (user:User {email: 'alice@thecodeforge.io'}) REMOVE user:Admin RETURN labels(user) AS labels; // ['User'] — Admin label removed // io.thecodeforge: Production pattern — partial update with coalesce // Preserve existing rating if the update doesn't include one MATCH (product:Product {sku: 'WIDGET-PRO'}) SET product.name = coalesce($update_name, product.name), product.price = coalesce($update_price, product.price), product.updated_at = datetime() RETURN product.name, product.price;
│"order.order_id"│"order.status"│"order.tracking_number"│
╞══════════════╪══════════╪═════════════════════════╡
│"ORD-2024-001"│"shipped" │"TRACK-99821" │
└──────────────┴──────────┴─────────────────────────┘
DELETE and DETACH DELETE: Removing Data
DELETE removes nodes and relationships. DETACH DELETE removes a node and ALL its relationships in one operation. This distinction matters — you cannot delete a node that still has relationships. If you try, Neo4j throws an error. DETACH DELETE bypasses this by removing relationships first, then the node.
In production, soft deletes (setting a 'deleted: true' property) are almost always better than hard deletes. Hard deletes break referential integrity in the graph — other nodes that referenced the deleted node now have dangling relationships. Soft deletes preserve the graph structure while hiding the node from queries (WHERE n.deleted IS NULL OR n.deleted = false).
When you must hard delete, always use DETACH DELETE for nodes. Using plain DELETE on a node with relationships will fail.
// io.thecodeforge: DELETE — remove a relationship only // The nodes remain; only the connection is severed MATCH (alice:Person {name: 'Alice'})-[r:FRIEND]->(bob:Person {name: 'Bob'}) DELETE r RETURN 'Friendship removed' AS result; // io.thecodeforge: DETACH DELETE — remove a node and ALL its relationships // Use this when deleting an entity entirely MATCH (user:User {email: 'spam@bot.net'}) DETACH DELETE user; // io.thecodeforge: Soft delete — production pattern // Mark as deleted instead of removing; preserves graph integrity MATCH (order:Order {order_id: 'ORD-2024-999'}) SET order.deleted = true, order.deleted_at = datetime(), order.deleted_by = 'admin@thecodeforge.io' RETURN order.order_id, order.deleted; // io.thecodeforge: Filtering out soft-deleted nodes in queries // Every MATCH in your app should include this filter MATCH (order:Order) WHERE order.deleted IS NULL RETURN order.order_id, order.total ORDER BY order.created_at DESC LIMIT 20; // io.thecodeforge: Delete all relationships of a specific type // Useful for bulk cleanup during data migrations MATCH ()-[r:TEMPORARY_LINK]->() DELETE r RETURN count(r) AS deleted_count; // io.thecodeforge: Conditional delete — only delete if certain conditions met MATCH (session:Session) WHERE session.last_active < datetime() - duration('P30D') AND session.active = false DETACH DELETE session RETURN count(session) AS expired_sessions_deleted;
│"expired_sessions_deleted" │
╞═══════════════════════════╡
│147 │
└───────────────────────────┘
MERGE: Idempotent Create-or-Find
MERGE is CREATE + MATCH in one operation. It checks if a pattern exists — if yes, it matches it; if no, it creates it. This makes MERGE the idempotent alternative to CREATE, essential for scripts that might run multiple times, ETL pipelines, and any scenario where duplicate data is dangerous.
MERGE has two companion clauses: ON CREATE SET (properties to set only when creating a new node) and ON MATCH SET (properties to set only when matching an existing node). Together, these give you upsert behavior: insert if new, update if existing.
Critical gotcha: MERGE matches the ENTIRE pattern, including relationships. If you MERGE (a)-[:FRIEND]->(b) and the nodes exist but the relationship doesn't, MERGE creates the relationship. But if only (a) exists and (b) doesn't, MERGE creates both (a) and (b) — it doesn't look up (b) by name first. This is the most common source of MERGE confusion.
// io.thecodeforge: MERGE — find or create a node // First run: creates the node. Second run: finds it. MERGE (p:Project {name: 'ForgeAPI'}) ON CREATE SET p.created_at = datetime(), p.version = '1.0', p.status = 'active' ON MATCH SET p.last_accessed = datetime() RETURN p.name, p.version, p.status; // io.thecodeforge: MERGE — idempotent relationship creation // Prevents duplicate FRIEND relationships between the same two people MATCH (alice:Person {name: 'Alice'}) MATCH (bob:Person {name: 'Bob'}) MERGE (alice)-[r:FRIEND]->(bob) ON CREATE SET r.since = date(), r.source = 'manual' RETURN alice.name, type(r) AS relationship, bob.name, r.since; // io.thecodeforge: MERGE with unique constraint — production pattern // Requires: CREATE CONSTRAINT unique_email FOR (u:User) REQUIRE u.email IS UNIQUE MERGE (user:User {email: 'dana@thecodeforge.io'}) ON CREATE SET user.name = 'Dana', user.created_at = datetime(), user.signup_source = 'organic' ON MATCH SET user.last_login = datetime() RETURN user.name, user.email, user.created_at; // io.thecodeforge: MERGE on a relationship path — create if not exists // This is the upsert pattern for connection-based data MATCH (product:Product {sku: 'WIDGET-PRO'}) MATCH (category:Category {name: 'Electronics'}) MERGE (product)-[r:IN_CATEGORY]->(category) ON CREATE SET r.assigned_at = datetime(), r.assigned_by = 'system' RETURN product.name, type(r), category.name; // io.thecodeforge: MERGE for ETL — batch upsert from external data // Load a list of users and upsert each one UNWIND [ {email: 'alice@thecodeforge.io', name: 'Alice', score: 95}, {email: 'bob@thecodeforge.io', name: 'Bob', score: 87}, {email: 'newuser@thecodeforge.io', name: 'Eve', score: 92} ] AS row MERGE (u:User {email: row.email}) ON CREATE SET u.name = row.name, u.score = row.score, u.first_seen = datetime() ON MATCH SET u.score = row.score, u.last_updated = datetime() RETURN u.email, u.name, u.score;
│"u.email" │"u.name" │"u.score"│"u.first_seen" │
╞══════════════════════════╪══════════╪═════════╪═════════════════════╡
│"alice@thecodeforge.io" │"Alice" │95 │2024-01-10T08:00:00Z │
│"bob@thecodeforge.io" │"Bob" │87 │2024-01-10T08:00:01Z │
│"newuser@thecodeforge.io" │"Eve" │92 │2024-03-15T14:30:00Z │
└──────────────────────────┴──────────┴─────────┴─────────────────────┘
WITH: Chaining Subqueries and Aggregation Pipelines
WITH is the pipe operator of Cypher — it passes results from one part of the query to the next. Think of it as the equivalent of a subquery boundary or a GROUP BY + HAVING in SQL. Every clause after WITH operates on the results of the clause before it.
WITH has three critical uses: 1. Aggregation pipelines — aggregate data, then filter or sort on the aggregated result. In SQL, you'd use HAVING. In Cypher, you aggregate in one WITH, then filter in the next WHERE. 2. Query chaining — break a complex query into readable stages. Each WITH acts as a checkpoint. 3. Limiting scope — pass only specific variables forward, which the engine uses to optimize the query plan.
The rule: any variable not listed in the WITH clause is no longer available downstream. This is deliberate — it lets the query planner know exactly what data flows forward, enabling optimizations.
// io.thecodeforge: WITH for aggregation — find prolific contributors // Count projects per developer, then filter on the count MATCH (dev:Developer)-[:CONTRIBUTES_TO]->(proj:Project) WITH dev, count(proj) AS project_count WHERE project_count >= 3 RETURN dev.name AS developer, project_count ORDER BY project_count DESC; // io.thecodeforge: WITH for query chaining — multi-stage filtering // Stage 1: Find active projects // Stage 2: Find developers on those projects // Stage 3: Aggregate and filter MATCH (proj:Project {status: 'active'}) WITH proj MATCH (dev:Developer)-[:CONTRIBUTES_TO]->(proj) WITH dev, collect(proj.name) AS projects, count(proj) AS num_projects WHERE num_projects >= 2 RETURN dev.name, projects, num_projects ORDER BY num_projects DESC; // io.thecodeforge: WITH with LIMIT — process in batches // Find the 100 most recent users, then do expensive operations on them MATCH (user:User) WITH user ORDER BY user.created_at DESC LIMIT 100 MATCH (user)-[:PLACED]->(order:Order) WITH user, count(order) AS order_count, sum(order.total) AS total_spent WHERE total_spent > 500 RETURN user.name, user.email, order_count, total_spent ORDER BY total_spent DESC; // io.thecodeforge: WITH as a scope boundary — only pass what you need MATCH (dev:Developer)-[:CONTRIBUTES_TO]->(proj:Project) WITH dev.name AS name, dev.language AS lang, count(proj) AS projects // dev is no longer available here — only name, lang, projects WHERE lang = 'Java' RETURN name, lang, projects ORDER BY projects DESC; // io.thecodeforge: WITH + COLLECT — build lists per group MATCH (cat:Category)<-[:IN_CATEGORY]-(product:Product) WITH cat, collect(product.name) AS products, count(product) AS product_count WHERE product_count >= 5 RETURN cat.name AS category, products, product_count ORDER BY product_count DESC;
│"category" │"products" │"product_count"│
╞══════════════╪══════════════════════════════════════════════╪═══════════════╡
│"Electronics" │["Widget Pro","Gadget X","Sensor V2",...] │12 │
│"Clothing" │["Tee Basic","Hoodie Dev","Cap Logo",...] │8 │
└──────────────┴──────────────────────────────────────────────┴───────────────┘
count(). The aggregation result must pass through a WITH (or RETURN) first. MATCH (dev)-[:CONTRIBUTES_TO]->(proj) WITH dev, count(proj) AS cnt WHERE cnt > 3 works. MATCH (dev)-[:CONTRIBUTES_TO]->(proj) WHERE count(proj) > 3 does NOT work — it's a syntax error. This is the Cypher equivalent of SQL's HAVING vs WHERE distinction.Aggregation Functions: COUNT, COLLECT, SUM, AVG, and Friends
Cypher supports the standard set of aggregation functions: count(), collect(), sum(), avg(), min(), max(), percentileCont(), percentileDisc(), and stDev(). The two you'll use most are count() and collect().
count() has two forms: count(n) counts non-null values of n, count() counts all rows regardless of nulls. This distinction matters — count(user) skips rows where user is null (like from an OPTIONAL MATCH), while count() includes them.
collect() is the most powerful aggregation function in Cypher — it gathers all values into a list. This is how you build nested results: group by a category, collect the items in that category into a list, return the category with its items. It replaces SQL's GROUP_CONCAT and subqueries for building nested JSON responses.
All aggregation functions ignore null values except count(*).
// io.thecodeforge: COUNT — two forms matter // count(*) counts all rows including nulls // count(order) counts only non-null orders MATCH (user:User) OPTIONAL MATCH (user)-[:PLACED]->(order:Order) RETURN user.name, count(*) AS total_rows, // includes users with no orders count(order) AS orders_placed // excludes null orders ORDER BY orders_placed DESC; // io.thecodeforge: COLLECT — build nested lists // Group products by category, return category with its products MATCH (cat:Category)<-[:IN_CATEGORY]-(product:Product) WITH cat.name AS category, collect(product) AS products RETURN category, size(products) AS count, products ORDER BY count DESC; // io.thecodeforge: COLLECT with property extraction — cleaner output // Collect just the names, not full node objects MATCH (proj:Project)<-[:CONTRIBUTES_TO]-(dev:Developer) WITH proj.name AS project, collect(dev.name) AS contributors RETURN project, contributors ORDER BY size(contributors) DESC; // io.thecodeforge: SUM and AVG — numeric aggregations MATCH (order:Order)-[:CONTAINS]->(item:OrderItem) WITH order, sum(item.quantity * item.unit_price) AS order_total RETURN avg(order_total) AS avg_order_value, min(order_total) AS min_order, max(order_total) AS max_order, sum(order_total) AS total_revenue, count(order) AS total_orders; // io.thecodeforge: PERCENTILE — distribution analysis MATCH (user:User)-[:PLACED]->(order:Order) WITH user, count(order) AS order_count RETURN percentileCont(order_count, 0.5) AS median_orders, percentileCont(order_count, 0.95) AS p95_orders, percentileCont(order_count, 0.99) AS p99_orders, avg(order_count) AS mean_orders; // io.thecodeforge: COLLECT for top-N per group // Top 3 products per category by rating MATCH (cat:Category)<-[:IN_CATEGORY]-(product:Product) WITH cat, product ORDER BY product.rating DESC WITH cat.name AS category, collect(product)[0..3] AS top_products RETURN category, top_products;
│"avg_order_value"│"min_order" │"max_order"│"total_revenue"│"total_orders" │
╞════════════════╪═════════════╪══════════╪══════════════════╪═══════════════╡
│73.42 │12.99 │892.50 │1,847,293.00 │25,162 │
└────────────────┴─────────────┴──────────┴──────────────────┴───────────────┘
ROW_NUMBER(). It gives you the top N items per group without a subquery. This is the standard pattern for leaderboards, top products per category, and most-active users per team.Variable-Length Paths: Traversing Multiple Hops
Variable-length paths let you traverse a variable number of relationships in one pattern. Instead of writing out each hop explicitly, you specify a range: [1..3] means 1 to 3 hops, [] means any number of hops, [*3] means exactly 3 hops.
This is the feature that makes graph databases shine for social networks (friends of friends), supply chains (supplier of supplier), and recommendation engines (users who bought what you bought bought what?).
Critical warning: unbounded paths ([*] with no upper limit) will explore every reachable node in the graph and can cause out-of-memory errors on large graphs. ALWAYS set a maximum depth. For most use cases, 1-5 hops is sufficient. Beyond 6 hops, the results are usually meaningless (everyone on Facebook is within 4.7 hops of everyone else).
Use shortestPath() and allShortestPaths() when you specifically want the shortest route between two nodes, not all possible paths.
// io.thecodeforge: Friends of friends — 2 hops exactly MATCH (me:Person {name: 'Alice'})-[:FRIEND*2]->(fof) WHERE fof <> me // exclude Alice herself RETURN DISTINCT fof.name AS friend_of_friend ORDER BY fof.name; // io.thecodeforge: Friends up to 3 levels deep — 1 to 3 hops MATCH (me:Person {name: 'Alice'})-[:FRIEND*1..3]->(connection) WHERE connection <> me RETURN DISTINCT connection.name AS connection, min(length(shortestPath((me)-[:FRIEND*]->(connection)))) AS degrees ORDER BY degrees, connection.name; // io.thecodeforge: Supply chain — find suppliers of suppliers MATCH (product:Product {name: 'Widget Pro'})-[:SUPPLIED_BY*1..4]->(supplier) RETURN product.name AS product, supplier.name AS ultimate_supplier, supplier.country AS origin_country; // io.thecodeforge: Shortest path between two people MATCH path = shortestPath( (alice:Person {name: 'Alice'})-[:FRIEND*]-(bob:Person {name: 'Bob'}) ) RETURN [n IN nodes(path) | n.name] AS path_names, length(path) AS degrees_of_separation; // io.thecodeforge: All shortest paths (if multiple equally-short routes exist) MATCH path = allShortestPaths( (alice:Person {name: 'Alice'})-[:FRIEND*]-(bob:Person {name: 'Bob'}) ) RETURN [n IN nodes(path) | n.name] AS path_names, length(path) AS hops; // io.thecodeforge: Bounded traversal with filtering at each hop // Find paths where every intermediate person is over 25 MATCH (alice:Person {name: 'Alice'})-[:FRIEND*1..3]->(target:Person) WHERE all(node IN nodes(path) WHERE node.age >= 25 OR node = alice OR node = target) RETURN DISTINCT target.name, length(path) AS hops;
│"path_names" │"degrees_of_separation"│
╞════════════════════════╪═══════════════════════╡
│["Alice","Charlie","Bob"]│2 │
└────────────────────────┴───────────────────────┘
UNWIND: Working with Lists and Expanding Collections
UNWIND takes a list and expands it into individual rows — the opposite of collect(). If collect() gathers rows into a list, UNWIND spreads a list into rows. This is essential for processing batch data, transforming aggregated results back into individual records, and working with JSON arrays passed from application code.
The most common production use: UNWIND a parameter list to create or merge multiple nodes in one query. Instead of executing 100 separate CREATE statements (100 round trips), you UNWIND a list of 100 items and create all nodes in one query (1 round trip).
UNWIND + MERGE is the standard ETL pattern for loading data into Neo4j.
// io.thecodeforge: UNWIND — expand a list into rows UNWIND ['Alice', 'Bob', 'Charlie'] AS name MERGE (p:Person {name: name}) RETURN p.name; // io.thecodeforge: UNWIND for batch creation — 1 query, 100 nodes // In production, $users comes from a JSON request body UNWIND $users AS user_data MERGE (u:User {email: user_data.email}) ON CREATE SET u.name = user_data.name, u.created_at = datetime() RETURN count(u) AS users_processed; // io.thecodeforge: UNWIND to create relationships in batch // Connect a list of skills to a developer in one query MATCH (dev:Developer {name: 'Alice'}) UNWIND ['Java', 'Spring Boot', 'Neo4j', 'Kubernetes'] AS skill_name MERGE (s:Skill {name: skill_name}) MERGE (dev)-[:HAS_SKILL {level: 'expert'}]->(s) RETURN dev.name, collect(s.name) AS skills; // io.thecodeforge: UNWIND after COLLECT — flatten and reprocess // Find all categories, collect their products, then flatten for analysis MATCH (cat:Category)<-[:IN_CATEGORY]-(product:Product) WITH cat, collect(product) AS products UNWIND products AS product RETURN cat.name AS category, product.name AS product, product.price AS price ORDER BY cat.name, product.price DESC; // io.thecodeforge: UNWIND with index — track position in the list UNWIND ['alpha', 'beta', 'gamma', 'delta'] AS item WITH item, range(0, 3)[indexOf(['alpha','beta','gamma','delta'], item)] AS idx MERGE (n:Item {name: item}) SET n.sort_order = idx RETURN n.name, n.sort_order ORDER BY n.sort_order; // io.thecodeforge: UNWIND for data cleanup — split comma-separated values MATCH (legacy:LegacyRecord) WHERE legacy.tags IS NOT NULL WITH legacy, split(legacy.tags, ',') AS tag_list UNWIND tag_list AS tag MERGE (t:Tag {name: trim(tag)}) MERGE (legacy)-[:TAGGED_WITH]->(t) RETURN count(DISTINCT legacy) AS records_processed, count(DISTINCT t) AS tags_created;
│"records_processed" │"tags_created" │
╞═══════════════════════╪═════════════════╡
│4,291 │187 │
└───────────────────────┴─────────────────┘
Subqueries with CALL: Encapsulating Complex Logic
CALL introduces subqueries in Cypher — self-contained query blocks that execute independently and return results to the outer query. There are two forms: CALL {} for inline subqueries (like SQL subqueries) and CALL procedure.name() for invoking stored procedures (like stored procedures in SQL).
CALL {} subqueries are essential for isolating aggregation, avoiding variable scope conflicts, and breaking complex queries into readable chunks. Each CALL {} block has its own scope — variables must be explicitly passed in and out.
CALL procedure is how you invoke APOC procedures, Graph Data Science algorithms, and custom procedures. The syntax is CALL db.indexes() YIELD indexName, state RETURN indexName.
// io.thecodeforge: CALL {} — inline subquery for isolated aggregation // Find developers, and for each, count their projects in a subquery MATCH (dev:Developer) CALL { WITH dev MATCH (dev)-[:CONTRIBUTES_TO]->(proj:Project) RETURN count(proj) AS project_count, collect(proj.name) AS project_names } RETURN dev.name, project_count, project_names ORDER BY project_count DESC; // io.thecodeforge: CALL {} — find the most recent order per user MATCH (user:User) CALL { WITH user MATCH (user)-[:PLACED]->(order:Order) RETURN order ORDER BY order.created_at DESC LIMIT 1 } RETURN user.name, order.order_id, order.total, order.created_at ORDER BY order.created_at DESC; // io.thecodeforge: CALL db.indexes — list all indexes in the database CALL db.indexes() YIELD name, type, entityType, labelsOrTypes, properties, state RETURN name, type, entityType, labelsOrTypes, properties, state ORDER BY entityType, name; // io.thecodeforge: CALL db.schema.visualization — see the graph schema CALL db.schema.visualization() YIELD nodes, relationships RETURN nodes, relationships; // io.thecodeforge: CALL apoc (if APOC is installed) — advanced procedures // Export a subgraph to JSON for backup or migration CALL apoc.export.json.query( 'MATCH (u:User)-[:PLACED]->(o:Order) RETURN u, o LIMIT 100', null, {stream: true} ) YIELD data RETURN data; // io.thecodeforge: CALL with UNION — combine results from multiple patterns MATCH (dev:Developer {name: 'Alice'}) CALL { WITH dev MATCH (dev)-[:CONTRIBUTES_TO]->(proj:Project) RETURN proj.name AS item, 'project' AS type UNION WITH dev MATCH (dev)-[:HAS_SKILL]->(skill:Skill) RETURN skill.name AS item, 'skill' AS type } RETURN dev.name, collect({name: item, type: type}) AS contributions;
│"dev.name"│"project_count"│"project_names" │
╞══════════╪══════════════╪═══════════════════════════════════════════╡
│"Alice" │4 │["ForgeAPI","DataPipeline","AuthSvc","..."│
│"Charlie" │3 │["DataPipeline","ETLRunner","Monitoring"] │
│"Dana" │1 │["ForgeAPI"] │
└──────────┴──────────────┴───────────────────────────────────────────┘
Indexes and Constraints: Schema Management for Performance
Neo4j is often described as 'schema-optional,' but that doesn't mean you should skip indexing. Without indexes, every MATCH on a property scans every node with that label — O(n). With a range index, the same MATCH uses a B-tree lookup — O(log n). With a uniqueness constraint, you get both an index AND a guarantee that no duplicates can be created.
Three types of indexes: 1. Range index — the default, supports equality and range queries (=, <, >, IN). CREATE INDEX FOR (n:Label) ON (n.property) 2. Text index — optimized for CONTAINS, ENDS WITH, and full-text search. CREATE TEXT INDEX FOR (n:Label) ON (n.property) 3. Point index — for geospatial queries. CREATE POINT INDEX FOR (n:Label) ON (n.location)
Constraints enforce data integrity: 1. Uniqueness — no two nodes with the same label can have the same property value. CREATE CONSTRAINT FOR (n:User) REQUIRE n.email IS UNIQUE 2. Node key — uniqueness + existence (the property must exist and be unique). CREATE CONSTRAINT FOR (n:User) REQUIRE (n.email, n.tenant_id) IS NODE KEY 3. Existence — the property must exist (not null). CREATE CONSTRAINT FOR (n:User) REQUIRE n.email IS NOT NULL
Production rule: create uniqueness constraints on every entity's primary identifier before loading data. This prevents duplicates at the database level and gives you a free index for lookups.
// io.thecodeforge: Create a range index — for fast property lookups CREATE INDEX user_email_index FOR (u:User) ON (u.email); CREATE INDEX product_sku_index FOR (p:Product) ON (p.sku); CREATE INDEX order_status_index FOR (o:Order) ON (o.status); // io.thecodeforge: Create a composite index — for queries that filter on multiple props CREATE INDEX user_tenant_email_index FOR (u:User) ON (u.tenant_id, u.email); // io.thecodeforge: Create a text index — for CONTAINS and ENDS WITH queries CREATE TEXT INDEX product_name_text_index FOR (p:Product) ON (p.name); // io.thecodeforge: Create a uniqueness constraint — prevents duplicate entities CREATE CONSTRAINT user_email_unique FOR (u:User) REQUIRE u.email IS UNIQUE; CREATE CONSTRAINT product_sku_unique FOR (p:Product) REQUIRE p.sku IS UNIQUE; CREATE CONSTRAINT order_id_unique FOR (o:Order) REQUIRE o.order_id IS UNIQUE; // io.thecodeforge: Create a node key — composite uniqueness + existence CREATE CONSTRAINT user_tenant_key FOR (u:User) REQUIRE (u.email, u.tenant_id) IS NODE KEY; // io.thecodeforge: Create an existence constraint — property must exist CREATE CONSTRAINT user_email_exists FOR (u:User) REQUIRE u.email IS NOT NULL; // io.thecodeforge: List all indexes and their status SHOW INDEXES YIELD name, type, entityType, labelsOrTypes, properties, state ORDER BY entityType, name; // io.thecodeforge: List all constraints SHOW CONSTRAINTS YIELD name, type, entityType, labelsOrTypes, properties ORDER BY entityType, name; // io.thecodeforge: Drop an index (if you need to recreate it) DROP INDEX user_email_index; // io.thecodeforge: Drop a constraint DROP CONSTRAINT user_email_unique;
│"name" │"type" │"entityType"│"labelsOrTypes"│"properties" │"state" │
╞══════════════════════════╪═════════╪════════════╪═══════════════╪══════════════╪═════════╡
│"user_email_index" │"RANGE" │"NODE" │["User"] │["email"] │"ONLINE" │
│"product_sku_index" │"RANGE" │"NODE" │["Product"] │["sku"] │"ONLINE" │
│"order_status_index" │"RANGE" │"NODE" │["Order"] │["status"] │"ONLINE" │
└──────────────────────────┴─────────┴────────────┴───────────────┴──────────────┴─────────┘
Real-World Patterns: Recommendation Engine, Fraud Detection, and Social Graph
The value of Cypher becomes clear when you see it applied to real problems. Here are three production patterns that demonstrate why graph databases outperform relational databases for connected data.
Recommendation Engine — 'Customers who bought X also bought Y.' In SQL, this requires a self-join on the purchases table with a GROUP BY, which gets exponentially slower as purchase history grows. In Cypher, it's a two-hop traversal: find everyone who bought the same product, then find what else they bought.
Fraud Detection — Find rings of accounts that share addresses, phone numbers, or devices. In SQL, this requires multiple joins across identity tables. In Cypher, you match a pattern where multiple accounts connect to the same shared resource.
Social Graph — Find the shortest path between two people, suggest connections based on mutual friends, identify communities. These are all natural graph traversals that Cypher handles natively.
// io.thecodeforge: RECOMMENDATION ENGINE // 'Customers who bought this also bought' — collaborative filtering MATCH (customer:Customer)-[:PURCHASED]->(product:Product {sku: 'WIDGET-PRO'}) <-[:PURCHASED]-(other:Customer)-[:PURCHASED]->(recommended:Product) WHERE recommended <> product // exclude the product they already bought AND NOT (customer)-[:PURCHASED]->(recommended) // exclude what they already own RETURN recommended.name AS product, recommended.sku AS sku, count(other) AS bought_by_others ORDER BY bought_by_others DESC LIMIT 5; // io.thecodeforge: FRAUD DETECTION — shared identity signals // Find accounts that share a phone number OR address — potential fraud ring MATCH (a1:Account)-[:HAS_PHONE]->(phone:Phone)<-[:HAS_PHONE]-(a2:Account) WHERE a1 <> a2 WITH a1, a2, collect(DISTINCT phone.number) AS shared_phones MATCH (a1)-[:HAS_ADDRESS]->(addr:Address)<-[:HAS_ADDRESS]-(a2) WITH a1, a2, shared_phones, collect(DISTINCT addr.full_address) AS shared_addresses WHERE size(shared_phones) >= 1 AND size(shared_addresses) >= 1 RETURN a1.account_id AS account_1, a2.account_id AS account_2, shared_phones, shared_addresses, size(shared_phones) + size(shared_addresses) AS risk_score ORDER BY risk_score DESC LIMIT 20; // io.thecodeforge: SOCIAL GRAPH — mutual friends suggestion MATCH (me:Person {name: 'Alice'})-[:FRIEND]->(mutual:Person)-[:FRIEND]->(suggestion:Person) WHERE NOT (me)-[:FRIEND]->(suggestion) // not already friends AND suggestion <> me // not yourself RETURN suggestion.name AS suggested_friend, count(mutual) AS mutual_friends, collect(mutual.name) AS through ORDER BY mutual_friends DESC LIMIT 10; // io.thecodeforge: SOCIAL GRAPH — community detection (simplified) // Find groups of 3+ people who are all friends with each other MATCH (a:Person)-[:FRIEND]->(b:Person)-[:FRIEND]->(c:Person)-[:FRIEND]->(a) WHERE id(a) < id(b) AND id(b) < id(c) // avoid counting same triangle multiple times RETURN a.name, b.name, c.name AS triangle ORDER BY a.name; // io.thecodeforge: KNOWLEDGE GRAPH — find all entities connected to a concept MATCH (concept:Concept {name: 'Microservices'})-[:RELATED_TO*1..2]->(related) WHERE related <> concept RETURN DISTINCT labels(related)[0] AS type, related.name AS entity, related.description AS description ORDER BY type, entity LIMIT 20;
│"product" │"sku" │"bought_by_others" │
╞═══════════════════╪══════════╪═══════════════════════════════════════╡
│"Gadget X" │"GADG-001"│847 │
│"Sensor V2" │"SENS-002"│623 │
│"Widget Mini" │"WIDM-003"│412 │
│"Power Pack" │"POWR-004"│298 │
│"Cable Pro" │"CABL-005"│156 │
└───────────────────┴──────────┴───────────────────────────────────────┘
PROFILE and EXPLAIN: Query Performance Analysis
Every query you write for production should be profiled first. Neo4j provides two tools: EXPLAIN shows the execution plan without running the query (fast, safe for production). PROFILE runs the query and shows the execution plan with actual row counts and database hits (slower, shows real performance).
The execution plan is a tree of operators: NodeByLabelScan (scans all nodes with a label), NodeIndexSeek (uses an index), Expand (traverses a relationship), Filter (applies WHERE conditions), ProduceResult (returns data). Your goal: minimize NodeByLabelScan operations and maximize NodeIndexSeek operations.
The 'Eager' operator is a red flag — it means the engine had to load all intermediate results into memory before continuing. This happens with certain aggregation patterns and can cause memory issues on large graphs. If you see Eager, consider restructuring the query.
Database hits are the unit of work — fewer hits = faster query. A query with 100 database hits is fast. A query with 10 million database hits is slow, regardless of how many rows it returns.
// io.thecodeforge: EXPLAIN — show plan without executing (safe for production) EXPLAIN MATCH (dev:Developer {language: 'Java'})-[:CONTRIBUTED_TO]->(proj:Project) WHERE proj.status = 'active' RETURN dev.name, proj.name; // io.thecodeforge: PROFILE — execute and show real metrics PROFILE MATCH (dev:Developer {language: 'Java'})-[:CONTRIBUTED_TO]->(proj:Project) WHERE proj.status = 'active' RETURN dev.name, proj.name; // io.thecodeforge: BAD QUERY — no label, forces full scan // This scans every node in the database EXPLAIN MATCH (n) WHERE n.email = 'alice@thecodeforge.io' RETURN n; // io.thecodeforge: GOOD QUERY — uses label + index // This uses the User label index and the email property index EXPLAIN MATCH (u:User {email: 'alice@thecodeforge.io'}) RETURN u; // io.thecodeforge: Identifying the Eager operator // If you see Eager in the plan, the engine is buffering all results EXPLAIN MATCH (order:Order) WITH order ORDER BY order.created_at DESC MATCH (order)-[:CONTAINS]->(item:OrderItem) RETURN order, collect(item) AS items; // io.thecodeforge: Checking if an index is being used // Run this after creating an index to verify the plan uses it EXPLAIN MATCH (u:User {email: 'test@thecodeforge.io'}) RETURN u.name; // If the plan shows NodeIndexSeek, the index is working. // If it shows NodeByLabelScan, the index isn't being used — check your query.
+------------------+----------------+---------+-----------+
| Operator | Estimated Rows | DB Hits | Memory (Bytes) |
+------------------+----------------+---------+-----------+
| ProduceResult | 1 | 0 | 0 |
| +NodeIndexSeek | 1 | 4 | 120 |
+------------------+----------------+---------+-----------+
Total database hits: 4 ← This is fast.
PROFILE output for the bad query:
+----------------------+----------------+-----------+---------------+
| Operator | Estimated Rows | DB Hits | Memory (Bytes)|
+----------------------+----------------+-----------+---------------+
| ProduceResult | 1 | 0 | 0 |
| +Filter | 1 | 2,847,293 | 480 |
| +NodeByLabelScan | 2,847,293 | 2,847,294 | 32,768 |
+----------------------+----------------+-----------+---------------+
Total database hits: 5,694,587 ← This is slow. Add an index.
| Aspect | Relational (SQL) | Graph (Cypher) |
|---|---|---|
| Data Model | Tables and Rows | Nodes and Relationships |
| Relationship Logic | Foreign Keys & Joins | Direct Traversal (Pointers) |
| Query Syntax | Verbose SELECT/JOIN | ASCII-Art Pattern Matching |
| Performance at Depth | Degrades exponentially with join depth | Constant for local traversals (index-free adjacency) |
| Schema | Rigid — ALTER TABLE for changes | Schema-optional — add properties without migration |
| Many-to-Many | Requires junction table | Native — just draw multiple relationships |
| Aggregation | GROUP BY + HAVING | WITH + WHERE (after aggregation) |
| Subqueries | Nested SELECT / CTE | CALL {} blocks |
| Batch Operations | INSERT INTO ... SELECT | UNWIND $list MERGE ... |
| Full-Text Search | Full-Text Index / LIKE '%x%' | TEXT INDEX + CONTAINS / apoc.text |
🎯 Key Takeaways
- Cypher is a pattern-based query language — you describe the SHAPE of your data using ASCII art, and the engine finds every instance of that shape in the graph. Think in nodes and paths, not tables and joins.
- CREATE always creates new nodes, even if duplicates exist. MERGE checks for existence first and creates only if missing. Use MERGE for entities, CREATE for events.
- Always use labels in your MATCH patterns — matching without a label forces a full database scan (O(n)), while matching with a label uses the index (O(log n)).
- WITH is the pipe operator of Cypher — it chains query stages, enables aggregation filtering (the HAVING equivalent), and limits variable scope for query optimization.
- Variable-length paths ([1..5]) are Cypher's superpower for social networks, supply chains, and recommendation engines — but NEVER use unbounded paths ([]) in production.
- UNWIND expands lists into rows — use it with MERGE for batch operations that process thousands of records in a single database round trip.
- Profile every production query with EXPLAIN or PROFILE. Look for NodeIndexSeek (good) vs NodeByLabelScan (bad). If you see Eager, restructure the query.
- Create uniqueness constraints on every entity's primary identifier before loading data — this prevents duplicates at the database level and gives you a free index.
- collect() gathers rows into a list, UNWIND spreads a list into rows. Together they're the foundation of nested result structures and batch processing in Cypher.
- Soft deletes (setting a 'deleted' property) are almost always better than hard deletes (DETACH DELETE) in production — hard deletes break graph integrity.
- Cypher's recommendation engine query is just two hops: (me)-[:PURCHASED]->(product)<-[:PURCHASED]-(other)-[:PURCHASED]->(rec). In SQL, this is a 15-line self-join. In Cypher, it's a single pattern.
- Use keyset pagination (WHERE created_at < $cursor LIMIT 10) instead of SKIP-based pagination (SKIP 100000 LIMIT 10) — SKIP forces the engine to discard rows, keyset jumps directly to the right position.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QExplain 'Index-free Adjacency' and how it makes Cypher traversals constant-time regardless of database size. How does this differ from SQL's approach to relationships?
- QWhat's the difference between CREATE and MERGE? When would you use each, and what happens if you run the same MERGE statement twice?
- QHow does the WITH clause work in Cypher? Give an example where you aggregate data and then filter on the aggregated result — what's the SQL equivalent?
- QWhat is a variable-length path query? Write a Cypher query to find all 'friends of friends' up to 3 levels deep, and explain why you should never use an unbounded path in production.
- QHow do you profile a Cypher query? What does NodeIndexSeek vs NodeByLabelScan tell you, and what does the 'Eager' operator signify?
- QExplain the difference between MATCH and OPTIONAL MATCH. In what scenario would you use OPTIONAL MATCH, and how does it relate to SQL's LEFT JOIN?
- QHow does MERGE handle race conditions when two concurrent transactions try to create the same node? What role do uniqueness constraints play?
- QWhat is the UNWIND clause and how would you use it to batch-insert 10,000 records in a single query? Why is this faster than executing 10,000 individual CREATE statements?
- QDesign a Cypher query for a recommendation engine that finds 'customers who bought X also bought Y.' Explain why this is faster in a graph database than in SQL.
- QYou have a graph with 50 million nodes and a query that's taking 30 seconds. Walk me through your debugging process — what do you check first, and what are the most common causes of slow Cypher queries?
Frequently Asked Questions
Does Cypher support ACID transactions like SQL?
Yes, Neo4j is fully ACID-compliant. Every Cypher query runs within a transaction — if any part of the query fails, the entire transaction is rolled back. For multi-statement transactions, you can use BEGIN and COMMIT explicitly through the driver API. Neo4j also uses optimistic locking: if two transactions modify the same node concurrently, the second one fails with a deadlock exception and must be retried. In your application code, wrap Cypher calls in a retry loop with exponential backoff.
What is the difference between CREATE and MERGE?
CREATE always creates a new node or relationship, even if an identical one already exists — this leads to duplicates. MERGE checks if the pattern exists first: if it does, MERGE matches it (does nothing); if it doesn't, MERGE creates it. MERGE is the idempotent version of CREATE. Use CREATE for event-based data (orders, log entries) where each record is inherently unique. Use MERGE for entity-based data (users, products) where the same entity might be referenced multiple times. MERGE also supports ON CREATE SET and ON MATCH SET for upsert behavior.
How do I handle many-to-many relationships in Cypher?
Many-to-many relationships are natural in graph databases — no junction table needed. Just create direct relationships between the nodes: (student:Student)-[:ENROLLED_IN]->(course:Course). If a student is enrolled in 5 courses, they have 5 direct relationships. If a course has 200 students, it has 200 incoming relationships. Querying is a simple pattern match: MATCH (s:Student)-[:ENROLLED_IN]->(c:Course) RETURN s.name, c.name. No join tables, no foreign keys, no performance degradation as the dataset grows.
What is the 'ASCII-art' syntax exactly?
Cypher uses text characters to represent graph elements. Parentheses (node) represent a node (they look like circles). Brackets -[rel:TYPE]-> represent a directed relationship (the arrow shows direction). Combining them: (a:Person)-[:FRIEND]->(b:Person) visually represents 'a Person node connected by a FRIEND relationship to another Person node.' This makes the query look like the data it queries — if you can draw it on a whiteboard, you can write it in Cypher.
Can I use Cypher to query non-graph data?
Cypher is optimized for graph traversals — using it for flat, non-connected data (like a simple key-value store) is inefficient. If your data has no relationships or the relationships are trivial (one-to-many without traversal needs), a relational database or document store is simpler and faster. Cypher's advantage kicks in when you need to traverse relationships: friends of friends, supply chain hops, recommendation paths, fraud ring detection. If your queries are mostly 'give me row X where column Y = Z,' you don't need a graph.
How do I import data from CSV into Neo4j?
Use LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row to read CSV files. The WITH HEADERS clause treats the first row as column names, accessible as row.column_name. Combine with MERGE for idempotent imports: LOAD CSV WITH HEADERS FROM 'file:///users.csv' AS row MERGE (u:User {email: row.email}) SET u.name = row.name. For large files (>100K rows), use PERIODIC COMMIT (Neo4j 4.x) or CALL { } IN TRANSACTIONS (Neo4j 5.x) to batch the writes and avoid running out of memory.
How do I paginate results efficiently in Cypher?
Don't use SKIP for large offsets — SKIP 100000 LIMIT 10 forces the engine to process and discard 100,000 rows. Use keyset pagination instead: pass the last seen value and filter on it. For example: MATCH (u:User) WHERE u.created_at < $last_seen_date RETURN u ORDER BY u.created_at DESC LIMIT 10. This lets the engine use the index to jump directly to the right position. Same pagination result, 1000x faster on large datasets. Store $last_seen_date from the last item of the previous page.
What is the WITH clause and when do I need it?
WITH is Cypher's pipe operator — it passes results from one query stage to the next. You need it in three situations: (1) after aggregation to filter on the aggregated result (like SQL's HAVING), (2) to chain multiple query stages for readability, and (3) to limit which variables pass forward (the engine uses this for optimization). Any variable not listed in the WITH clause becomes unavailable downstream. Example: MATCH (dev)-[:CONTRIBUTES_TO]->(proj) WITH dev, count(proj) AS cnt WHERE cnt > 3 — the count must pass through a WITH before you can filter on it.
How do I optimize a slow Cypher query?
Run PROFILE on the query and look for three things: (1) NodeByLabelScan — the engine is scanning all nodes with a label instead of using an index. Fix: create an index on the filtered property. (2) Eager operator — the engine is buffering all intermediate results in memory. Fix: restructure the query to reduce intermediate result size. (3) High database hits — the engine is doing too much work. Fix: start your MATCH from the most selective node (the one with the fewest matches), use labels everywhere, and avoid unbounded variable-length paths.
Can I use Cypher with Spring Data Neo4j?
Yes, Spring Data Neo4j (SDN) is the official integration. You define entity classes with @Node annotations, repositories that extend Neo4jRepository, and custom queries with @Query("cypher here"). SDN handles object-graph mapping — it converts Neo4j nodes and relationships into Java objects and back. For complex queries, use @Query with native Cypher. For batch operations, use UNWIND $batch AS row MERGE ... in a @Query method. SDN works with both Neo4j Community and Enterprise editions.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.