Seaborn — NaN in Hue Silently Deletes Rows
3 NaN region values in Seaborn hue deleted all rows for that group, causing a false sales drop.
- Seaborn is a high-level statistical visualization library built on Matplotlib
- It expects tidy data: one row per observation, one column per variable
- Figure-level functions (relplot, catplot) create FacetGrids and manage subplots automatically
- Axes-level functions (boxplot, scatterplot) give you direct Matplotlib control
- Performance: plotting 10k points takes ~0.5s; for larger datasets, sample or enable rasterization
- Production pitfall: NaN values in hue columns cause silent row drops, distorting group comparisons
- Biggest mistake: calling plt.title() after a figure-level function — the title lands on the wrong panel
Every data project hits the same wall: you have the numbers, but you can't see them. A DataFrame full of customer ages, purchase values, and churn flags is just a rectangle of digits until someone visualises it. Seaborn exists precisely for that moment — the moment between 'I have data' and 'I understand data'. It's used daily by data scientists at companies like Spotify and Airbnb to explore datasets before modelling and to communicate findings to non-technical stakeholders.
The real problem Seaborn solves isn't just aesthetics, though its defaults are beautiful. It solves the complexity problem. To draw a grouped box plot with error bars and a sensible colour palette in pure Matplotlib takes 40 lines and a lot of Stack Overflow. In Seaborn it takes three. More importantly, Seaborn understands the concept of 'tidy data' — it knows what a DataFrame is, it reads column names directly, and it maps statistical relationships onto visual properties automatically. That's a fundamentally different abstraction level.
By the end of this article you'll know which Seaborn chart to reach for in six real-world scenarios, why the Figure-level vs Axes-level distinction matters when you're building dashboards, how to customise without fighting the library, and the three mistakes that silently ruin charts for beginners. You'll also be ready to answer the Seaborn questions that come up in data analyst and data science interviews.
Seaborn's Mental Model: Tidy Data, Figure-Level vs Axes-Level
Before you write a single line of Seaborn, you need to understand its two core assumptions, because breaking either one causes confusing bugs.
First: Seaborn expects tidy data. That means one observation per row and one variable per column. If your DataFrame has columns called 'Jan_Sales', 'Feb_Sales', 'Mar_Sales', Seaborn will fight you. The correct shape has a 'Month' column and a 'Sales' column — one row per month per product. Pandas' melt() function is your friend here.
Second: Seaborn has two tiers of functions. Axes-level functions like histplot(), scatterplot(), and boxplot() draw onto a single Matplotlib Axes object — they behave like normal Matplotlib and you can combine them freely. Figure-level functions like displot(), relplot(), and catplot() create their own Figure and can produce multi-panel grids via a 'col=' or 'row=' argument. They return a FacetGrid object, not an Axes, which is why calling plt.title() on one produces the wrong result.
Knowing this split stops you spending an hour wondering why your title is in the wrong place or why subplots won't cooperate.
Choosing the Right Chart: Six Real-World Scenarios
The most common Seaborn mistake isn't bad syntax — it's reaching for the wrong chart. Here's the decision framework professionals actually use.
Distribution of a single numeric variable? Use histplot() with kde=True to overlay the density curve. It answers 'is this data normally distributed, skewed, or bimodal?' before you choose a statistical test.
Relationship between two numeric variables? scatterplot() with hue= for a third categorical dimension. Add a regression line with lmplot() when you want to communicate correlation to a non-technical audience.
Comparing a numeric variable across categories? boxplot() for showing spread and outliers, violinplot() when sample size is large enough to trust the density estimate (roughly n > 30 per group), and barplot() only when mean + uncertainty is the right summary.
Correlation across many numeric columns? heatmap() on a correlation matrix. This is the chart that identifies multicollinearity before you build a regression model.
Change over time? lineplot() with hue= for multiple groups. Seaborn automatically aggregates and draws confidence intervals when multiple observations exist per x value.
Distribution across two categorical dimensions? heatmap() on a pivot table, or pointplot() with both x= and hue= for overlapping line-point combos.
Customising Seaborn Without Fighting It — Themes, Palettes, and Matplotlib Escape Hatches
Seaborn's defaults are intentionally good. The trap beginners fall into is immediately overriding everything and ending up with something worse than the default. The right mental model is: let Seaborn do 80%, then use Matplotlib for the final 20%.
The sns.set_theme() call at the top of your script is the single most powerful line. It sets the background, grid style, font scale, and colour palette for every chart that follows. Choose from five styles: 'darkgrid', 'whitegrid', 'dark', 'white', and 'ticks'. For presentations use 'white'; for exploratory analysis 'whitegrid' helps you read values.
Colour palettes deserve real thought. The 'colorblind' palette is the professional default — it's distinguishable by people with deuteranopia and protanopia (about 8% of men). For sequential data (low to high) use 'Blues' or 'YlOrRd'. For diverging data (negative to positive, like correlations) use 'coolwarm' or 'RdBu_r'. Never use the default rainbow — it implies ordering where none exists.
For anything Seaborn can't do natively, you always have access to the underlying Matplotlib object. Axes-level functions return the Axes; figure-level functions expose their Figure via grid.figure and individual axes via grid.axes_dict.
Pairplots and FacetGrids — Exploring Entire Datasets in One Call
Once you've got individual charts under control, Seaborn's real superpower for exploratory data analysis is the multi-chart grid. Two functions deliver this: pairplot() and FacetGrid.
pairplot() is the tool you run on a new dataset before you do anything else. It draws every numeric column against every other numeric column — scatter plots off-diagonal, distributions on-diagonal — and colour-codes by a categorical variable. In five seconds you can see which pairs of features are linearly related, which ones cluster by class, and which ones are skewed. It's the fastest possible dataset overview.
FacetGrid is the manual version. You control exactly which variable goes on rows, which goes on columns, and then you map any Axes-level Seaborn or Matplotlib function onto every panel. This is how you build dashboards programmatically — one loop builds 12 charts, perfectly aligned, with shared axes.
Both are figure-level, so the plt.title() caveat from Section 1 applies. The payoff is that the layout, spacing, and legend are all handled for you.
Handling Large Datasets and Plot Performance
Seaborn's default settings work well for datasets up to tens of thousands of points. Beyond that, performance degrades and patterns become invisible due to overplotting. Here are the strategies used in production.
First, always sample when exploring. df.sample(n=3000) preserves the shape of the data and reduces rendering time from minutes to seconds. Use the random_state parameter for reproducibility.
Second, use transparency and small markers. scatterplot with alpha=0.1, s=5 can show density without a solid blob. For even larger data, switch to a hexbin histogram: sns.histplot(x='col1', y='col2', kind='hex') bins the points into hexagons and colours by count.
Third, enable rasterization for vector graphics. When saving to PDF or SVG, vectorised scatter plots with 100k points can produce files that crash viewers. Set scatter_kws={'rasterized': True} to rasterize only the scatter layer while keeping axes and labels as vectors.
Fourth, downsample along time series. For longitudinal data, resample to a lower frequency (e.g., hourly to daily averages) before plotting. Use pandas resample() and aggregate by mean.
Finally, use catplot with kind='box' and multiple plots via col= to split data into manageable panels instead of one overwhelming chart.
| Feature / Aspect | Axes-Level Functions (e.g. boxplot) | Figure-Level Functions (e.g. catplot) |
|---|---|---|
| Returns | Matplotlib Axes object | FacetGrid object |
Use plt.title()? | Yes — works as expected | No — use grid.figure.suptitle() |
| Multi-panel grids | Manual (plt.subplots) | Built-in via col=, row= params |
| Combine with other charts | Easy — pass ax= param | Harder — use .map_dataframe() |
| Best for | Dashboard panels, custom layouts | Exploratory faceting, quick multi-group views |
| Legend control | Full Matplotlib control | Via grid.add_legend() method |
| Figure size control | figsize on plt.subplots() | height= and aspect= params |
Key Takeaways
- Tidy data is non-negotiable – melt wide tables before plotting.
- Figure-level functions own the figure; axes-level functions draw on your axes.
- Use 'colorblind' palette for accessible, professional charts.
- Always sample before pairplot or scatter with >10k points.
- Rasterize scatter layers to keep PDF/SVG output small.
- Heatmap correlations before modelling to catch multicollinearity.
Common Mistakes to Avoid
- Calling plt.title() after a figure-level function
Symptom: Title appears on a single panel instead of the whole figure, causing confusion in multi-panel charts.
Fix: Use grid.figure.suptitle('Title') for figure-level plots. For axes-level,plt.title()works fine. - Not handling NaN in hue column
Symptom: Seaborn silently drops rows with NaN in hue, x, or y — group sizes look wrong and comparisons are biased.
Fix: Always run df['hue_col'].isna().sum() before plotting. Fill with 'Unknown' or drop those rows explicitly. - Plotting too many points without sampling
Symptom: Chart is extremely slow, freezes the kernel, or produces huge vector files that crash PDF viewers.
Fix: Use df.sample(n=3000, random_state=42) for scatter plots. For large datasets, use hexbin or kdeplot. - Using rainbow palette for categorical data
Symptom: Colours imply an ordering that doesn't exist, confusing viewers. Also inaccessible to colour-blind readers.
Fix: Use 'colorblind' or 'Set2' for categorical data. Use 'coolwarm' or 'RdBu' for diverging, 'Blues' for sequential. - Not melting wide-format data before plotting
Symptom: Seaborn expects tidy data (one row per observation). Pivoted columns like 'Jan_Sales', 'Feb_Sales' break grouping and cause errors.
Fix: Usepd.melt()to transform wide data into long format with a 'month' column and a 'sales' column.
Interview Questions on This Topic
- QWhat is the difference between figure-level and axes-level functions in Seaborn? Give examples.SeniorReveal
- QHow does Seaborn handle missing data (NaN) in the hue column? What should you do to avoid problems?Mid-levelReveal
- QExplain the concept of tidy data and why Seaborn requires it. How do you reshape non-tidy data?JuniorReveal
- QWhat performance optimisation strategies would you recommend for Seaborn plots on large datasets?SeniorReveal
- QHow do you create a multi-panel chart with Seaborn where each panel shows a different category? Give code example using FacetGrid.Mid-levelReveal
Frequently Asked Questions
Why does my title appear on only one small panel instead of the whole figure?
You called plt.title() after a figure-level function (catplot, relplot, etc.). Figure-level functions return a FacetGrid object, not a matplotlib Axes. Use grid.figure.suptitle('Your Title') instead.
How do I fix missing categories in a grouped seaborn chart?
Check for NaN in the hue column using df['hue_col'].isna().sum(). Seaborn silently drops rows with NaN. Fill missing values with df['hue_col'].fillna('Unknown') or use before plotting.dropna()
What palette should I use for accessible visualisation?
Use palette='colorblind' in Seaborn. It uses the Wong (2011) colour set that is distinguishable by people with the most common forms of colour vision deficiency (deuteranopia, protanopia). It's the professional default.
How do I combine a seaborn chart with a matplotlib annotation?
For axes-level functions, you get back an Axes object – use . For figure-level, access individual axes via ax.annotate()grid.axes_dict[category] or grid.axes.flat[index]. You have full matplotlib access after the Seaborn call.
What is the difference between lmplot and regplot?
is a figure-level function that creates a FacetGrid and adds a regression line per hue group – it's great for multi-panel communication. lmplot() is an axes-level function that draws a regression line on a single axes, giving you more control over the underlying matplotlib figure.regplot()
That's Python Libraries. Mark it forged?
5 min read · try the examples if you haven't