How to Visualize Vector Embeddings: A Complete Guide to t-SNE, UMAP, and DBSCAN

2026-03-17 · # AI Concepts

vector embeddings t-SNE UMAP DBSCAN PCA dimensionality reduction clustering

When you’re building a RAG pipeline or debugging a text classification model, there comes a point where you’re staring at an array of 1,536 numbers and your brain just goes blank. Do these numbers actually form a meaningful structure? Are similar documents really clustering together? Or did the embedding model train itself in completely the wrong direction? You can’t tell just by looking at raw numbers.

This post is about the tools that let you see those 1,536 numbers. We’ll cover three families of techniques: dimensionality reduction (projecting high-dimensional space onto a 2D plane), clustering (automatically discovering hidden structure), and similarity measurement (quantifying the relationship between two vectors). For each one, we’ll go from the underlying math to practical interpretation.

What Is a Vector Embedding?

An embedding encodes meaning as a direction in number space. Feed the word “cat” to a model, and it outputs something like [0.23, -0.71, 0.04, ...] — an array of hundreds of floating-point values. What matters isn’t any individual number; it’s the direction and distance between two vectors.

Think of a coordinate system. In 2D, two points being close together means their physical distance is small. In embedding space, two vectors being close together means their semantic content is similar. OpenAI’s text-embedding-3-small uses 1,536 dimensions; sentence-transformers’ all-MiniLM-L6-v2 uses 384. Since humans can only directly visualize up to three dimensions, we need ways to compress high-dimensional space into something we can see.

Dimensionality Reduction: PCA, t-SNE, UMAP

PCA — Find the Directions of Maximum Variance

PCA (Principal Component Analysis) was developed by Karl Pearson in 1901¹. It works by identifying the axes along which the data has the most variance, then building a new coordinate system from those axes.

The mathematical core is an eigendecomposition of the covariance matrix:

Covariance matrix: C = (1/n) * X^T * X
Eigendecomposition: C * v = λ * v

The eigenvector v defines the direction of the new axis; the eigenvalue λ tells you how much variance that axis explains. The first principal component (PC1) captures the most variance; PC2 is orthogonal to PC1 and captures the second most.

Key parameters and interpretation

n_components is the only parameter you really need to set. Set it to 2 to get a 2D scatter plot. Always check the Explained Variance Ratio: if PC1+PC2 together explain more than 80% of total variance, your 2D projection is trustworthy. If it’s below 30%, you’re losing most of the information.

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
reduced = pca.fit_transform(embeddings)  # shape: (n, 2)
print(pca.explained_variance_ratio_)     # something like [0.42, 0.18]

Reading the result: If clusters separate cleanly in a PCA scatter plot, the embeddings have a linearly separable structure. If points stretch along a long axis (a cigar shape), one particular semantic direction dominates.

Pros and cons: Fast, deterministic (same result every run), and invertible. The downside is that PCA only captures linear relationships — nonlinear structure common in embedding spaces may appear as one big undifferentiated blob. PCA is best used as a preprocessing step (e.g., compress 512 dimensions down to 50, then run t-SNE on that).

t-SNE — Preserve Neighborhood Relationships as Probabilities

t-SNE (t-distributed Stochastic Neighbor Embedding) was published by van der Maaten and Hinton in 2008². Its goal is to preserve the neighborhood structure of high-dimensional data when projecting it to low dimensions.

The core idea is to minimize the difference between two probability distributions.

High-dimensional similarity (Gaussian):

p(j|i) = exp(-||x_i - x_j||^2 / (2 * σ_i^2)) / Σ_k exp(-||x_i - x_k||^2 / (2 * σ_i^2))
p_ij = (p(j|i) + p(i|j)) / (2n)

Low-dimensional similarity (t-distribution, df=1):

q_ij = (1 + ||y_i - y_j||^2)^(-1) / Σ_(k≠l) (1 + ||y_k - y_l||^2)^(-1)

Cost function (KL Divergence minimization):

C = KL(P || Q) = Σ_i Σ_j p_ij * log(p_ij / q_ij)

The key insight is why t-SNE uses a Gaussian in high dimensions but a t-distribution (fat-tailed) in low dimensions. A Gaussian assigns near-zero probability to distant points, effectively ignoring them. A t-distribution assigns significantly higher probability to the same distance. This asymmetry means points that were far apart in high dimensions actively repel each other in the low-dimensional layout, rather than collapsing together.

Key parameters

Parameter	Default	Effect
`perplexity`	30	Effective number of neighbors per point. Lower values emphasize local structure; higher values emphasize global structure
`n_iter`	1000	Number of optimization iterations. Increase if the result hasn’t converged
`learning_rate`	200	Too low → points clump into a ball; too high → points scatter

How perplexity changes the picture: At perplexity=5, you get many tight, small clusters with sharp boundaries. At perplexity=50, individual clusters merge and large-scale groupings emerge. It’s good practice to run with perplexity at 5, 30, and 100 and compare all three.

Interpretation pitfalls: In a t-SNE plot, the distance between clusters is meaningless. Two clusters that appear close on screen are not necessarily close in the original high-dimensional space — t-SNE preserves local neighborhood relationships, not global distance structure. Results also vary between runs (always fix random_state), and you cannot add new data points to an existing t-SNE result (no out-of-sample extension).

[!KEY] In a t-SNE visualization, “these two clusters look far apart on screen” does not mean they’re semantically distant. Focus on the shape and density of clusters; ignore inter-cluster distances.

UMAP — Fuzzy Simplicial Sets on a Topological Manifold

UMAP (Uniform Manifold Approximation and Projection) was published by McInnes and Healy in 2018³, grounded in Riemannian geometry and fuzzy topology. With over 8,100 GitHub stars, it has become the primary alternative to t-SNE.

The core assumption is that high-dimensional data lies on some smooth manifold, and the goal is to preserve that manifold’s topological structure in a lower-dimensional representation.

For each point x_i, let ρ_i be the distance to its nearest neighbor (the local scale) and σ_i be the connectivity strength:

High-dimensional similarity:
v_ij = exp(-(d(x_i, x_j) - ρ_i) / σ_i)
w_ij = v_ij + v_ji - v_ij * v_ji   (fuzzy union)

Low-dimensional similarity (t-distribution variant):
q_ij = (1 + a * ||y_i - y_j||^(2b))^(-1)

Cost function (Cross-entropy):
C = Σ_ij [w_ij * log(w_ij/q_ij) + (1-w_ij) * log((1-w_ij)/(1-q_ij))]

Parameters a and b are fit automatically from your min_dist and spread settings.

Key parameters

Parameter	Default	Effect
`n_neighbors`	15	Analogous to t-SNE’s perplexity. Lower → local structure; higher → global structure
`min_dist`	0.1	How tightly points cluster in low-dimensional space. Lower → more compact clusters
`metric`	euclidean	Distance function. `cosine` is often better for embeddings
`spread`	1.0	Overall scale of the embedding space

The critical difference from t-SNE: UMAP preserves global structure to a meaningful degree. The relative distance between two clusters actually carries information — clusters that are far apart in high-dimensional space tend to remain far apart in UMAP. UMAP is also dramatically faster (often tens of times faster for large datasets), supports out-of-sample extension via transform(), and can use n_components > 2 as a preprocessing step before clustering.

Interpretation tips: min_dist=0.0 makes points within the same cluster collapse tightly together, making cluster boundaries very sharp. min_dist=0.5 lets you see internal cluster structure. For embedding analysis, metric='cosine' often gives better results because cosine similarity better captures semantic similarity in high-dimensional spaces.

[!KEY] UMAP preserves global structure better than t-SNE. However, n_neighbors below 5 introduces artifacts, while values above 100 blur cluster boundaries. For embedding analysis, 20–50 is generally a good starting point.

Clustering: K-Means, DBSCAN, HDBSCAN

Once you’ve visualized the structure with dimensionality reduction, use clustering to find that structure automatically.

K-Means — Partition Around Centroids

K-Means is the most classical algorithm for dividing data into K clusters. The objective is to minimize within-cluster variance:

Cost function (Within-Cluster Sum of Squares, WCSS):
J = Σ_k Σ_{x_i ∈ C_k} ||x_i - μ_k||^2

μ_k = (1/|C_k|) * Σ_{x_i ∈ C_k} x_i   (cluster centroid)

The algorithm is straightforward: randomly initialize K centroids → assign each point to its nearest centroid → recompute centroids → repeat until convergence.

Key parameters

The fundamental constraint is that you must specify K upfront. Common approaches to finding the optimal K are the Elbow Method and the Silhouette Score:

Silhouette Score:
s(i) = (b(i) - a(i)) / max(a(i), b(i))

a(i): mean distance from i to all other points in the same cluster
b(i): mean distance from i to all points in the nearest other cluster
s(i) range: -1 to 1  (higher is better)

Pros and cons: Simple to implement and fast. However, it can only detect convex (spherical) clusters, is sensitive to outliers, and requires knowing K in advance. For embedding analysis, Spherical K-Means (using cosine distance instead of Euclidean) is a better fit.

DBSCAN — Find Clusters by Density

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) was published by Ester, Kriegel, Sander, and Xu in 1996⁴. It defines clusters as regions of high density.

The algorithm is defined by two parameters: eps (ε), the radius that defines a neighborhood, and min_samples, the minimum number of neighbors required to be considered a core point.

Neighborhood set:
N_eps(x) = { y ∈ D | d(x, y) <= eps }

Core point condition:
|N_eps(x)| >= min_samples

Direct density-reachability:
x → y: y ∈ N_eps(x) and x is a core point

Density-connectedness:
x ↔ y: ∃ z such that z reaches both x and y

How eps and min_samples affect results

If eps is too large, everything merges into one massive cluster. Too small, and most points become noise (labeled -1). Raising min_samples requires denser regions to qualify as clusters. The standard way to find a good eps is to plot a k-nearest neighbor distance graph: set k=min_samples, sort each point’s distance to its k-th neighbor in ascending order, and look for the “elbow” where the distance starts to rise sharply — that’s your eps.

Pros and cons: Can detect arbitrarily shaped clusters, requires no K, and automatically flags outliers as -1. The downside is that eps is tricky to tune, and it struggles when clusters have widely varying densities (common in embedding spaces with many outliers).

HDBSCAN — Hierarchical Density Search

HDBSCAN (Hierarchical DBSCAN) is based on the theoretical framework by Campello, Moulavi, and Sander (2013)⁵, implemented as a library by McInnes, Healy, and Astels in 2017⁶. It eliminates the eps parameter and can detect clusters at multiple density levels simultaneously.

The core concepts are mutual reachability distance and the condensed tree:

Core distance:
d_core(x) = d(x, k-th nearest neighbor)

Mutual reachability distance:
d_mreach(x, y) = max(d_core(x), d_core(y), d(x, y))

Build MST (Minimum Spanning Tree) on these distances →
Construct hierarchical cluster tree →
Compress into condensed tree →
Select clusters with maximum stability

Cluster stability formula:

S(C) = Σ_{x ∈ C} (λ_max(x) - λ_birth(C))

λ = 1/d (inverse of distance, i.e., density level)
λ_birth(C): density level at which the cluster first appears
λ_max(x): density level at which the point leaves the cluster

Key parameters

Parameter	Default	Effect
`min_cluster_size`	5	Minimum cluster size. Larger values produce fewer, bigger clusters
`min_samples`	None	Core point threshold. Higher values make clustering more conservative
`cluster_selection_method`	eom	eom (default): prefers large, stable clusters / leaf: prefers smaller clusters

Pros and cons: Works well without tuning eps, handles clusters of unequal density, and supports soft clustering — each point gets a probability of cluster membership, so you can identify borderline points. In practice, HDBSCAN tends to be far more robust than DBSCAN for embedding analysis.

Similarity Measurement: Cosine Similarity vs. Euclidean Distance

Two main approaches exist for quantifying how similar two embeddings are.

Cosine Similarity

cosine_similarity(A, B) = (A · B) / (||A|| * ||B||)

A · B = Σ_i A_i * B_i
||A|| = sqrt(Σ_i A_i^2)

Range: -1 to 1
 1: identical direction (semantically the same)
 0: orthogonal (no meaningful relationship)
-1: opposite direction (semantically opposite)

Cosine similarity compares only the direction of vectors, ignoring magnitude. This makes it well suited for embeddings, because embedding models encode meaning as direction, not magnitude. “Dog” and “puppy” might have embeddings pointing in similar directions but with different lengths.

Euclidean Distance

euclidean_distance(A, B) = sqrt(Σ_i (A_i - B_i)^2)

Range: 0 to ∞
0: identical points
Larger values → farther apart

Euclidean distance measures the absolute positional difference between vectors. When embeddings are L2-normalized (||v|| = 1), Euclidean distance and cosine similarity are monotonically related:

||A - B||^2 = 2 - 2 * cosine_similarity(A, B)
(when A and B are both unit vectors)

So for L2-normalized embeddings, both metrics produce the same ranking. For unnormalized embeddings, cosine similarity is more reliable. OpenAI embeddings are already L2-normalized; sentence-transformers varies by model.

When to use which:

RAG retrieval, semantic search: cosine similarity
K-Means clustering: Euclidean after normalization, or Spherical K-Means
Outlier detection: Euclidean distance (absolute position matters)
Preprocessing for t-SNE/UMAP: cosine distance (metric='cosine')

Practical Workflow: Combining the Tools

A typical workflow for exploring embedding space looks like this:

graph TD
    A[Collect raw embeddings] --> B[Reduce to 50 dims with PCA]
    B --> C{Goal}
    C --> D[Visualize: UMAP 2D]
    C --> E[Cluster: HDBSCAN]
    D --> F[Color points by cluster label]
    E --> F
    F --> G[Review outliers and borderline samples]

Step 1 — Preprocessing: For millions of embeddings, use PCA to compress down to 50–100 dimensions first. Running UMAP directly on 1,536 dimensions is slow, and distance calculations in very high-dimensional noisy space are unreliable (the curse of dimensionality).

Step 2 — Visualization: Apply UMAP (n_neighbors=15, min_dist=0.1, metric='cosine') for a 2D projection.

Step 3 — Clustering: Apply HDBSCAN (min_cluster_size=10) on the PCA-reduced space — not on the 2D UMAP projection. Running clustering on 2D coordinates introduces distortions from the projection itself.

Step 4 — Validation: Compute Silhouette Score per cluster, and manually read representative samples from each cluster to verify that the labels are meaningful. Points with HDBSCAN probabilities_ below 0.5 are borderline cases worth examining closely.

Common mistakes and how to avoid them

Using t-SNE coordinates as clustering features: t-SNE prioritizes local structure, which severely distorts global distances. Always cluster in the original space (or PCA-reduced space), not the 2D t-SNE layout.
Insisting on K-Means without choosing K: Embedding clusters are rarely spherical. HDBSCAN typically produces far more natural results.
Not checking normalization: Run np.linalg.norm(embeddings, axis=1) on your embeddings. Values close to 1.0 → already normalized; otherwise, apply sklearn.preprocessing.normalize().
Too-small perplexity / n_neighbors: With 50,000+ embeddings, perplexity=5 will show meaningless fragmented structure. Start near the square root of your dataset size.

Modern Visualization Tools

Nomic Atlas

Nomic Atlas is a platform for interactively exploring millions of embeddings in a web browser. Upload your embedding array via the Python SDK and it automatically builds an interactive map with UMAP projection and clustering applied. An official example is included in the OpenAI Cookbook.

import nomic
from nomic import atlas

project = atlas.map_embeddings(
    embeddings=embeddings,   # numpy array (n, dim)
    data=metadata,           # list of metadata dicts
    id_field='id',
    colorable_fields=['category']
)

Nomic provides several large public maps — 5.4 million Twitter tweets, 6.4 million Stable Diffusion images — which are useful references for developing intuition about embedding structure.

Apple Embedding Atlas

An open-source tool released by Apple in 2025⁷ under the MIT license. It implements the UMAP algorithm in WebAssembly so your data never leaves the browser. The project has over 3,600 GitHub stars. It includes a built-in density clustering algorithm (implemented in Rust) for automatic labeling, and is available as a Jupyter Notebook widget.

pip install embedding-atlas
embedding-atlas serve embeddings.parquet

It supports interactive exploration of millions of embeddings locally, with cross-filtering and metadata search. For environments where data privacy matters, it’s a strong alternative to Nomic Atlas.

Closing Thoughts

Making sense of vector embeddings isn’t something any single tool can do on its own. The practical approach is a combination of four: PCA for preprocessing, UMAP for structural visualization, HDBSCAN for discovering clusters, and cosine similarity for validating individual relationships.

The key is understanding what each tool assumes and where it breaks down. Avoid trusting inter-cluster distances in t-SNE, applying K-Means to 2D projections, or using Euclidean distance on unnormalized embeddings — and your analysis becomes substantially more reliable.

Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(11), 559–572. ↩
van der Maaten, L. J. P., & Hinton, G. E. (2008). Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research, 9(Nov), 2579–2605. https://www.jmlr.org/papers/v9/vandermaaten08a.html ↩
McInnes, L., & Healy, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426. https://arxiv.org/abs/1802.03426 ↩
Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), 226–231. ↩
Campello, R. J. G. B., Moulavi, D., & Sander, J. (2013). Density-Based Clustering Based on Hierarchical Density Estimates. Pacific-Asia Conference on Knowledge Discovery and Data Mining, 160–172. ↩
McInnes, L., Healy, J., & Astels, S. (2017). hdbscan: Hierarchical density based clustering. Journal of Open Source Software, 2(11), 205. https://doi.org/10.21105/joss.00205 ↩
Ren, D., Hohman, F., Lin, H., & Moritz, D. (2025). Embedding Atlas: Low-Friction, Interactive Embedding Visualization. arXiv:2505.06386. https://github.com/apple/embedding-atlas ↩

What Is a Vector Embedding?

Dimensionality Reduction: PCA, t-SNE, UMAP

PCA — Find the Directions of Maximum Variance

t-SNE — Preserve Neighborhood Relationships as Probabilities

UMAP — Fuzzy Simplicial Sets on a Topological Manifold

Clustering: K-Means, DBSCAN, HDBSCAN

K-Means — Partition Around Centroids

DBSCAN — Find Clusters by Density

HDBSCAN — Hierarchical Density Search

Similarity Measurement: Cosine Similarity vs. Euclidean Distance

Cosine Similarity

Euclidean Distance

Practical Workflow: Combining the Tools

Modern Visualization Tools

Nomic Atlas

Apple Embedding Atlas

Closing Thoughts

Footnotes