Probability vector

In mathematics and statistics, a probability vector or stochastic vector is a vector with non-negative entries that add up to one.

Underlying every probability vector is an experiment that can produce an outcome. To connect this experiment to mathematics, one introduces a discrete random variable, which is a function that assigns a numerical value to each possible outcome. For example, if the experiment consists of rolling a single die, the possible values of this random variable are the integers 1,2,…,6. The associated probability vector has six components, each representing the probability of obtaining the corresponding outcome. More generally, a probability vector of length n represents the distribution of probabilities across the n possible numerical outcomes of a random variable.^[1]

The vector gives us the probability mass function of that random variable, which is the standard way of characterizing a discrete probability distribution.^[2]

Examples

Here are some examples of probability vectors. The vectors can be either columns or rows.^[3]

$x_{0}={\begin{bmatrix}0.5\\0.25\\0.25\end{bmatrix}},$
$x_{1}={\begin{bmatrix}0\\1\\0\end{bmatrix}},$
$x_{2}={\begin{bmatrix}0.65&0.35\end{bmatrix}},$
$x_{3}={\begin{bmatrix}0.3&0.5&0.07&0.1&0.03\end{bmatrix}}.$

Properties

The mean of the components of any probability vector is $1/n$ .^[4]

The Euclidean length of a probability vector is related to the variance of its components by ^[5]

\|p\|={\sqrt {\,n\sigma ^{2}+{\tfrac {1}{n}}\,}}

.

This expression for length reaches its minimum value of ${\tfrac {1}{\sqrt {n}}}$ when all components are equal, with $p_{i}=1/n$ .^[3]
The longest probability vector has the value 1 in a single component and 0 in all others, and has a length of 1.^[3]
The shortest vector corresponds to maximum uncertainty, the longest to maximum certainty.
The variance $\sigma ^{2}$ of a probability vector $p=(p_{1},p_{2},\ldots ,p_{n})$ satisfies:

\sigma ^{2}\in \left[\,0,\,{\tfrac {n-1}{n^{2}}}\,\right].

The lower bound occurs when all components are equal

p_{i}=1/n

, and the upper bound when one component equals

1

and the rest are

0

.^[6]

Significance of the bounds on variance

The bounds on variance show that as the number of possible outcomes $n$ increases, the variance necessarily decreases toward zero. As a result, the uncertainty associated with any single outcome increases because the components of the probability vector become more nearly equal. In empirical work, this often motivates binning the outcomes to reduce $n$ ; although this discards some information contained in the original outcomes, it allows the coarser-grained structure of the distribution to be revealed. The decrease in variance with increasing $n$ reflects the same tendency toward uniformity that underlies entropy in information theory and statistical mechanics.^[7]

Geometry of the probability simplex

A simplex is the simplest geometric object that fully occupies the region of a given dimension defined by its vertices. It is constructed as the convex hull of n affinely independent points: for $n=2$ it is a line segment, for $n=3$ a triangle, for $n=4$ a tetrahedron, and so on.

The probability simplex (or standard simplex) is the canonical example of a simplex. It is obtained by taking the n standard basis vectors $e_{1}=(1,0,0,\ldots ,0),\quad e_{2}=(0,1,0,\ldots ,0),\quad e_{3}=(0,0,1,0,\ldots ,0),\ \ldots$ as vertices and forming their convex hull: $\Delta _{n-1}=\{\,p\in \mathbb {R} ^{n}\mid p_{i}\geq 0,\ \sum _{i=1}^{n}p_{i}=1\,\}.$

This is an $(n-1)$ -dimensional simplex lying on the affine hyperplane $\sum _{i}p_{i}=1$ . A random variable with $n$ possible outcomes naturally lives in this $(n-1)$ -simplex rather than an $n$ -simplex, because the requirement that all probabilities sum to 1 removes one degree of freedom.

The components $p_{i}$ serve as barycentric coordinates, giving this simplex an immediate interpretation in probability theory: each vertex corresponds to a certain outcome, and each interior point represents a mixture or distribution over the n outcomes. Every possible discrete probability distribution on n outcomes corresponds to exactly one point in this simplex, and conversely each point of the simplex defines a unique distribution. Moving toward a vertex along barycentric coordinates corresponds to increasing certainty about the outcome, while movement toward the center represents increasing uncertainty resulting from a more uniform distribution.

The probability simplex serves as the canonical simplex in $\mathbb {R} ^{n}$ , since any other simplex can be obtained from it by an affine transformation, making it the standard reference for geometric and probabilistic analyses.^[8]^[9]

Properties of the probability simplex

Every probability vector of dimension n lies within an (n − 1)-dimensional simplex. The convex hull of this simplex does not form a smooth, gradually changing surface in $\mathbb {R} ^{n}$ ; instead, it has sharp vertices, straight edges, and flat faces.

Zero probability:

Assigning a zero probability to an outcome corresponds to moving onto a lower-dimensional face of the simplex, since that outcome is no longer possible.

Adding a vertex:

Adding one new possible outcome to the random variable increases n by one and introduces a new orthogonal dimension. A new vertex appears in that dimension, and each face of the previous simplex combines with this vertex to form a new facet of one higher dimension. For example, when a triangle (2-simplex) gains a new vertex, connecting it to each of its three edges produces three new triangular facets, forming a tetrahedron. In the next step, adding another vertex would produce a 4-simplex, whose facets are tetrahedra.

Affine hyperplane:

The probability simplex lies on the affine hyperplane $\sum _{i}p_{i}=1$ in $\mathbb {R} ^{n}$ . Its normal vector is $a=(1,1,\ldots ,1)$ with norm $\|a\|={\sqrt {n}}$ . Thus, all points on the hyperplane lie within the positive orthant at the same perpendicular distance $1/{\sqrt {n}}$ from the origin. This is because the projection of any point on the hyperplane onto its normal vector $a$ is constant—by definition of a plane. The Euclidean distance from the origin to individual points on the plane varies, but the length of their perpendicular projection (the component along $a$ ) remains fixed. Affine independence means that the defining points in hyperspace are located relative to one another, not with respect to the origin as in the case of linear independence. This allows affinely independent objects to “float” relative to the origin, since their defining equation includes a constant term that specifies their offset along the normal direction. Changing this constant translates the entire object parallel to itself, preserving its internal relationships while changing its position in space.

Centroid:

The centroid, corresponding to the uniform distribution, is $u=(1/n,1/n,\ldots ,1/n)$ . It lies at both a Euclidean and perpendicular distance $\|u\|=1/{\sqrt {n}}$ from the origin, since the line from the origin to the centroid coincides with the simplex’s normal vector. Each vertex is at an equal Euclidean distance ${\sqrt {(n-1)/n}}$ from the centroid.

Containing hypercube:

The (n − 1)-dimensional probability simplex lies entirely within the n-dimensional unit hypercube. That hypercube has a total content (or measure) of one unit. However, the meaning of that measure changes with dimension: a unit square (dimension 2) has an area of one (1 × 1 = 1), a unit cube (dimension 3) has a volume of one (1 × 1 × 1 = 1), and a unit hypercube of dimension 4 has a measure of one (1⁴ = 1), and so on. Although the total content remains constant, the cube’s Euclidean diagonal length increases as ${\sqrt {n}}$ , so the hypercube becomes geometrically “sparser”^[10] as n increases—its corners move farther apart even though its content is unchanged. The probability simplex occupies only a thin (n − 1)-dimensional slice across this hypercube, forming a cross-section at a perpendicular distance of $1/{\sqrt {n}}$ from the origin.^[8]^[9]

Volume (Content):

The $(n-1)$ -dimensional content (volume) of the standard probability simplex can be computed using the Gram–determinant formula for simplex volume. Choosing the vertex $e_{n}$ as a base point, the remaining vertices define edge vectors

$v_{i}=e_{i}-e_{n},\qquad i=1,\dots ,n-1,$

which lie in the $(n-1)$ -dimensional affine hyperplane where the coordinates $p_{i}$ sum to one. The parallelepiped spanned by these vectors has content

$\operatorname {Vol} _{\mathrm {para} }={\sqrt {\det(G)}},$

where $G$ is the Gram matrix $G_{ij}=v_{i}\cdot v_{j}$ . This Gram matrix has determinant $n$ . ^[11]^[12] Since this simplex occupies exactly $1/(n-1)!$ of the volume of its parallelepiped, its content is

$V_{n-1}={\frac {\sqrt {n}}{(n-1)!}}.$

This quantity decreases factorially with $n$ , so although the simplex lies inside the unit $n$ -cube (which has volume 1), the fraction of the hypercube’s volume contained in the simplex becomes super-exponentially small as $n$ increases.

References

^ Bertsekas, D. P., & Tsitsiklis, J. N. (2008). Introduction to Probability (2nd ed.). Athena Scientific. Available as MIT lecture notes PDF. Chapter 2, p. 3.
^ Jacobs, Konrad (1992), Discrete Stochastics, Basler Lehrbücher [Basel Textbooks], vol. 3, Birkhäuser Verlag, Basel, p. 45, doi:10.1007/978-3-0348-8645-1, ISBN 3-7643-2591-7, MR 1139766.
^ ^a ^b ^c Lee, Geoffrey (2016). "MATH1014 Linear Algebra Lecture 10 Notes" (PDF). Australian National University. Retrieved 16 October 2025.
^ StatisticsHowTo, Probability Vector: Definition, Examples, Properties
^ "Length of a Probability Vector". CrossValidated. 2021. Retrieved 16 October 2025.
^ Bertsekas, D. P. & Tsitsiklis, J. N. (2008). Introduction to Probability. 2nd ed. Athena Scientific. pp. 53–54.
^ Source needed. probably Cover & Thomas
^ ^a ^b Boyd, Stephen; Vandenberghe, Lieven (2004). Convex Optimization (PDF). Cambridge University Press. p. 32.
^ ^a ^b Murphy, Kevin P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. p. 108.
^ “Sparser” in this context means that two points chosen at random in the n-dimensional unit hypercube have an expected Root Mean Square distance from each other of ${\sqrt {n/6}}$ , so as n increases the average separation between random points grows proportionally to ${\sqrt {n}}$ .
^ Cover, Thomas M.; Thomas, Joy A. (2006). "12". Elements of Information Theory (2nd ed.). Wiley.
^ Gallier, Jean; Quaintance, Jocelyn (2020). "7.8". Linear Algebra and Optimization with Applications to Machine Learning (PDF). Springer.

[1] Bertsekas, D. P., & Tsitsiklis, J. N. (2008). Introduction to Probability (2nd ed.). Athena Scientific. Available as MIT lecture notes PDF. Chapter 2, p. 3.

[2] Jacobs, Konrad (1992), Discrete Stochastics, Basler Lehrbücher [Basel Textbooks], vol. 3, Birkhäuser Verlag, Basel, p. 45, doi:10.1007/978-3-0348-8645-1, ISBN 3-7643-2591-7, MR 1139766.

[ANU-3] Lee, Geoffrey (2016). "MATH1014 Linear Algebra Lecture 10 Notes" (PDF). Australian National University. Retrieved 16 October 2025.

[StatHowTo-4] StatisticsHowTo, Probability Vector: Definition, Examples, Properties

[5] "Length of a Probability Vector". CrossValidated. 2021. Retrieved 16 October 2025.

[6] Bertsekas, D. P. & Tsitsiklis, J. N. (2008). Introduction to Probability. 2nd ed. Athena Scientific. pp. 53–54.

[7] Source needed. probably Cover & Thomas

[BoydVandenberghe2004-8] Boyd, Stephen; Vandenberghe, Lieven (2004). Convex Optimization (PDF). Cambridge University Press. p. 32.

[Murphy2022-9] Murphy, Kevin P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. p. 108.

[10] “Sparser” in this context means that two points chosen at random in the n-dimensional unit hypercube have an expected Root Mean Square distance from each other of ${\sqrt {n/6}}$ , so as n increases the average separation between random points grows proportionally to ${\sqrt {n}}$ .

[CoverThomas-11] Cover, Thomas M.; Thomas, Joy A. (2006). "12". Elements of Information Theory (2nd ed.). Wiley.

[GallierQuaintance-12] Gallier, Jean; Quaintance, Jocelyn (2020). "7.8". Linear Algebra and Optimization with Applications to Machine Learning (PDF). Springer.

[1]