Full Note on Maximum Likelihood Estimation (MLE)
If you look under the hood of almost any machine learning algorithm—whether it's a simple linear regression or a massive deep neural network—you’ll find one core mathematical idea running the whole show: Maximum Likelihood Estimation (MLE).
It sounds like an intimidating academic term, but it's actually a very intuitive concept once it clicks. Let's walk through what MLE actually is, how the math works, and why it's the exact same thing as the "loss functions" you hear about in modern AI.
Part 1: The Core Intuition
Before we touch any math, we need to clear up a common confusion. In everyday English, people use "probability" and "likelihood" to mean the exact same thing. In statistics, they are complete opposites.
Probability vs. Likelihood
Think about it like this:
Scenario A (Probability): Imagine I give you a coin, and I tell you upfront that it's heavily biased. It will land on Heads 80% of the time ($\theta = 0.8$).
Probability asks: Given this known coin, what is the chance of flipping 3 Heads in a row?
Here, the rule (parameter) is known, and we are guessing the data.
Scenario B (Likelihood): Now imagine the reverse. I hand you a mystery coin. You have no idea how it behaves. You flip it 3 times and get: Heads, Heads, Heads.
Likelihood asks: Given this data we just saw, what is the most plausible bias ($\theta$) of this coin?
Here, the data is known, and we are guessing the rule (parameter).
This is the entire philosophy of MLE: In the real world, we collect data. We assume there's some underlying mathematical rule (or model parameters) that generated that data, but we don't know what it is. Maximum Likelihood Estimation is simply the process of finding the specific model parameters that make the data we actually collected look as mathematically probable as possible.
Part 2: The Mathematical Framework
Alright, let's put some formal math behind this idea.
1. The Likelihood Function
Let’s say we have a dataset $X = \{x_1, x_2, ..., x_n\}$. We usually assume these data points are Independent and Identically Distributed (i.i.d.). That’s just a fancy way of saying one data point doesn't influence another, and they all come from the same environment.
Let $\theta$ represent the unknown parameters of our model (like the weights in a neural network).
To find the probability of observing our entire dataset given $\theta$, we just multiply the individual probabilities of each data point together:
$$L(\theta | X) = P(x_1 | \theta) \times P(x_2 | \theta) \times ... \times P(x_n | \theta)$$
$$L(\theta | X) = \prod_{i=1}^{n} P(x_i | \theta)$$
This equation is the Likelihood Function. Our main goal in MLE is to find the value for $\theta$ that makes this equation output the highest possible number.
2. The Log-Likelihood Hack
Trying to maximize that equation directly is a complete nightmare for computers and for mathematicians doing calculus. Here's why:
Computer Underflow: Probabilities are tiny numbers between 0 and 1. If you multiply 10,000 tiny decimals together (like you would for a dataset with 10,000 rows), the number becomes so unimaginably small that your computer just gives up and rounds it to 0.
Calculus is Hard: To find the maximum of a function, we have to take its derivative. Doing that on a massive chain of multiplications requires applying the Product Rule thousands of times. It's too messy.
To fix this, mathematicians use a clever hack: we take the natural logarithm ($\ln$ or $\log$) of the Likelihood function. Because a logarithm always goes strictly up, whatever $\theta$ maximizes the Log-Likelihood will also perfectly maximize the original Likelihood.
By using the basic log rule $\log(a \times b) = \log(a) + \log(b)$, we can turn that terrible chain of multiplication into a beautiful, easy-to-manage sum:
$$\ell(\theta | X) = \log L(\theta | X) = \sum_{i=1}^{n} \log P(x_i | \theta)$$
3. Finding the Maximum
Now, to find our optimal parameters ($\theta_{MLE}$), we just take the derivative (gradient) of this new Log-Likelihood function, set it to 0, and solve for $\theta$:
$$\frac{\partial}{\partial \theta} \ell(\theta | X) = 0$$
Part 3: The Bridge to Machine Learning
So, how does this actually connect to writing code in PyTorch, TensorFlow, or Scikit-Learn?
If you've studied any machine learning, you know we rarely talk about "maximizing likelihood." Instead, everyone talks about "minimizing loss." Here's the secret: maximizing a positive number is mathematically identical to minimizing a negative number. So, in machine learning, instead of maximizing the Log-Likelihood, we just slap a negative sign in front of it and minimize it. This is called the Negative Log-Likelihood (NLL):
$$\text{Loss} = - \sum_{i=1}^{n} \log P(x_i | \theta)$$
Let's look at how two of the most famous ML loss functions are literally just MLE in a trench coat.
Proof 1: Linear Regression and Mean Squared Error (MSE)
In Linear Regression, we try to predict a continuous number $y$ based on some input $x$. We usually assume our predictions are pretty good, but they suffer from random, normally distributed (Gaussian) errors.
The normal distribution formula for our prediction looks like this, where our model's prediction is the mean ($\hat{y} = \theta^T x$) and the variance is $\sigma^2$:
$$P(y_i | x_i, \theta) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( - \frac{(y_i - \hat{y}_i)^2}{2\sigma^2} \right)$$
Let's plug that directly into our Negative Log-Likelihood formula:
$$\text{NLL} = - \sum_{i=1}^{n} \log \left[ \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( - \frac{(y_i - \hat{y}_i)^2}{2\sigma^2} \right) \right]$$
Using our log rules, we split the terms:
$$\text{NLL} = - \sum_{i=1}^{n} \left[ \log\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right) - \frac{(y_i - \hat{y}_i)^2}{2\sigma^2} \right]$$
Now, here is where it gets fun. We only care about finding the best $\theta$ (which is hiding inside our prediction $\hat{y}_i$). That first ugly log term? It's just a constant. It doesn't affect our optimization at all, so we can throw it away. The $\frac{1}{2\sigma^2}$ part? Also just a constant scaling factor. Throw it away too.
Look at what we have left:
$$\text{Minimize:} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$
Conclusion: Minimizing the Mean Squared Error (MSE) in Machine Learning is mathematically exactly the same thing as running Maximum Likelihood Estimation assuming normal Gaussian noise.
Proof 2: Logistic Regression and Cross-Entropy Loss
What about classification? In binary classification, our target $y$ is either 0 or 1. This follows a Bernoulli distribution. Our model spits out a probability $p$ that $y=1$.
The Bernoulli formula is remarkably clever:
$$P(y_i | x_i) = p^{y_i} (1 - p)^{(1 - y_i)}$$
(Think about it: if $y_i=1$, the second half becomes $(1-p)^0 = 1$, so we are just left with $p$. If $y_i=0$, the first half disappears and we are left with $(1-p)$.)
Let's plug this into the Negative Log-Likelihood:
$$\text{NLL} = - \sum_{i=1}^{n} \log \left[ p^{y_i} (1 - p)^{(1 - y_i)} \right]$$
Apply the log rules:
$$\text{NLL} = - \sum_{i=1}^{n} \left[ y_i \log(p) + (1 - y_i) \log(1 - p) \right]$$
Conclusion: If you've trained a classifier, you recognize this instantly. This is the exact formula for Binary Cross-Entropy (BCE) Loss. When your neural network minimizes Cross-Entropy, it is literally just doing Maximum Likelihood Estimation.
Part 4: Step-by-Step Math Examples (The Basics)
To make sure this really sticks, let's work through some manual calculations to see how MLE finds the right parameters across totally different distributions in practice.
Example 1: The Biased Coin (Bernoulli Distribution)
Let's say you flip a coin 5 times and get: Heads, Heads, Tails, Heads, Tails. Let $H=1$ and $T=0$.
Our dataset is $X = \{1, 1, 0, 1, 0\}$.
We want to find the MLE for the probability of heads, which we'll call $\theta$.
Step 1: Write the Likelihood Function
For a Bernoulli trial, $P(x) = \theta^x (1-\theta)^{1-x}$.
$$L(\theta | X) = \prod_{i=1}^{5} \theta^{x_i} (1-\theta)^{1-x_i}$$
Since we got 3 Heads ($x=1$) and 2 Tails ($x=0$):
$$L(\theta) = \theta^3 (1-\theta)^2$$
Step 2: Take the Log-Likelihood
$$\ell(\theta) = 3\log(\theta) + 2\log(1-\theta)$$
Step 3: Take the Derivative and set to 0
$$\frac{d}{d\theta} \ell(\theta) = \frac{3}{\theta} - \frac{2}{1-\theta} = 0$$
Step 4: Solve for $\theta$
$$3(1-\theta) = 2\theta \implies 3 - 3\theta = 2\theta \implies 5\theta = 3 \implies \theta_{MLE} = 0.6$$
Result: The math tells us the maximum likelihood estimate is 0.6 (or 60%), perfectly matching our observed 3 out of 5 heads!
Example 2: Call Center Wait Times (Exponential Distribution)
Let's say the time it takes for a call center to answer the phone follows an Exponential distribution: $P(x) = \lambda e^{-\lambda x}$.
You record 3 wait times (in minutes): 2, 4, 6.
What is the maximum likelihood estimate for the rate parameter $\lambda$?
Step 1: Write the Likelihood Function
$$L(\lambda | X) = (\lambda e^{-2\lambda}) \times (\lambda e^{-4\lambda}) \times (\lambda e^{-6\lambda}) = \lambda^3 e^{-12\lambda}$$
Step 2: Take the Log-Likelihood
$$\ell(\lambda) = 3\log(\lambda) - 12\lambda$$
Step 3: Take the Derivative and set to 0
$$\frac{d}{d\lambda} \ell(\lambda) = \frac{3}{\lambda} - 12 = 0$$
Step 4: Solve for $\lambda$
$$12\lambda = 3 \implies \lambda_{MLE} = 0.25$$
Result: The rate parameter $\lambda$ is $0.25$ calls per minute. This means the average wait time is $1/0.25 = 4$ minutes.
Part 5: Advanced Real-World Complications
In the real world, datasets are messy. Variances aren't always known, data is sometimes hidden or cut off, and probabilities have strict constraints. Let's see how MLE handles these complications.
Complication 1: Unknown Mean AND Unknown Variance
Example 3: Sensor Calibration (Gaussian with Two Parameters)
In previous Gaussian examples, we assumed we already knew the variance (the noise). But what if you buy a cheap temperature sensor and you don't know the true temperature ($\mu$) or how noisy the sensor is ($\sigma^2$)?
You take 3 readings: 10°C, 12°C, 14°C. We must estimate both $\mu$ and $\sigma^2$ simultaneously using partial derivatives.
Let's call the variance $v$ (so $v = \sigma^2$) just to make the calculus easier to read. Our dataset is $X = \{10, 12, 14\}$, so $n=3$.
Step 1: Write the Log-Likelihood Function
The full Gaussian log-likelihood for $n$ data points is:
$$\ell(\mu, v) = -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log(v) - \frac{1}{2v} \sum_{i=1}^{n} (x_i - \mu)^2$$
For our 3 data points, this becomes:
$$\ell(\mu, v) = -\frac{3}{2} \log(2\pi) - \frac{3}{2} \log(v) - \frac{1}{2v} \left[ (10-\mu)^2 + (12-\mu)^2 + (14-\mu)^2 \right]$$
Step 2: Find the Mean ($\mu$) first
We take the partial derivative with respect to $\mu$ and set it to 0. (The first two log terms disappear because they don't contain $\mu$).
$$\frac{\partial}{\partial \mu} \ell(\mu, v) = - \frac{1}{2v} [ 2(10-\mu)(-1) + 2(12-\mu)(-1) + 2(14-\mu)(-1) ] = 0$$
Since $1/2v$ can't be zero, the inside must be zero:
$$(10-\mu) + (12-\mu) + (14-\mu) = 0 \implies 36 - 3\mu = 0 \implies \mathbf{\mu_{MLE} = 12}$$
Step 3: Find the Variance ($v$)
Now we take the partial derivative with respect to $v$. We plug in our known $\mu=12$.
The sum of squares is: $(10-12)^2 + (12-12)^2 + (14-12)^2 = (-2)^2 + 0^2 + (2)^2 = 4 + 0 + 4 = 8$.
So our equation is just:
$$\ell(v) = -\frac{3}{2} \log(2\pi) - \frac{3}{2} \log(v) - \frac{8}{2v}$$
Take the derivative (remember the derivative of $1/v$ is $-1/v^2$):
$$\frac{\partial}{\partial v} \ell(v) = -\frac{3}{2v} + \frac{8}{2v^2} = 0$$
Multiply everything by $2v^2$ to clear the fractions:
$$-3v + 8 = 0 \implies 3v = 8 \implies \mathbf{v_{MLE} = 2.67}$$
Result: The MLE for the room temperature is 12°C, and the MLE for the sensor's variance is 2.67. MLE handles multi-parameter optimization gracefully using partial derivatives!
Complication 2: Censored Data (We don't have the full story)
Example 4: Server Hard Drive Failures (Survival Analysis)
You are testing the lifespan of hard drives. You start 3 drives at the same time.
Drive A dies after exactly 2 years.
Drive B dies after exactly 4 years.
Drive C is still running perfectly at year 5 when your boss demands the report.
You don't know when Drive C will die—only that it survived at least 5 years. This is called Right-Censored Data. If you just guess Drive C died at year 5, your math will be wrong. We assume lifespans follow an Exponential distribution.
Step 1: Mix the Likelihoods
For Drives A and B, we know the exact time of death, so we use the standard Probability Density Function (PDF): $f(x) = \lambda e^{-\lambda x}$.
For Drive C, we only know it survived past year 5, so we use the Survival Function (1 minus the Cumulative Density Function): $S(x) = e^{-\lambda x}$.
$$L(\lambda | X) = f(2) \times f(4) \times S(5)$$
$$L(\lambda) = (\lambda e^{-2\lambda}) \times (\lambda e^{-4\lambda}) \times (e^{-5\lambda})$$
Add the exponents together:
$$L(\lambda) = \lambda^2 e^{-11\lambda}$$
Step 2: Take the Log-Likelihood & Derivative
$$\ell(\lambda) = 2\log(\lambda) - 11\lambda$$
$$\frac{d}{d\lambda} \ell(\lambda) = \frac{2}{\lambda} - 11 = 0$$
Step 3: Solve for $\lambda$
$$\frac{2}{\lambda} = 11 \implies \mathbf{\lambda_{MLE} = \frac{2}{11} \approx 0.1818}$$
Result: The failure rate is 0.1818 failures per year, which means the estimated average lifespan of a drive is $1 / 0.1818 = \mathbf{5.5 \text{ years}}$. Notice how elegantly MLE accounted for the drive that hadn't died yet!
Complication 3: Strict Mathematical Constraints
Example 5: A/B/C Marketing Test (Multinomial Distribution)
You run an email marketing campaign with 3 possible outcomes: Click ($p_1$), Hover ($p_2$), or Ignore ($p_3$). You send it to 100 people and observe:
50 Clicks
30 Hovers
20 Ignores
The complication? These are probabilities of a single event, so they must add up to exactly 1 (i.e., $p_1 + p_2 + p_3 = 1$). If we just take standard derivatives, the math might spit out probabilities that sum to 1.5, which is impossible.
To fix this, we enforce the constraint by substitution: Since they sum to 1, we know that $p_3 = 1 - p_1 - p_2$.
Step 1: Write the Likelihood Function
For categorical data, we use the Multinomial distribution. (We can ignore the factorial constants at the front because they disappear when we take the derivative).
$$L(p_1, p_2) \propto p_1^{50} \times p_2^{30} \times (1 - p_1 - p_2)^{20}$$
Step 2: Take the Log-Likelihood
$$\ell(p_1, p_2) = 50\log(p_1) + 30\log(p_2) + 20\log(1 - p_1 - p_2)$$
Step 3: Take Partial Derivatives
First, with respect to $p_1$ (treat $p_2$ as a constant):
$$\frac{\partial \ell}{\partial p_1} = \frac{50}{p_1} - \frac{20}{1 - p_1 - p_2} = 0$$
Move the negative term over and cross-multiply:
$$\frac{50}{p_1} = \frac{20}{1 - p_1 - p_2} \implies 50(1 - p_1 - p_2) = 20p_1$$
Notice that $(1 - p_1 - p_2)$ is just $p_3$. So this simplifies to:
$$50p_3 = 20p_1 \implies \mathbf{p_1 = 2.5 p_3}$$
Next, take the derivative with respect to $p_2$:
$$\frac{\partial \ell}{\partial p_2} = \frac{30}{p_2} - \frac{20}{1 - p_1 - p_2} = 0$$
Following the exact same algebra:
$$30(1 - p_1 - p_2) = 20p_2 \implies 30p_3 = 20p_2 \implies \mathbf{p_2 = 1.5 p_3}$$
Step 4: Solve Using the Constraint
We know $p_1 + p_2 + p_3 = 1$. Let's plug in what we found!
$$(2.5 p_3) + (1.5 p_3) + p_3 = 1$$
$$5 p_3 = 1 \implies \mathbf{p_3 = 0.20}$$
Now plug $p_3$ back into our other equations:
$$p_1 = 2.5(0.20) = \mathbf{0.50}$$
$$p_2 = 1.5(0.20) = \mathbf{0.30}$$
Result: The math confirms our intuition: $p_1=50\%$, $p_2=30\%$, and $p_3=20\%$. MLE successfully navigated the boundaries of the probability constraints!