Created on July 09, 2023

Written by Some author

Read time: 2 minutes

Summary: This passage discusses the relationship between sequence composition, entropy, and information content, highlighting the probabilities and character counts within a sequence and how they contribute to measuring its entropy and information gain.

The idea of sequence within the context of biology can be seen everywhere, including our daily life. I will be talking about important concepts such as the derivation of entropy and mutation and the derivation of master equation in the context of biology. First of all, let's assume that we have a list of alphabet of size $m$ with $a_1, a_2, ..., a_m$. We want to find specifically if we can produce a sequence with specific number of characters from each alphabet. Let's say we have $n_1$ number of $a_1$, $n_2$ number of $a_2$, ... ,$n_m$ number of $a_m$. Together the sum of $n_1$, $n_2$, ... $n_m$ as $N$ and the probability for each character to appear will be $p_1, p_2,...,p_m$. We will have the following relation:

$$\sum p_i = 1$$

$$\sum {N \choose n_1, n_2, ... n_m} \prod_{i=1}^m p_i^{n_i} = 1$$

specifically, our probability for each $\#(a_1) = n_1, \#(a_2) = n_2, ... \#(a_m) = n_m,$ will be $${N \choose n_1, n_2, ... n_m} \prod_{i=1}^m p_i^{n_i}$$

This probability quantifies the information content of the sequence, as it indicates how likely or unlikely it is to observe a specific combination of character counts. The greater the number of possible combinations with the same composition of characters, the higher the information content or entropy of the sequence.

If we have a sufficiently long sequence with length $M$, then we will likely to see $p_i M$number of the characters $a_i$. This is intuitive since if we do coin toss, we will concatenate all the trial results to find the probability of coin flips. And further if we consider how much information gain, we will have. If we have one sequence that is of size $n$ for some arbitrary $n$, then we will have another sequence with size $2n$, which will give twice as much information as we can gain. The similar result can be seen from the computational complexity, where we will have $O(n)$-time solvable oracle can solve strictly less problems that require $O(n^2)-$time solvable problems, though it's measured in a scale of exponentiation and here we are using only multiplication due to the nature of the multinomial coefficient.

So it will be interesting to us to take the logarithm of the multinomial coefficient, which we will get entropy from the system.

$$\ln { N \choose n_1, n_2, ... n_m} = \ln(N!) - \ln(n_1!)- \ln(n_2!) -...- \ln(n_m!)$$

$$\approx N \ln N - N - n_1 \ln n_1 + n_1 - n_2 \ln n_2 + n_2-....- n_m \ln n_m + n_m$$

$$= (n_1 + n_2 + ... +n_m) \ln N - n_1 \ln n_1 - n_2 \ln n_2 -...- n_m \ln n_m $$

$$= -N\sum_{i=1}^m \frac{n_i}{N} \ln\left(\frac{n_i}{N}\right)$$