Pseudo Random Generator (PRG) is widely used, from basic rand() calls in C programs, cryptography, to randomized algorithm and network simulations. The key idea of PRG is to leverage deterministic arithmetic operations to produce random numbers that appear to be truely random, a naive example (yet a poor source of random numbers) is the John von Neumann’s approach in 1946 which takes the square of the previous number and extract the middle digits.

The property of determinism might seem counter-intuitive, one could enhance the intuition leveraging RANDOM internal Bash function (or /dev/random, /dev/urandom etc):

# Non-deterministic upon multiple runs since seed is randomly initialized
# Deterministic across multiple runs due to the given fixed seed

As John von Neumann said “Any one who considers arithmetical methods of producing random digits is, of course, in a state of sin.” — a natural doubt one might have at first glance given that the generated number is determined by the previous number (TAOCP Vol. II). How to justify a PRG algorithm design in practice, i.e., what does it mean by the output sequence appears to be truely random (which is vague)? The way to define it quantitatively is to step back and define it as a proposition under concrete conditionals. A set of sequences appears to be random w.r.t. a set of a collection of statistical tests (output a binary decision on whether the source is a truely random source using certain algorithms computed over the provided sequences) if it behaves the same as truely random sequences under these specified tests. We can’t brute force all possible tests in practice yet the better/more the tests, the more confidence we have in terms of the closeness of the approximation.

PRG is extensively used in the story of cryptography. As is known, One-time Pad is perfectly secure (a.k.a. information-theoretic security) for ciphertext-only attackers even with unlimited computing power, i.e., as Claude Shannon put it, the mutual information \(I(m;c)=0\) where m and c are random variables for plaintext and ciphertext messages respectively. An alternative definition is \(\forall m_{1}, m_{2} \in M, \forall c \in C, P_{k\in K}(E_{k}(m_{1})=c) = P_{k\in K}(E_{k}(m_{2})=c)\ \forall k \in K\). In other words, \(k\oplus m\) should be a (continuous) random variable of uniform distribution, which can be proved by leveraging the useful property of XOR: \(Z=X\oplus Y\) is uniform if \(X\) is a arbitrary random variable over \(\{0,1\}^{n}\) and \(Y\) is an independent uniform variable (the introduction of an independent uniform random variable “wipes out” the original distribution after XOR).

However, in practice, it is hard to generate a truly random key with at least the size of the message. Meanwhile, we have to secretely share such symmetric key each time since it can not be reused. To make OTP practical, instead of generating truely random keys, we could use PRG \(G\) which takes an initial seed and generates deterministically pseudorandom bits, i.e., \(G: \{0,1\}^{l} \rightarrow \{0,1\}^{n}, l \leq n\). That gives us so called stream cipher to overcome the limitations of OTP.

Is PRG secure? One track is to leverage the previous definition of “appears to be random” and interpret PRG security as computational indistinguishability from a true random source under statistical tests. A formal formulation (A Graduate Course in Applied Cryptography Attack Game 3.1): A PRG is secure if the adversary A’s advantage (\(Adv(A, G) = |P(\hat{b}=1|challenger\ sends\ G(k))-\frac{1}{2}|\)) is negligible for all efficient statistical tests.

An alternative (actually can be proved equivalent to the other) track is that PRG has to be next-bit-unpredictable where unpredictability can be formulated as following (A Graduate Course in Applied Cryptography 3.5 Attack Game 3.2): A PRG is predictable if there exists an efficient prediction algorithm \(A\) (given PRG output at indicies \(0 \leq i \leq L-1\) and output next bit) for the adversary s.t. its advantage \(Adv(A, G) = |P(A(G(k)_{0,1,\ldots,L-1})=G(k)_{L})-\frac{1}{2}|\geq \epsilon\), where \(\epsilon\) is a non-negligible value (practically/not rigirously, (non-)negligilble can be intepreted as a small scalar, e.g., non-negligible \(\epsilon=2^{-30}\), negligible \(\epsilon=2^{-100}\)). Similarly, a PRG is unpredicatable if the advantage is less than a negligible.

Note that for both tracks above to justify a secure PRG, we actually allow a tiny leakage of information. Therefore, perfect security no longer holds for the corresponding ciphers and we need an alternative security definition for ciphers against a computationally bounded adversary. That gives us semantic security, weaker yet practical: instead of enforcing strict equality as in perfect security, we only need them to be close enough, i.e., \(|P_{k\in K}(E_{k}(m_{1})=c) - P_{k\in K}(E_{k}(m_{2})=c)| \leq \epsilon\). A more vivid/practical formulation is again via an attack game as in A Graduate Course in Applied Cryptography 2.2.2 Attack Game 2.1. For a given cipher \(\mathcal{E}=\{E,D\}\) over \((K, M, C)\), experiement \(b \in \{0,1\}\):

  • Adversary A chooses \(m_{0}, m_{1} \in \mathcal{M}\) where \(\|m_{0}\|=\|m_{1}\|\).
  • The challenger computes \(k \in K, c = E_{k}(m_{b})\) and sends c to A.
  • A decides \(\hat{b}\in\{0,1\}\)

A cipher \(\mathcal{E}\) is said to be semantically secure if for all efficient adversaries, the advantage \(Adv(A, \mathcal{E}) = \lvert P(W_{0})-P(W_{1})\rvert\leq\epsilon\) where we denote \(W_{b}\) as the event that A outputs 1 in experiment b. Such formulation is equivalent to Indistinguishability under Chosen Plaintext Attack (IND-CPA) in symmetric case (IND-CPA can also be applied to public key crypto).