Bretagnolle–Huber inequality

Inequality in information theory

In information theory, the Bretagnolle–Huber inequality bounds the total variation distance between two probability distributions P {\displaystyle P} and Q {\displaystyle Q} by a concave and bounded function of the Kullback–Leibler divergence D K L ( P Q ) {\displaystyle D_{\mathrm {KL} }(P\parallel Q)} . The bound can be viewed as an alternative to the well-known Pinsker's inequality: when D K L ( P Q ) {\displaystyle D_{\mathrm {KL} }(P\parallel Q)} is large (larger than 2 for instance.[1]), Pinsker's inequality is vacuous, while Bretagnolle–Huber remains bounded and hence non-vacuous. It is used in statistics and machine learning to prove information-theoretic lower bounds relying on hypothesis testing[2]  (Bretagnolle–Huber–Carol Inequality is a variation of Concentration inequality for multinomially distributed random variables which bounds the total variation distance.)

Formal statement

Preliminary definitions

Let P {\displaystyle P} and Q {\displaystyle Q} be two probability distributions on a measurable space ( X , F ) {\displaystyle ({\mathcal {X}},{\mathcal {F}})} . Recall that the total variation between P {\displaystyle P} and Q {\displaystyle Q} is defined by

d T V ( P , Q ) = sup A F { | P ( A ) Q ( A ) | } . {\displaystyle d_{\mathrm {TV} }(P,Q)=\sup _{A\in {\mathcal {F}}}\{|P(A)-Q(A)|\}.}

The Kullback-Leibler divergence is defined as follows:

D K L ( P Q ) = { X log ( d P d Q ) d P if  P Q , + otherwise . {\displaystyle D_{\mathrm {KL} }(P\parallel Q)={\begin{cases}\int _{\mathcal {X}}\log {\bigl (}{\frac {dP}{dQ}}{\bigr )}\,dP&{\text{if }}P\ll Q,\\[1mm]+\infty &{\text{otherwise}}.\end{cases}}}

In the above, the notation P Q {\displaystyle P\ll Q} stands for absolute continuity of P {\displaystyle P} with respect to Q {\displaystyle Q} , and d P d Q {\displaystyle {\frac {dP}{dQ}}} stands for the Radon–Nikodym derivative of P {\displaystyle P} with respect to Q {\displaystyle Q} .

General statement

The Bretagnolle–Huber inequality says:

d T V ( P , Q ) 1 exp ( D K L ( P Q ) ) 1 1 2 exp ( D K L ( P Q ) ) {\displaystyle d_{\mathrm {TV} }(P,Q)\leq {\sqrt {1-\exp(-D_{\mathrm {KL} }(P\parallel Q))}}\leq 1-{\frac {1}{2}}\exp(-D_{\mathrm {KL} }(P\parallel Q))}

Alternative version

The following version is directly implied by the bound above but some authors[2] prefer stating it this way. Let A F {\displaystyle A\in {\mathcal {F}}} be any event. Then

P ( A ) + Q ( A ¯ ) 1 2 exp ( D K L ( P Q ) ) {\displaystyle P(A)+Q({\bar {A}})\geq {\frac {1}{2}}\exp(-D_{\mathrm {KL} }(P\parallel Q))}

where A ¯ = Ω A {\displaystyle {\bar {A}}=\Omega \smallsetminus A} is the complement of A {\displaystyle A} .

Indeed, by definition of the total variation, for any A F {\displaystyle A\in {\mathcal {F}}} ,

Q ( A ) P ( A ) d T V ( P , Q ) 1 1 2 exp ( D K L ( P Q ) ) = Q ( A ) + Q ( A ¯ ) 1 2 exp ( D K L ( P Q ) ) {\displaystyle {\begin{aligned}Q(A)-P(A)\leq d_{\mathrm {TV} }(P,Q)&\leq 1-{\frac {1}{2}}\exp(-D_{\mathrm {KL} }(P\parallel Q))\\&=Q(A)+Q({\bar {A}})-{\frac {1}{2}}\exp(-D_{\mathrm {KL} }(P\parallel Q))\end{aligned}}}

Rearranging, we obtain the claimed lower bound on P ( A ) + Q ( A ¯ ) {\displaystyle P(A)+Q({\bar {A}})} .

Proof

We prove the main statement following the ideas in Tsybakov's book (Lemma 2.6, page 89),[3] which differ from the original proof[4] (see C.Canonne's note [1] for a modernized retranscription of their argument).

The proof is in two steps:

1. Prove using Cauchy–Schwarz that the total variation is related to the Bhattacharyya coefficient (right-hand side of the inequality):

1 d T V ( P , Q ) 2 ( P Q ) 2 {\displaystyle 1-d_{\mathrm {TV} }(P,Q)^{2}\geq \left(\int {\sqrt {PQ}}\right)^{2}}

2. Prove by a clever application of Jensen’s inequality that

( P Q ) 2 exp ( D K L ( P Q ) ) {\displaystyle \left(\int {\sqrt {PQ}}\right)^{2}\geq \exp(-D_{\mathrm {KL} }(P\parallel Q))}
  • Step 1:
First notice that
d T V ( P , Q ) = 1 min ( P , Q ) = max ( P , Q ) 1 {\displaystyle d_{\mathrm {TV} }(P,Q)=1-\int \min(P,Q)=\int \max(P,Q)-1}
To see this, denote A = arg max A Ω | P ( A ) Q ( A ) | {\displaystyle A^{*}=\arg \max _{A\in \Omega }|P(A)-Q(A)|} and without loss of generality, assume that P ( A ) > Q ( A ) {\displaystyle P(A^{*})>Q(A^{*})} such that d T V ( P , Q ) = P ( A ) Q ( A ) {\displaystyle d_{\mathrm {TV} }(P,Q)=P(A^{*})-Q(A^{*})} . Then we can rewrite
d T V ( P , Q ) = A max ( P , Q ) A min ( P , Q ) {\displaystyle d_{\mathrm {TV} }(P,Q)=\int _{A^{*}}\max(P,Q)-\int _{A^{*}}\min(P,Q)}
And then adding and removing A ¯ max ( P , Q )  or  A ¯ min ( P , Q ) {\displaystyle \int _{\bar {A^{*}}}\max(P,Q){\text{ or }}\int _{\bar {A^{*}}}\min(P,Q)} we obtain both identities.
Then
1 d T V ( P , Q ) 2 = ( 1 d T V ( P , Q ) ) ( 1 + d T V ( P , Q ) ) = min ( P , Q ) max ( P , Q ) ( min ( P , Q ) max ( P , Q ) ) 2 = ( P Q ) 2 {\displaystyle {\begin{aligned}1-d_{\mathrm {TV} }(P,Q)^{2}&=(1-d_{\mathrm {TV} }(P,Q))(1+d_{\mathrm {TV} }(P,Q))\\&=\int \min(P,Q)\int \max(P,Q)\\&\geq \left(\int {\sqrt {\min(P,Q)\max(P,Q)}}\right)^{2}\\&=\left(\int {\sqrt {PQ}}\right)^{2}\end{aligned}}}
because P Q = min ( P , Q ) max ( P , Q ) . {\displaystyle PQ=\min(P,Q)\max(P,Q).}
  • Step 2:
We write ( ) 2 = exp ( 2 log ( ) ) {\displaystyle (\cdot )^{2}=\exp(2\log(\cdot ))} and apply Jensen's inequality:
( P Q ) 2 = exp ( 2 log ( P Q ) ) = exp ( 2 log ( P Q P ) ) = exp ( 2 log ( E P [ ( P Q ) 1 ] ) ) exp ( E P [ log ( P Q ) ] ) = exp ( D K L ( P , Q ) ) {\displaystyle {\begin{aligned}\left(\int {\sqrt {PQ}}\right)^{2}&=\exp \left(2\log \left(\int {\sqrt {PQ}}\right)\right)\\&=\exp \left(2\log \left(\int P{\sqrt {\frac {Q}{P}}}\right)\right)\\&=\exp \left(2\log \left(\operatorname {E} _{P}\left[\left({\sqrt {\frac {P}{Q}}}\right)^{-1}\,\right]\right)\right)\\&\geq \exp \left(\operatorname {E} _{P}\left[-\log \left({\frac {P}{Q}}\right)\right]\right)=\exp(-D_{KL}(P,Q))\end{aligned}}}
Combining the results of steps 1 and 2 leads to the claimed bound on the total variation.

Examples of applications

Sample complexity of biased coin tosses

Source:[1]

The question is How many coin tosses do I need to distinguish a fair coin from a biased one?

Assume you have 2 coins, a fair coin (Bernoulli distributed with mean p 1 = 1 / 2 {\displaystyle p_{1}=1/2} ) and an ε {\displaystyle \varepsilon } -biased coin ( p 2 = 1 / 2 + ε {\displaystyle p_{2}=1/2+\varepsilon } ). Then, in order to identify the biased coin with probability at least 1 δ {\displaystyle 1-\delta } (for some δ > 0 {\displaystyle \delta >0} ), at least

n 1 2 ε 2 log ( 1 2 δ ) . {\displaystyle n\geq {\frac {1}{2\varepsilon ^{2}}}\log \left({\frac {1}{2\delta }}\right).}

In order to obtain this lower bound we impose that the total variation distance between two sequences of n {\displaystyle n} samples is at least 1 2 δ {\displaystyle 1-2\delta } . This is because the total variation upper bounds the probability of under- or over-estimating the coins' means. Denote P 1 n {\displaystyle P_{1}^{n}} and P 2 n {\displaystyle P_{2}^{n}} the respective joint distributions of the n {\displaystyle n} coin tosses for each coin, then

We have

( 1 2 δ ) 2 d T V ( P 1 n , P 2 n ) 2 1 e D K L ( P 1 n P 2 n ) = 1 e n D K L ( P 1 P 2 ) = 1 e n log ( 1 / ( 1 4 ε 2 ) ) 2 {\displaystyle {\begin{aligned}(1-2\delta )^{2}&\leq d_{\mathrm {TV} }\left(P_{1}^{n},P_{2}^{n}\right)^{2}\\[4pt]&\leq 1-e^{-D_{\mathrm {KL} }(P_{1}^{n}\parallel P_{2}^{n})}\\[4pt]&=1-e^{-nD_{\mathrm {KL} }(P_{1}\parallel P_{2})}\\[4pt]&=1-e^{-n{\frac {\log(1/(1-4\varepsilon ^{2}))}{2}}}\end{aligned}}}

The result is obtained by rearranging the terms.

Information-theoretic lower bound for k-armed bandit games

In multi-armed bandit, a lower bound on the minimax regret of any bandit algorithm can be proved using Bretagnolle–Huber and its consequence on hypothesis testing (see Chapter 15 of Bandit Algorithms[2]).

History

The result was first proved in 1979 by Jean Bretagnolle and Catherine Huber, and published in the proceedings of the Strasbourg Probability Seminar.[4] Alexandre Tsybakov's book[3] features an early re-publication of the inequality and its attribution to Bretagnolle and Huber, which is presented as an early and less general version of Assouad's lemma (see notes 2.8). A constant improvement on Bretagnolle–Huber was proved in 2014 as a consequence of an extension of Fano's Inequality.[5]

See also

References

  1. ^ a b c Canonne, Clément (2022). "A short note on an inequality between KL and TV". arXiv:2202.07198 [math.PR].
  2. ^ a b c Lattimore, Tor; Szepesvari, Csaba (2020). Bandit Algorithms (PDF). Cambridge University Press. Retrieved 18 August 2022.
  3. ^ a b Tsybakov, Alexandre B. (2010). Introduction to nonparametric estimation. Springer Series in Statistics. Springer. doi:10.1007/b13794. ISBN 978-1-4419-2709-5. OCLC 757859245. S2CID 42933599.
  4. ^ a b Bretagnolle, J.; Huber, C. (1978), "Estimation des densités : Risque minimax", Séminaire de Probabilités XII, Lecture notes in Mathematics, vol. 649, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 342–363, doi:10.1007/bfb0064610, ISBN 978-3-540-08761-8, S2CID 122597694, retrieved 2022-08-20
  5. ^ Gerchinovitz, Sébastien; Ménard, Pierre; Stoltz, Gilles (2020-05-01). "Fano's Inequality for Random Variables". Statistical Science. 35 (2). arXiv:1702.05985. doi:10.1214/19-sts716. ISSN 0883-4237. S2CID 15808752.