draft-valin-videocodec-pvq.xml 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349
  1. <?xml version="1.0" encoding="US-ASCII"?>
  2. <!DOCTYPE rfc SYSTEM "rfc2629.dtd">
  3. <?rfc toc="yes"?>
  4. <?rfc tocompact="yes"?>
  5. <?rfc tocdepth="3"?>
  6. <?rfc tocindent="yes"?>
  7. <?rfc symrefs="yes"?>
  8. <?rfc sortrefs="yes"?>
  9. <?rfc comments="yes"?>
  10. <?rfc inline="yes"?>
  11. <?rfc compact="yes"?>
  12. <?rfc subcompact="no"?>
  13. <rfc category="std" docName="draft-valin-videocodec-pvq-02"
  14. ipr="noDerivativesTrust200902">
  15. <front>
  16. <title abbrev="Video PVQ">Pyramid Vector Quantization for Video Coding</title>
  17. <author initials="JM." surname="Valin" fullname="Jean-Marc Valin">
  18. <organization>Mozilla</organization>
  19. <address>
  20. <postal>
  21. <street>331 E. Evelyn Avenue</street>
  22. <city>Mountain View</city>
  23. <region>CA</region>
  24. <code>94041</code>
  25. <country>USA</country>
  26. </postal>
  27. <email>jmvalin@jmvalin.ca</email>
  28. </address>
  29. </author>
  30. <date day="9" month="March" year="2015" />
  31. <abstract>
  32. <t>This proposes applying pyramid vector quantization (PVQ) to video coding.</t>
  33. </abstract>
  34. </front>
  35. <middle>
  36. <section title="Introduction">
  37. <t>
  38. This draft describes a proposal for adapting the Opus <xref
  39. target="RFC6716">RFC 6716</xref> energy conservation
  40. principle to video coding based on a pyramid vector quantizer (PVQ)
  41. <xref target="Pyramid-VQ"/>.
  42. One potential advantage of conserving energy of the AC coefficients in
  43. video coding is preserving textures rather than low-passing them.
  44. Also, by introducing a fixed-resolution PVQ-type quantizer, we
  45. automatically gain a simple activity masking model.</t>
  46. <t>The main challenge of adapting this scheme to video is that we have a
  47. good prediction (the reference frame), so we are essentially starting
  48. from a point that is already on the PVQ hyper-sphere,
  49. rather than at the origin like in CELT.
  50. Other challenges are the introduction of a quantization matrix and the
  51. fact that we want the reference (motion predicted) data to perfectly
  52. correspond to one of the entries in our codebook. This proposal is described
  53. in greater details in <xref target="Perceptual-VQ"/>, as well as in
  54. demo <xref target="PVQ-demo"/>.
  55. </t>
  56. </section>
  57. <section title="Gain-Shape Coding and Activity Masking">
  58. <t>The main idea behind the proposed video coding scheme is to code
  59. groups of DCT coefficient as a scalar gain and a unit-norm "shape" vector.
  60. A block's AC coefficients may all be part of the same group, or may be
  61. divided by frequency (e.g. by octave) and/or by directionality (horizontal
  62. vs vertical).
  63. </t>
  64. <t>It is desirable for a single quality parameter to control the resolution
  65. of both the gain and the shape.
  66. Ideally, that quality parameter should also take into account activity
  67. masking, that is, the fact that the eye is less sensitive to regions of
  68. an image that have more details.
  69. According to Jason Garrett-Glaser, the perceptual analysis in the x264
  70. encoder uses a resolution proportional to the variance of the AC
  71. coefficients raised to the power a, with a=0.173.
  72. For gain-shape quantization, this is equivalent to using a resolution
  73. of g^(2a), where g is the gain.
  74. We can derive a scalar quantizer that follows this resolution:
  75. <figure align="center">
  76. <artwork align="center"><![CDATA[
  77. b
  78. g=Q_g gamma ,
  79. ]]></artwork>
  80. </figure>
  81. where gamma is the gain quantization index, b=1/(1-2*a) and Q_g is the gain resolution
  82. and main quality parameter.
  83. </t>
  84. <t>An important aspect of the current proposal is the use of prediction.
  85. In the case of the gain, there is usually a significant correlation
  86. with the gain of neighboring blocks.
  87. One way to predict the gain of a block is to compute the gain of the
  88. coefficients obtained through intra or inter prediction.
  89. Another way is to use the encoded gain of the neighboring blocks to
  90. explicitly predict the gain of the current block.
  91. </t>
  92. </section>
  93. <section title="Householder Reflection">
  94. <t>Let vector x_d denote the (pre-normalization) DCT band to be coded in
  95. the current block and vector r_d denote the corresponding reference
  96. (based on intra prediction or motion compensation), the encoder
  97. computes and encodes the "band gain" g = sqrt(x_d^T x_d).
  98. The normalized band is computed as
  99. <figure align="center">
  100. <artwork align="center"><![CDATA[
  101. x_d
  102. x = --------- ,
  103. || x_d ||
  104. ]]></artwork>
  105. </figure>
  106. with the normalized reference vector r similarly computed based on r_d.
  107. The encoder then finds the position and sign of the largest component in vector r:
  108. <figure align="center">
  109. <artwork align="center"><![CDATA[
  110. m = argmax_i | r_i |
  111. s = sign(r_m)
  112. ]]></artwork>
  113. </figure>
  114. and computes the Householder reflection that reflects r to -s e_m, where
  115. e_m is a unit vector that points in the direction of dimension m.
  116. The reflection vector is given by
  117. <figure align="center">
  118. <artwork align="center"><![CDATA[
  119. v = r + s e_m .
  120. ]]></artwork>
  121. </figure>
  122. The encoder reflects the normalized band to find the unit-norm vector
  123. <figure align="center">
  124. <artwork align="center"><![CDATA[
  125. v^T x
  126. z = x - 2 ----- v .
  127. v^T v
  128. ]]></artwork>
  129. </figure>
  130. </t>
  131. <t>The closer the current band is from the reference band, the closer
  132. z is from -s e_m.
  133. This can be represented either as an angle, or as a coordinate on a
  134. projected pyramid.
  135. </t>
  136. </section>
  137. <section title="Angle-Based Encoding">
  138. <t>Assuming no quantization, the similarity can be represented by the angle
  139. <figure align="center">
  140. <artwork align="center"><![CDATA[
  141. theta = arccos(-s z_m) .
  142. ]]></artwork>
  143. </figure>
  144. If theta is quantized and transmitted to the decoder, then z can be
  145. reconstructed as
  146. <figure align="center">
  147. <artwork align="center"><![CDATA[
  148. z = -s cos(theta) e_m + sin(theta) z_r ,
  149. ]]></artwork>
  150. </figure>
  151. where z_r is a unit vector based on z that excludes dimension m.
  152. </t>
  153. <t>
  154. The vector z_r can be quantized using PVQ.
  155. Let y be a vector of integers that satisfies
  156. <figure align="center">
  157. <artwork align="center"><![CDATA[
  158. sum_i(|y[i]|) = K ,
  159. ]]></artwork>
  160. </figure>
  161. with K determined in advance, then the PVQ search finds the vector y
  162. that maximizes y^T z_r / (y^T y) . The quantized version of z_r is
  163. <figure align="center">
  164. <artwork align="center"><![CDATA[
  165. y
  166. z_rq = ------- .
  167. || y ||
  168. ]]></artwork>
  169. </figure>
  170. </t>
  171. <t>If we assume that MSE is a good criterion for optimizing the resolution,
  172. then the angle quantization resolution should be (roughly)
  173. <figure align="center">
  174. <artwork align="center"><![CDATA[
  175. dg 1 b
  176. Q_theta = ---------*----- = ------ .
  177. d(gamma) g gamma
  178. ]]></artwork>
  179. </figure>
  180. To derive the optimal K we need to consider the normalized distortion for a
  181. Laplace-distributed variable found experimentally to be approximately
  182. <figure align="center">
  183. <artwork align="center"><![CDATA[
  184. (N-1)^2 + C*(N-1)
  185. D_p = ----------------- ,
  186. 24*K^2
  187. ]]></artwork>
  188. </figure>
  189. with C ~= 4.2. The distortion due to the gain is
  190. <figure align="center">
  191. <artwork align="center"><![CDATA[
  192. b^2*Q_g^2*gamma^(2*b-2)
  193. D_g = ----------------------- .
  194. 12
  195. ]]></artwork>
  196. </figure>
  197. Since PVQ codes N-2 degrees of freedom, its distortion should also be
  198. (N-2) times the gain distortion, which eventually leads us to the optimal
  199. number of pulses
  200. <figure align="center">
  201. <artwork align="center"><![CDATA[
  202. gamma*sin(theta) / N + C - 2 \
  203. K = ---------------- sqrt | --------- | .
  204. b \ 2 /
  205. ]]></artwork>
  206. </figure>
  207. </t>
  208. <t>The value of K does not need to be coded because all the variables it
  209. depends on are known to the decoder.
  210. However, because Q_theta depends on the gain, this can lead to
  211. unacceptable loss propagation behavior in the case where inter prediction
  212. is used for the gain.
  213. This problem can be worked around by making the approximation
  214. sin(theta)~=theta.
  215. With this approximation, then K depends only on the theta
  216. quantization index, with no dependency on the gain.
  217. Alternatively, instead of quantizing theta, we can quantize sin(theta)
  218. which also removes the dependency on the gain.
  219. In the general case, we quantize f(theta) and then assume that
  220. sin(theta)~=f(theta). A possible choice of f(theta) is a quadratic
  221. function of the form:
  222. <figure align="center">
  223. <artwork align="center"><![CDATA[
  224. 2
  225. f(theta) = a1 theta - a2 theta.
  226. ]]></artwork>
  227. </figure>
  228. where a1 and a2 are two constants satisfying the constraint that
  229. f(pi/2)=pi/2.
  230. The value of f(theta) can also be predicted, but in case where we care
  231. about error propagation, it should only be predicted from information
  232. coded in the current frame.
  233. </t>
  234. </section>
  235. <section title="Bi-prediction">
  236. <t>We can use this scheme for bi-prediction by introducing a second theta
  237. parameter.
  238. For the case of two (normalized) reference frames r1 and r2,
  239. we introduce s1=(r1+r2)/2 and s2=(r1-r2)/2.
  240. We start by using s1 as a reference, apply the Householder reflection to
  241. both x and s2, and evaluate theta1.
  242. From there, we derive a second Householder reflection from the reflected
  243. version of s2 and apply it to z. The result is that the theta2 parameter
  244. controls how the current image compares to the two reference images.
  245. It should even be possible to use this in the case of fades, using two
  246. references that are before the frame being encoded.
  247. </t>
  248. </section>
  249. <section title="Coefficient coding">
  250. <t>Encoding coefficients quantized with PVQ differs from encoding scalar-quantized
  251. coefficients from the fact that the sum of the coefficients magnitude is known (equal
  252. to K). It is possible to take advantage of the known K value either through modeling
  253. the distribution of coefficient magnitude or by modeling the zero runs. In the case of
  254. magnitude modeling, the expectation of the magnitude of coefficient n is modeled as
  255. <figure align="center">
  256. <artwork align="center"><![CDATA[
  257. K_n
  258. E(|y_n|) = alpha * ----- ,
  259. N - n
  260. ]]></artwork>
  261. </figure>
  262. where K_n is the number of pulses left after encoding coeffients from 0 to n-1
  263. and alpha depends on the distribution of the coefficients. For run-length modeling, the
  264. expectation of the position of the next non-zero coefficient is given by
  265. <figure align="center">
  266. <artwork align="center"><![CDATA[
  267. N - n
  268. E(|run|) = beta * ----- ,
  269. K_n
  270. ]]></artwork>
  271. </figure>
  272. where beta also models the coefficient distribution.
  273. </t>
  274. </section>
  275. <section title="Development Repository">
  276. <t>The algorithms in this proposal are being developed as part of
  277. Xiph.Org's Daala project.
  278. The code is available in the Daala git repository at
  279. <eref target="https://git.xiph.org/daala.git"/>. See <eref target="https://xiph.org/daala/"/> for more
  280. information.
  281. </t>
  282. </section>
  283. <section anchor="IANA" title="IANA Considerations">
  284. <t>This document makes no request of IANA.</t>
  285. </section>
  286. <section anchor="Security" title="Security Considerations">
  287. <t>This draft has no security considerations.</t>
  288. </section>
  289. <section anchor="Acknowledgements" title="Acknowledgements">
  290. <t>Thanks to Jason Garrett-Glaser, Timothy Terriberry, Greg Maxwell, and
  291. Nathan Egge for their contribution to this document.</t>
  292. </section>
  293. </middle>
  294. <back>
  295. <references title="Informative References">
  296. <?rfc include="http://xml.resource.org/public/rfc/bibxml/reference.RFC.6716.xml"?>
  297. <reference anchor="Pyramid-VQ">
  298. <front>
  299. <title>A Pyramid Vector Quantizer</title>
  300. <author initials="T." surname="Fischer" fullname=""><organization/></author>
  301. <date month="July" year="1986" />
  302. </front>
  303. <seriesInfo name="IEEE Trans. on Information Theory, Vol. 32" value="pp. 568-583" />
  304. </reference>
  305. <reference anchor="Perceptual-VQ" target="http://jmvalin.ca/papers/spie_pvq.pdf">
  306. <front>
  307. <title>Perceptual Vector Quantization for Video Coding</title>
  308. <author initials="JM." surname="Valin" fullname=""><organization/></author>
  309. <author initials="TB." surname="Terriberry" fullname=""><organization/></author>
  310. <date month="February" year="2015" />
  311. </front>
  312. <seriesInfo name="Proceedings of SPIE Visual Information Processing and Communication" value=""/>
  313. </reference>
  314. <reference anchor="PVQ-demo" target="https://people.xiph.org/~jm/daala/pvq_demo/">
  315. <front>
  316. <title>Daala: Perceptual Vector Quantization (PVQ)</title>
  317. <author initials="JM." surname="Valin" fullname=""><organization/></author>
  318. <date month="November" year="2014" />
  319. </front>
  320. </reference>
  321. </references>
  322. </back>
  323. </rfc>