design.tex 42 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865
  1. \documentclass[11pt,letterpaper]{article}
  2. \pagestyle{headings}
  3. \bibliographystyle{alpha}
  4. \title{Ogg Daala Design Rationale}
  5. \author{Timothy B. Terriberry}
  6. \date{}
  7. \begin{document}
  8. \maketitle
  9. \section{Introduction}
  10. The purpose of this document is to discuss some of the various design
  11. trade-offs in creating a video codec, and the rationale behind each of the
  12. decisions made for Ogg Daala.
  13. The goal of this document is to ensure that decisions are made in a consistent
  14. and considered manner.
  15. \section{General Structure}
  16. The general structure of the video sequence is motion-compensated prediction
  17. followed by some form of two-dimensional transform of the residual error.
  18. The final transform coefficients for each frame are quantized, and all data is
  19. entropy encoded.
  20. This has been the basis for all of the most popular video codecs, such as
  21. MPEG1, MPEG2, H.263, MPEG4 parts 2 and 10 (H.264), VC1, etc., as no other
  22. strategy has proven as effective for eliminating temporal redundancy.
  23. In addition, much of the software for video encoding and playback has been
  24. written with these codecs in mind.
  25. This means it often has built-in assumptions about the way codecs work that may
  26. make interoperation with a new technique more difficult or impossible without
  27. a redesign.
  28. This is not to say that things such as three-dimensional wavelets are being
  29. discarded out-of-hand.
  30. In their current state, however, they are less efficient from both a
  31. compression and an execution standpoint.
  32. They do offer temporal scalability.
  33. This feature, however, is not useful for many applications and can be achieved
  34. in other ways with slightly less flexibility by utilizing P and B frames
  35. properly.
  36. The cost of this scalability, however, is increased latency, and minimizing
  37. latency \textit{is} strongly desirable for many applications.
  38. Until a convincing demonstration that this situation has changed is available,
  39. the 2D+MC model is assumed.
  40. \subsection{Lossless Encoding}
  41. One of the biggest operational issues with many codecs today is that they are
  42. fundamentally incapable of encoding a given video stream without quality
  43. degredation, regardless of the number of bits that are thrown at it.
  44. A single encode may not degrade the video in a perceptible manner, but a
  45. sequence is often encoded, transferred, edited and re-encoded repeatedly.
  46. Compounding the problem, many encoders also contain bugs which introduce
  47. additional artifacts at any bit rate.
  48. Lossless encoding allows for simple detection of such bugs.
  49. Some extensions to the new H.264 codec have been proposed that would allow for
  50. a lossless mode, but this is accomplished by switching between a PCM coding
  51. mode that was included essentially to provide a theoretical upper bound on
  52. bitrate.
  53. The argument against lossless encoding is simply that it consumes too much
  54. space to be practical for most purposes.
  55. In fact, lossless codecs do exist and are used when it is known a priori that
  56. additional editing and encoding are needed, and when these operations can be
  57. done locally without transferring twenty gigabyte or larger files around.
  58. There is no way to progressively scale between the extreme of lossy encoding
  59. and lossless.
  60. This is because the transforms used in current lossy codecs inherently
  61. introduce loss, even before quantization is employed to reduce the bit rate.
  62. What makes this issue so exasperating is that there is no reason why this has
  63. to be the case.
  64. The only stage of the encoding and decoding process which introduces loss by
  65. design in order to reduce the bit rate is the quantization stage.
  66. Even worse, such loosely-defined transforms also lead to compatibility issues.
  67. IEEE standard 1180 was introduced to address this issue by ensuring that
  68. implementations are sufficiently accurate.
  69. But many encoders or decoders are used that are not IEEE 1180 compliant, simply
  70. because their implementations are much faster or do not require floating-point
  71. hardware.
  72. Buggy encoders may produce video which slowly turns green or purple as it plays
  73. due to consistent round-off errors, or "drift".
  74. Even if the encoder checks the output, the problem will likely go unnoticed, as
  75. the decoder will probably use a similarly non-compliant transform, with
  76. matching defects.
  77. One of the reasons for not specifying an exact transform definition was to
  78. allow hardware and software designers additional flexibility to optimize it.
  79. However, the fastest known algorithm for an 8-point DCT is over a decade old
  80. \cite{LLM89}.
  81. Hardware has also sufficiently advanced so that making a transform fast enough
  82. is no longer the critical problem.
  83. This video codec, therefore, utilizes a transform that is exactly defined.
  84. As a consequence, it must consist of entirely integer operations, as the
  85. floating-point processors on different hardware are not compatible.
  86. This is also useful for embedded devices, which may not have a floating-point
  87. unit at all.
  88. The only other stage of the encoding and decoding process that could introduce
  89. unnecessary loss is color conversion.
  90. This is mostly an application issue.
  91. As long as the codec can handle the most common color formats, there is no
  92. reason for editing or encoding software to introduce additional color
  93. conversions.
  94. \section{Bit Stream Format}
  95. This section summarizes the considerations that went into the design of the
  96. bit stream format.
  97. This includes issues relating both to how symbols are compressed and to how
  98. they are organized into packets.
  99. \subsection{Entropy Coder}
  100. Huffman coding has long been used for video codecs because of its extreme
  101. simplicity and speed.
  102. It is also unemcumbered by patents.
  103. Unfortunately, more advanced techniques such as arithmetic coding produce
  104. results that are far superior, as was demonstrated in research for the H.26L
  105. codec.
  106. The obvious reason is that arithmetic coding allows symbols to be encoded with
  107. a fractional number of bits, specifically less than one.
  108. The less obvious reason is that arithmetic coding allows much more flexible
  109. probability models.
  110. Changing the probability distribution of the symbols being coded requires as
  111. many as $O(n)$ operations to generate a new huffman table, since every symbol
  112. in the alphabet may change.
  113. Arithmetic coding, by contrast, only requires the sums of probabilities of
  114. the first $k$ symbols be known.
  115. Fast algorithms exist which can compute and update these sums in $O(\log n)$
  116. time~\cite{Fen94,Fen95,Mof99}.
  117. This makes adapting the symbol probabilities after each decoded symbol
  118. practical, which allows for the coder to adapt to non-stationary sources.
  119. Range coding was a technique published by an IBM researcher~\cite{Mar79} around
  120. the same time that arithmetic coding was developed~\cite{Pas76}.
  121. The idea is extremely similar, except that range coding can process output in
  122. chunks larger than a bit at a time.
  123. This allows byte-level implementations, which are more efficient.
  124. Also, it is believed that range coding is not encumbered by patents.
  125. Either way, it is not too much longer until the critical IBM patents on
  126. arithmetic encoding expire.
  127. Thus, this codec uses a range coder for entropy encoding.
  128. \subsubsection{A Review of Arithmetic and Range Coding}
  129. In order to discuss further design decisions, a quick review of arithmetic and
  130. range coding is helpful.
  131. Probabilities are modeled as integer frequency counts.
  132. Each symbol $s$ can be assigned an integer $f_s$ representing its relative
  133. frequency.
  134. The sum of the counts of all the symbols up to and including $s$ in the
  135. alphabet is denoted $F_s$.
  136. The sum of all of these integers is $t$, the total frequency.
  137. The alphabet size and symbol probabilities can be changed arbitrarily for each
  138. encoded symbol, so long as the decoder can make the same changes given the
  139. decoded information to that point.
  140. During encoding, an interval in the range $[0,1)$ is selected, based upon the
  141. probabilities of the symbols being encoded.
  142. As each new symbol is encoded, the range is narrowed in proportion to the
  143. estimated probability of the symbol, and offset according to the probabilities
  144. of earlier symbols in the alphabet.
  145. At the end of the encoding, a number in the final interval is selected to
  146. represent the entire encoded sequence.
  147. Practical implementations represent the interval as $[L,L+R)$, where $L$ and
  148. $R$ are two fixed-length integers: the lower bound of the interval and its
  149. range, respectively.
  150. A partition function, $f(c,t,R)$, where $c\in[0,t)$ is a frequency count,
  151. carves up the range and offsets the lower bound as each symbol is encoded.
  152. The update formulas are
  153. \begin{eqnarray*}
  154. L'&=&L+f(F_{s-1},t,R)\\
  155. R'&=&f(F_s,t,R)-f(F_{s-1},t,R).
  156. \end{eqnarray*}
  157. Clearly, $R$ must be greater than $t$, or some symbols could potentially be
  158. mapped to an empty interval.
  159. Whenever $R$ becomes too small, the interval--$R$ and $L$--is scaled by a power
  160. of two, and one or more bits of $L$ are output.
  161. Allowing $R$ to become smaller before renormalizing requires fewer operations,
  162. but reduces the accuracy of the partitioning operation and reduces the bits
  163. available for the probability estimates.
  164. Because of the fixed bit-length of $R$, a similar restriction must be placed on
  165. $t$.
  166. This means that only approximations of the true probabilities are used, and
  167. this can be a source of compression inefficiency.
  168. This is not normally an issue, since the probabilities are usually estimated,
  169. not exactly known.
  170. Range coding always outputs a byte at a time, however, making 8 fewer bits
  171. available for frequency counts than with normal arithmetic coding.
  172. Various algorithms may place even greater restrictions on $t$, which could make
  173. the effects significant.
  174. During decoding, the same interval is reconstructed by examining integer
  175. output.
  176. A fixed-length integer, $V$, contains the bits of the output that correspond to
  177. the current value of $L$.
  178. As $L$ is shifted during renormalization, $V$ is also shifted, and new bits are
  179. read from the input to fill the bottom.
  180. Normally, only $D=V-L$ is actually computed and stored for efficiency.
  181. Then the partition function is inverted, computing a value $c$ for which
  182. $D\in[f(c,t,R),f(c+1,t,R))$.
  183. The frequency counts are then searched for the symbol $s$ such that
  184. $c\in[F_{s-1},F_s)$.
  185. This is the decoded symbol.
  186. Once this is known, the operations of the encoder can be mimiced.
  187. \subsubsection{Carry Propagation}
  188. Because the interval $[L,L+R)$ can straddle the value $1$ after
  189. renormalization, the bits of $L$ shifted off the left end are not always the
  190. final bits of the code word.
  191. Carry propagation can change an arbitrarily long sequence of one bits to zeros.
  192. For general purpose coders, this can be solved by bit stuffing, which
  193. artificially clamps the current interval to be entirely above or below $1$.
  194. When the number of bits outstanding is kept as a simple 32-bit count, this
  195. wastes at most one bit every four billion output bits.
  196. This adds increased complexity to the decoder, however, which must now mimic
  197. these actions of the encoder so that it too can perform this bit stuffing at
  198. the proper place.
  199. A range coder allows counting outstanding bits at the byte level, and so for
  200. Daala's purposes it should be sufficient to restrict the size of a packet to
  201. 4 GB, thus eliminating the need for bit stuffing.
  202. The 32-bit counter is only needed on the encoding side, and so embedded 16-bit
  203. decoders can still use 16-bit arithmetic.
  204. \subsubsection{Binary Coding}
  205. Range coding can be further simplified by reducing the number of symbols to
  206. two: 0 and 1.
  207. Every symbol to be encoded is first converted to some binary representation,
  208. and then the bits of that representation are modeled and coded.
  209. This makes maintaining the probability estimates trivial, but requires
  210. $O(\log n)$ bits to be encoded for every symbol.
  211. This still requires $O(\log n)$ operations to update the statistics.
  212. Computing the partition function $f$ and its inverse is usually more expensive
  213. than updating the statistics.
  214. An alphabet of size $n$, however, usually utilizes $n-1$ free parameters in its
  215. statisical model, compared to $\log(n)$ parameters for $\log(n)$ binary
  216. models.
  217. Thus, the binary models require less storage, and also incur less overhead for
  218. modeling errors, at least on a \textit{per-symbol} basis.
  219. They also require more symbols to be encoded, however.
  220. This is not to say that $n$-ary models could not be designed which required
  221. fewer parameters.
  222. Indeed, in predicting items like motion vector lengths, there should be strong
  223. correlations between neighboring symbols in the alphabet.
  224. Traditional models do not account for this in any way, i.e., the observance of
  225. a value $k$ does not increase the probability that $k+1$ and $k-1$ might also
  226. be observed.
  227. Binary-tree models actually make some such adjustments, but not across tree
  228. branches.
  229. It is expected that the number of contexts with a relatively large alphabet
  230. will be small, and in fact must be small to avoid context dilution, so the
  231. increased memory overhead for the model parameters should be moderate.
  232. Hence larger alphabets are more flexible and potentially faster than binary
  233. alphabets, and only require a moderate increase in implementation complexity.
  234. Therefore a restriction to a binary alphabet is not desired.
  235. The unrestricted version is called ``multialphabet" arithmetic coding in the
  236. literature.
  237. \subsubsection{The Partition Function}
  238. The normal partition function for arithmetic coding is
  239. \begin{displaymath}
  240. f(c,t,R)=(c\cdot R)//t,
  241. \end{displaymath}
  242. where $//$ denotes integer division.
  243. Rounding instead of truncation in the division can increase the efficiency, but
  244. at the cost of increased complexity of the decoder, which must properly invert
  245. the operation.
  246. Moffat et al. propose the more efficient
  247. \begin{displaymath}
  248. f(c,t,R)=\left\{
  249. \begin{array}{ll}c\cdot(R//t)&{\rm if\ }c<t,\\R&{\rm otherwise},\end{array}
  250. \right.
  251. \end{displaymath}
  252. which has the advantage of only requiring one division per encoding operation,
  253. regardless of how many times $f$ is evaluated \cite{MNW98}.
  254. Note that both of these are approximations to the optimal partition function,
  255. which requires extended precision arithmetic.
  256. The compression inefficiency due to the latter is small when $t$ is much
  257. smaller than $R$.
  258. Using 6 fewer bits for $t$, the loss is only $0.1\%$.
  259. There are several methods for designing a multiply-free partition function.
  260. Here, the term ``multiply" from the literature refers to scaling by the
  261. estimated probability, $c/t$, and so includes the division step.
  262. Most methods restrict $R$ or the probabilities to a special form.
  263. Methods specific to binary coding use a small number of values for $R$ and
  264. the frequency counts, and can be driven with lookup tables or shifts and adds
  265. \cite{LR81, PMLA88, HV92, HV94, HM97}.
  266. These can be extended to multialphabet coding by using a small number of values
  267. to approximate $R//t$ from Moffat's partition function instead
  268. \cite{RM89, CKW91, PS93, Lei95, FP97}.
  269. This guarantees that no symbol gets allocated a zero-sized range, but care must
  270. be taken to ensure that the approximate values of the partition function do
  271. not exceed the real $R$.
  272. Many of these impose restrictions on the type of probability models that can be
  273. used, whether or not they can be adaptive and how the adaptive ones are
  274. updated.
  275. Stuiver and Moffat, however, take the approach of defining a new partition
  276. function \cite{SM98}.
  277. The second function is highly inaccurate when $t$ approaches $R$, as the
  278. truncation error increases and all the extra space is allocated to the last
  279. symbol.
  280. They instead normalize $t$ to fall in $(R/2,R]$, which forces $R//t$ to be $1$,
  281. and then evenly distribute the truncation errors with the function
  282. \begin{displaymath}
  283. f(c,t,R)=\left\{
  284. \begin{array}{ll}c&{\rm if\ }c<d,\\2c-d&{\rm otherwise},\end{array}
  285. \right.
  286. \end{displaymath}
  287. where $d=2t-R$.
  288. The value $d$ divides the frequency counts such that $d$ values in the range
  289. $[0,t)$ are allocated single units in $R$, and $t-d$ are allocated double
  290. units.
  291. This approach imposes no special form on $R$ or the probability estimates.
  292. The requirement that $t$ fall in $(R/2,R]$ can be handled by scaling $t$ and
  293. the representative frequency counts by a power of two right before coding.
  294. This can be computed with a lookup table, or by using special hardware
  295. instructions available on some processors to locate the high-order bit.
  296. Once the high-order bits are the same, one additional shift may be required to
  297. bring $t$ into the proper range.
  298. The maximum error in the multiplication when using this function is a
  299. factor of two, so the error is no more than one bit per symbol.
  300. If one assumes the probabilities being used as accurate and that the symbols
  301. being encoded are independent of each other, this can be reduced to
  302. $\log(2e^{-1}\log e)\approx 0.0861$.
  303. Values close to this are observed, even when these assumptions are grossly not
  304. met \cite{SM98}.
  305. The partition function as given over-estimates symbol probabilities at the end
  306. of the interval, and under-estimates symbol probabilities at the beginning of
  307. the interval.
  308. It is more beneficial if the probabilties of more probable symbols are the ones
  309. that are over-estimated, since the relative error is smaller.
  310. In most contexts, the smaller symbols have a larger probability of occurring.
  311. Therefore a partition function with the roles of the two piecewise segments
  312. reversed is used:
  313. \begin{displaymath}
  314. f(c,t,R)=\left\{
  315. \begin{array}{ll}2c&{\rm if\ }c<d,\\c+d&{\rm otherwise},\end{array}
  316. \right.
  317. \end{displaymath}
  318. where $d=R-t$.
  319. \subsubsection{Word Size}
  320. For a word size $w$, $w-1$ bits are available for $R$ and $L$, since one bit is
  321. used to simplify carry detection.
  322. Since range coding allows $R$ to shrink by up to 8 bits, $w-9$ bits are
  323. available for frequency counts.
  324. %The number of symbols encoded in most contexts in a frame is relatively small.
  325. %Therefore a word size of 16 bits allows 7 bits to be used for frequency counts,
  326. % which should be sufficient, and allows fast operation on embedded processors.
  327. %Note that if the second, more accurate, partition function were used, frequency
  328. % counts of 7 bits would incur an error near 1 bit per symbol, far higher than
  329. % the theoretic $0.0861$ bits per symbol incurred by the third function.
  330. \subsubsection{Probability Estimator}
  331. The $\frac{1}{b}$-biased $k$-array Dirichlet estimator provides probability
  332. estimates for a stationary zero-order stochastic process with an inefficiency
  333. of $O(\log n)$ bits, where $n$ is the number of symbols encoded
  334. \cite{Sht87, STW95}.
  335. \begin{displaymath}
  336. P(s_i|S) = \frac{b\cdot {\rm occ}(s_i,S)+1}{bn+k}.
  337. \end{displaymath}
  338. Here, $s_i$ is the $i$th symbol in the alphabet, $S$ is the string of symbols
  339. observed so far, ${\rm occ}(s_i,S)$ is the number of occurrences of $s_i$ in
  340. $S$, $n$ is the length of $S$, $k$ is the size of the alphabet, and $b$ is the
  341. bias parameter.
  342. When $b=2$ and $k=2$, this is also called the Krichevsky-Trofimov estimator
  343. \cite{KT81}.
  344. The bias parameter controls how quickly the model adapts to observed symbols.
  345. In an actual implementation, the frequency count $f_s$ for each symbol is
  346. simply initialized to $1$.
  347. Each time a symbol is observed, $b$ is added to its count.
  348. In order to prevent statistics from accumulating to be too large, and to allow
  349. adaptation to non-stationary sources, the counts are scaled by a factor of
  350. $1/2$, rounding up, when the total $t$ exceeds a threshold parameter.
  351. \subsubsection{Code Books}
  352. Unlike codebooks for non-adaptive codes, codebooks for range codes are not
  353. actually required.
  354. They do provide several advantages, however.
  355. First, as the number of symbols encoded in most contexts in a frame is assumed
  356. to be small, initial probabilities can be specified to reduce the time needed
  357. for the model to adapt.
  358. Second, because different contexts fit the model of a stationary zero-order
  359. source to different degrees, different bias and threshold parameters in the
  360. estimator allow the models to adapt at different rates.
  361. Far more significant than the previous two, certain symbols may be given an
  362. initial probability of zero.
  363. This prevents the symbol from ever being decoded.
  364. This allows the encoder to incur no bit rate penalty whatsoever for features of
  365. the bitstream that it does not use.
  366. This can, for example, allow the encoder to selectively switch between half-pel
  367. or quarter-pel motion vector resolution on a frame-by-frame basis.
  368. For these reasons, several codebooks can be defined in the codec setup headers
  369. by the encoder for use in range coding.
  370. Codebooks are selected on a frame by frame basis, and the current codebook is
  371. specified by index for each frame.
  372. Specifying both the initial probabilities and the update rate is sufficient to
  373. eliminate the need for a custom threshold for each context, so the default of
  374. $2^7$ is always used.
  375. This provides the maximum precision possible for rescaling operations.
  376. \subsection{Packetization}
  377. \subsubsection{Bit Rate Peeling}
  378. One of the features of the Ogg Vorbis codec is that audio packets can be
  379. truncated at any point and still result in a valid, decodable stream.
  380. This feature is primarily useful for the specific application of real-time
  381. video streaming.
  382. It does not make sense with a 2D+MC transform model, however.
  383. The reason is that the encoder uses the decoded output of previous frames for
  384. the motion-compensated prediction, as this is the same information that will
  385. be available to the decoder.
  386. Reducing the quality of these frames changes the prediction and thus the
  387. error residual.
  388. This means that the encoded error residual of the predicted frames no longer
  389. corresponds to the real error residual, introducing additional artifacts.
  390. These artifacts are then used to predict future frames, causing progressively
  391. increasing error.
  392. It might be possible to include this feature in B frames, which are not used to
  393. predict other frames.
  394. B frames make up only a small part of the total bit rate, however, and so do
  395. not provide much rate flexibility.
  396. Bit rate peeling is therefore not considered.
  397. \section{Residual Coding}
  398. \subsection{Lapped Block Transforms}
  399. Lapped block transforms retain the advantages of block transforms--low
  400. complexity and locality of reference--but eliminate one of the primary
  401. disadvantages: blocking artifacts.
  402. Most current blocked-based transform codecs utilize a deblocking filter on the
  403. decoding side, after the inverse transform is applied, to reduce blocking
  404. artifacts.
  405. These filters have several disadvantages.
  406. In the face of severe quantization they are often not sufficient to overcome
  407. blocking artifacts.
  408. Also, their effects are usually ignored during quantization in the encoder,
  409. which can lead to artificial smoothing of real edges.
  410. To avoid such problems, adaptive filters are used which require conditional
  411. checks that are expensive on modern CPUs with deep pipelines.
  412. In the case of H.264, it was estimated that the deblocking filter required over
  413. one-third of the total decoding time.%TODO: Find reference.
  414. This brings about one of the primary advantages to an out-of-loop filter, like
  415. that used by MPEG4 part 2, which is that slower systems can disable it to
  416. reduce CPU utilization at the cost of visual quality.
  417. Both H.264 and Theora have gone the alternative route: using a loop-filter so
  418. that the encoder can take advantage of the increased quality of the
  419. reconstructed frames during prediction.
  420. Lapped block transforms, like loop filters, would also impose a fixed CPU cost
  421. that could not be optionally disabled by slower CPUs, however unlike the
  422. adaptive filter of H.264, their structure is regular.
  423. Lapped block transforms address the other problems by introducing a
  424. complementary pre-filter, which is the exact inverse of the deblocking
  425. post-filter.
  426. The pre-filter essentially lengthens the basis functions normally used by the
  427. block transform and causes them to decay at the ends.
  428. These longer basis functions allow for increased decorrelation between blocks,
  429. so that sharp discontinuities at their edges after pre-filtering no longer
  430. lead to sharp discontinuities in the reconstructed signal after
  431. post-filtering.
  432. Actual edges along block boundaries, on the other hand, now produce larger
  433. high-frequency components, so that normal quantization procedures can take
  434. them into account.
  435. \subsubsection{General Structure}
  436. A general structure is known for separable pre-/post-filter pairs which
  437. guarantees perfect reconstruction, finite impulse response and linear-phase
  438. \cite{Tra01a}.
  439. The structure is complete, such that all filters with these properties are
  440. represented.
  441. The form of the pre-filter which allows increasing the length of the block
  442. basis functions by a factor of two is
  443. \[
  444. {\mathbf P}=\frac{1}{2}
  445. \left[
  446. \begin{table}{cc}{\mathbf I}&{\mathbf J}\\{\mathbf J}&-{\mathbf I}\end{table}
  447. \right]
  448. \left[
  449. \begin{table}{cc}{\mathbf U}&{\mathbf 0}\\{\mathbf 0}&{\mathbf V}\end{table}
  450. \right]
  451. \left[
  452. \begin{table}{cc}{\mathbf I}&{\mathbf J}\\{\mathbf J}&-{\mathbf I}\end{table}
  453. \right].
  454. \]
  455. Each of ${\mathbf I}$, ${\mathbf J}$, ${\mathbf U}$ and ${\mathbf V}$ are
  456. equal-sized square matrices, at most half the size of the block transform.
  457. Here ${\mathbf I}$ is the identity matrix, ${\mathbf J}$ is an anti-diagonal
  458. matrix of $1$'s, sometimes called the reversal matrix, and ${\mathbf U}$ and
  459. ${\mathbf V}$ are aribtrary invertible matrices.
  460. Note that these matrices can be smaller than half the size of the block
  461. transform.
  462. This allows increased blocking artifacts to be traded off against decreased
  463. ringing.
  464. Longer basis functions are also possible by using additional filter stages.
  465. Restricting ${\mathbf U}$ and ${\mathbf V}$ to further be orthogonal produces
  466. orthogonal transforms, which have nice error analysis properties, but usually
  467. inferior energy-compaction abilities.
  468. The pre-filter is applied before the block transform, at an offset, such that
  469. its support straddles two blocks, with one half taken from each.
  470. Boundary conditions are handled simply by not applying the filter when its
  471. support would overlap the boundary.
  472. \subsubsection{Approximate Forms}
  473. The matrix ${\mathbf U}$ has very little effect on coding gain, and so can be
  474. taken to be the identity to simplify the transform without much loss
  475. \cite{Tra01a}.
  476. The matrix ${\mathbf V}$ may then be parameterized as a diagonal scale martix
  477. surrounded by rotation matrices using the singular value decomposition.
  478. Tran notes that some of the rotation angles are always very small, and suggests a
  479. two simpler forms with just one rotation matrix composed of only $K-1$ angles,
  480. where $K$ is the size of ${\mathbf V}$ \cite{Tra01a}.
  481. He also suggests approximating the rotation by two lifting steps:
  482. $x_{i+1}=x_{i+1}+px_i$ followed by $x_i=x_i+ux_{i+1}$.
  483. The Type-III transform arranges the rotations in the $x_{K-1}-x_{K-2},\ldots,
  484. x_2-x_1,x_1-x0$ planes.
  485. The Type-IV transform splits the two lifting steps up so that first the steps
  486. $x_1=x_1+p_0x_0,x_2=x_2+p_1x_1,\ldots,x_{K-1}=x_{K-1}+p_{K-2}x_{K-2}$ are
  487. applied and then the steps $x_{K-2}=x_{K-2}+u_{K-2}x_{K-1},\ldots,
  488. x_1=x_1+u_1x_2,x_0=x_0+u_0x_1$ are applied.
  489. An optimized $8\times 16$ filter with ${\mathbf U}={\mathbf I}$ can achieve a
  490. coding gain of $9.6151$ dB.
  491. Using the Type-III approximation reduces this to $9.6115$ dB, and using the
  492. Type-IV approximation still yields $9.6005$ dB.
  493. Forcing the transform to be orthogonal in any of these cases imposes
  494. approximately a $0.2$ dB penalty \cite{Tra01b}.
  495. Thus the penalty for using either approximation is very small.
  496. In general, the structure of the Type-III transform matches better with the
  497. sizes of the coefficients, so that the smaller coefficients with larger
  498. relative approximation error occur towards the end of the transform.
  499. The Type-IV transform still remains competitve in terms of coding gain,
  500. however, and a special use for it will be seen shortly.
  501. \subsubsection{Regularity Constraints}
  502. The transforms can be made $\{1,2\}$-regular, so that with just the DC
  503. coefficient of each block, smooth ramp signals may be reconstructed.
  504. These ramps match very well with the offset not coded in our motion
  505. compensation scheme.
  506. Regularity is imposed with the additional constraint
  507. ${\mathbf V}{\mathbf q}=M{\mathbf u}$.
  508. Here ${\mathbf q}=[1,3,\ldots,2K-1]^T$, ${\mathbf u}=[1,1,\ldots,1]^T$ and
  509. $M\in\{2K,2K+1\}$ is the size of the block transform.
  510. For the Type-III transform, this is equivalent to the constraints
  511. \begin{eqnarray*}
  512. s_0 &=&M(1-u_0)\\
  513. s_i &=&\frac{M}{2i+1}(1-u_i)-\frac{2i-1}{2i+1}s_{i-1}p_{i-1}\\
  514. & &{\rm \ for\ }i=1,\ldots,K-2\\
  515. s_{K-1}&=&\frac{M}{M-1}-\frac{M-3}{M-1}s_{K-2}p_{K-2}
  516. \end{eqnarray*}
  517. where $s_i$ is the scale factor for each row of ${\mathbf V}$.
  518. Similar constraints exist for the Type-IV transform:
  519. \begin{eqnarray*}
  520. s_0 &=&M(1-u_0)\\
  521. s_i &=&\frac{M}{2i+1}(1-u_i-(1-u_{i-1})p_{i-1})\\
  522. & &{\rm \ for\ }i=1,\ldots,K-2\\
  523. s_{K-1}&=&\frac{M}{M-1}(1-(1-u_{K-2})p_{K-2}).
  524. \end{eqnarray*}
  525. \subsubsection{Optimal Dyadic Rational Approximations}
  526. Finally, all the coefficients $u_i,p_i,s_i$ may be approximated by dyadic
  527. rationals so that the transform may be implemented entirely with integer
  528. operations.
  529. Because the scale factors of the post filter are $1/s_i$, and decoder speed is
  530. more important, it would be desirable to approximate $1/s_i$ as a dyadic
  531. rational instead of $s_i$, but this imposes severe restrictions on the $s_i$.
  532. For example, $1/s_0$ must be a power of two, or $M(1-u_0)$ would have a
  533. denominator not divisible by two.
  534. Settling for $K$ integer divisions in the decoder allows transforms with much
  535. higher coding gain.
  536. One can also see that the constraints for the Type-IV transform do not have the
  537. same chain of dependencies on previous scale factors that those of the
  538. Type-III transform do.
  539. This makes it much easier to satisfy these constraints, and so the best
  540. approximation of the Type-IV transform to a given precision might actually be
  541. better than that of the Type-III transform.
  542. This is in fact confirmed to be the case for $8\times 16$ filters.
  543. The transform with the optimal coding gain for a given number of bits of
  544. precision in the dyadic rational approximations can be obtained by using
  545. numeric methods to find the optimal point at some large precision, and then
  546. performing a brute force search around this for parameters $u_i$ and $p_i$
  547. which satisfy the constraints.
  548. This brute force search was done by searching the faces of an expanding
  549. hyper-cube around the approximation of the optimal parameters.
  550. The coding gain function has a single local maximum and behaves very smoothly
  551. near the critical point, justifying the use of simple numeric methods to find
  552. the global optimum.
  553. In order to speed up the search, once the maximum value of the coding gain
  554. function on a face of the cube falls below that of the best solution already
  555. found, that face stops expanding, and need not be considered again.
  556. When all faces have stopped, the global optimum has been found.
  557. This method is exponential in the number of parameters, since it must examine
  558. all dyadic rationals in a volume of parameter space.
  559. Two levels out in the 14-dimensional parameter space needed to optimize a
  560. $16\times 32$ transform contains over 34 billion lattice points.
  561. Some shortcutting may be made by throwing out pieces of hyper-cube faces when
  562. it can be shown that any one of the constraints is not satisfied over it, but
  563. the computational complexity still grows very quickly.
  564. \subsubsection{Embedded Concerns}
  565. For hardware and embedded systems, it is desirable to keep the dynamic range
  566. of all calculations within 16 bits.
  567. This is even desirable for modern 32-bit processors, many of which include SIMD
  568. instruction sets that can operate on 4 16-bit operands at a time.
  569. However, residuals may already require 10 bits to represent.
  570. An additional two bits are required for the output of each of two
  571. pre-filtering operations, one in the horizontal and one in the vertical
  572. direction.
  573. This severely limits the size of the dyadic rationals that can be used when
  574. computing the transform.
  575. Already, for a $8\times 16$ pre/post filter, a dyadic rational approximation
  576. that fits within 16 bits incurs a $0.3$ dB loss in coding gain.
  577. This can be reduced in theory by altering the structure of the lifting steps to
  578. reduce the dynamic range of critical values in the calculation, but this
  579. introduces round-off errors.
  580. The optimization process described above assumes that round-off errors due to
  581. lifting steps can be ignored, so that all transforms may be treated as linear.
  582. It is unknown which leads to a lower coding gain: using a less accurate
  583. approximation of the transform coefficients or using a less accurate
  584. representation of the values being transformed.
  585. However, as restructuring introduces errors earlier in the calculation, it will
  586. likely cause the greater loss.
  587. Liang and Tran recommend implementing the multiplications by the dyadic
  588. rationals only using right-shifts \cite{LT01}.
  589. Hence $\frac{3}{8}x$ would be implemented as
  590. $\lfloor\frac{x}{4}\rfloor+\lfloor\frac{x}{8}\rfloor$.
  591. This introduces still further rounding errors, though it should be better than
  592. simply using a less accurate approximation of the coefficient.
  593. The errors can be further reduced by left shifting where the room is available.
  594. This approach might have patent issues, however.
  595. This approach also does not solve the problem of multiplying by the scale
  596. factors.
  597. Unlike normal lifting steps, this multiplication is not trivially invertible.
  598. Since the scale factors are always greater than or equal to one, inversion is
  599. normally attained by adding one to a positive result on the encoder's side
  600. and simply doing the normal division on the decoder's side.
  601. Whatever approximation scheme is used to reduce the dynamic range of this
  602. operation, care must be taken that the decoder will be able to easily invert
  603. the process.
  604. In light of all these considerations, considerable effort could be put forth to
  605. design a transform with a small dynamic range that retains as many desirable
  606. properties as possible.
  607. All of this effort can be avoided by simply using 32-bit arithmetic to perform
  608. the transforms.
  609. Therefore, although all coefficients produced by lapped transforms considered
  610. here will fit within a 16-bit signed integer, 32-bit operations will be
  611. required for intermediate results.
  612. \subsubsection{Boundary Handling}
  613. It has already been said that pre/post filters which would overlap the boundary
  614. of an image can simply be discarded.
  615. However, if the size of the image is not a multiple of the block size, special
  616. handling still needs to be done in the region where a block is only partially
  617. filled by the image.
  618. The usual method of handling this is to extend the signal to fill the block,
  619. and then apply the transform.
  620. There are a number of ways to extend the signal, however.
  621. The most practical ways are to compute some simple function of the signal, such
  622. as mirroring it or copying the last value into the empty region.
  623. In theory, it should be possible to choose values that guarantee that there are
  624. as many trailing zeros output from the transform as there were added values.
  625. The actual values can be found by a straightforward method.
  626. Arbitrary values are assumed for the unknowns, the transform computed, and the
  627. desired coefficients are set to zero.
  628. Then the inverse transform is computed, and the results of the inverse are used
  629. for the unknowns.
  630. This is repeated until the process converges.
  631. This method will work when the inputs can be represented to arbitrary precision
  632. and the transform is a linear one.
  633. It may not converge with bounded representations and lifting steps in the
  634. transform which round or truncate values.
  635. Therefore, the bitstream syntax should not be required to produce these zero
  636. terms, or lossless coding might be compromised.
  637. \section{Perceptual Coding}
  638. These factors influence only encoder design.
  639. They should have minimal to no effect on the bitstream format.
  640. Models which describe the sensitivity of the human visual system are
  641. parameterized by the viewing environment.
  642. Thus, size is measured in degrees of arc spanned after projection, contrast is
  643. relative to the surrounding lighting and background, etc.
  644. Either a ``standard" viewing environment must be assumed, or whatever
  645. procedures that are used must take parameters which describe the salient
  646. features of the environment.
  647. At a minimum these should include expected image size, viewing distance, and
  648. background luminance.
  649. \subsection{Region of Interest Coding}
  650. Motion is empirically 5 times as important as any other factor when determining
  651. regions of interest in a scene \cite{Osb99, NC96}.
  652. The other factors used by Osberger are contrast, size, shape, location, and
  653. differentiation between foreground and background.
  654. When importance weightings were averaged over $16\times 16$ blocks, $50\%$ of
  655. viewer attention was consistently located in the most important $10\%$ of the
  656. image for still images.
  657. A study found that over $90\%$ of fixations occurred in less than $18\%$ of the
  658. area any video scene, while Osberger's most important $20\%$ contained less
  659. than $70\%$ of the attention, so there is still room for improvement in
  660. identifying regions of interest.
  661. Humans get cues about their own motion from both visual clues and their inner
  662. ear.
  663. Angular velocities around $0.7$ cycles/minute (6 deg/sec) mark the point where
  664. visual cues and measurements in the inner ear are equally important.
  665. Above that the inner is more important, and below that visual cues are.
  666. This can be used as a threshold for determining which motions are too fast to
  667. track and which are important.
  668. This corresponds well with empirical results by Osberger that suggest a value
  669. of 5 deg/sec for the maximum sensitivity to motion \cite{Osb99}.
  670. There is typically a $0.1$ to $0.2$ second delay between the time a new object
  671. appears in the visual field and the time the eye begins to track its motion.
  672. Most methods of region of interest coding require transmitting some sort of
  673. side information that describes where the regions are.
  674. One method does this by transmitting model parameters for areas of expected
  675. foveation.%TODO: \cite{}.
  676. This allows both the encoder and decoder to scale wavelet coefficients by the
  677. same values, and then transmit or decode the bits in normal significance
  678. order.
  679. This has the drawback of making the attention model part of the bitstream
  680. format, so that future research cannot make drastic modifications to it, and
  681. of restricting the use of more perceptually based quantization methods, e.g.
  682. contrast-based methods.
  683. Viewing the example images in the paper also reveals the drastic dependence on
  684. viewing conditions of the foveation model itself, as far too much information
  685. is thrown away in ``unimportant" regions.
  686. Considering that ROI selection is only accurate half the time, this approach is
  687. likely to do more harm than good.
  688. A more flexible method is simply to use adaptive quantization to control how
  689. bits are distributed.
  690. This may require an increased amount of side information, but leaves control
  691. of the model entirely up to the encoder.
  692. Because the encoder also has some control over the entropy encoding of the
  693. quantization parameter, simple encoders which do not wish to use adaptive
  694. quantization or use only a limited version of it can reduce or eliminate the
  695. bit rate overhead altogether.
  696. \subsection{Supra-Threshold Quantization}
  697. For current codecs, it is usually the case that pre-processing images to
  698. remove detail is more effective than adaptive quantization.
  699. This is because pre-processing can be done without introducing additional
  700. artifacts such as blocking artifacts \cite{Osb99}.
  701. For a well-designed codec, this should not be the case.
  702. It is generally believed that the detection threshold for stimuli with
  703. different orientations and frequencies is a non-linear Minkowski sum of the
  704. detection threshold for each individual stimulus, though the exact exponent
  705. varies depending on testing conditions.
  706. The exponents determined by experiments range from $\Beta=2.2$ to
  707. $\Beta=\infty$ (no summation) for sine-wave and Gabor targets presented
  708. against a uniform background or an unnatural masker.
  709. Experiments using natural images as a masker reveal that the summation is
  710. approximately linear $\Beta=1$ \cite{CH02a}.
  711. The conclusion of this is that it is acceptable to optimize each channel of a
  712. filter bank independently, without taking into account inter-channel
  713. dependencies.
  714. The other important observation is that human sensitivity to relative changes
  715. in quality above the threshold of detection does not proceed in a linear
  716. fashion.
  717. This means that it is not valid to attempt to find quantizers which produce
  718. equivalent distortions at sub-threshold levels, and then scale them linearly
  719. to supra-threshold levels in order to save bits.
  720. Chandler and Hemami have performed experiments for quantizing various wavelet
  721. channels, but their results rely on a specific viewing environment, including
  722. a specific image size and level of wavelet decomposition \cite{CH02b}.
  723. It is not clear how to generalize their results to other viewing environments or
  724. more adaptive transforms.
  725. %%CELT-style AVQ:
  726. %% Code L1 energy in (band, block?) with exponential codebook
  727. %% Code distribution of energy with PVQ codebook
  728. %% Lattice spacing equal to energy quantizer
  729. %% Distribution _not_ uniform
  730. %% Always use a predictor:
  731. %% Inter predictor
  732. %% MC, then forward transform
  733. %% Intra predictor
  734. %% Predict from neighboring post-transform coefficients
  735. %% Use one or more linear predictors
  736. %% Can build custom, optimal predictor for various texture orientations
  737. %% Can we pick which predictor to use by classifying source coefficients?
  738. %% Advantage: no side information
  739. %% Predictor gives predicted energy (code residual from prediction)
  740. %% Predictor gives predicted PVQ pulse vector
  741. %% Scale predictor by change in energy
  742. %% Code residual from predicted pulse distribution
  743. %% Take advantage of lattice structure (at least one coefficient completely
  744. %% determined by energy constraint)
  745. %% Take advantage of non-uniform distribution (residual likely to be near 0)
  746. %% SILK-style codebook training sufficient?
  747. %% Scalar quantization with ranges restricted?
  748. %% Challenges:
  749. %% Want PVQ codebook offset so no quantization need be applied to predictor
  750. %% Want Energy codebook offset so predictor*1.0 is in codebook
  751. %% Bit-exact exponential codebook requires O(N) multiplies to search (enc+dec)
  752. %% Big look-up table of scale factors?
  753. %% Doubles the number of muls needed for search by a no-table implementation.
  754. %% Instead of e_{n+1}=e_n*s>>b, s_{n+1}=s_n*s_1>>b, e_{n+1}=e_0*s_{n+1}>>b.
  755. \bibliography{daala}
  756. \end{document}