123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865 |
- \documentclass[11pt,letterpaper]{article}
- \pagestyle{headings}
- \bibliographystyle{alpha}
- \title{Ogg Daala Design Rationale}
- \author{Timothy B. Terriberry}
- \date{}
- \begin{document}
- \maketitle
- \section{Introduction}
- The purpose of this document is to discuss some of the various design
- trade-offs in creating a video codec, and the rationale behind each of the
- decisions made for Ogg Daala.
- The goal of this document is to ensure that decisions are made in a consistent
- and considered manner.
- \section{General Structure}
- The general structure of the video sequence is motion-compensated prediction
- followed by some form of two-dimensional transform of the residual error.
- The final transform coefficients for each frame are quantized, and all data is
- entropy encoded.
- This has been the basis for all of the most popular video codecs, such as
- MPEG1, MPEG2, H.263, MPEG4 parts 2 and 10 (H.264), VC1, etc., as no other
- strategy has proven as effective for eliminating temporal redundancy.
- In addition, much of the software for video encoding and playback has been
- written with these codecs in mind.
- This means it often has built-in assumptions about the way codecs work that may
- make interoperation with a new technique more difficult or impossible without
- a redesign.
- This is not to say that things such as three-dimensional wavelets are being
- discarded out-of-hand.
- In their current state, however, they are less efficient from both a
- compression and an execution standpoint.
- They do offer temporal scalability.
- This feature, however, is not useful for many applications and can be achieved
- in other ways with slightly less flexibility by utilizing P and B frames
- properly.
- The cost of this scalability, however, is increased latency, and minimizing
- latency \textit{is} strongly desirable for many applications.
- Until a convincing demonstration that this situation has changed is available,
- the 2D+MC model is assumed.
- \subsection{Lossless Encoding}
- One of the biggest operational issues with many codecs today is that they are
- fundamentally incapable of encoding a given video stream without quality
- degredation, regardless of the number of bits that are thrown at it.
- A single encode may not degrade the video in a perceptible manner, but a
- sequence is often encoded, transferred, edited and re-encoded repeatedly.
- Compounding the problem, many encoders also contain bugs which introduce
- additional artifacts at any bit rate.
- Lossless encoding allows for simple detection of such bugs.
- Some extensions to the new H.264 codec have been proposed that would allow for
- a lossless mode, but this is accomplished by switching between a PCM coding
- mode that was included essentially to provide a theoretical upper bound on
- bitrate.
- The argument against lossless encoding is simply that it consumes too much
- space to be practical for most purposes.
- In fact, lossless codecs do exist and are used when it is known a priori that
- additional editing and encoding are needed, and when these operations can be
- done locally without transferring twenty gigabyte or larger files around.
- There is no way to progressively scale between the extreme of lossy encoding
- and lossless.
- This is because the transforms used in current lossy codecs inherently
- introduce loss, even before quantization is employed to reduce the bit rate.
- What makes this issue so exasperating is that there is no reason why this has
- to be the case.
- The only stage of the encoding and decoding process which introduces loss by
- design in order to reduce the bit rate is the quantization stage.
- Even worse, such loosely-defined transforms also lead to compatibility issues.
- IEEE standard 1180 was introduced to address this issue by ensuring that
- implementations are sufficiently accurate.
- But many encoders or decoders are used that are not IEEE 1180 compliant, simply
- because their implementations are much faster or do not require floating-point
- hardware.
- Buggy encoders may produce video which slowly turns green or purple as it plays
- due to consistent round-off errors, or "drift".
- Even if the encoder checks the output, the problem will likely go unnoticed, as
- the decoder will probably use a similarly non-compliant transform, with
- matching defects.
- One of the reasons for not specifying an exact transform definition was to
- allow hardware and software designers additional flexibility to optimize it.
- However, the fastest known algorithm for an 8-point DCT is over a decade old
- \cite{LLM89}.
- Hardware has also sufficiently advanced so that making a transform fast enough
- is no longer the critical problem.
- This video codec, therefore, utilizes a transform that is exactly defined.
- As a consequence, it must consist of entirely integer operations, as the
- floating-point processors on different hardware are not compatible.
- This is also useful for embedded devices, which may not have a floating-point
- unit at all.
- The only other stage of the encoding and decoding process that could introduce
- unnecessary loss is color conversion.
- This is mostly an application issue.
- As long as the codec can handle the most common color formats, there is no
- reason for editing or encoding software to introduce additional color
- conversions.
- \section{Bit Stream Format}
- This section summarizes the considerations that went into the design of the
- bit stream format.
- This includes issues relating both to how symbols are compressed and to how
- they are organized into packets.
- \subsection{Entropy Coder}
- Huffman coding has long been used for video codecs because of its extreme
- simplicity and speed.
- It is also unemcumbered by patents.
- Unfortunately, more advanced techniques such as arithmetic coding produce
- results that are far superior, as was demonstrated in research for the H.26L
- codec.
- The obvious reason is that arithmetic coding allows symbols to be encoded with
- a fractional number of bits, specifically less than one.
- The less obvious reason is that arithmetic coding allows much more flexible
- probability models.
- Changing the probability distribution of the symbols being coded requires as
- many as $O(n)$ operations to generate a new huffman table, since every symbol
- in the alphabet may change.
- Arithmetic coding, by contrast, only requires the sums of probabilities of
- the first $k$ symbols be known.
- Fast algorithms exist which can compute and update these sums in $O(\log n)$
- time~\cite{Fen94,Fen95,Mof99}.
- This makes adapting the symbol probabilities after each decoded symbol
- practical, which allows for the coder to adapt to non-stationary sources.
- Range coding was a technique published by an IBM researcher~\cite{Mar79} around
- the same time that arithmetic coding was developed~\cite{Pas76}.
- The idea is extremely similar, except that range coding can process output in
- chunks larger than a bit at a time.
- This allows byte-level implementations, which are more efficient.
- Also, it is believed that range coding is not encumbered by patents.
- Either way, it is not too much longer until the critical IBM patents on
- arithmetic encoding expire.
- Thus, this codec uses a range coder for entropy encoding.
- \subsubsection{A Review of Arithmetic and Range Coding}
- In order to discuss further design decisions, a quick review of arithmetic and
- range coding is helpful.
- Probabilities are modeled as integer frequency counts.
- Each symbol $s$ can be assigned an integer $f_s$ representing its relative
- frequency.
- The sum of the counts of all the symbols up to and including $s$ in the
- alphabet is denoted $F_s$.
- The sum of all of these integers is $t$, the total frequency.
- The alphabet size and symbol probabilities can be changed arbitrarily for each
- encoded symbol, so long as the decoder can make the same changes given the
- decoded information to that point.
- During encoding, an interval in the range $[0,1)$ is selected, based upon the
- probabilities of the symbols being encoded.
- As each new symbol is encoded, the range is narrowed in proportion to the
- estimated probability of the symbol, and offset according to the probabilities
- of earlier symbols in the alphabet.
- At the end of the encoding, a number in the final interval is selected to
- represent the entire encoded sequence.
- Practical implementations represent the interval as $[L,L+R)$, where $L$ and
- $R$ are two fixed-length integers: the lower bound of the interval and its
- range, respectively.
- A partition function, $f(c,t,R)$, where $c\in[0,t)$ is a frequency count,
- carves up the range and offsets the lower bound as each symbol is encoded.
- The update formulas are
- \begin{eqnarray*}
- L'&=&L+f(F_{s-1},t,R)\\
- R'&=&f(F_s,t,R)-f(F_{s-1},t,R).
- \end{eqnarray*}
- Clearly, $R$ must be greater than $t$, or some symbols could potentially be
- mapped to an empty interval.
- Whenever $R$ becomes too small, the interval--$R$ and $L$--is scaled by a power
- of two, and one or more bits of $L$ are output.
- Allowing $R$ to become smaller before renormalizing requires fewer operations,
- but reduces the accuracy of the partitioning operation and reduces the bits
- available for the probability estimates.
- Because of the fixed bit-length of $R$, a similar restriction must be placed on
- $t$.
- This means that only approximations of the true probabilities are used, and
- this can be a source of compression inefficiency.
- This is not normally an issue, since the probabilities are usually estimated,
- not exactly known.
- Range coding always outputs a byte at a time, however, making 8 fewer bits
- available for frequency counts than with normal arithmetic coding.
- Various algorithms may place even greater restrictions on $t$, which could make
- the effects significant.
- During decoding, the same interval is reconstructed by examining integer
- output.
- A fixed-length integer, $V$, contains the bits of the output that correspond to
- the current value of $L$.
- As $L$ is shifted during renormalization, $V$ is also shifted, and new bits are
- read from the input to fill the bottom.
- Normally, only $D=V-L$ is actually computed and stored for efficiency.
- Then the partition function is inverted, computing a value $c$ for which
- $D\in[f(c,t,R),f(c+1,t,R))$.
- The frequency counts are then searched for the symbol $s$ such that
- $c\in[F_{s-1},F_s)$.
- This is the decoded symbol.
- Once this is known, the operations of the encoder can be mimiced.
- \subsubsection{Carry Propagation}
- Because the interval $[L,L+R)$ can straddle the value $1$ after
- renormalization, the bits of $L$ shifted off the left end are not always the
- final bits of the code word.
- Carry propagation can change an arbitrarily long sequence of one bits to zeros.
- For general purpose coders, this can be solved by bit stuffing, which
- artificially clamps the current interval to be entirely above or below $1$.
- When the number of bits outstanding is kept as a simple 32-bit count, this
- wastes at most one bit every four billion output bits.
- This adds increased complexity to the decoder, however, which must now mimic
- these actions of the encoder so that it too can perform this bit stuffing at
- the proper place.
- A range coder allows counting outstanding bits at the byte level, and so for
- Daala's purposes it should be sufficient to restrict the size of a packet to
- 4 GB, thus eliminating the need for bit stuffing.
- The 32-bit counter is only needed on the encoding side, and so embedded 16-bit
- decoders can still use 16-bit arithmetic.
- \subsubsection{Binary Coding}
- Range coding can be further simplified by reducing the number of symbols to
- two: 0 and 1.
- Every symbol to be encoded is first converted to some binary representation,
- and then the bits of that representation are modeled and coded.
- This makes maintaining the probability estimates trivial, but requires
- $O(\log n)$ bits to be encoded for every symbol.
- This still requires $O(\log n)$ operations to update the statistics.
- Computing the partition function $f$ and its inverse is usually more expensive
- than updating the statistics.
- An alphabet of size $n$, however, usually utilizes $n-1$ free parameters in its
- statisical model, compared to $\log(n)$ parameters for $\log(n)$ binary
- models.
- Thus, the binary models require less storage, and also incur less overhead for
- modeling errors, at least on a \textit{per-symbol} basis.
- They also require more symbols to be encoded, however.
- This is not to say that $n$-ary models could not be designed which required
- fewer parameters.
- Indeed, in predicting items like motion vector lengths, there should be strong
- correlations between neighboring symbols in the alphabet.
- Traditional models do not account for this in any way, i.e., the observance of
- a value $k$ does not increase the probability that $k+1$ and $k-1$ might also
- be observed.
- Binary-tree models actually make some such adjustments, but not across tree
- branches.
- It is expected that the number of contexts with a relatively large alphabet
- will be small, and in fact must be small to avoid context dilution, so the
- increased memory overhead for the model parameters should be moderate.
- Hence larger alphabets are more flexible and potentially faster than binary
- alphabets, and only require a moderate increase in implementation complexity.
- Therefore a restriction to a binary alphabet is not desired.
- The unrestricted version is called ``multialphabet" arithmetic coding in the
- literature.
- \subsubsection{The Partition Function}
- The normal partition function for arithmetic coding is
- \begin{displaymath}
- f(c,t,R)=(c\cdot R)//t,
- \end{displaymath}
- where $//$ denotes integer division.
- Rounding instead of truncation in the division can increase the efficiency, but
- at the cost of increased complexity of the decoder, which must properly invert
- the operation.
- Moffat et al. propose the more efficient
- \begin{displaymath}
- f(c,t,R)=\left\{
- \begin{array}{ll}c\cdot(R//t)&{\rm if\ }c<t,\\R&{\rm otherwise},\end{array}
- \right.
- \end{displaymath}
- which has the advantage of only requiring one division per encoding operation,
- regardless of how many times $f$ is evaluated \cite{MNW98}.
- Note that both of these are approximations to the optimal partition function,
- which requires extended precision arithmetic.
- The compression inefficiency due to the latter is small when $t$ is much
- smaller than $R$.
- Using 6 fewer bits for $t$, the loss is only $0.1\%$.
- There are several methods for designing a multiply-free partition function.
- Here, the term ``multiply" from the literature refers to scaling by the
- estimated probability, $c/t$, and so includes the division step.
- Most methods restrict $R$ or the probabilities to a special form.
- Methods specific to binary coding use a small number of values for $R$ and
- the frequency counts, and can be driven with lookup tables or shifts and adds
- \cite{LR81, PMLA88, HV92, HV94, HM97}.
- These can be extended to multialphabet coding by using a small number of values
- to approximate $R//t$ from Moffat's partition function instead
- \cite{RM89, CKW91, PS93, Lei95, FP97}.
- This guarantees that no symbol gets allocated a zero-sized range, but care must
- be taken to ensure that the approximate values of the partition function do
- not exceed the real $R$.
- Many of these impose restrictions on the type of probability models that can be
- used, whether or not they can be adaptive and how the adaptive ones are
- updated.
- Stuiver and Moffat, however, take the approach of defining a new partition
- function \cite{SM98}.
- The second function is highly inaccurate when $t$ approaches $R$, as the
- truncation error increases and all the extra space is allocated to the last
- symbol.
- They instead normalize $t$ to fall in $(R/2,R]$, which forces $R//t$ to be $1$,
- and then evenly distribute the truncation errors with the function
- \begin{displaymath}
- f(c,t,R)=\left\{
- \begin{array}{ll}c&{\rm if\ }c<d,\\2c-d&{\rm otherwise},\end{array}
- \right.
- \end{displaymath}
- where $d=2t-R$.
- The value $d$ divides the frequency counts such that $d$ values in the range
- $[0,t)$ are allocated single units in $R$, and $t-d$ are allocated double
- units.
- This approach imposes no special form on $R$ or the probability estimates.
- The requirement that $t$ fall in $(R/2,R]$ can be handled by scaling $t$ and
- the representative frequency counts by a power of two right before coding.
- This can be computed with a lookup table, or by using special hardware
- instructions available on some processors to locate the high-order bit.
- Once the high-order bits are the same, one additional shift may be required to
- bring $t$ into the proper range.
- The maximum error in the multiplication when using this function is a
- factor of two, so the error is no more than one bit per symbol.
- If one assumes the probabilities being used as accurate and that the symbols
- being encoded are independent of each other, this can be reduced to
- $\log(2e^{-1}\log e)\approx 0.0861$.
- Values close to this are observed, even when these assumptions are grossly not
- met \cite{SM98}.
- The partition function as given over-estimates symbol probabilities at the end
- of the interval, and under-estimates symbol probabilities at the beginning of
- the interval.
- It is more beneficial if the probabilties of more probable symbols are the ones
- that are over-estimated, since the relative error is smaller.
- In most contexts, the smaller symbols have a larger probability of occurring.
- Therefore a partition function with the roles of the two piecewise segments
- reversed is used:
- \begin{displaymath}
- f(c,t,R)=\left\{
- \begin{array}{ll}2c&{\rm if\ }c<d,\\c+d&{\rm otherwise},\end{array}
- \right.
- \end{displaymath}
- where $d=R-t$.
-
- \subsubsection{Word Size}
- For a word size $w$, $w-1$ bits are available for $R$ and $L$, since one bit is
- used to simplify carry detection.
- Since range coding allows $R$ to shrink by up to 8 bits, $w-9$ bits are
- available for frequency counts.
- %The number of symbols encoded in most contexts in a frame is relatively small.
- %Therefore a word size of 16 bits allows 7 bits to be used for frequency counts,
- % which should be sufficient, and allows fast operation on embedded processors.
- %Note that if the second, more accurate, partition function were used, frequency
- % counts of 7 bits would incur an error near 1 bit per symbol, far higher than
- % the theoretic $0.0861$ bits per symbol incurred by the third function.
- \subsubsection{Probability Estimator}
- The $\frac{1}{b}$-biased $k$-array Dirichlet estimator provides probability
- estimates for a stationary zero-order stochastic process with an inefficiency
- of $O(\log n)$ bits, where $n$ is the number of symbols encoded
- \cite{Sht87, STW95}.
- \begin{displaymath}
- P(s_i|S) = \frac{b\cdot {\rm occ}(s_i,S)+1}{bn+k}.
- \end{displaymath}
- Here, $s_i$ is the $i$th symbol in the alphabet, $S$ is the string of symbols
- observed so far, ${\rm occ}(s_i,S)$ is the number of occurrences of $s_i$ in
- $S$, $n$ is the length of $S$, $k$ is the size of the alphabet, and $b$ is the
- bias parameter.
- When $b=2$ and $k=2$, this is also called the Krichevsky-Trofimov estimator
- \cite{KT81}.
- The bias parameter controls how quickly the model adapts to observed symbols.
- In an actual implementation, the frequency count $f_s$ for each symbol is
- simply initialized to $1$.
- Each time a symbol is observed, $b$ is added to its count.
- In order to prevent statistics from accumulating to be too large, and to allow
- adaptation to non-stationary sources, the counts are scaled by a factor of
- $1/2$, rounding up, when the total $t$ exceeds a threshold parameter.
- \subsubsection{Code Books}
- Unlike codebooks for non-adaptive codes, codebooks for range codes are not
- actually required.
- They do provide several advantages, however.
- First, as the number of symbols encoded in most contexts in a frame is assumed
- to be small, initial probabilities can be specified to reduce the time needed
- for the model to adapt.
- Second, because different contexts fit the model of a stationary zero-order
- source to different degrees, different bias and threshold parameters in the
- estimator allow the models to adapt at different rates.
- Far more significant than the previous two, certain symbols may be given an
- initial probability of zero.
- This prevents the symbol from ever being decoded.
- This allows the encoder to incur no bit rate penalty whatsoever for features of
- the bitstream that it does not use.
- This can, for example, allow the encoder to selectively switch between half-pel
- or quarter-pel motion vector resolution on a frame-by-frame basis.
- For these reasons, several codebooks can be defined in the codec setup headers
- by the encoder for use in range coding.
- Codebooks are selected on a frame by frame basis, and the current codebook is
- specified by index for each frame.
- Specifying both the initial probabilities and the update rate is sufficient to
- eliminate the need for a custom threshold for each context, so the default of
- $2^7$ is always used.
- This provides the maximum precision possible for rescaling operations.
- \subsection{Packetization}
- \subsubsection{Bit Rate Peeling}
- One of the features of the Ogg Vorbis codec is that audio packets can be
- truncated at any point and still result in a valid, decodable stream.
- This feature is primarily useful for the specific application of real-time
- video streaming.
- It does not make sense with a 2D+MC transform model, however.
- The reason is that the encoder uses the decoded output of previous frames for
- the motion-compensated prediction, as this is the same information that will
- be available to the decoder.
- Reducing the quality of these frames changes the prediction and thus the
- error residual.
- This means that the encoded error residual of the predicted frames no longer
- corresponds to the real error residual, introducing additional artifacts.
- These artifacts are then used to predict future frames, causing progressively
- increasing error.
- It might be possible to include this feature in B frames, which are not used to
- predict other frames.
- B frames make up only a small part of the total bit rate, however, and so do
- not provide much rate flexibility.
- Bit rate peeling is therefore not considered.
- \section{Residual Coding}
- \subsection{Lapped Block Transforms}
- Lapped block transforms retain the advantages of block transforms--low
- complexity and locality of reference--but eliminate one of the primary
- disadvantages: blocking artifacts.
- Most current blocked-based transform codecs utilize a deblocking filter on the
- decoding side, after the inverse transform is applied, to reduce blocking
- artifacts.
- These filters have several disadvantages.
- In the face of severe quantization they are often not sufficient to overcome
- blocking artifacts.
- Also, their effects are usually ignored during quantization in the encoder,
- which can lead to artificial smoothing of real edges.
- To avoid such problems, adaptive filters are used which require conditional
- checks that are expensive on modern CPUs with deep pipelines.
- In the case of H.264, it was estimated that the deblocking filter required over
- one-third of the total decoding time.%TODO: Find reference.
- This brings about one of the primary advantages to an out-of-loop filter, like
- that used by MPEG4 part 2, which is that slower systems can disable it to
- reduce CPU utilization at the cost of visual quality.
- Both H.264 and Theora have gone the alternative route: using a loop-filter so
- that the encoder can take advantage of the increased quality of the
- reconstructed frames during prediction.
- Lapped block transforms, like loop filters, would also impose a fixed CPU cost
- that could not be optionally disabled by slower CPUs, however unlike the
- adaptive filter of H.264, their structure is regular.
- Lapped block transforms address the other problems by introducing a
- complementary pre-filter, which is the exact inverse of the deblocking
- post-filter.
- The pre-filter essentially lengthens the basis functions normally used by the
- block transform and causes them to decay at the ends.
- These longer basis functions allow for increased decorrelation between blocks,
- so that sharp discontinuities at their edges after pre-filtering no longer
- lead to sharp discontinuities in the reconstructed signal after
- post-filtering.
- Actual edges along block boundaries, on the other hand, now produce larger
- high-frequency components, so that normal quantization procedures can take
- them into account.
- \subsubsection{General Structure}
- A general structure is known for separable pre-/post-filter pairs which
- guarantees perfect reconstruction, finite impulse response and linear-phase
- \cite{Tra01a}.
- The structure is complete, such that all filters with these properties are
- represented.
- The form of the pre-filter which allows increasing the length of the block
- basis functions by a factor of two is
- \[
- {\mathbf P}=\frac{1}{2}
- \left[
- \begin{table}{cc}{\mathbf I}&{\mathbf J}\\{\mathbf J}&-{\mathbf I}\end{table}
- \right]
- \left[
- \begin{table}{cc}{\mathbf U}&{\mathbf 0}\\{\mathbf 0}&{\mathbf V}\end{table}
- \right]
- \left[
- \begin{table}{cc}{\mathbf I}&{\mathbf J}\\{\mathbf J}&-{\mathbf I}\end{table}
- \right].
- \]
- Each of ${\mathbf I}$, ${\mathbf J}$, ${\mathbf U}$ and ${\mathbf V}$ are
- equal-sized square matrices, at most half the size of the block transform.
- Here ${\mathbf I}$ is the identity matrix, ${\mathbf J}$ is an anti-diagonal
- matrix of $1$'s, sometimes called the reversal matrix, and ${\mathbf U}$ and
- ${\mathbf V}$ are aribtrary invertible matrices.
- Note that these matrices can be smaller than half the size of the block
- transform.
- This allows increased blocking artifacts to be traded off against decreased
- ringing.
- Longer basis functions are also possible by using additional filter stages.
- Restricting ${\mathbf U}$ and ${\mathbf V}$ to further be orthogonal produces
- orthogonal transforms, which have nice error analysis properties, but usually
- inferior energy-compaction abilities.
- The pre-filter is applied before the block transform, at an offset, such that
- its support straddles two blocks, with one half taken from each.
- Boundary conditions are handled simply by not applying the filter when its
- support would overlap the boundary.
- \subsubsection{Approximate Forms}
- The matrix ${\mathbf U}$ has very little effect on coding gain, and so can be
- taken to be the identity to simplify the transform without much loss
- \cite{Tra01a}.
- The matrix ${\mathbf V}$ may then be parameterized as a diagonal scale martix
- surrounded by rotation matrices using the singular value decomposition.
- Tran notes that some of the rotation angles are always very small, and suggests a
- two simpler forms with just one rotation matrix composed of only $K-1$ angles,
- where $K$ is the size of ${\mathbf V}$ \cite{Tra01a}.
- He also suggests approximating the rotation by two lifting steps:
- $x_{i+1}=x_{i+1}+px_i$ followed by $x_i=x_i+ux_{i+1}$.
- The Type-III transform arranges the rotations in the $x_{K-1}-x_{K-2},\ldots,
- x_2-x_1,x_1-x0$ planes.
- The Type-IV transform splits the two lifting steps up so that first the steps
- $x_1=x_1+p_0x_0,x_2=x_2+p_1x_1,\ldots,x_{K-1}=x_{K-1}+p_{K-2}x_{K-2}$ are
- applied and then the steps $x_{K-2}=x_{K-2}+u_{K-2}x_{K-1},\ldots,
- x_1=x_1+u_1x_2,x_0=x_0+u_0x_1$ are applied.
- An optimized $8\times 16$ filter with ${\mathbf U}={\mathbf I}$ can achieve a
- coding gain of $9.6151$ dB.
- Using the Type-III approximation reduces this to $9.6115$ dB, and using the
- Type-IV approximation still yields $9.6005$ dB.
- Forcing the transform to be orthogonal in any of these cases imposes
- approximately a $0.2$ dB penalty \cite{Tra01b}.
- Thus the penalty for using either approximation is very small.
- In general, the structure of the Type-III transform matches better with the
- sizes of the coefficients, so that the smaller coefficients with larger
- relative approximation error occur towards the end of the transform.
- The Type-IV transform still remains competitve in terms of coding gain,
- however, and a special use for it will be seen shortly.
- \subsubsection{Regularity Constraints}
- The transforms can be made $\{1,2\}$-regular, so that with just the DC
- coefficient of each block, smooth ramp signals may be reconstructed.
- These ramps match very well with the offset not coded in our motion
- compensation scheme.
- Regularity is imposed with the additional constraint
- ${\mathbf V}{\mathbf q}=M{\mathbf u}$.
- Here ${\mathbf q}=[1,3,\ldots,2K-1]^T$, ${\mathbf u}=[1,1,\ldots,1]^T$ and
- $M\in\{2K,2K+1\}$ is the size of the block transform.
- For the Type-III transform, this is equivalent to the constraints
- \begin{eqnarray*}
- s_0 &=&M(1-u_0)\\
- s_i &=&\frac{M}{2i+1}(1-u_i)-\frac{2i-1}{2i+1}s_{i-1}p_{i-1}\\
- & &{\rm \ for\ }i=1,\ldots,K-2\\
- s_{K-1}&=&\frac{M}{M-1}-\frac{M-3}{M-1}s_{K-2}p_{K-2}
- \end{eqnarray*}
- where $s_i$ is the scale factor for each row of ${\mathbf V}$.
- Similar constraints exist for the Type-IV transform:
- \begin{eqnarray*}
- s_0 &=&M(1-u_0)\\
- s_i &=&\frac{M}{2i+1}(1-u_i-(1-u_{i-1})p_{i-1})\\
- & &{\rm \ for\ }i=1,\ldots,K-2\\
- s_{K-1}&=&\frac{M}{M-1}(1-(1-u_{K-2})p_{K-2}).
- \end{eqnarray*}
- \subsubsection{Optimal Dyadic Rational Approximations}
- Finally, all the coefficients $u_i,p_i,s_i$ may be approximated by dyadic
- rationals so that the transform may be implemented entirely with integer
- operations.
- Because the scale factors of the post filter are $1/s_i$, and decoder speed is
- more important, it would be desirable to approximate $1/s_i$ as a dyadic
- rational instead of $s_i$, but this imposes severe restrictions on the $s_i$.
- For example, $1/s_0$ must be a power of two, or $M(1-u_0)$ would have a
- denominator not divisible by two.
- Settling for $K$ integer divisions in the decoder allows transforms with much
- higher coding gain.
- One can also see that the constraints for the Type-IV transform do not have the
- same chain of dependencies on previous scale factors that those of the
- Type-III transform do.
- This makes it much easier to satisfy these constraints, and so the best
- approximation of the Type-IV transform to a given precision might actually be
- better than that of the Type-III transform.
- This is in fact confirmed to be the case for $8\times 16$ filters.
- The transform with the optimal coding gain for a given number of bits of
- precision in the dyadic rational approximations can be obtained by using
- numeric methods to find the optimal point at some large precision, and then
- performing a brute force search around this for parameters $u_i$ and $p_i$
- which satisfy the constraints.
- This brute force search was done by searching the faces of an expanding
- hyper-cube around the approximation of the optimal parameters.
- The coding gain function has a single local maximum and behaves very smoothly
- near the critical point, justifying the use of simple numeric methods to find
- the global optimum.
- In order to speed up the search, once the maximum value of the coding gain
- function on a face of the cube falls below that of the best solution already
- found, that face stops expanding, and need not be considered again.
- When all faces have stopped, the global optimum has been found.
- This method is exponential in the number of parameters, since it must examine
- all dyadic rationals in a volume of parameter space.
- Two levels out in the 14-dimensional parameter space needed to optimize a
- $16\times 32$ transform contains over 34 billion lattice points.
- Some shortcutting may be made by throwing out pieces of hyper-cube faces when
- it can be shown that any one of the constraints is not satisfied over it, but
- the computational complexity still grows very quickly.
- \subsubsection{Embedded Concerns}
- For hardware and embedded systems, it is desirable to keep the dynamic range
- of all calculations within 16 bits.
- This is even desirable for modern 32-bit processors, many of which include SIMD
- instruction sets that can operate on 4 16-bit operands at a time.
- However, residuals may already require 10 bits to represent.
- An additional two bits are required for the output of each of two
- pre-filtering operations, one in the horizontal and one in the vertical
- direction.
- This severely limits the size of the dyadic rationals that can be used when
- computing the transform.
- Already, for a $8\times 16$ pre/post filter, a dyadic rational approximation
- that fits within 16 bits incurs a $0.3$ dB loss in coding gain.
- This can be reduced in theory by altering the structure of the lifting steps to
- reduce the dynamic range of critical values in the calculation, but this
- introduces round-off errors.
- The optimization process described above assumes that round-off errors due to
- lifting steps can be ignored, so that all transforms may be treated as linear.
- It is unknown which leads to a lower coding gain: using a less accurate
- approximation of the transform coefficients or using a less accurate
- representation of the values being transformed.
- However, as restructuring introduces errors earlier in the calculation, it will
- likely cause the greater loss.
- Liang and Tran recommend implementing the multiplications by the dyadic
- rationals only using right-shifts \cite{LT01}.
- Hence $\frac{3}{8}x$ would be implemented as
- $\lfloor\frac{x}{4}\rfloor+\lfloor\frac{x}{8}\rfloor$.
- This introduces still further rounding errors, though it should be better than
- simply using a less accurate approximation of the coefficient.
- The errors can be further reduced by left shifting where the room is available.
- This approach might have patent issues, however.
- This approach also does not solve the problem of multiplying by the scale
- factors.
- Unlike normal lifting steps, this multiplication is not trivially invertible.
- Since the scale factors are always greater than or equal to one, inversion is
- normally attained by adding one to a positive result on the encoder's side
- and simply doing the normal division on the decoder's side.
- Whatever approximation scheme is used to reduce the dynamic range of this
- operation, care must be taken that the decoder will be able to easily invert
- the process.
- In light of all these considerations, considerable effort could be put forth to
- design a transform with a small dynamic range that retains as many desirable
- properties as possible.
- All of this effort can be avoided by simply using 32-bit arithmetic to perform
- the transforms.
- Therefore, although all coefficients produced by lapped transforms considered
- here will fit within a 16-bit signed integer, 32-bit operations will be
- required for intermediate results.
- \subsubsection{Boundary Handling}
- It has already been said that pre/post filters which would overlap the boundary
- of an image can simply be discarded.
- However, if the size of the image is not a multiple of the block size, special
- handling still needs to be done in the region where a block is only partially
- filled by the image.
- The usual method of handling this is to extend the signal to fill the block,
- and then apply the transform.
- There are a number of ways to extend the signal, however.
- The most practical ways are to compute some simple function of the signal, such
- as mirroring it or copying the last value into the empty region.
- In theory, it should be possible to choose values that guarantee that there are
- as many trailing zeros output from the transform as there were added values.
- The actual values can be found by a straightforward method.
- Arbitrary values are assumed for the unknowns, the transform computed, and the
- desired coefficients are set to zero.
- Then the inverse transform is computed, and the results of the inverse are used
- for the unknowns.
- This is repeated until the process converges.
- This method will work when the inputs can be represented to arbitrary precision
- and the transform is a linear one.
- It may not converge with bounded representations and lifting steps in the
- transform which round or truncate values.
- Therefore, the bitstream syntax should not be required to produce these zero
- terms, or lossless coding might be compromised.
- \section{Perceptual Coding}
- These factors influence only encoder design.
- They should have minimal to no effect on the bitstream format.
- Models which describe the sensitivity of the human visual system are
- parameterized by the viewing environment.
- Thus, size is measured in degrees of arc spanned after projection, contrast is
- relative to the surrounding lighting and background, etc.
- Either a ``standard" viewing environment must be assumed, or whatever
- procedures that are used must take parameters which describe the salient
- features of the environment.
- At a minimum these should include expected image size, viewing distance, and
- background luminance.
- \subsection{Region of Interest Coding}
- Motion is empirically 5 times as important as any other factor when determining
- regions of interest in a scene \cite{Osb99, NC96}.
- The other factors used by Osberger are contrast, size, shape, location, and
- differentiation between foreground and background.
- When importance weightings were averaged over $16\times 16$ blocks, $50\%$ of
- viewer attention was consistently located in the most important $10\%$ of the
- image for still images.
- A study found that over $90\%$ of fixations occurred in less than $18\%$ of the
- area any video scene, while Osberger's most important $20\%$ contained less
- than $70\%$ of the attention, so there is still room for improvement in
- identifying regions of interest.
- Humans get cues about their own motion from both visual clues and their inner
- ear.
- Angular velocities around $0.7$ cycles/minute (6 deg/sec) mark the point where
- visual cues and measurements in the inner ear are equally important.
- Above that the inner is more important, and below that visual cues are.
- This can be used as a threshold for determining which motions are too fast to
- track and which are important.
- This corresponds well with empirical results by Osberger that suggest a value
- of 5 deg/sec for the maximum sensitivity to motion \cite{Osb99}.
- There is typically a $0.1$ to $0.2$ second delay between the time a new object
- appears in the visual field and the time the eye begins to track its motion.
- Most methods of region of interest coding require transmitting some sort of
- side information that describes where the regions are.
- One method does this by transmitting model parameters for areas of expected
- foveation.%TODO: \cite{}.
- This allows both the encoder and decoder to scale wavelet coefficients by the
- same values, and then transmit or decode the bits in normal significance
- order.
- This has the drawback of making the attention model part of the bitstream
- format, so that future research cannot make drastic modifications to it, and
- of restricting the use of more perceptually based quantization methods, e.g.
- contrast-based methods.
- Viewing the example images in the paper also reveals the drastic dependence on
- viewing conditions of the foveation model itself, as far too much information
- is thrown away in ``unimportant" regions.
- Considering that ROI selection is only accurate half the time, this approach is
- likely to do more harm than good.
- A more flexible method is simply to use adaptive quantization to control how
- bits are distributed.
- This may require an increased amount of side information, but leaves control
- of the model entirely up to the encoder.
- Because the encoder also has some control over the entropy encoding of the
- quantization parameter, simple encoders which do not wish to use adaptive
- quantization or use only a limited version of it can reduce or eliminate the
- bit rate overhead altogether.
- \subsection{Supra-Threshold Quantization}
- For current codecs, it is usually the case that pre-processing images to
- remove detail is more effective than adaptive quantization.
- This is because pre-processing can be done without introducing additional
- artifacts such as blocking artifacts \cite{Osb99}.
- For a well-designed codec, this should not be the case.
- It is generally believed that the detection threshold for stimuli with
- different orientations and frequencies is a non-linear Minkowski sum of the
- detection threshold for each individual stimulus, though the exact exponent
- varies depending on testing conditions.
- The exponents determined by experiments range from $\Beta=2.2$ to
- $\Beta=\infty$ (no summation) for sine-wave and Gabor targets presented
- against a uniform background or an unnatural masker.
- Experiments using natural images as a masker reveal that the summation is
- approximately linear $\Beta=1$ \cite{CH02a}.
- The conclusion of this is that it is acceptable to optimize each channel of a
- filter bank independently, without taking into account inter-channel
- dependencies.
- The other important observation is that human sensitivity to relative changes
- in quality above the threshold of detection does not proceed in a linear
- fashion.
- This means that it is not valid to attempt to find quantizers which produce
- equivalent distortions at sub-threshold levels, and then scale them linearly
- to supra-threshold levels in order to save bits.
- Chandler and Hemami have performed experiments for quantizing various wavelet
- channels, but their results rely on a specific viewing environment, including
- a specific image size and level of wavelet decomposition \cite{CH02b}.
- It is not clear how to generalize their results to other viewing environments or
- more adaptive transforms.
- %%CELT-style AVQ:
- %% Code L1 energy in (band, block?) with exponential codebook
- %% Code distribution of energy with PVQ codebook
- %% Lattice spacing equal to energy quantizer
- %% Distribution _not_ uniform
- %% Always use a predictor:
- %% Inter predictor
- %% MC, then forward transform
- %% Intra predictor
- %% Predict from neighboring post-transform coefficients
- %% Use one or more linear predictors
- %% Can build custom, optimal predictor for various texture orientations
- %% Can we pick which predictor to use by classifying source coefficients?
- %% Advantage: no side information
- %% Predictor gives predicted energy (code residual from prediction)
- %% Predictor gives predicted PVQ pulse vector
- %% Scale predictor by change in energy
- %% Code residual from predicted pulse distribution
- %% Take advantage of lattice structure (at least one coefficient completely
- %% determined by energy constraint)
- %% Take advantage of non-uniform distribution (residual likely to be near 0)
- %% SILK-style codebook training sufficient?
- %% Scalar quantization with ranges restricted?
- %% Challenges:
- %% Want PVQ codebook offset so no quantization need be applied to predictor
- %% Want Energy codebook offset so predictor*1.0 is in codebook
- %% Bit-exact exponential codebook requires O(N) multiplies to search (enc+dec)
- %% Big look-up table of scale factors?
- %% Doubles the number of muls needed for search by a no-table implementation.
- %% Instead of e_{n+1}=e_n*s>>b, s_{n+1}=s_n*s_1>>b, e_{n+1}=e_0*s_{n+1}>>b.
- \bibliography{daala}
- \end{document}
|