mc.tex 40 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849
  1. \documentclass[11pt,letterpaper]{article}
  2. %\usepackage[]{epsfig}
  3. \usepackage{amssymb}
  4. \usepackage{amsmath}
  5. \usepackage{amsxtra}
  6. \usepackage[pdfpagemode=None,pdfstartview=FitH,pdfview=FitH,colorlinks=true,
  7. pdftitle=Adaptive\ Motion\ Compensation\ Without\ Blocking\ Artifacts,
  8. pdfauthor=Timothy\ B.\ Terriberry]{hyperref}
  9. %Allow inclusion of EPS/PDF figures with TeX labels
  10. \newif\ifpdf
  11. \ifx\pdfoutput\undefined
  12. \pdffalse
  13. \else
  14. \pdfoutput=1
  15. \pdftrue
  16. \fi
  17. \ifpdf
  18. \pdfcompresslevel=9
  19. \providecommand{\figinput}[1]{\input{#1.pdftex_t}}
  20. \usepackage[pdftex]{graphicx}
  21. \DeclareGraphicsRule{.pdftex}{pdf}{.pdftex}{}
  22. \else
  23. \providecommand{\figinput}[1]{\input{#1.pstex_t}}
  24. \usepackage{graphicx}
  25. \DeclareGraphicsRule{.pstex}{eps}{.pstex}{}
  26. \fi
  27. \pagestyle{headings}
  28. \bibliographystyle{alpha}
  29. \DeclareMathOperator*{\argmin}{argmin}
  30. \newcommand{\w}[1]{\ensuremath{{w_{#1}}}}
  31. \newcommand{\s}[1]{\ensuremath{{s_{#1}}}}
  32. \newcommand{\m}[1]{\ensuremath{{\mathbf{m}_{#1}}}}
  33. \title{Adaptive Motion Compensation Without Blocking Artifacts}
  34. \author{Timothy B. Terriberry}
  35. \date{}
  36. \begin{document}
  37. \maketitle
  38. \begin{abstract}
  39. We present a method for adaptive switching between CGI and OBMC that avoids
  40. discontinuities along block boundaries.
  41. We extend the method to support adaptive subdivision of the control grid, as
  42. well.
  43. Finally, we present a multiresolution blending method that improves the
  44. sharpness of the predictor while reducing computational complexity.
  45. \end{abstract}
  46. \section{Introduction}
  47. Most modern video codecs still use straightforward Block Matching Algorithms
  48. (BMA)~\cite{DM95} to perform motion compensation, despite their simplicity.
  49. The blocking artifacts they produce coincide with similar blocking artifacts in
  50. the transform stage, so there is little motivation to avoid them.
  51. However, as interest increases in lapped transforms or wavelets---which avoid
  52. blocking artifacts in the transform stage---there is a need for motion
  53. compensation methods that are themselves free from blocking artifacts.
  54. Historically, two of the main methods proposed have been Overlapped Block
  55. Motion Compensation (OBMC) and Control Grid Interpolation (CGI).
  56. \subsection{Overlapped Block Motion Compensation}
  57. OBMC was introduced by Watanabe and Singhal expressly to improve prediction
  58. near block boundaries~\cite{WS91}.
  59. It uses a piecewise constant motion model estimated over a regular grid, like
  60. BMA.
  61. Blocking artifacts are avoided by weighting the predictions around each grid
  62. point with a smoothly varying window function.
  63. The prediction gain is due to the blending of several samples of the reference
  64. image, which is typically highly spatially correlated.
  65. This compensates for motion uncertainty away from the grid points, provided
  66. there are no sharp edges~\cite{OS94,ZSNKI02}.
  67. The primary difficulty with OBMC is its tendency to blur sharp
  68. features~\cite{Ish98}.
  69. Compared to normal half-pel BMA, Katto and Ohta report an $0.4$ dB
  70. improvement in SNR of the prediction by simply utilizing motion vectors (MVs)
  71. estimated by BMA~\cite{KO95}.
  72. Kuo and Kuo report improvements between $0.5$ and $0.8$ dB vs. whole-pel BMA,
  73. and an additional $0.2$ to $0.5$ dB gain from local iterative refinement of
  74. the MVs~\cite{KK97}.
  75. The different MV accuracies are important, as Girod points out that the
  76. blending used in half-pel BMA has noise-reducing properties similar to those
  77. of a multi-hypothesis prediction method like OBMC~\cite{Gir00}.
  78. \subsection{Control Grid Interpolation}
  79. CGI, introduced by Sullivan and Baker, moves the blending from the image domain
  80. to the motion domain~\cite{SB91}.
  81. It interpolates the MVs between grid points to produce a single predictor,
  82. which helps avoid the over-smoothing problems of OBMC.
  83. However, when the real motion is discontinuous, such as near an occluding edge,
  84. gross warping errors can occur.
  85. Related methods use triangular patches~\cite{NH91} or content-adaptive
  86. meshes~\cite{ATB95}, with affine or projective motion models~\cite{WL96} and
  87. varying increases in computational requirements.
  88. In this work we focus on \textit{backwards-mapping} approaches, which use a
  89. semi-regular control grid in the current frame.
  90. \textit{Forward-mapping} approaches, which allow arbitrary control point
  91. placement, are not considered.
  92. Initially, CGI offered only subjective improvements, with little to no gain in
  93. objective PSNR~\cite{SB91,AT98}.
  94. However, significant compression gains are to be had by improving the motion
  95. estimation process.
  96. Chen and Willson report gains of $1.8$ to $2.4$ dB using iterative dynamic
  97. programming instead of local refinement, while reducing the bits spent on
  98. MVs by $25\%$ to $30\%$~\cite{CW00}.
  99. This is in stark contrast to OBMC, which only improved by $0.1$ to $0.2$ dB,
  100. though with a similar bitrate reduction.
  101. Nosratinia proposes adapting the CGI interpolation kernel to reduce the
  102. inter-block dependency when it would benefit PSNR.
  103. His kernels vary between bilinear at one extreme, and a step function at the
  104. other, effectively switching between CGI and BMA, but he uses the same kernel
  105. over an entire frame~\cite{Nos01}.
  106. \subsection{Adaptive Switching Algorithms}
  107. In fact, one of the primary drawbacks with BMA, OBMC, and CGI is that each uses
  108. the same motion model over the entire image, while true motion fields are
  109. better modeled as piecewise smooth, with isolated discontinuities.
  110. Many researchers have experimented with adaptively switching between various
  111. methods on a block-by-block basis.
  112. Kuo and Kuo switch between BMA and OBMC to reduce encoder and decoder
  113. complexity while maintaining the quality of OBMC~\cite{KK97}.
  114. Ishwar reports gains of $0.5$ to $0.9$ dB over OBMC alone, and similarly,
  115. $0.5$ to $1.0$ dB gains over CGI alone by switching with BMA in each
  116. case~\cite{Ish98}.
  117. Altunbasak and Tekalp report improvements at low bitrates by switching between
  118. CGI with varying levels of mesh subdivision, OBMC, and BMA~\cite{AT98}.
  119. Heising et al. also use a wavelet coder to demonstrate that switching between
  120. OBMC and CGI provides better performance than any one of the three approaches,
  121. or any other pair of them~\cite{HMCP01}.
  122. The main drawback of all of these adaptive methods is the introduction of
  123. blocking artifacts when switching between the models.
  124. Section~\ref{sec:switching} outlines an approach that avoids these artifacts.
  125. \subsection{Adaptive Subdivision}
  126. Another strategy for adapting to the true motion field is to vary the
  127. resolution of the grid on which MVs are sampled~\cite{CYC90}.
  128. Lee showed that simultaneous rate-distortion optimization of the BMA block size
  129. and the quantizers used to code the residual produced gains of up to $1.7$ dB
  130. over a fixed $16\times 16$ block size at 64 kbit/s~\cite{Lee95}.
  131. Huang and Hsu proposed using variable block sizes for CGI and achieved a
  132. $29\%$ savings in MV bitrate with a $0.2$ dB increase in PSNR compared to
  133. triangle-based CGI~\cite{HH94}.
  134. In order to preserve continuity of the motion field, the MVs at vertices that
  135. split the edge of a larger block are fixed at the value interpolated from that
  136. edge's endpoints.
  137. Zhang \textit{et al.} used variable block sizes with OBMC and achieved
  138. reductions of $0.2$ to $0.9$ dB in SNR over variable size BMA at 60
  139. kbit/s~\cite{ZAS98}.
  140. To avoid blocking artifacts, they subdivide the blocks in the decoded quadtree
  141. until none of the interpolation kernels overlap more than one adjacent block
  142. in a given direction and then assign them all the same MV.
  143. These two schemes are incompatible with each other, but we present a unifying
  144. method in Section~\ref{sec:subdiv} that is also compatible with our
  145. artifact-free switching algorithm.
  146. \section{Continuous Switching Between OBMC and CGI}
  147. \label{sec:switching}
  148. In this section, we describe a method for switching between OBMC and CGI
  149. without introducing artificial discontinuities along block boundaries.
  150. This is accomplished by labeling the block \textit{edges} with the type of
  151. interpolation to use, instead of the block itself.
  152. Each edge labeled with a `V' for vector interpolation (CGI) or a `B' for
  153. blending in the image domain (OBMC).
  154. Then all that is required are interpolation formulas for the block interior
  155. that yield the desired edge interpolation types.
  156. We use bilinear interpolation for both OBMC and CGI.
  157. For OBMC, neither the raised cosine nor trapezoidal windows have been shown to
  158. be consistently better~\cite{ZSNKI02}, and the bilinear window simplifies the
  159. adaptive subdivision introduced in the next section.
  160. Orchard and Sullivan propose adapting the window function to a sequence to
  161. optimize prediction gain, but when iterative motion estimation is used, the
  162. advantage over a bilinear window is minimal~\cite{OS94}.
  163. For CGI, the bilinear motion model is substantially less complex than
  164. projective or affine models, and has the advantage of being uniquely defined
  165. over a quadrilateral domain, unlike the affine model.
  166. Nosratinia only applied his optimized kernels when motion estimation was
  167. performed by BMA, and they performed between $0.2$ dB better and $1.5$ dB
  168. worse than a bilinear kernel with iteratively refined MVs~\cite{Nos01}.
  169. However, encoders with strict complexity requirements would be better off
  170. simply using OBMC, where motion estimation is less sensitive to the
  171. interdependency between adjacent MVs.
  172. We use square blocks with an origin in the upper-left corner and $x$ and $y$
  173. coordinates that vary between $0$ and $1$.
  174. The vertices and edges of the blocks are indexed in a clockwise manner, as
  175. illustrated in Figure~\ref{fig:blockidx}.
  176. The bilinear weights for each vertex used in vector interpolation are $\w k$,
  177. and those for blending the resulting predictors are $\s k$, defined as:
  178. \begin{align*}
  179. \w0 = \s0 & = (1-x)\cdot (1-y) \\
  180. \w1 = \s1 & = x\cdot (1-y) \\
  181. \w2 = \s2 & = x\cdot y \\
  182. \w3 = \s3 & = (1-x)\cdot y
  183. \end{align*}
  184. Although $\w k$ and $\s k$ are equivalent in this section, we will modify
  185. $\s k$ in the next section in order to perform adaptive subdivision.
  186. Therefore we make the distinction now.
  187. $I$ is the reference image, and for simplicity we denote the predictor
  188. $I(x+m_x(x,y),y+m_y(x,y))$ for the pixel at $(x,y)$ with motion vector
  189. $\m{}=(m_x(x,y),m_y(x,y))$ as simply $I(\m{})$.
  190. \begin{figure}[tb]
  191. \center
  192. \scalebox{0.99}{\figinput{blockidx}}
  193. \caption{Vertex and edge indices for a block.}
  194. \label{fig:blockidx}
  195. \end{figure}
  196. With this notation in hand, we give formulas for the sixteen possible edge
  197. labelings.
  198. There are six unique cases; the others may be obtained by rotational symmetry.
  199. \setlength{\multlinegap}{0pt}
  200. \begin{description}
  201. \item[VVVV]
  202. \begin{multline}
  203. \label{eq:vvvv}
  204. I\left(\w0\m0+\w1\m1+
  205. \w2\m2+\w3\m3\right)\hfill
  206. \end{multline}
  207. \item[BVVV]
  208. \begin{multline}
  209. \label{eq:bvvv}
  210. I\left((\w0+\w1)\m0+
  211. \w2\m2+\w3\m3\right)\cdot \s0\,+ \hfill\\
  212. I\left((\w0+\w1)\m1+
  213. \w2\m2+\w3\m3\right)\cdot \s1\,+ \hfill\\
  214. I\left(\w0\m0+\w1\m1+
  215. \w2\m2+\w3\m3\right)\cdot\left(\s2+\s3\right)\hfill
  216. \end{multline}
  217. \item[BVBV]
  218. \begin{multline}
  219. \label{eq:bvbv}
  220. I\left((\w0+\w1)\m0+
  221. (\w2+\w3)\m3\right)\cdot\left(\s3+\s0\right)\,+ \hfill\\
  222. I\left((\w0+\w1)\m1+
  223. (\w2+\w3)\m2\right)\cdot\left(\s1+\s2\right)\hfill
  224. \end{multline}
  225. \item[VVBB]
  226. \begin{multline}
  227. \label{eq:vvbb}
  228. I\left((1-\w1)\m0+\w1\m1\right)\cdot \s0\,+\,
  229. I\left(\w1\m1+(1-\w1)\m2\right)\cdot \s2\,+ \hfill\\
  230. I\left(\w0\m0+\w1\m1+\w2\m2+\w3\m3\right)\cdot \s1\,+\,
  231. I\left(\m3\right)\cdot \s3\hfill
  232. \end{multline}
  233. \item[VBBB]
  234. \begin{multline}
  235. \label{eq:vbbb}
  236. I\left((1-\w1)\m0+\w1\m1\right)\cdot \s0\,+\,
  237. I\left(\m2\right)\cdot \s2\,+ \hfill\\
  238. I\left(\w0\m0+(1-\w0)\m1\right)\cdot \s1\,+\,
  239. I\left(\m3\right)\cdot \s3\hfill
  240. \end{multline}
  241. \item[BBBB]
  242. \begin{multline}
  243. \label{eq:bbbb}
  244. I\left(\m0\right)\cdot \s0+
  245. I\left(\m1\right)\cdot \s1+
  246. I\left(\m2\right)\cdot \s2+
  247. I\left(\m3\right)\cdot \s3\hfill
  248. \end{multline}
  249. \end{description}
  250. It is a straightforward exercise to verify that these give the desired behavior
  251. along each edge.
  252. These formulas are not unique, but have been chosen to be compatible with the
  253. adaptive subdivision introduced in the next section.
  254. Each associates a bilinearly interpolated vector with each vertex and then
  255. bilinearly blends the resulting predictors.
  256. Using the same vector interpolation formula for multiple vertices makes the
  257. prediction more like the CGI in~\eqref{eq:vvvv}.
  258. Using multiple copies of the same nodes within a particular interpolation
  259. formula makes the prediction more like the OBMC in~\eqref{eq:bbbb}.
  260. A similar set of formulas exists for triangular meshes using barycentric
  261. weights.
  262. \section{Adaptive Subdivision}
  263. \label{sec:subdiv}
  264. In order to extend our method to handle variable block sizes while maintaining
  265. continuity along the edges, we borrow a data structure from the related
  266. graphics problem of surface simplification, called the semi-regular 4-8
  267. mesh~\cite{VG00,DWSMAM97}.
  268. This is normally used to represent a hierarchical triangle mesh, but we use it
  269. for variable-sized quadrilaterals because its subdivision rules provide a
  270. simple means of preserving continuity across block boundaries.
  271. The vertices in the 4-8 mesh are arranged in a quadtree, where subdivision
  272. proceeds in two stages, as illustrated in Figure~\ref{fig:4-8sub}.
  273. In the first stage, a new vertex is added to the center of a quadrilateral.
  274. This subdivides the quadrilateral into four \textit{quadrants}, but does not
  275. add any additional vertices to the edges.
  276. Such edges are called \textit{unsplit}.
  277. In the second stage, each of the quadrilateral's edges are split and connected
  278. to the center vertex, forming four new quadrilaterals.
  279. One useful property of this dual-stage subdivision is that the number of
  280. vertices in the mesh merely doubles after each stage, instead of quadrupling
  281. as it would under normal quadtree subdivision.
  282. This provides more fine-grained control over the number of MVs transmitted.
  283. \begin{figure}[tb]
  284. \center
  285. \includegraphics[width=\columnwidth]{4-8sub}
  286. \caption{The first four subdivision levels in a 4-8 mesh.
  287. Solid dots indicate vertices with transmitted MVs.
  288. New vertices at each level have arrows pointing to their two parents.
  289. Vertices from the previous level have arrows from each of their four children.
  290. In both cases, some may lie in an adjacent block.}
  291. \label{fig:4-8sub}
  292. \end{figure}
  293. This is accomplished by assigning every vertex two parents in the tree, which
  294. are the two adjacent nodes from the immediately preceding subdivision level.
  295. A vertex may not be added to the mesh until both of its parents are present.
  296. This ensures adjacent blocks never differ by more than one level of
  297. subdivision.
  298. Therefore, we need only specify how to handle a block that has undergone stage
  299. one subdivision, but still has one or more unsplit edges, as illustrated in
  300. Figure~\ref{fig:vbunsplit}.
  301. Such a block is divided into quadrants, and each is interpolated separately
  302. using the formulas from Section~\ref{sec:switching}, with some restrictions
  303. and modifications.
  304. \begin{figure}[tb]
  305. \center
  306. \figinput{vbunsplit}
  307. \caption{Interpolation setup for a quadrant resulting from stage one
  308. subdivision.
  309. The MVs used at each corner are labeled, as is the type of interpolation
  310. performed along each edge.
  311. $c$ marks the exterior corner, which receives the extra blending weight along
  312. the top edge.}
  313. \label{fig:vbunsplit}
  314. \end{figure}
  315. The first restriction we impose is that an interior edge perpendicular to an
  316. unsplit edge must use the same interpolation type.
  317. It is actually well-defined to have an interior `B' edge adjacent to an unsplit
  318. `V' edge, but having an interior `V' edge adjacent to an unsplit `B' edge
  319. would require that the MV converge towards two values simultaneously, blended
  320. in equal proportion.
  321. It might be possible to develop a scheme that would accomplish this, but it
  322. would be overly complex, and the edge itself would not actually behave like
  323. other true `V' edges.
  324. We don't expect either case to be useful in practice and so disallow them both.
  325. Hence, an unsplit `V' edge can be handled by simply averaging its two MVs
  326. and using the result for that corner of the quadrilateral.
  327. An unsplit `B' edge is only slightly more complicated.
  328. The same two MVs are used along the edge as before, but we shift some of the
  329. weight used for blending from the middle of the edge to the exterior corner.
  330. More precisely, the weights $\s k$ are replaced by modified weights $\s{k}'$.
  331. For example, if $c$ is the index of the vertex in the exterior corner,
  332. $\oplus$ denotes addition modulo four, and $c\oplus 1$ is the index of the
  333. corner bisecting the unsplit edge, then
  334. \begin{align*}
  335. \s{c}' & = \s{c}+\frac{1}{2}\s{c\oplus 1} &
  336. \s{c\oplus 1}' & = \frac{1}{2}\s{c\oplus 1}\ .
  337. \end{align*}
  338. The remaining weights are unchanged.
  339. A similar modification is used if it is $c\oplus 3$ that lies on the unsplit
  340. `B' edge.
  341. The modifications are cumulative.
  342. That is, if both $c\oplus 1$ and $c\oplus 3$ lie on unsplit `B' edges,
  343. \begin{align*}
  344. \s{c}' & = \s{c}+\frac{1}{2}\s{c\oplus 1}+\frac{1}{2}\s{c\oplus 3} &
  345. \s{c\oplus 1}' & = \frac{1}{2}\s{c\oplus 1}\\
  346. \s{c\oplus 3}' & = \frac{1}{2}\s{c\oplus 3} &
  347. \s{c\oplus 2}' & = \s{c\oplus 2}\ .
  348. \end{align*}
  349. It is clear that this strategy matches an adjacent quadrilateral along an
  350. unsplit edge, but careful examination will verify that it also matches other
  351. quadrants along the interior edges.
  352. Each of the resulting weights can be evaluated with finite differences at the
  353. cost of one addition per pixel, plus setup overhead.
  354. The blending can still be done with three multiplies per pixel, since the
  355. weights always sum to one.
  356. %The blending now requires four multiplies\footnote{A three-multiply version is
  357. % still possible, but requires a much more complex inner loop}, however,
  358. % compared to the three required by an optimized version of the bilinear
  359. % blending kernel.
  360. The mesh itself may require more vertices than the unconstrained meshes of
  361. previous work to achieve a given level of subdivision in a local area, but
  362. requires fewer bits to encode the subdivision itself, simply because there are
  363. fewer admissible meshes.
  364. As long as a $(0,0)$ MV residual can be efficiently encoded, the worst-case
  365. rate of the 4-8 mesh should be close to that of a similar, unconstrained mesh.
  366. \section{Multiresolution Blending}
  367. \label{sec:blend}
  368. While the bilinear blending functions we have chosen ensure a gradual
  369. transition from one block to the next, they can still produce unwanted
  370. artifacts.
  371. For large image features, a structured color shift may still be visible, though
  372. blurred, while small image features may appear doubled or superimposed.
  373. To solve this problem, Burt and Adelson proposed a multiresolution blending
  374. procedure~\cite{BA83}.
  375. They decompose each image to be blended with a wavelet transform and then
  376. perform the blending with a transition region commensurate with the scale of
  377. each decomposition level.
  378. The coefficients from each level are then recomposed to form the blended image.
  379. Each $I(\cdot)$ term in our interpolation formulas forms a predictor image that
  380. must be blended.
  381. For simplicity, we perform only one level of decomposition into $LL$, $HL$,
  382. $LH$, and $HH$ bands using a Haar wavelet.
  383. This can be done in only a few additions and subtractions, since common scale
  384. factors can be ignored.
  385. We assign the same weight to corresponding coefficients in all four bands,
  386. given by $\s k$ evaluated at the center of the $2\times 2$ block of pixels
  387. they represent.
  388. The $LL$ band is then bilinearly blended as normal.
  389. But in the $HL$, $LH$, and $HH$ bands, we choose the sample with the largest
  390. weight and discard the others.
  391. In fact, because the shapes of the weight functions $\s k$ are regular, in
  392. practice we need not ever compute the discarded samples.
  393. The Haar transform is then reversed to produce the final prediction.
  394. Because the bilinear blending occurs only in the $LL$ band, fully $\frac{3}{4}$
  395. of the multiplications in this step are avoided.
  396. \section{Implementation and Motion Estimation}
  397. The algorithms in Sections~\ref{sec:switching},~\ref{sec:subdiv},
  398. and~\ref{sec:blend} have been implemented in a (TODO: describe codec).
  399. To reduce aliasing, the reference frames are first upsampled by a factor of
  400. two in each dimension using a Wiener filter~\cite{Wer96}.
  401. This corrects for aliasing at the half-pel locations, where errors are the
  402. largest~\cite{WM03}.
  403. Samples at finer fractional displacements are bilinearly interpolated from the
  404. upsampled image.
  405. Luma blocks are square with sizes ranging from $16\times 16$ to $4\times 4$.
  406. At the coarsest resolution, the control points are placed in the center of each
  407. macroblock, and an extra ring of control points is added outside the image
  408. with MVs fixed at $(0,0)$.
  409. This accomplishes several things.
  410. It makes the minimum number of transmitted MVs the same as in BMA, it aligns
  411. the interpolation kernels with the piecewise-linear segments of any
  412. $(1,2)$-regular subband filter, and it simplifies boundary cases when decoding
  413. the adaptive mesh.
  414. In the chroma planes, color subsampling may produce smaller, rectangular
  415. blocks.
  416. Multiresolution blending is disabled for chroma blocks if either dimension is
  417. smaller than $4$.
  418. We perform rate-distortion (R-D) optimization during motion estimation to
  419. balance the error attainable against the number of bits required to achieve
  420. it~\cite{OR98}.
  421. These two metrics are related by a Lagrangian multiplier in the cost function
  422. \begin{align}
  423. J & = D+\lambda R
  424. \end{align}
  425. For a particular choice of $\lambda$, a minimal value of $J$ represents a point
  426. on the optimal operational rate-distortion curve: i.e., the best achievable
  427. distortion for a given rate, or vice versa.
  428. The value of $\lambda$ required to achieve a given rate can be found with a
  429. bisection search, or a reasonable estimate may be obtained directly from the
  430. target quantizer setting~\cite{SW98}.
  431. Since the residual errors after motion compensation are reasonably
  432. well-approximated by an exponential distribution, we use the Sum of Absolute
  433. Differences (SAD) in the luma plane as our distortion metric~\cite{SLH00}.
  434. We approximate the rate of each MV with the number of bits required to
  435. represent the magnitude of each component in binary (after subtracting a
  436. predictor), plus one bit for the sign.
  437. A magnitude of zero requires one bit to represent.
  438. The rate term for each vertex also includes a bit for each flag that indicates
  439. the presence of a child (2 per vertex on average), and a bit for each edge
  440. label (2 at level $0$, and 3 at levels $2$ and $4$).
  441. The real motion information is arithmetic encoded, so this approximation is used
  442. to avoid having to update the rate term for every vertex every time one is
  443. changed.
  444. We use a median-of-four predictor for almost every motion vector, as
  445. illustrated in Figure~\ref{fig:mvpred}.
  446. The middle two values of each MV component are averaged, rounding towards even
  447. values.
  448. The only exception is if a predictor lies inside a $16\times 16$ block that
  449. comes after the current one in raster order, in which case we use the median
  450. of the three remaining predictors.
  451. This occurs for the level $2$ and $4$ vertices on the right and bottom block
  452. edges.
  453. Excluding this predictor allows the mesh to be encoded one block at a time,
  454. instead of level-by-level.
  455. This permits lower latency in the encoder, and it increases cache coherency and
  456. decreases memory footprint in the decoder.
  457. \begin{figure}[tb]
  458. \center
  459. \includegraphics[width=\columnwidth]{mvpred}
  460. \caption{The predictors used for the MVs at each level of the mesh.
  461. Except at level $0$, these are all ancestors of that MV, and thus are required
  462. to be present.}
  463. \label{fig:mvpred}
  464. \end{figure}
  465. The number of combinations of subdivision levels, MVs, and edge labelings
  466. available make finding a globally optimal set of parameters impractical.
  467. The problem of finding optimal subdivision levels alone is known to be
  468. NP-hard~\cite{AS94}.
  469. The estimation procedure outlined here attempts to balance speed with
  470. compression performance, though it could certainly be improved by future
  471. research.
  472. It proceeds in several stages.
  473. \subsection{Initial Estimates}
  474. First, a EPZS${}^2$-style algorithm is used to produce an initial MV estimate
  475. at each subdivision level using BMA~\cite{Tou02}.
  476. This algorithm computes several MV candidates using spatial and temporal
  477. neighbors and assuming constant speed or acceleration.
  478. The candidates are ranked by reliability and the search terminates early if one
  479. is found with a SAD below a dynamically chosen threshold.
  480. Otherwise, a local gradient search is performed using a square pattern around
  481. the best candidate vector.
  482. An alternate version, dubbed simply EPZS, uses a small diamond pattern instead
  483. of a square.
  484. However, because duplicated SAD checks are easy to eliminate, the complexity
  485. savings are relatively small: $3$ or $5$ new candidates for the square vs.
  486. $3$ new candidates for the diamond, per iteration.
  487. This makes the square pattern more attractive, as it produces slightly better
  488. MVs.
  489. Tourapis does not use a Lagrangian cost function, relying solely on the
  490. correlation among predictors and the termination thresholds to achieve
  491. smoothly varying, easily compressible MVs.
  492. However, incorporating a rate term greatly improves the quality of the result.
  493. %TODO: Cite?
  494. Therefore we use $J$ to select the best candidate at each stage.
  495. We continue to perform threshold checks on the raw SAD values, however.
  496. The thresholds in EPZS${}^2$ also ensure extra computation time is spent only
  497. on blocks whose predictor can reasonably be expected to improve.
  498. At level $0$, $16\times 16$ blocks centered on the control points are used,
  499. which correspond to macroblocks.
  500. Levels $1$ and $2$ use $8\times 8$ blocks, while $3$ and $4$ use $4\times 4$
  501. blocks.
  502. As the mesh is subdivided, existing vertices do not have their MVs re-estimated
  503. with the smaller block sizes, even though the area that MV influences is
  504. reduced.
  505. The larger support used by previous searches yields more reliable estimates,
  506. and the accuracy of BMA is already highest at the block center~\cite{ZSNKI02}.
  507. All MVs are estimated only up to whole-pel accuracy at this stage.
  508. \subsection{Adaptive Subdivision}
  509. The second step fixes the mesh subdivision.
  510. During this stage, the SAD for each block is computed using the proposed
  511. prediction method instead of BMA.
  512. Since MV interdependency has thus far been ignored, we use `B' labels for all
  513. of the edges, as these dependencies have a less pronounced effect on the
  514. prediction quality with OBMC.
  515. The MVs produced by the previous stage are held fixed in this one; only the
  516. mesh subdivision level changes.
  517. The extra subdivision required to add a vertex to the 4-8 mesh is similar to
  518. the implicit subdivision used by Zhang \textit{et al.} in their variable block
  519. size OBMC scheme~\cite{ZAS98}.
  520. The difference is that we optimize over and encode such subdivision explicitly.
  521. We use a global R-D optimization strategy with general mesh decimations, as
  522. proposed by Balmelli~\cite{Bal01}.
  523. This is a greedy approach that starts with a full mesh and successively
  524. decimates vertices.
  525. Restricting decimation candidates to the leaves of the mesh often produces
  526. sequences where the distortion goes \textit{down} as the rate is decreased,
  527. clearly indicating that the previous rate allocation was not optimal.
  528. General mesh decimations, on the other hand, allow any vertex to be removed at
  529. a given step, not just the leaves.
  530. If a non-leaf is decimated, all of its children are decimated as well.
  531. This helps smooth out non-monotonicities in the distortion measure during the
  532. decimation process, especially at low rates.
  533. The following notation is needed to describe the algorithm.
  534. The current mesh is denoted by $M$, and $M_v$ is the \textit{merging domain} of
  535. $v$ in $M$: the set of vertices in $M$ that must be removed to remove $v$.
  536. This includes $v$ and all of its undecimated children.
  537. Additionally, the variation $\Delta\underline{u}(M_v)$ contains the pairs
  538. $(\Delta D(M_v),\Delta R(M_v))$: the change in distortion (SAD) and rate
  539. caused by removing $M_v$ from $M$.
  540. We also refer to the change in SAD in a single block $b$ caused by removing a
  541. single vertex $v$ as $\Delta D_b(v)$.
  542. Finally, $A_v$ is the set of ancestors of $v$ in $M$.
  543. Some minor additions to Balmelli's original algorithm are made to handle the
  544. fact that distortion is measured over squares, not triangles.
  545. The steps of the algorithm are:
  546. \begin{enumerate}
  547. \item For all $v$, compute $\Delta\underline{u}(M_v)$.
  548. \item Do
  549. \begin{enumerate}
  550. \item $v^*\leftarrow\argmin_{v\in M} -\frac{\Delta D(M_v)}{\Delta R(M_v)}$
  551. \item If $-\frac{\Delta D(M_{v^*})}{\Delta R(M_{v^*})}>\lambda$, stop.
  552. \item For all $w\in M_{v^*}$, sorted by depth from deepest to shallowest:
  553. \begin{enumerate}
  554. \item For all $a\in A_w$, $\Delta\underline{u}(M_a)\leftarrow
  555. \Delta\underline{u}(M_a)-\Delta\underline{u}(M_w)$
  556. \item Remove $w$ from the mesh.
  557. \item\label{step:deltad}
  558. If $w$ was on an even level, then for each adjacent block $b$ with a $w'\in M$
  559. such that $w'$ lies on the same level as $w$:
  560. \begin{enumerate}
  561. \item $\delta D\leftarrow (\Delta D_b(w')$ for that block after decimating
  562. $w)-(\Delta D_b(w')$ for that block before decimating $w)$
  563. \item\label{step:deltadupdate} For all $a\in \{w'\}\cup A_{w'}\setminus A_{w}$,
  564. $\Delta D(M_a)\leftarrow\Delta D(M_a)+\delta D$
  565. \end{enumerate}
  566. \end{enumerate}
  567. \end{enumerate}
  568. \end{enumerate}
  569. These steps ensure that $\Delta\underline{u}(M_v)$ contains the up-to-date
  570. changes in the global rate and distortion after each merging domain is
  571. decimated.
  572. This update process properly accounts for overlapping merging domains due to an
  573. inclusion-exclusion principle; see Balmelli for details~\cite{Bal01}.
  574. Step~\ref{step:deltad} handles the case of decimating one corner of a block,
  575. $w$, when the opposite corner, $w'$, remains.
  576. This changes $\Delta D_b(w')$, the cost of decimating the opposite corner, and
  577. that change must be propagated to each merging domain to which $w'$ belongs.
  578. No change needs to be made to the common ancestors of $w$ and $w'$ however:
  579. once $\Delta D(M_{w'})$ is updated, the normal update process is sufficient.
  580. The addition of these extra steps does not affect the computational complexity
  581. of the algorithm, which is $\Theta(n\log n)$, where $n$ is the size of the
  582. initial mesh.
  583. The distortion measurements needed to initialize and update
  584. $\Delta\underline{u}(M_v)$ can be computed once, in advance, by computing the
  585. SAD value of each block in all sizes and with all possible combinations of
  586. unsplit edges.
  587. All told, each pixel in the image is used in exactly nine SAD computations.
  588. Also, since the mesh only undergoes four levels of subdivision, there are only
  589. a small number of unique merging domains and ancestor sets.
  590. These can be computed offline in advance to simplify the decimation process.
  591. To compute the set difference $A_{w'}\setminus A_{w}$, we note that $w$ and
  592. $w'$ share a single common parent, $p$.
  593. The common ancestors of $w$ and $w'$ are now formed by the set
  594. $\{p\}\cup A_{p}$, meaning one can add $\delta D$ to the nodes in $A_{w'}$ and
  595. then subtract it from the nodes in $\{p\}\cup A_{p}$ to effect the set
  596. difference in Step~\ref{step:deltadupdate}.
  597. Alternatively, one could simply use a larger set of lookup tables.
  598. \subsection{Iterative Refinement and Edge Labeling}
  599. The next stage uses the iterated dynamic programming (DP) proposed by Chen and
  600. Willson to assign edge labels and refine the MVs, accounting for their
  601. interdependencies~\cite{CW00}.
  602. In this scheme, a single row (resp. column) of MVs and edge labels is optimized
  603. at a time using a Viterbi trellis~\cite{For73}, while the rest remain fixed.
  604. If there is no direct block edge between two consecutive MVs in a row (resp.
  605. column) then the trellis stops, and a new one is started.
  606. This continues until the entire row (resp. column) has been examined.
  607. The process is then repeated until the total change in Lagrangian cost $J$
  608. falls below a given threshold.
  609. Each vertex being examined has $h$ candidate MVs, and $2h$ trellis states, one
  610. for each candidate and edge label.
  611. One state for each MV candidate uses a `B' label between that vertex and the
  612. next vertex, and the other uses a `V' label.
  613. However, the edge label associated with the \textit{next} vertex does not
  614. affect the SADs of the blocks around the current vertex.
  615. Therefore, we need only check $2h^2$ trellis edges between states, not $4h^2$.
  616. \subsubsection{Rate and Distortion Changes}
  617. We use the \textit{change} in rate and distortion to compute the cost of each
  618. path in the trellis.
  619. A single MV can influence the distortion of as many as 12 neighboring blocks.
  620. Only the ones to the left (resp. above) are added to the current cost of each
  621. path.
  622. When the following edge label and MV are chosen, an additional 2 to 8 blocks
  623. may be added.
  624. If necessary, the blocks to the right (resp. below) are added after the last
  625. MV in the trellis.
  626. %TODO: Figure.
  627. Unfortunately, the rate of a MV depends on the values of the MVs used to
  628. predict it.
  629. Chen and Willson assume MVs use 1-D differential coding, as in MPEG-1, but
  630. modern video codecs use a median of three or four neighboring MVs to predict
  631. the current one.
  632. This means that several (not necessarily consecutive) MVs on the DP path may be
  633. used to predict a given MV, and the corresponding change in rate is not known
  634. until a MV has been chosen for all of them.
  635. If we were to consider all possible combinations of candidates for the
  636. predictors, the number of trellis edges would increase by several orders of
  637. magnitude.
  638. This seems excessively wasteful, since as long as the changes to the MVs are
  639. small, the median operation ensures only one of them is likely to have any
  640. influence on the predicted value in the first place.
  641. Instead, we immediately compute the rate change in each predicted
  642. vector---excluding those that themselves lie further along the DP path, since
  643. we do not yet know what MV will be encoded.
  644. We do this assuming all MVs not already considered by the DP remain fixed, and
  645. add the change to the cost of the current path.
  646. If changing a subsequent MV on the path causes the rate of one of these
  647. predicted MVs to change again, the new rate change is used from then on.
  648. Because we essentially discard a large number of trellis states of limited
  649. utility, we might theoretically discard the path that does not change MVs or
  650. edge labels, even though its true cost is lower than the ones we keep.
  651. Thus, as a safety precaution, we check the final cost of the best path, and do
  652. not apply it if it is greater than zero.
  653. This does occur in practice, but very rarely.
  654. Other, simpler alternatives to this approach are also possible.
  655. For example, we tried only considering rate changes for MVs on the actual DP
  656. path, which is much like Chen and Willson's approach.
  657. However, on frames with complex motion, we have seen dramatic improvements in
  658. visual quality and motion field smoothness by properly accounting for
  659. \textit{all} of the rate changes.
  660. This is because a level $0$ MV, for example, may be used to predict up to $20$
  661. other MVs, only $6$ of which lie on a given DP path.
  662. In a dense mesh, the rate changes off the path may dominate the ones on it.
  663. Finally, there are the edge labels to consider.
  664. Optimizing them for compressibility now would bias them towards the default `B'
  665. label used in earlier stages.
  666. Therefore we ignore the cost of labels during this stage.
  667. This is equivalent to our previous assumption that they will each be coded with
  668. one bit.
  669. \subsubsection{Complexity Reduction}
  670. Chen and Willson showed that using a logarithmic search instead of an
  671. exhaustive one for the DP resulted in an average PSNR loss of only $0.05$ dB
  672. and an average MV bitrate increase of $55$ bits per frame.
  673. We take an even more aggressive approach, and replace the logarithmic search
  674. with a diamond search~\cite{ZM00}.
  675. Because the complexity of a given DP chain increases as $O(h^2)$, reducing the
  676. candidate count can give a substantial performance boost.
  677. The logarithmic search uses a $9$-candidate square pattern in each stage.
  678. The displacements used in the pattern shrink by a factor of two in each phase.
  679. Chen and Willson used a 3-phase search to achieve an effective search radius of
  680. $\pm 7\times 7$ pixels.
  681. In our framework, this requires examining $3\times 2\times 9^2=486$ trellis
  682. edges per MV for one full iteration.
  683. However, the large displacement patterns only alter a small number of MVs that
  684. have usually been grossly mis-estimated.
  685. They are even likely to cause further mis-estimation in areas with repeated
  686. structure or a lack of texture, which makes further refinement less effective.
  687. One alternative is to simply discard them, and only perform the square pattern
  688. search with one-pixel displacements.
  689. The chance of being trapped in a local minimum is increased, but three times as
  690. many iterations can be performed in the same amount of time.
  691. On the other hand, a small diamond search pattern has only $5$ candidates,
  692. making it even more attractive.
  693. The diamond search (DS) algorithm has several problems when used for BMA.
  694. It fails to identify strong global motions, and at higher resolutions produces
  695. non-smooth motion fields due to an increase in large, uniform areas.
  696. However, EPZS${}^2$ does an excellent job of identifying global motion, and the
  697. use of a Lagrangian cost function helps ensure easily compressible MVs.
  698. Although the performance advantage of DS over a square search is slight in
  699. BMA, in a DP context, the small diamond only requires examining
  700. $2\times 5^2=50$ trellis edges per MV per iteration.
  701. This allows more than nine times as many iterations as the full logarithmic
  702. search in the same amount of time.
  703. TODO: Initial results on coastguard suggest that the PSNR loss for using the
  704. diamond pattern is only 0.0013 dB per frame!
  705. The speedup, however, is dramatic.
  706. %No quantization, small threshold, 100 frames:
  707. %On average, $3.70$ full logarithmic search iterations are needed per frame,
  708. % compared to $4.37$ diamond pattern iterations.
  709. %> 35 dB quantization, large threshold:
  710. On average, $2.02$ full logarithmic search iterations are needed per frame,
  711. compared to $2.03$ diamond pattern iterations.
  712. This represents about $9.6$ times fewer trellis edges examined, with virtually
  713. no quality loss.
  714. Of the 299 predicted frames, 150 have a slightly \textit{higher} PSNR when
  715. using diamond search.
  716. This needs to be tested more thoroughly (more sequences, more $\lambda$
  717. values), and the effect on bitrate also needs to be measured.
  718. Note that the computational complexity is still relatively high.
  719. In a single iteration, each of the four edges of a block are traversed exactly
  720. once by a DP path, during which its SAD is evaluated $50$ times, for a total
  721. of $200$ SAD calculations per block.
  722. This is nearly as many as full search BMA with a $\pm 7\times 7$ window, and
  723. computing our interpolated, blended residuals has a much higher complexity.
  724. Thus it is not suitable for a real-time implementation.
  725. A DP solution for choosing edge labels alone, however, requires only $8$ SAD
  726. calculations per block, and could feasibly be used in a real-time system.
  727. %TODO: Use thresholds to eliminate blocks from the DP.
  728. %Only examine vertices adjacent to a block above threshold.
  729. %How do we determine thresholds?
  730. %Because of the adaptive subdivision, we will not have the same set of
  731. % consecutive blocks in adjacent frames.
  732. %We instead mark each block whose SAD remains above its threshold, and consider
  733. % only MVs that affect those blocks in subsequent iterations of the DP.
  734. %This focuses extra computation time only on areas where the motion estimation
  735. % is performing poorly, as that is where we have the most to gain.
  736. %When used in DP, this is not the case.
  737. %TODO: Test! Does that actually make it better?
  738. \subsection{Sub-pel Refinement}
  739. The same small diamond-pattern search can be used to refine the motion vectors
  740. to sub-pel precision.
  741. Our implementation supports half-, quarter-, or eighth-pel resolution MVs.
  742. First, the DP process is iterated with the small diamond and half-pel
  743. displacements until the change in Lagrangian cost $J$ for an iteration falls
  744. below a given threshold.
  745. %The first iteration is applied to all MVs in the mesh, not just those adjacent
  746. % to a block with a high SAD.
  747. %TODO: Optimize edge labels for compressibility.
  748. Finer resolutions are only used if they provide an overall R-D benefit, which
  749. is tested on a frame-by-frame basis.
  750. First, iteration is done with quarter-pel displacements, followed, if
  751. successful, by eighth-pel.
  752. If the decrease in SAD from the finer resolution MVs cannot balance the 2 bit
  753. per MV increase in bitrate, then the coarser vectors are used instead.
  754. %TODO: Continue to allow edge labels to vary?
  755. \section{Results}
  756. The motion estimation algorithms described here were tested against several
  757. standard sequences.
  758. Because the presence of coding errors can have a large impact on motion
  759. estimation performance, each reference frame was produced by encoding the
  760. residual error using DPCM with a uniform quantizer and simple 1-pixel error
  761. diffusion.
  762. This was chosen because it is simple to implement---and thus easily
  763. reproducible---and does not suffer from blocking artifacts.
  764. The resulting reference frames have a PSNR of greater than $35$ dB.
  765. \bibliography{daala}
  766. \end{document}