1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642652662672682692702712722732742752762772782792802812822832842852862872882892902912922932942952962972982993003013023033043053063073083093103113123133143153163173183193203213223233243253263273283293303313323333343353363373383393403413423433443453463473483493503513523533543553563573583593603613623633643653663673683693703713723733743753763773783793803813823833843853863873883893903913923933943953963973983994004014024034044054064074084094104114124134144154164174184194204214224234244254264274284294304314324334344354364374384394404414424434444454464474484494504514524534544554564574584594604614624634644654664674684694704714724734744754764774784794804814824834844854864874884894904914924934944954964974984995005015025035045055065075085095105115125135145155165175185195205215225235245255265275285295305315325335345355365375385395405415425435445455465475485495505515525535545555565575585595605615625635645655665675685695705715725735745755765775785795805815825835845855865875885895905915925935945955965975985996006016026036046056066076086096106116126136146156166176186196206216226236246256266276286296306316326336346356366376386396406416426436446456466476486496506516526536546556566576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916926936946956966976986997007017027037047057067077087097107117127137147157167177187197207217227237247257267277287297307317327337347357367377387397407417427437447457467477487497507517527537547557567577587597607617627637647657667677687697707717727737747757767777787797807817827837847857867877887897907917927937947957967977987998008018028038048058068078088098108118128138148158168178188198208218228238248258268278288298308318328338348358368378388398408418428438448458468478488498508518528538548558568578588598608618628638648658668678688698708718728738748758768778788798808818828838848858868878888898908918928938948958968978988999009019029039049059069079089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000100110021003100410051006100710081009101010111012101310141015101610171018101910201021102210231024102510261027102810291030103110321033103410351036103710381039104010411042104310441045104610471048104910501051105210531054105510561057105810591060106110621063106410651066106710681069107010711072107310741075107610771078107910801081108210831084108510861087108810891090109110921093109410951096109710981099110011011102110311041105110611071108110911101111111211131114111511161117111811191120112111221123112411251126112711281129113011311132113311341135113611371138113911401141114211431144114511461147114811491150115111521153115411551156115711581159116011611162116311641165116611671168116911701171117211731174117511761177117811791180118111821183118411851186118711881189119011911192119311941195119611971198119912001201120212031204120512061207120812091210121112121213121412151216121712181219122012211222122312241225122612271228122912301231123212331234123512361237123812391240124112421243124412451246124712481249125012511252125312541255125612571258125912601261126212631264126512661267126812691270127112721273127412751276127712781279128012811282128312841285128612871288128912901291129212931294129512961297129812991300130113021303130413051306 |
- VP3 Bitstream Format and Decoding Process
- by Mike Melanson (mike at multimedia.cx)
- v0.5: December 8, 2004
- [December 8, 2004: Note that this document is not complete and likely
- will never be completed. However, it helped form the basis of Theora I
- specification available at
- http://www.theora.org/doc/Theora_I_spec.pdf ]
- Contents
- --------
- * Introduction
- * Underlying Coding Concepts
- * VP3 Coding Overview
- * VP3 Chunk Format
- * Decoding The Frame Header
- * Initializing The Quantization Matrices
- * Hilbert Coding Pattern
- * Unpacking The Block Coding Information
- * Unpacking The Macroblock Coding Mode Information
- * Unpacking The Macroblock Motion Vectors
- * Unpacking The DCT Coefficients
- * Reversing The DC Prediction
- * Reconstructing The Frame
- * Theora Specification
- * Appendix A: Quantization Matrices And Scale Factors
- * Appendix B: Macroblock Coding Mode Alphabets
- * Appendix C: DCT Coefficient VLC Tables
- * Appendix D: The VP3 IDCT
- * Acknowledgements
- * References
- * Changelog
- Introduction
- ------------
- A company named On2 (http://www.on2.com) created a video codec named
- VP3. Eventually, they decided to open source it. Like any body of code
- that was produced on a deadline, the source code was not particularly
- clean or well-documented. This makes it difficult to understand the
- fundamental operation of the codec.
- This document describes the VP3 bitstream format and decoding process at
- a higher level than source code.
- Underlying Coding Concepts
- --------------------------
- In order to understand the VP3 coding method it is necessary to
- understand the individual steps in the process. Like many multimedia
- compression algorithms VP3 does not consist of a single coding method.
- Rather, it uses a chain of methods to achieve compression.
- If you are acquainted with the MPEG video clique then many of VP3's
- coding concepts should look familiar as well. What follows is a list of
- the coding methods used in VP3 and a brief description of each.
- * Discrete Cosine Transform (DCT): This is a magical mathematical
- function that takes a group of numbers and turns it into another group
- of numbers. The transformed group of numbers exhibits some curious
- properties. Notably, larger numbers are concentrated in certain areas of
- the transformed group.
- A video codec like VP3 often operates on 8x8 blocks of numbers. When
- these 8x8 blocks are transformed using a DCT the larger numbers occur
- mostly in the up and left areas of the block with the largest number
- occurring as the first in the block (up-left corner). This number is
- called the DC coefficient. The other 63 numbers are called the AC
- coefficients.
- The DCT and its opposite operation, the inverse DCT, require a lot of
- multiplications. Much research and experimentation is focused of
- optimizing this phase of the coding/decoding process.
- * Quantization: This coding step tosses out information by essentially
- dividing a number to be coded by a factor and throwing away the
- remainder. The inverse process (dequantization) involves multiplying by
- the same factor to obtain a number that is close enough to the original.
- * Run Length Encoding (RLE): The concept behind RLE is to shorten runs
- of numbers that are the same. For example, the string "88888" is encoded
- as (5, 8), indicating a run of 5 '8' numbers. In VP3 (and MPEG/JPEG),
- RLE is used to record the number of zero-value coefficients that occur
- before a non-zero coefficient. For example:
- 0 0 0 0 5 0 2 0 0 0 9
- is encoded as:
- (4, 5), (1, 2), (3, 9)
- This indicates that a run of 4 zeroes is followed by a coefficient of 5;
- then a run of 1 zero is followed by 2; then a run of 3 zeroes is
- followed by 9.
- * Zigzag Ordering: After transforming and quantizing a block of samples,
- the samples are not in an optimal order for run length encoding. Zigzag
- ordering rearranges the samples to put more zeros between non-zero
- samples.
- * Differential (or Delta) Pulse Code Modulation (DPCM): 1 + 1 = 2. Got
- that? Seriously, that is what DPCM means. Rather than encoding absolute
- values, encode the differences between successive values. For example:
- 82 84 81 80 86 88 85
- Can be delta-encoded as:
- 82 +2 -3 -1 +6 +2 -3
- Most of the numbers turn into smaller numbers which require less
- information to encode.
- * Motion Compensation: Simply, this coding method specifies that a block
- from a certain position in the previous frame is to be copied into a new
- position in the current frame. This technique is often combined with DCT
- and DPCM coding, as well as fractional pixel motion.
- * Entropy Coding (a.k.a. Huffman Coding): This is the process of coding
- frequently occurring symbols with fewer bits than symbols that are not
- likely to occur as frequently.
- * Variable Length Run Length Booleans: An initial Boolean bit is
- extracted from the bitstream. A variable length code (VLC) is extracted
- from the bitstream and converted to a count. This count indicates that
- the next (count) elements are to be set to the Boolean value.
- Afterwards, the Boolean value is toggled, the next VLC is extracted and
- converted to a count, and the process continues until all elements are
- set to either 0 or 1.
- * YUV Colorspace: Like many modern video codecs, VP3 operates on a YUV
- colorspace rather than a RGB colorspace. Specifically, VP3 uses YUV
- 4:2:0, alias YUV420P, YV12. Note: Throughout the course of this
- document, the U and V planes (a.k.a., Cb and Cr planes) will be
- collectively referred to as C planes (color or chrominance planes).
- * Frame Types: VP3 has intra-coded frames, a.k.a. intraframes, I-frames,
- or keyframes. VP3 happens to call these golden frames. VP3 has
- interframes, a.k.a. predicted frames or P-frames. These frames can use
- information from either the previous interframe or from the previous
- golden frame.
- VP3 Overview
- ------------
- The first thing to understand about the VP3 coding method is that it
- encodes all 3 planes upside down. That is, the data is encoded from
- bottom-to-top rather than top-to-bottom as is done with many video
- codecs.
- VP3 codes a video frame by first breaking each of the 3 planes (Y, U,
- and V) into a series of 8x8 blocks called fragments. VP3 also has a
- notion of superblocks. Superblocks encapsulate 16 fragments arranged in
- a 4x4 matrix. Each plane has its own set of superblocks. Further, VP3
- also uses the notion of macroblocks which is the same as that found in
- JPEG/MPEG. One macroblock encompasses 4 blocks from the Y plane arranged
- in a 2x2 matrix, 1 block from the U plane, and 1 block from the V plane.
- While a fragment or a superblock applies to 1 and only 1 plane, a
- macroblock extends over all 3 planes.
- VP3 compresses golden frames by transforming each fragment with a
- discrete cosine transform. Each transformed sample is then quantized and
- the DC coefficient is reduced via DPCM using a combination of DC
- coefficients from surrounding fragments as predictors. Then, each
- fragment's DC coefficient is entropy-coded in the output bitstream,
- followed by each fragment's first AC coefficient, then each second AC
- coefficient, and so on.
- An interframe, naturally, is more complicated. While there is only one
- coding mode available for a golden frame (intra coding), there are 8
- coding modes that the VP3 coder can choose from for interframe
- macroblocks. Intra coding as seen in the keyframe is still available.
- The rest of the modes involve encoding a fragment diff, either from the
- previous frame or the golden frame, from the same coordinate or from the
- same coordinate plus a motion vector. All of the macroblock coding modes
- and motion vectors are encoded in an interframe bitstream.
- VP3 Chunk Format
- ----------------
- The high-level format of a compressed VP3 frame is laid out as:
- * chunk header
- * block coding information
- * macroblock coding mode information
- * motion vectors
- * DC coefficients
- * 1st AC coefficients
- * 2nd AC coefficients
- * ...
- * 63rd AC coefficients
- Decoding The Frame Header
- -------------------------
- The chunk header always contains at least 1 byte which has the following
- format:
- bit 7: 0 = golden frame, 1 = interframe
- bit 6: unused
- bits 5-0: Quality index (0..63)
- Further, if the frame is a golden frame, there are 2 more bytes in the
- header:
- byte 0: version byte 0
- byte 1:
- bits 7-3: VP3 version number (stored)
- bit 2: key frame coding method (0 = DCT key frame, only type
- supported)
- bits 1-0: unused, spare bits
- All frame headers are encoded with a quality index. This 6-bit value is
- used to index into 2 dequantizer scaling tables, 1 for DC values and 1
- for AC values. Each of the 3 dequantization tables is modified per these
- scaling values.
- Initializing The Quantization Matrices
- --------------------------------------
- VP3 has three static matrices for quantizing and dequantizing fragments.
- One matrix is for quantizing golden frame Y fragments, one matrix is for
- quantizing golden frame C fragments, and one matrix is for quantizing both
- golden frame and interframe Y or C fragments. While these matrices are
- static, they are adjusted according to quality index coded in the header.
- The quality index is an index into 2 64-element tables:
- dc_scale_factor[] and ac_scale_factor[]. Each quantization factor from
- each of the three quantization matrices is adjusted by the appropriate
- scale factor according to this formula:
- base quantizer * scale factor
- quantizer = -----------------------------
- 100
-
- where scale factor =
- dc_scale_factor[quality_index] for DC dequantizer
- ac_scale_factor[quality_index] for AC dequantizer
- The quantization matrices need to be recalculated at the beginning of a
- frame decode if the current frame's quality index is different from the
- previous frame's quality index.
- See Appendix A for the complete VP3 quantization matrices and scale factor
- tables.
- As an example, this is the base quantization matrix for golden frame Y
- fragments:
- 16 11 10 16 24 40 51 61
- 12 12 14 19 26 58 60 55
- 14 13 16 24 40 57 69 56
- 14 17 22 29 51 87 80 62
- 18 22 37 58 68 109 103 77
- 24 35 55 64 81 104 113 92
- 49 64 78 87 103 121 120 101
- 72 92 95 98 112 100 103 99
- If a particular coded frame specifies a quality index of 54. Element 54
- of the dc_scale_factor table is 20, thus:
- 16 * 20
- DC coefficient quantizer = ------- = 3
- 100
- Element 54 of the ac_scale_factor table is 24. The AC coefficient
- quantizers are each scaled using this factor, e.g.:
- 11 * 24
- ------- = 2
- 100
- 100 * 24
- -------- = 24
- 100
- [not complete; still need to explain how these quantizers are saturated
- and scaled with respect to the DCT process]
- Hilbert Coding Pattern
- ----------------------
- VP3 uses a Hilbert pattern to code fragments within a superblock. A
- Hilbert pattern is a recursive pattern that can grow quite complicated.
- The coding pattern that VP3 uses is restricted to this pattern subset,
- where each fragment in a superblock is represented by a 'X':
- X -> X X -> X
- | ^
- v |
- X <- X X <- X
- | ^
- v |
- X X -> X X
- | ^ | ^
- v | v |
- X -> X X -> X
- As an example of this pattern, consider a plane that is 256 samples wide
- and 64 samples high. Each fragment row will be 32 fragments wide. The
- first superblock in the plane will be comprised of these 16 fragments:
- 0 1 2 3 ... 31
- 32 33 34 35 ... 63
- 64 65 66 67 ... 95
- 96 97 98 99 ... 127
- The order in which these 16 fragments are coded is:
- 0 | 0 1 14 15
- 32 | 3 2 13 12
- 64 | 4 7 8 11
- 96 | 5 6 9 10
- All of the image coding information, including the block coding status
- and modes, the motion vectors, and the DCT coefficients, are all coded
- and decoded using this pattern. Thus, it is rather critical to have the
- pattern and all of its corner cases handled correctly. In the above
- example, if the bottom row and left column were not present due to the
- superblock being in a corner, the pattern proceeds as if the missing
- fragments were present, but the missing fragments are omitted in the
- final coding list. The coding order would be:
- 0, 1, 2, 3, 4, 7, 8, 13, 14
- Unpacking The Block Coding Information
- --------------------------------------
- After unpacking the frame header, the decoder unpacks the block coding
- information. The only information determined in this phase is whether a
- particular superblock and its fragments are coded in the current frame
- or unchanged from the previous frame. The actual coding method is
- determined in the next phase.
- If the frame is a golden frame then every superblock, macroblock, and
- fragment is marked as coded.
- If the frame is an interframe, then the block coding information must be
- decoded. This is the phase where a decoder will build a list of coded
- fragments for which coding mode, motion vector, and DCT coefficient data
- must be decoded.
- First, a list of partially-coded superblocks is unpacked from the
- stream. This list is coded as a series of variable-length run length
- codes (VLRLC). First, the code is initialized by reading the next bit in
- the stream. Then, while there are still superblocks remaining in the
- list, fetch a VLC from the stream according to this table:
- Codeword Run Length
- 0 1
- 10x 2-3
- 110x 4-5
- 1110xx 6-9
- 11110xxx 10-17
- 111110xxxx 18-33
- 111111xxxxxxxxxxxx 34-4129
- For example, a VLC of 1101 represents a run length of 5. If the VLRLC
- was initialized to 1, then the next 5 superblocks would be set to 1,
- indicating that they are partially coded in the current frame. Then the
- bit value is toggled to 0, another VLC is fetched from the stream and
- the process continues until each superblock has been marked either
- partially coded (1) or not (0).
- If any of the superblocks were marked as not partially coded in the
- previous step, then a list of fully-coded superblocks is unpacked next
- using the same VLRLC as the list of partially-coded superblocks.
- Initialize the VLRLC with the next bit in the stream. For each
- superblock that was not marked as partially coded, mark it with either a
- 0 or 1 according to the current VLRLC. By the end of this step, each
- superblock will be marked as either not coded, partially coded, or fully
- coded.
- Let's work through an example with an image frame that is 256x64 pixels.
- This means that the Y plane contains 4x2 superblocks and each of the C
- planes contains 2 superblocks each. The superblocks are numbered as
- follows:
- Y: 0 1 2 3 U: 8 9
- 4 5 6 7 V: 10 11
- This is the state of the bitstream:
- 1100011001101
- Which is interpreted as:
- initial 2 1's 1 0 4 1's 5 0's
- 1 100 0 1100 1101
- Superblocks 0-1 and 3-6 are marked as partially coded. Since there were
- blocks that were not marked, proceed to unpack the list of fully-coded
- superblocks. This is the state of the bitstream:
- 1101101
- Which is interpreted as:
- initial 3 1's 3 0's
- 1 101 100
- Superblocks 2, 7, and 8 are marked as fully coded while superblocks 9,
- 10, and 11 are marked as not coded.
- If any of the superblocks were marked as partially coded, the next data
- in the bitstream will define which fragments inside each partially-coded
- superblock are coded. This is the first place where the Hilbert pattern
- comes into play.
- For each partially-coded superblock, iterate through each fragment
- according to the Hilbert pattern. Use the VLRLC method, only with a
- different table, to determine which fragments are coded. The VLRLC table
- for fragment coding runs is:
- Codeword Run Length
- 0x 1-2
- 10x 3-4
- 110x 5-6
- 1110xx 7-10
- 11110xx 11-14
- 11111xxxx 15-30
- Continuing with the contrived example, superblocks 0 and 1 are both
- partially coded. This is the state of the bitstream:
- 0011001111010001111010...(not complete)
- Which is interpreted as:
- initial 2 0's 3 1's 13 0's 1 1 13 0's
- 0 01 100 1111010 00 1111010 ...
- This indicates that fragments 2-4 in superblock 0 are coded, while
- fragments 0, 1, and 5-15 are not. Note that the run of 12 0's cascades
- over into the next fragment, indicating that fragment 0 of superblock 1
- is not coded. Fragment 1 of superblock 1 is coded, while the rest of the
- superblock's fragments are not coded. The example ends there (a real
- bitstream should have enough data to describe all of the partially-coded
- superblocks). Superblock 2 is fully coded which means all 16 fragments
- are coded. Thus, superblocks 0-2 have the following coded fragments:
- 0 | x x x x x x x x 0 1 14 15
- 32 | 3 2 x x x 2 x x 3 2 13 12
- 64 | 4 x x x x x x x 4 7 8 11
- 96 | x x x x x x x x 5 6 9 10
- This is a good place to generate the list of coded fragment numbers for
- this frame. In this case, the list will begin as:
- 33 32 64 37 8 9 41 40 72 104 105 73 ...
- and so on through the remaining 8 fragments of superblock 2 and onto the
- fragments for the remaining superblocks that are either fully or
- partially coded.
- Unpacking The Macroblock Coding Mode Information
- ------------------------------------------------
- After unpacking the block coding information, the decoder unpacks the
- macroblock coding mode information. This process is simple when
- decoding a golden frame-- since the only possible decoding mode is INTRA,
- no macroblock coding mode information is transmitted. However, in an
- interframe, each coded macroblock is encoded with one of 8 methods:
- 0, INTER_NO_MV:
- current fragment =
- (fragment from previous frame @ same coordinates) +
- (DCT-encoded residual)
- 1, INTRA:
- current fragment = DCT-encoded block, just like in a golden frame
- 2, INTER_PLUS_MV:
- current fragment =
- (fragment from previous frame @ (same coords + motion vector)) +
- (DCT-encoded residual)
- 3, INTER_LAST_MV:
- same as INTER_PLUS_MV but using the last motion vector decoded from
- the bitstream
- 4, INTER_PRIOR_LAST;
- same as INTER_PLUS_MV but using the second-to-last motion vector
- decoded from the bitstream
- 5, USING_GOLDEN:
- same as INTER_NO_MV but referencing the golden frame instead of
- previous interframe
- 6, GOLDEN_MV:
- same as INTER_PLUS_MV but referencing the golden frame instead of
- previous interframe
- 7, INTER_FOURMV:
- same as INTER_PLUS_MV except that each of the 4 Y fragments gets its
- own motion vector, and the U and V fragments share the same motion
- vector which is the average of the 4 Y fragment vectors
- The MB coding mode information is encoded using one of 8 alphabets. The
- first 3 bits of the MB coding mode stream indicate which of the 8
- alphabets, 0..7, to use to decode the MB coding information in this frame.
- The reason for the different alphabets is to minimize the number of bits
- needed to encode this section of information. Each alphabet arranges the
- coding modes in a different order, indexing the 8 modes into 8 index
- slots. Index 0 is encoded with 1 bit (0), index 1 is encoded with 2 bits
- (10), index 2 is encoded with 3 bits (110), and so on up to indices 6 and
- 7 which are encoded with 6 bits each (1111110 and 1111111, respectively):
- index encoding
- ----- --------
- 0 0
- 1 10
- 2 110
- 3 1110
- 4 11110
- 5 111110
- 6 1111110
- 7 1111111
- For example, the coding modes are arranged in alphabet 1 as follows:
- index coding mode
- ----- -----------
- 0 MODE_INTER_LAST_MV
- 1 MODE_INTER_PRIOR_LAST
- 2 MODE_INTER_PLUS_MV
- 3 MODE_INTER_NO_MV
- 4 MODE_INTRA
- 5 MODE_USING_GOLDEN,
- 6 MODE_GOLDEN_MV
- 7 MODE_INTER_FOURMV
- This alphabet arrangement is designed for frames in which motion vectors
- based off of the previous interframe dominate.
- When unpacking MB coding mode information for a frame, the decoder first
- reads 3 bits from the stream to determine the alphabet. In this example,
- the 3 bits would be 001 to indicate alphabet 1. Consider this contrived
- bitstream following the alphabet number:
- 1010000011000011111110...
-
- The bits are read as follows:
- 10 10 0 0 0 0 110 0 0 0 1111111 0
- index: 1 1 0 0 0 0 2 0 0 0 7 0
- This arrangement of indices translates to this series of coding modes:
- index coding mode
- ----- -----------
- 1 MODE_INTER_PRIOR_LAST
- 1 MODE_INTER_PRIOR_LAST
- 0 MODE_INTER_LAST_MV
- 0 MODE_INTER_LAST_MV
- 0 MODE_INTER_LAST_MV
- 0 MODE_INTER_LAST_MV
- 2 MODE_INTER_PLUS_MV
- 0 MODE_INTER_LAST_MV
- 0 MODE_INTER_LAST_MV
- 0 MODE_INTER_LAST_MV
- 7 MODE_INTER_FOURMV
- 0 MODE_INTER_LAST_MV
-
- There are 6 pre-defined alphabets. Consult Appendix B for the complete
- alphabets. What happens if none of the 6 pre-defined alphabets fit? The
- VP3 encoder can choose to use alphabet 0 which indicates a custom
- alphabet. The 3-bit coding mode numbers for each index, 0..7, are stored
- after the alphabet number in the bitstream. For example, the sequence:
- 000 111 110 101 100 011 010 001 000
- would indicate coding alphabet 0 (custom alphabet), index 0 corresponds to
- coding mode 7 (INTER_FOURMV), index 1 corresponds to coding mode 6
- (GOLDEN_MV), and so on down to index 7 which would correspond to coding
- mode 0 (INTER_NO_MV).
- There is one more possible alphabet: Alphabet 7. This alphabet is
- reserved for when there is such a mixture of coding modes used in a frame
- that using any variable-length coding mode would result in more bits than
- a fixed-length representation. When alphabet 7 is specified, the decoder
- reads 3 bits at a time from the bitstream, and uses those directly as the
- macroblock coding modes.
- To recap, this is the general algorithm for decoding macroblock coding
- mode information:
- if (golden frame)
- all frames are intracoded, there is no MB coding mode information
- else
- read 3 bits from bitstream to determine alphabet
- if alphabet = 0
- this is a custom alphabet, populate index table with 8 3-bit coding
- modes read from bitstream
- foreach coded macroblock, unpack a coding mode:
- if alphabet = 7
- read 3 bits from the bitstream as the coding mode for the
- macroblock
- else
- read a VLC from the bitstream
- use the decoded VLC value to index into the coding mode alphabet
- selected for this frame and assign the indexed coding mode to
- this macroblock
-
- Unpacking The Macroblock Motion Vectors
- ---------------------------------------
- After unpacking the macroblock coding mode information, the decoder
- unpacks the macroblock motion vectors. This phase essentially assigns a
- motion vector to each of the 6 constituent fragments of any coded
- macroblock that requires motion vectors.
- If the frame is a golden frame then there is no motion compensation and
- no motion vectors are encoded in the bitstream.
- If the frame is an interframe, the next bit is read from the bitstream
- to determine the vector entropy coding method used. If the coding method
- is zero then all of the vectors will be unpacked using a VLC method. If
- the coding method is 1 then all of the vectors will be unpacked using a
- fixed length method.
- The VLC unpacking method reads 3 bits from the bitstream. These 3 bits
- comprise a number ranging from 0..7 which indicate the next action:
- 0, MV component = 0
- 1, MV component = 1
- 2, MV component = -1
- 3, MV component = 2, read next bit for sign
- 4, MV component = 3, read next bit for sign
- 5, MV component = 4 + (read next 2 bits), read next bit for sign
- range: (4..7, -4..-7)
- 6, MV component = 8 + (read next 3 bits), read next bit for sign
- range: (8..15, -8..-15)
- 7, MV component = 16 + (read next 4 bits), read next bit for sign
- range: (16..31, -16..-31)
- The fixed length vector unpacking method simply reads the next 5 bits
- from the bitstream, reads the next bit for sign, and calls the whole
- thing a motion vector component. This gives a range of (-31..31), which
- is the same range as the VLC method.
- For example, consider the following contrived motion vector bitstream:
- 000001011011111000...
- The stream is read as:
- 0 (000 010) (110 111 1 100 0)
- The first bit indicates the entropy method which, in this example, is
- variable length as opposed to fixed length. The next 3 bits are 0 which
- indicate a X MV component of 0. The next 3 bits are 2 which indicate a Y
- MV component of -1. The first motion vector encoded in this stream is
- (0, -1). The next 3 bits are 6 which indicate 8 + next 3 bits (7) with
- another bit indicating sign (1 in this case, which is negative). Thus,
- the X MV component is -15. The next 3 bits are 4 which indicate a Y MV
- component of 3 with one more bit for the sign (0 is positive). So the
- second motion vector encoded in this stream is (-15, 3).
- As an example of the fixed-length entropy method, consider the following
- contrived bitstream:
- 1010101101010...
- The stream is read as:
- 1 01010 1 10101 0
- The first bit indicates the fixed length entropy method. The first 5 bits
- are 10 followed by a negative sign bit. The next 5 bits are 21 followed by
- a positive sign bit. The first motion vector in this stream is (-10, 21).
- During this phase of the decoding process, it is traditional to assign all
- motion vectors for all coded macroblocks that require them, whether they
- are unpacked from the motion vector bitstream or copied from previous
- coded macroblocks. It is necessary to track the motion vectors for both
- the previous macroblock as well as the next-to-last (prior) macroblock.
- The general algorithm for this phase is as follows:
- foreach coded macroblock
- last MV = 0
- prior last MV = 0
- if coding mode = MODE_INTER_PLUS_MV or MODE_GOLDEN_MV
- read current MV pair from the bitstream and set all fragment motion
- vectors to that pair
- prior last MV = last MV
- last MV = current MV
-
- if coding mode = MODE_INTER_FOURMV
- read MV for first Y fragment in macroblock
- read MV for second Y fragment in macroblock
- read MV for third Y fragment in macroblock
- read MV for fourth Y fragment in macroblock
- set U & V fragment motion vectors to average of 4 Y vectors,
- calculated as follows:
- if sum of all 4 X motion components is positive, the X
- motion component for the U & V fragments is (sum + 2) / 4,
- otherwise, it is (sum - 2) / 4; repeat the same process for the
- Y components
- prior last MV = last MV
- last MV = MV for fourth Y fragment from this macroblock
-
- if coding mode = MODE_INTER_LAST_MV
- motion vectors for this macroblock are the same as last MV; note
- that in this case, the last MV remains the last MV and the prior
- last MV remains the prior last MV
-
- if coding mode = MODE_INTER_PRIOR_LAST
- motion vectors for this macroblock are the same as prior last MV
- prior last MV = last MV
- last MV = current MV (effectively, swap last and prior last vectors)
- Unpacking The DCT Coefficients
- ------------------------------
- After unpacking the macroblock motion vectors, the decoder unpacks the
- fragment DCT coefficient data. Each coded fragment has 64 DCT
- coefficients. Some of the coefficients will be non-zero. Many of the
- coefficients will, or should be 0 as this is where the coding method
- derives much of its compression.
- During this phase, the decoder will be unpacking DCT coefficients, zero
- runs, and end-of-block (EOB) codes. The decoder unpacks the the DC
- coefficients for all fragments, then all of the first AC coefficients,
- and so on until all of the 64 DCT coefficients are unpacked from the
- bitstream.
- To obtain the DCT coefficients, the decoder unpacks a series of VLCs
- from the bitstream which turn into a series of tokens ranging from
- 0..31. Each of these tokens specifies which action to take next. VP3
- defines 80 different 32-element histograms for VLC decoding:
- 16 histograms for DC token decoding
- 16 histograms for group 1 AC token decoding
- 16 histograms for group 2 AC token decoding
- 16 histograms for group 3 AC token decoding
- 16 histograms for group 4 AC token decoding
- The decoder fetches 4 bits from the bitstream that will be used to
- select a DC histogram and 4 bits that will be used to select 4 AC
- histograms, one for each AC group.
- The meaning of each of the 32 possible tokens follows. 'EB' stands for
- extra bits read from bitstream directly after the VLC token:
- 0, DCT_EOB_TOKEN
- set the current block to EOB, meaning that the block is marked as being
- fully unpacked
- 1, DCT_EOB_PAIR_TOKEN
- set the next 2 blocks to EOB
- 2. DCT_EOB_TRIPLE_TOKEN
- set the next 3 blocks to EOB
- 3, DCT_REPEAT_RUN_TOKEN
- set the next (2 EBs + 4) blocks to EOB
- 4, DCT_REPEAT_RUN2_TOKEN
- set the next (3 EBs + 8) blocks to EOB
- 5, DCT_REPEAT_RUN3_TOKEN
- set the next (4 EBs + 16) blocks to EOB
- 6, DCT_REPEAT_RUN4_TOKEN
- set the next (12 EBs) blocks to EOB
- 7, DCT_SHORT_ZRL_TOKEN
- skip (3 EBs + 1) positions in the output matrix
- 8, DCT_ZRL_TOKEN
- skip (6 EBs + 1) positions in the output matrix
- 9, ONE_TOKEN
- output 1 as coefficient
- 10, MINUS_ONE_TOKEN
- output -1 as coefficient
- 11, TWO_TOKEN
- output 2 as coefficient
- 12, MINUS_TWO_TOKEN
- output -2 as coefficient
- 13, 14, 15, 16, LOW_VAL_TOKENS
- next EB determines coefficient sign; coeff = DCT_VAL_CAT2_MIN (3) +
- (token - 13) (this gives a range of +/- 3..6)
- 17, DCT_VAL_CATEGORY3
- next EB determines coefficient sign; coeff = DCT_VAL_CAT3_MIN (7) + next
- EB (this gives a range of +/- 7..8)
- 18, DCT_VAL_CATEGORY4
- next EB determines coefficient sign; coeff = DCT_VAL_CAT4_MIN (9) + next
- 2 EBs (this gives a range of +/- 9..12)
- 19, DCT_VAL_CATEGORY5
- next EB determines coefficient sign; coeff = DCT_VAL_CAT5_MIN (13) +
- next 3 EBs (this gives a range of +/- 13..20)
- 20, DCT_VAL_CATEGORY6
- next EB determines coefficient sign; coeff = DCT_VAL_CAT6_MIN (21) +
- next 4 EBs (this gives a range of +/- 21..36)
- 21, DCT_VAL_CATEGORY7
- next EB determines coefficient sign; coeff = DCT_VAL_CAT7_MIN (37) +
- next 5 EBs (this gives a range of +/- 37..68)
- 22, DCT_VAL_CATEGORY8
- next EB determines coefficient sign; coeff = DCT_VAL_CAT8_MIN (69) +
- next 9 EBs (this gives a range of +/- 69..580)
- 23, 24, 25, 26, 27, DCT_RUN_CATEGORY1
- coefficient of +/- 1 preceded by a number of 0s; next EB determines sign
- of coefficient; skip (token - 22) 0s in the output matrix before
- placing the final coefficient (this gives a range of 1..5 0s)
- 28, DCT_RUN_CATEGORY1B
- coefficient of +/- 1 preceded by a number of 0s; next EB determines sign
- of coefficient; skip (next 2 EBs + 6) 0s in the output matrix before
- placing the final coefficient (this gives a range of 6..9 0s)
- 29, DCT_RUN_CATEGORY1C
- coefficient of +/- 1 preceded by a number of 0s; next EB determines sign
- of coefficient; skip (next 3 EBs + 10) 0s in the output matrix before
- placing the final coefficient (this gives a range of 10..17 0s)
- 30, DCT_RUN_CATEGORY2
- coefficient of +/- 2..3 preceded by a single zero; next EB determines
- sign of coefficient; coefficient = (next EB + 2)
- 31, DCT_RUN_CATEGORY2B (not specifically named in VP3 source)
- coefficient of +/- 2..3 preceded by 2 or 3 0s; next EB determines
- sign of coefficient; coefficient = (next EB + 2); skip (next EB + 2) 0s
- before placing coefficient in output matrix
- Note: EOB runs can, and often do, cross threshold stages and plane
- boundaries. For example, a decoder may have decoded all of the AC #2
- coefficients for all fragments and still have an EOB run of 2. That
- means that during the AC #3 decode process, the first 2 coded fragments
- that are not already EOB will be set to EOB.
- Let's work through a highly contrived example to illustrate the
- coefficient decoding process.
- [not finished]
- When the decoder is finished unpacking the DCT coefficients, the entire
- encoded VP3 frame bitstream should be consumed.
- Reversing The DC Prediction
- ---------------------------
- Now that all of the DCT coefficient data has been unpacked, the DC
- coefficients need to be fully reconstructed before the IDCT can be
- performed.
- VP3 uses a somewhat involved process for DC prediction which uses up to
- four DC coefficients from surrounding fragments. For each fragment to be
- transformed with the IDCT, the DC coefficient is predicted from weighted
- sum of the DC coefficients in the left (l), up-left (ul), up (u), and
- up-right (ur) fragments, if they are coded (not unchanged from the
- previous frame) in a compatible frame (current, previous, or golden).
- In a golden frame, the prediction is quite straightforward since all
- fragments will be coded. A fragment's DC prediction will fall into 1 of
- 5 groups:
- abbbbbbbbb
- cdddddddde
- cdddddddde
- cdddddddde
- cdddddddde
- * Group a is the top left corner fragment. There is nothing to predict
- from. This DC coefficient has a lot of energy and requires many bits to
- code.
- * Group b is the remainder of the top row of fragments. These fragments
- can only predict from the left fragment.
- * Group c is the left column of fragments, not including the top left
- fragment. These fragments have the top and top-right fragments from
- which to predict.
- * Group d is the main body of fragments. These fragments have access to
- all 4 predictors.
- * Group e is the right column of fragments, not including the top right
- fragment. These fragments can predict from the left, up-left and up
- fragments.
- The process of reversing prediction for interframes grows more complex.
- First, the decoder must evaluate which candidate fragments (l, ul, u, or
- ur) are available for as predictors. Then, it can only use fragments
- that are coded within the same frame (current, previous, or golden).
- Further, there are auxiliary predictors for each frame type that are
- initialized to 0 at the start of each video frame decode operation. The
- decoder falls back on these auxiliary predictors when it can not find
- any valid candidate predictors for the current fragment.
- To work through some examples, consider the following notation, e.g.:
- ul-C = up-left fragment, coded in the current frame
- u-P = up fragment, coded as a motion residual from the previous frame
- ur-C = up-right fragment, coded in the current frame
- l-G = left fragment, coded as a motion residual from the golden frame
- x-P = current fragment where DC prediction is being performed, coded
- as a motion residual from the previous frame
- This is a simple case:
- ul-C u-C ur-C
- l-C x-C
- The current fragment predicts from all four of the candidate fragments
- since they are coded in the same frame.
- ul-P u-C ur-C
- l-P x-P
- The current fragment predicts from the left and up-left fragments.
- ul-C u-P ur-G
- l-P x-G
- The current fragment predicts from the up-right fragment.
- ul-C u-C ur-C
- l-C x-G
- The current fragment does not predict from any of the candidate
- fragments since the current fragment is a motion residual from the
- golden frame. Rather, add the auxiliary golden frame predictor to the
- current fragment's DC coefficient. Save the new DC coefficient as the
- new golden frame auxiliary DC predictor.
- If the decoder only finds one valid candidate predictor, then it is used
- by itself. When the decoder finds multiple valid candidate fragments
- from which to predict DC, it applies a weighting function to the
- surrounding fragments' DC coefficients. The following table presents all
- 16 possible combinations of available/not available predictors and what
- to do in each case:
- ul u ur l
- -- -- -- --
- 0 0 0 0 no predictors available:
- use the last predictor saved for the frame type
- (either intra, inter, or golden)
- 0 0 0 1 left predictor available:
- pred = l.dc
- 0 0 1 0 up-right predictor available:
- pred = ur.dc
- 0 0 1 1 up-right, left predictors available:
- pred = (53 * ur.dc) + (75 * l.dc)
- --------------------------
- 128
- 0 1 0 0 up predictor available:
- pred = u.dc
- 0 1 0 1 up, left predictors available:
- pred = (u.dc + l.dc)
- -------------
- 2
- 0 1 1 0 up, up-right predictors available:
- discard up-right predictor
- pred = u.dc
- 0 1 1 1 up, up-right, left predictors available:
- discard up predictor
- pred = (53 * ur.dc) + (75 * l.dc)
- --------------------------
- 128
- 1 0 0 0 up-left predictor available:
- pred = ul.dc
- 1 0 0 1 up-left, left predictors available:
- discard up-left predictor
- pred = l.dc
- 1 0 1 0 up-left, up-right predictors available:
- pred = (ul.dc + ur.dc)
- ---------------
- 2
- 1 0 1 1 up-left, up-right, left predictors available:
- discard up-left predictor
- pred = (53 * ur.dc) + (75 * l.dc)
- --------------------------
- 128
- 1 1 0 0 up-left, up predictors available:
- discard up-left
- pred = u.dc
- 1 1 0 1 up-left, up, left predictors available:
- pred = (-26 * ul.dc + 29 * u.dc + 29 * l.dc)
- -------------------------------------
- 32
- 1 1 1 0 up-left, up, up-right predictors available:
- pred = (3 * ul.dc + 10 * u.dc + 3 * ur.dc)
- -----------------------------------
- 16
- 1 1 1 1 all 4 predictors available:
- discard up-right predictor
- pred = (-26 * ul.dc + 29 * u.dc + 29 * l.dc)
- -------------------------------------
- 32
- Note that this final prediction case ([ul u l]) risks outranging. The
- difference of the predicted DC is checked against u.dc, l.dc, and ul.dc,
- in that order, and if the difference is greater than 128 in any case,
- the predictor is assigned as that DC coefficient. In pseudocode:
- if (ABSOLUTE_VALUE(pred - u.dc) > 128)
- pref = u.dc
- else if (ABSOLUTE_VALUE(pred - l.dc) > 128)
- pref = l.dc
- else if (ABSOLUTE_VALUE(pred - ul.dc) > 128)
- pref = ul.dc
- The predicted value is, at long last, added to the fragment's decoded DC
- coefficient. Finally, the new DC coefficient is saved as the frame
- type's auxiliary predictor. For example, if this fragment is coded as a
- motion residual from the previous frame, save the fragment's DC
- coefficient as the previous frame auxiliary predictor.
- [still need to mention precise rounding considerations, a.k.a, the
- HIGHTBITDUPPED() macro]
- Reconstructing The Frame
- ------------------------
- rough outline:
- - foreach fragment:
- - if motion vector
- - copy motion fragment from appropriate frame into current frame
- (don't forget to account for unrestricted motion vectors)
- - dequantize fragment coefficients
- - run coefficients through inverse DCT
- - if INTRA coded fragment
- - output transformed coefficients
- - else
- - apply transformed residual to motion fragment
-
- [not finished]
- Theora Specification
- --------------------
- The Theora project leverages the VP3 codec into a new video coding
- system. The algorithm and bitstream format are the same as VP3 with a
- few minor differences:
- 1) The frame orientation is reversed-- VP3 is coded from bottom to top
- while Theora video is coded from top to bottom.
- [nope-- only true in the first few alpha releases; final Theora spec will
- be upside-down, the same as VP3]
- 2) Variable histograms-- VP3 uses a hardcoded set of histograms for DCT
- coefficient coding (described in section "Unpacking The DCT
- Coefficients"). Theora packs the histogram information in the header of
- the transport format (which is meant to be Ogg, but can probably be
- coerced into a variety of other multimedia container formats).
- 3) Variable quantization-- As with the histograms, Theora codes the
- quantization tables and quality thresholds (described in section
- "Initializing The Quantization Matrices") into the header.
- 4) [special VLRLC case for encoding unusually large runs of blocks;
- necessary for HD resolutions]
- [still need coding format of histogram and quantizer information]
- Appendix A: VP31 Quantization Matrices And Scale Factors
- --------------------------------------------------------
- The following quantization matrices and scale factor tables are hardcoded
- into the VP31 coding standard. These tables can vary according to the
- setup information transported along with a Theora file.
- Base quantization matrix for golden frame Y fragments (note that this
- is the same as JPEG):
- 16 11 10 16 24 40 51 61
- 12 12 14 19 26 58 60 55
- 14 13 16 24 40 57 69 56
- 14 17 22 29 51 87 80 62
- 18 22 37 58 68 109 103 77
- 24 35 55 64 81 104 113 92
- 49 64 78 87 103 121 120 101
- 72 92 95 98 112 100 103 99
- Base quantization matrix for golden frame C fragments (note that this
- is the same as JPEG):
-
- 17 18 24 47 99 99 99 99
- 18 21 26 66 99 99 99 99
- 24 26 56 99 99 99 99 99
- 47 66 99 99 99 99 99 99
- 99 99 99 99 99 99 99 99
- 99 99 99 99 99 99 99 99
- 99 99 99 99 99 99 99 99
- 99 99 99 99 99 99 99 99
- Base quantization matrix for interframe Y and C fragments:
- 16 16 16 20 24 28 32 40
- 16 16 20 24 28 32 40 48
- 16 20 24 28 32 40 48 64
- 20 24 28 32 40 48 64 64
- 24 28 32 40 48 64 64 64
- 28 32 40 48 64 64 64 96
- 32 40 48 64 64 64 96 128
- 40 48 64 64 64 96 128 128
- DC coefficient scale factor table:
- 220 200 190 180 170 170 160 160
- 150 150 140 140 130 130 120 120
- 110 110 100 100 90 90 90 80
- 80 80 70 70 70 60 60 60
- 60 50 50 50 50 40 40 40
- 40 40 30 30 30 30 30 30
- 30 20 20 20 20 20 20 20
- 20 10 10 10 10 10 10 10
- AC coefficient scale factor table:
- 500 450 400 370 340 310 285 265
- 245 225 210 195 185 180 170 160
- 150 145 135 130 125 115 110 107
- 100 96 93 89 85 82 75 74
- 70 68 64 60 57 56 52 50
- 49 45 44 43 40 38 37 35
- 33 32 30 29 28 25 24 22
- 21 19 18 17 15 13 12 10
- Appendix B: Macroblock Coding Mode Alphabets
- --------------------------------------------
- These are the 6 pre-defined alphabets used to decode macroblock coding
- mode information:
- Alphabet 1:
- index coding mode
- ----- -----------
- 0 MODE_INTER_LAST_MV
- 1 MODE_INTER_PRIOR_LAST
- 2 MODE_INTER_PLUS_MV
- 3 MODE_INTER_NO_MV
- 4 MODE_INTRA
- 5 MODE_USING_GOLDEN,
- 6 MODE_GOLDEN_MV
- 7 MODE_INTER_FOURMV
- Alphabet 2:
- index coding mode
- ----- -----------
- 0 MODE_INTER_LAST_MV
- 1 MODE_INTER_PRIOR_LAST
- 2 MODE_INTER_NO_MV
- 3 MODE_INTER_PLUS_MV
- 4 MODE_INTRA
- 5 MODE_USING_GOLDEN
- 6 MODE_GOLDEN_MV
- 7 MODE_INTER_FOURMV
- Alphabet 3:
- index coding mode
- ----- -----------
- 0 MODE_INTER_LAST_MV
- 1 MODE_INTER_PLUS_MV
- 2 MODE_INTER_PRIOR_LAST
- 3 MODE_INTER_NO_MV
- 4 MODE_INTRA
- 5 MODE_USING_GOLDEN
- 6 MODE_GOLDEN_MV
- 7 MODE_INTER_FOURMV
-
- Alphabet 4:
- index coding mode
- ----- -----------
- 0 MODE_INTER_LAST_MV
- 1 MODE_INTER_PLUS_MV
- 2 MODE_INTER_NO_MV
- 3 MODE_INTER_PRIOR_LAST
- 4 MODE_INTRA
- 5 MODE_USING_GOLDEN
- 6 MODE_GOLDEN_MV
- 7 MODE_INTER_FOURMV
- Alphabet 5:
- index coding mode
- ----- -----------
- 0 MODE_INTER_NO_MV
- 1 MODE_INTER_LAST_MV
- 2 MODE_INTER_PRIOR_LAST
- 3 MODE_INTER_PLUS_MV
- 4 MODE_INTRA
- 5 MODE_USING_GOLDEN
- 6 MODE_GOLDEN_MV
- 7 MODE_INTER_FOURMV
-
- Alphabet 6:
- index coding mode
- ----- -----------
- 0 MODE_INTER_NO_MV
- 1 MODE_USING_GOLDEN
- 2 MODE_INTER_LAST_MV
- 3 MODE_INTER_PRIOR_LAST
- 4 MODE_INTER_PLUS_MV
- 5 MODE_INTRA
- 6 MODE_GOLDEN_MV
- 7 MODE_INTER_FOURMV
- Appendix C: DCT Coefficient VLC Tables
- --------------------------------------
- - VP31 tables are hardcoded
- - Theora tables are transported with video stream
- [not finished]
- Appendix D: The VP3 IDCT
- ------------------------
- [not finished]
- Acknowledgements
- ----------------
- Thanks to Michael Niedermayer (michaelni at gmx dot at) for peer review,
- corrections, and recommendations for improvement.
- Dan Miller (dan at on2 dot com) for clarifications on pieces of the
- format.
- Timothy B. Terriberry (tterribe at vt dot edu) for clarification about the
- differences between VP3 and Theora, detailed explanation of motion
- vector mechanics.
- References
- ----------
- Tables necessary for decoding VP3:
- http://mplayerhq.hu/cgi-bin/cvsweb.cgi/~checkout~/ffmpeg/libavcodec/vp3data.h?content-type=text/x-cvsweb-markup&cvsroot=FFMpeg
- Official VP3 site:
- http://www.vp3.com/
- Theora, based on VP3:
- http://www.theora.org/
- On2, creators of the VP3 format:
- http://www.on2.com/
- ChangeLog
- ---------
- v0.5: December 8, 2004
- - reworked section "Reversing The DC Prediction" to include a tabular
- representation of all 16 prediction modes
- v0.4: March 2, 2004
- - renamed and expanded section "Initializing The Quantization Matrices"
- - outlined section "Reconstructing The Frame"
- - moved Theora Differences Appendix to its own section entitled "Theora
- Specification"
- - added Appendix: Quantization Matrices And Scale Factors
- - added Appendix: DCT Coefficient VLC Tables
- v0.3: February 29, 2004
- - expanded section "Unpacking The Macroblock Coding Mode Information"
- - expanded section "Unpacking The Macroblock Motion Vectors"
- - added Appendix: Macroblock Coding Mode Alphabets
- v0.2: October 9, 2003
- - expanded section "Reversing the DC Prediction"
- - added Appendix: Theora Differences
- v0.1: June 17, 2003
- - initial release, nowhere near complete
|