stereo.html 23 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495
  1. <HTML><HEAD><TITLE>xiph.org: Ogg Vorbis documentation</TITLE>
  2. <BODY bgcolor="#ffffff" text="#202020" link="#006666" vlink="#000000">
  3. <nobr><img src="white-ogg.png"><img src="vorbisword2.png"></nobr><p>
  4. <h1><font color=#000070>
  5. Stereo Channel Coupling in the Vorbis CODEC
  6. </font></h1>
  7. <em>Last update to this document: June 27, 2001</em><br>
  8. <h2>Abstract</h2> The Vorbis audio CODEC provides a channel coupling
  9. mechanisms designed to reduce effective bitrate by both eliminating
  10. interchannel redundancy and eliminating stereo image information
  11. labeled inaudible or undesirable according to spatial psychoacoustic
  12. models. This document describes both the mechanical coupling
  13. mechanisms available within the Vorbis specification, as well as the
  14. specific stereo coupling models used by the reference
  15. <tt>libvorbis</tt> CODEC provided by xiph.org.
  16. <h2>Terminology</h2> Terminology as used in this document is based on
  17. common terminology associated with contemporary CODECs such as MPEG I
  18. audio layer 3 (mp3). However, some differences in terminology are
  19. useful in the context of Vorbis as Vorbis functions somewhat
  20. differently than most current formats. For clarity, a few terms are
  21. defined beforehand here, and others will be defined where they first
  22. appear in context.<p>
  23. <h3>Subjective and Objective</h3>
  24. <em>Objective</em> fidelity is a measure, based on a computable,
  25. mechanical metric, of how carefully an output matches an input. For
  26. example, a stereo amplifier may claim to introduce less that .01%
  27. total harmonic distortion when amplifying an input signal; this claim
  28. is easy to verify given proper equipment, and any number of testers are
  29. likely to arrive at the same, exact results. One need not listen to
  30. the equipment to make this measurement.<p>
  31. However, given two amplifiers with identical, verifiable objective
  32. specifications, listeners may strongly prefer the sound quality of one
  33. over the other. This is actually the case in the decades old debate
  34. [some would say jihad] among audiophiles involving vacuum tube versus
  35. solid state amplifiers. There are people who can tell the difference,
  36. and strongly prefer one over the other despite seemingly identical,
  37. measurable quality. This preference is <em>subjective</em> and
  38. difficult to measure but nonetheless real.
  39. Individual elements of subjective differences often can be qualified,
  40. but overall subjective quality generally is not measurable. Different
  41. observers are likely to disagree on the exact results of a subjective
  42. test as each observer's perspective differs. When measuring
  43. subjective qualities, the best one can hope for is average, empirical
  44. results that show statistical significance across a group.<p>
  45. Perceptual codecs are most concerned with subjective, not objective,
  46. quality. This is why evaluating a perceptual codec via distortion
  47. measures and sonograms alone is useless; these objective measures may
  48. provide insight into the quality or functioning of a codec, but cannot
  49. answer the much squishier subjective question, "Does it sound
  50. good?". The tube amplifier example is perhaps not the best as very few
  51. people can hear, or care to hear, the minute differences between tubes
  52. and transistors, whereas the subjective differences in perceptual
  53. codecs tend to be quite large even when objective differences are
  54. not.<p>
  55. <h3>Fidelity, Artifacts and Differences</h3> Audio <em>artifacts</em>
  56. and loss of fidelity or more simply put, audio <em>differences</em>
  57. are not the same thing.<p>
  58. A loss of fidelity implies differences between the perceived input and
  59. output signal; it does not necessarily imply that the differences in
  60. output are displeasing or that the output sounds poor (although this
  61. is often the case). Tube amplifiers are <em>not</em> higher fidelity
  62. than modern solid state and digital systems. They simply produce a
  63. form of distortion and coloring that is either unnoticeable or actually
  64. pleasing to many ears.<p>
  65. As compared to an original signal using hard metrics, all perceptual
  66. codecs [ASPEC, ATRAC, MP3, WMA, AAC, TwinVQ, AC3 and Vorbis included]
  67. lose objective fidelity in order to reduce bitrate. This is fact. The
  68. idea is to lose fidelity in ways that cannot be perceived. However,
  69. most current streaming applications demand bitrates lower than what
  70. can be achieved by sacrificing only objective fidelity; this is also
  71. fact, despite whatever various company press releases might claim.
  72. Subjective fidelity eventually must suffer in one way or another.<p>
  73. The goal is to choose the best possible tradeoff such that the
  74. fidelity loss is graceful and not obviously noticeable. Most listeners
  75. of FM radio do not realize how much lower fidelity that medium is as
  76. compared to compact discs or DAT. However, when compared directly to
  77. source material, the difference is obvious. A cassette tape is lower
  78. fidelity still, and yet the degredation, relatively speaking, is
  79. graceful and generally easy not to notice. Compare this graceful loss
  80. of quality to an average 44.1kHz stereo mp3 encoded at 80 or 96kbps.
  81. The mp3 might actually be higher objective fidelity but subjectively
  82. sounds much worse.<p>
  83. Thus, when a CODEC <em>must</em> sacrifice subjective quality in order
  84. to satisfy a user's requirements, the result should be a
  85. <em>difference</em> that is generally either difficult to notice
  86. without comparison, or easy to ignore. An <em>artifact</em>, on the
  87. other hand, is an element introduced into the output that is
  88. immediately noticeable, obviously foreign, and undesired. The famous
  89. 'underwater' or 'twinkling' effect synonymous with low bitrate (or
  90. poorly encoded) mp3 is an example of an <em>artifact</em>. This
  91. working definition differs slightly from common usage, but the coined
  92. distinction between differences and artifacts is useful for our
  93. discussion.<p>
  94. The goal, when it is absolutely necessary to sacrifice subjective
  95. fidelity, is obviously to strive for differences and not artifacts.
  96. The vast majority of CODECs today fail at this task miserably,
  97. predictably, and regularly in one way or another. Avoiding such
  98. failures when it is necessary to sacrifice subjective quality is a
  99. fundamental design objective of Vorbis and that objective is reflected
  100. in Vorbis's channel coupling design.<p>
  101. <h2>Mechanisms</h2>
  102. In encoder release beta 4 and earlier, Vorbis supported multiple
  103. channel encoding, but the channels were encoded entirely separately
  104. with no cross-analysis or redundancy elimination between channels.
  105. This multichannel strategy is very similar to the mp3's <em>dual
  106. stereo</em> mode and Vorbis uses the same name for it's analogous
  107. uncoupled multichannel modes.
  108. However, the Vorbis spec provides for, and Vorbis release 1.0 rc1 and
  109. later implement a coupled channel strategy. Vorbis has two specific
  110. mechanisms that may be used alone or in conjunction to implement
  111. channel coupling. The first is <em>channel interleaving</em> via
  112. residue backend #2, and the second is <em>square polar mapping</em>.
  113. These two general mechanisms are particularly well suited to coupling
  114. due to the structure of Vorbis encoding, as we'll explore below, and
  115. using both we can implement both totally <em>lossless stereo image
  116. coupling</em>, as well as various lossy models that seek to eliminate
  117. inaudible or unimportant aspects of the stereo image in order to
  118. enhance bitrate. The exact coupling implementation is generalized to
  119. allow the encoder a great deal of flexibility in implementation of a
  120. stereo model without requiring any significant complexity increase
  121. over the combinatorically simpler mid/side joint stereo of mp3 and
  122. other current audio codecs.<p>
  123. Channel interleaving may be applied directly to more than a single
  124. channel and polar mapping is hierarchical such that polar coupling may be
  125. extrapolated to an arbitrary number of channels and is not restricted
  126. to only stereo, quadriphonics, ambisonics or 5.1 surround. However,
  127. the scope of this document restricts itself to the stereo coupling
  128. case.<p>
  129. <h3>Square Polar Mapping</h3>
  130. <h4>maximal correlation</h4>
  131. Recall that the basic structure of a a Vorbis I stream first generates
  132. from input audio a spectral 'floor' function that serves as an
  133. MDCT-domain whitening filter. This floor is meant to represent the
  134. rough envelope of the frequency spectrum, using whatever metric the
  135. encoder cares to define. This floor is subtracted from the log
  136. frequency spectrum, effectively normalizing the spectrum by frequency.
  137. Each input channel is associated with a unique floor function.<p>
  138. The basic idea behind any stereo coupling is that the left and right
  139. channels usually correlate. This correlation is even stronger if one
  140. first accounts for energy differences in any given frequency band
  141. across left and right; think for example of individual instruments
  142. mixed into different portions of the stereo image, or a stereo
  143. recording with a dominant feature not perfectly in the center. The
  144. floor functions, each specific to a channel, provide the perfect means
  145. of normalizing left and right energies across the spectrum to maximize
  146. correlation before coupling. This feature of the Vorbis format is not
  147. a convenient accident.<p>
  148. Because we strive to maximally correlate the left and right channels
  149. and generally succeed in doing so, left and right residue is typically
  150. nearly identical. We could use channel interleaving (discussed below)
  151. alone to efficiently remove the redundancy between the left and right
  152. channels as a side effect of entropy encoding, but a polar
  153. representation gives benefits when left/right correlation is
  154. strong. <p>
  155. <h4>point and diffuse imaging</h4>
  156. The first advantage of a polar representation is that it effectively
  157. seperates the spatial audio information into a 'point image'
  158. (magnitude) at a given frequency and located somewhere in the sound
  159. field, and a 'diffuse image' (angle) that fills a large amount of
  160. space simultaneously. Even if we preserve only the magnitude (point)
  161. data, a detailed and carefully chosen floor function in each channel
  162. provides us with a free, fine-grained, frequency relative intensity
  163. stereo*. Angle information represents diffuse sound fields, such as
  164. reverberation that fills the entire space simultaneously.<p>
  165. *<em>Because the Vorbis model supports a number of different possible
  166. stereo models and these models may be mixed, we do not use the term
  167. 'intensity stereo' talking about Vorbis; instead we use the terms
  168. 'point stereo', 'phase stereo' and subcategories of each.</em><p>
  169. The majority of a stereo image is representable by polar magnitude
  170. alone, as strong sounds tend to be produced at near-point sources;
  171. even non-diffuse, fast, sharp echoes track very accurately using
  172. magnitude representation almost alone (for those experimenting with
  173. Vorbis tuning, this strategy works much better with the precise,
  174. piecewise control of floor 1; the continuous approximation of floor 0
  175. results in unstable imaging). Reverberation and diffuse sounds tend
  176. to contain less energy and be psychoacoustically dominated by the
  177. point sources embedded in them. Thus, we again tend to concentrate
  178. more represented energy into a predictably smaller number of numbers.
  179. Separating representation of point and diffuse imaging also allows us
  180. to model and manipulate point and diffuse qualities separately.<p>
  181. <h4>controlling bit leakage and symbol crosstalk</h4> Because polar
  182. representation concentrates represented energy into fewer large
  183. values, we reduce bit 'leakage' during cascading (multistage VQ
  184. encoding) as a secondary benefit. A single large, monolithic VQ
  185. codebook is more efficient than a cascaded book due to entropy
  186. 'crosstalk' among symbols between different stages of a multistage cascade.
  187. Polar representation is a way of further concentrating entropy into
  188. predictable locations so that codebook design can take steps to
  189. improve multistage codebook efficiency. It also allows us to cascade
  190. various elements of the stereo image independently.<p>
  191. <h4>eliminating trigonometry and rounding</h4>
  192. Rounding and computational complexity are potential problems with a
  193. polar representation. As our encoding process involves quantization,
  194. mixing a polar representation and quantization makes it potentially
  195. impossible, depending on implementation, to construct a coupled stereo
  196. mechanism that results in bit-identical decompressed output compared
  197. to an uncoupled encoding should the encoder desire it.<p>
  198. Vorbis uses a mapping that preserves the most useful qualities of
  199. polar representation, relies only on addition/subtraction, and makes
  200. it trivial before or after quantization to represent an
  201. angle/magnitude through a one-to-one mapping from possible left/right
  202. value permutations. We do this by basing our polar representation on
  203. the unit square rather than the unit-circle.<p>
  204. Given a magnitude and angle, we recover left and right using the
  205. following function (note that A/B may be left/right or right/left
  206. depending on the coupling definition used by the encoder):<p>
  207. <pre>
  208. if(magnitude>0)
  209. if(angle>0){
  210. A=magnitude;
  211. B=magnitude-angle;
  212. }else{
  213. B=magnitude;
  214. A=magnitude+angle;
  215. }
  216. else
  217. if(angle>0){
  218. A=magnitude;
  219. B=magnitude+angle;
  220. }else{
  221. B=magnitude;
  222. A=magnitude-angle;
  223. }
  224. }
  225. </pre>
  226. The function is antisymmetric for positive and negative magnitudes in
  227. order to eliminate a redundant value when quantizing. For example, if
  228. we're quantizing to integer values, we can visualize a magnitude of 5
  229. and an angle of -2 as follows:<p>
  230. <img src="squarepolar.png">
  231. <p>
  232. This representation loses or replicates no values; if the range of A
  233. and B are integral -5 through 5, the number of possible Cartesian
  234. permutations is 121. Represented in square polar notation, the
  235. possible values are:
  236. <pre>
  237. 0, 0
  238. -1,-2 -1,-1 -1, 0 -1, 1
  239. 1,-2 1,-1 1, 0 1, 1
  240. -2,-4 -2,-3 -2,-2 -2,-1 -2, 0 -2, 1 -2, 2 -2, 3
  241. 2,-4 2,-3 ... following the pattern ...
  242. ... 5, 1 5, 2 5, 3 5, 4 5, 5 5, 6 5, 7 5, 8 5, 9
  243. </pre>
  244. ...for a grand total of 121 possible values, the same number as in
  245. Cartesian representation (note that, for example, <tt>5,-10</tt> is
  246. the same as <tt>-5,10</tt>, so there's no reason to represent
  247. both. 2,10 cannot happen, and there's no reason to account for it.)
  248. It's also obvious that this mapping is exactly reversible.<p>
  249. <h3>Channel interleaving</h3>
  250. We can remap and A/B vector using polar mapping into a magnitude/angle
  251. vector, and it's clear that, in general, this concentrates energy in
  252. the magnitude vector and reduces the amount of information to encode
  253. in the angle vector. Encoding these vectors independently with
  254. residue backend #0 or residue backend #1 will result in substantial
  255. bitrate savings. However, there are still implicit correlations
  256. between the magnitude and angle vectors. The most obvious is that the
  257. amplitude of the angle is bounded by its corresponding magnitude
  258. value.<p>
  259. Entropy coding the results, then, further benefits from the entropy
  260. model being able to compress magnitude and angle simultaneously. For
  261. this reason, Vorbis implements residuebackend #2 which preinterleaves
  262. a number of input vectors (in the stereo case, two, A and B) into a
  263. single output vector (with the elements in the order of
  264. A_0, B_0, A_1, B_1, A_2 ... A_n-1, B_n-1) before entropy encoding. Thus
  265. each vector to be coded by the vector quantization backend consists of
  266. matching magnitude and angle values.<p>
  267. The astute reader, at this point, will notice that in the theoretical
  268. case in which we can use monolithic codebooks of arbitrarily large
  269. size, we can directly interleave and encode left and right without
  270. polar mapping; in fact, the polar mapping does not appear to lend any
  271. benefit whatsoever to the efficiency of the entropy coding. In fact,
  272. it is perfectly possible and reasonable to build a Vorbis encoder that
  273. dispenses with polar mapping entirely and merely interleaves the
  274. channel. Libvorbis based encoders may configure such an encoding and
  275. it will work as intended.<p>
  276. However, when we leave the ideal/theoretical domain, we notice that
  277. polar mapping does give additional practical benefits, as discussed in
  278. the above section on polar mapping and summarized again here:<p>
  279. <ul>
  280. <li>Polar mapping aids in controlling entropy 'leakage' between stages
  281. of a cascaded codebook. <li>Polar mapping separates the stereo image
  282. into point and diffuse components which may be analyzed and handled
  283. differently.
  284. </ul>
  285. <h2>Stereo Models</h2>
  286. <h3>Dual Stereo</h3>
  287. Dual stereo refers to stereo encoding where the channels are entirely
  288. separate; they are analyzed and encoded as entirely distinct entities.
  289. This terminology is familiar from mp3.<p>
  290. <h3>Lossless Stereo</h3>
  291. Using polar mapping and/or channel interleaving, it's possible to
  292. couple Vorbis channels losslessly, that is, construct a stereo
  293. coupling encoding that both saves space but also decodes
  294. bit-identically to dual stereo. OggEnc 1.0 and later offers this
  295. mode.<p>
  296. Overall, this stereo mode is overkill; however, it offers a safe
  297. alternative to users concerned about the slightest possible
  298. degredation to the stereo image or archival quality audio.<p>
  299. <h3>Phase Stereo</h3>
  300. Phase stereo is the least aggressive means of gracefully dropping
  301. resolution from the stereo image; it affects only diffuse imaging.<p>
  302. It's often quoted that the human ear is nearly entirely deaf to signal
  303. phase above about 4kHz; this is nearly true and a passable rule of
  304. thumb, but it can be demonstrated that even an average user can tell
  305. the difference between high frequency in-phase and out-of-phase noise.
  306. Obviously then, the statement is not entirely true. However, it's
  307. also the case that one must resort to nearly such an extreme
  308. demostration before finding the counterexample.<p>
  309. 'Phase stereo' is simply a more aggressive quantization of the polar
  310. angle vector; above 4kHz it's generally quite safe to quantize noise
  311. and noisy elements to only a handful of allowed phases. The phases of
  312. high amplitude pure tones may or may not be preserved more carefully
  313. (they are relatively rare and L/R tend to be in phase, so there is
  314. generally little reason not to spend a few more bits on them) <p>
  315. <h4>eight phase stereo</h4>
  316. Vorbis implements phase stereo coupling by preserving the entirety of the magnitude vector (essential to fine amplitude and energy resolution overall) and quantizing the angle vector to one of only four possible values. Given that the magnitude vector may be positive or negative, this results in left and right phase having eight possible permutation, thus 'eight phase stereo':<p>
  317. <img src="eightphase.png"><p>
  318. Left and right may be in phase (positive or negative), the most common
  319. case by far, or out of phase by 90 or 180 degrees.<p>
  320. <h4>four phase stereo</h4>
  321. Four phase stereo takes the quantization one step further; it allows
  322. only in-phase and 180 degree out-out-phase signals:<p>
  323. <img src="fourphase.png"><p>
  324. <h3>Point Stereo</h3>
  325. Point stereo eliminates the possibility of out-of-phase signal
  326. entirely. Any diffuse quality to a sound source tends to collapse
  327. inward to a point somewhere within the stereo image. A practical
  328. example would be balanced reverberations within a large, live space;
  329. normally the sound is diffuse and soft, giving a sonic impression of
  330. volume. In point-stereo, the reverberations would still exist, but
  331. sound fairly firmly centered within the image (assuming the
  332. reverberation was centered overall; if the reverberation is stronger
  333. to the left, then the point of localization in point stereo would be
  334. to the left). This effect is most noticeable at low and mid
  335. frequencies and using headphones (which grant perfect stereo
  336. separation). Point stereo is is a graceful but generally easy to
  337. detect degrdation to the sound quality and is thus used in frequency
  338. ranges where it is least noticeable.<p>
  339. <h3>Mixed Stereo</h3>
  340. Mixed stereo is the simultaneous use of more than one of the above
  341. stereo encoding models, generally using more aggressive modes in
  342. higher frequencies, lower amplitudes or 'nearly' in-phase sound.<p>
  343. It is also the case that near-DC frequencies should be encoded using
  344. lossless coupling to avoid frame blocking artifacts.<p>
  345. <h3>Vorbis Stereo Modes</h3>
  346. Vorbis, for the most part, uses lossless stereo and a number of mixed
  347. modes constructed out of the above models. As of the current pre-1.0
  348. testing version of the encoder, oggenc supports the following modes.
  349. Oggenc's default choice varies by bitrate and each mode is selectable
  350. by the user:<p>
  351. <dl>
  352. <dt>dual stereo
  353. <dd>uncoupled stereo encoding<p>
  354. <dt>lossless stereo
  355. <dd>lossless stereo coupling; produces exactly equivalent output to dual stereo<p>
  356. <dt>eight phase stereo
  357. <dd>a mixed mode combining lossless stereo for frequencies to approximately 4 kHz (and all strong pure tones) and eight phase stereo above<p>
  358. <dt>aggressive eight phase stereo
  359. <dd>a mixed mode combining lossless stereo for frequencies to approximately 2 kHz (and for all strong pure tones) and eight phase stereo above<p>
  360. <dt>eight/four phase stereo <dd>A mixed mode combining lossless stereo
  361. for bass, eight phase stereo for noisy content and lossless stereo for
  362. tones to approximately 4kHz and four phase stereo above 4kHz.<p>
  363. <dt>eight phase/point stereo <dd>A mixed mode combining lossless stereo
  364. for bass, eight phase stereo for noisy content and lossless stereo for
  365. tones to approximately 4kHz and point stereo above 4kHz.<p>
  366. <dt>aggressive eight phase/point stereo
  367. <dd>A mixed mode combining lossless stereo
  368. for bass, eight phase stereo to approximately 2kHz and point stereo above 2kHz.<p>
  369. <dt>point stereo
  370. <dd>A mixed mode combining lossless stereo to approximately 4kHz and point stereo above 4kHz.<p>
  371. <dt>aggressive point stereo
  372. <dd>A mixed mode combining lossless stereo to approximately 1-2kHz and point stereo above.<p>
  373. </dl>
  374. <hr>
  375. <a href="http://www.xiph.org/">
  376. <img src="white-xifish.png" align=left border=0>
  377. </a>
  378. <font size=-2 color=#505050>
  379. Ogg is a <a href="http://www.xiph.org">Xiphophorus</a> effort to
  380. protect essential tenets of Internet multimedia from corporate
  381. hostage-taking; Open Source is the net's greatest tool to keep
  382. everyone honest. See <a href="http://www.xiph.org/about.html">About
  383. Xiphophorus</a> for details.
  384. <p>
  385. Ogg Vorbis is the first Ogg audio CODEC. Anyone may
  386. freely use and distribute the Ogg and Vorbis specification,
  387. whether in a private, public or corporate capacity. However,
  388. Xiphophorus and the Ogg project (xiph.org) reserve the right to set
  389. the Ogg/Vorbis specification and certify specification compliance.<p>
  390. Xiphophorus's Vorbis software CODEC implementation is distributed
  391. under a BSD-like License. This does not restrict third parties from
  392. distributing independent implementations of Vorbis software under
  393. other licenses.<p>
  394. OggSquish, Vorbis, Xiphophorus and their logos are trademarks (tm) of
  395. <a href="http://www.xiph.org/">Xiphophorus</a>. These pages are
  396. copyright (C) 1994-2001 Xiphophorus. All rights reserved.<p>
  397. </body>