288-privcount-with-shamir.txt 24 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568
  1. Filename: 288-privcount-with-shamir.txt
  2. Title: Privacy-Preserving Statistics with Privcount in Tor (Shamir version)
  3. Author: Nick Mathewson, Tim Wilson-Brown, Aaron Johnson
  4. Created: 1-Dec-2017
  5. Supercedes: 280
  6. Status: Accepted
  7. 0. Acknowledgments
  8. Tariq Elahi, George Danezis, and Ian Goldberg designed and implemented
  9. the PrivEx blinding scheme. Rob Jansen and Aaron Johnson extended
  10. PrivEx's differential privacy guarantees to multiple counters in
  11. PrivCount:
  12. https://github.com/privcount/privcount/blob/master/README.markdown#research-background
  13. Rob Jansen and Tim Wilson-Brown wrote the majority of the experimental
  14. PrivCount code, based on the PrivEx secret-sharing variant. This
  15. implementation includes contributions from the PrivEx authors, and
  16. others:
  17. https://github.com/privcount/privcount/blob/master/CONTRIBUTORS.markdown
  18. This research was supported in part by NSF grants CNS-1111539,
  19. CNS-1314637, CNS-1526306, CNS-1619454, and CNS-1640548.
  20. The use of a Shamir secret-sharing-based approach is due to a
  21. suggestion by Aaron Johnson (iirc); Carolin Zöbelein did some helpful
  22. analysis here.
  23. Aaron Johnson and Tim Wilson-Brown made improvements to the draft proposal.
  24. 1. Introduction and scope
  25. PrivCount is a privacy-preserving way to collect aggregate statistics
  26. about the Tor network without exposing the statistics from any single
  27. Tor relay.
  28. This document describes the behavior of the in-Tor portion of the
  29. PrivCount system. It DOES NOT describe the counter configurations,
  30. or any other parts of the system. (These will be covered in separate
  31. proposals.)
  32. 2. PrivCount overview
  33. Here follows an oversimplified summary of PrivCount, with enough
  34. information to explain the Tor side of things. The actual operation
  35. of the non-Tor components is trickier than described below.
  36. In PrivCount, a Data Collector (DC, in this case a Tor relay) shares
  37. numeric data with N different Tally Reporters (TRs). (A Tally Reporter
  38. performs the summing and unblinding roles of the Tally Server and Share
  39. Keeper from experimental PrivCount.)
  40. All N Tally Reporters together can reconstruct the original data, but
  41. no (N-1)-sized subset of the Tally Reporters can learn anything about
  42. the data.
  43. (In reality, the Tally Reporters don't reconstruct the original data
  44. at all! Instead, they will reconstruct a _sum_ of the original data
  45. across all participating relays.)
  46. In brief, the system works as follow:
  47. To share data, for each counter value V to be shared, the Data Collector
  48. first adds Gaussian noise to V in order to produce V', uses (K,N) Shamir
  49. secret-sharing to generate N shares of V' (K<=N, K being the
  50. reconstruction threshold), encrypts each share to a different Tally
  51. Reporter, and sends each encrypted share to the Tally Reporter it
  52. is encrypted for.
  53. The Tally Reporters then agree on the set S of Data Collectors that sent
  54. data to all of them, and each Tally Reporter forms a share of the aggregate
  55. value by decrypting the shares it received from the Data Collectors in S
  56. and adding them together. The Tally Reporters then, collectively, perform
  57. secret reconstruction, thereby learning the sum of all the different
  58. values V'.
  59. The use of Shamir secret sharing lets us survive up to N-K crashing TRs.
  60. Waiting until the end to agree on a set S of surviving relays lets us
  61. survive an arbitrary number of crashing DCs. In order to prevent bogus
  62. data from corrupting the tally, the Tally Reporters can perform the
  63. aggregation step multiple times, each time proceeding with a different
  64. subset of S and taking the median of the resulting values.
  65. Relay subsets should be chosen at random to avoid relays manipulating their
  66. subset membership(s). If an shared random value is required, all relays must
  67. submit their results, and then the next revealed shared random value can
  68. be used to select relay subsets. (Tor's shared random value can be
  69. calculated as soon as all commits have been revealed. So all relay results
  70. must be received *before* any votes are cast in the reveal phase for that
  71. shared random value.)
  72. Below we describe the algorithm in more detail, and describe the data
  73. format to use.
  74. 3. The algorithm
  75. All values below are B-bit integers modulo some prime P; we suggest
  76. B=62 and P = 2**62 - 2**30 - 1 (hex 0x3fffffffbfffffff). The size of
  77. this field is an upper limit on the largest sum we can calculate; it
  78. is not a security parameter.
  79. There are N Tally Reporters: every participating relay must agree on
  80. which N exist, and on their current public keys. We suggest listing
  81. them in the consensus networkstatus document. All parties must also
  82. agree on some ordering the Tally Reporters. Similarly, all parties
  83. must also agree on some value K<=N.
  84. There are a number of well-known "counters", identified known by ASCII
  85. identifiers. Each counter is a value that the participating relays
  86. will know how to count. Let C be the number of counters.
  87. 3.1. Data Collector (DC) side
  88. At the start of each period, every Data Collector ("client" below)
  89. initializes their state as follows
  90. 1. For every Tally Reporter with index i, the client constructs a
  91. random 32-byte random value SEED_i. The client then generates
  92. a pseudorandom bitstream of using the SHAKE-256
  93. XOF with SEED_i as its input, and divides this stream into
  94. C values, with the c'th value denoted by MASK(i, c).
  95. [To divide the stream into values, consider the stream 8 bytes at a
  96. time as unsigned integers in network (big-endian) order. For each
  97. such integer, clear the top (64-B) bits. If the result is less than
  98. P, then include the integer as one of the MASK(i, .) values.
  99. Otherwise, discard this 8-byte segment and proceed to the next
  100. value.]
  101. 2. The client encrypts SEED_i using the public key of Tally
  102. Reporter i, and remembers this encrypted value. It discards
  103. SEED_i.
  104. 3. For every counter c, the client generates a noise value Z_c
  105. from an appropriate Gaussian distribution. If the noise value is
  106. negative, the client adds P to bring Z_c into the range 0...(P-1).
  107. (The noise MUST be sampled using the procedure in Appendix C.)
  108. The client then uses Shamir secret sharing to generate
  109. N shares (x,y) of Z_c, 1 <= x <= N, with the x'th share to be used by
  110. the x'th Tally Reporter. See Appendix A for more on Shamir secret
  111. sharing. See Appendix B for another idea about X coordinates.
  112. The client picks a random value CTR_c and stores it in the counter,
  113. which serves to locally blind the counter.
  114. The client then subtracts (MASK(x, c)+CTR_c) from y, giving
  115. "encrypted shares" of (x, y0) where y0 = y-CTR_c.
  116. The client then discards all MASK values, all CTR values, and all
  117. original shares (x,y), all CTR and the noise value Z_c. For each
  118. counter c, it remembers CTR_c, and N shares of the form (x, y).
  119. To increment a counter by some value "inc":
  120. 1. The client adds "inc" to counter value, modulo P.
  121. (This step is chosen to be optimal, since it will happen more
  122. frequently than any other step in the computation.)
  123. Aggregate counter values that are close to P/2 MUST be scaled to
  124. avoid overflow. See Appendix D for more information. (We do not think
  125. that any counters on the current Tor network will require scaling.)
  126. To publish the counter values:
  127. 1. The client publishes, in the format described below:
  128. The list of counters it knows about
  129. The list of TRs it knows about
  130. For each TR:
  131. For each counter c:
  132. A list of (i, y-CTR_c-MASK(x,c)), which corresponds
  133. to the share for the i'th TR of counter c.
  134. SEED_i as encrypted earlier to the i'th TR's public key.
  135. 3.2. Tally Reporter (TR) side
  136. This section is less completely specified than the Data Collector's
  137. behavior: I expect that the TRs will be easier to update as we proceed.
  138. (Each TR has a long-term identity key (ed25519). It also has a
  139. sequence of short-term curve25519 keys, each associated with a single
  140. round of data collection.)
  141. 1. When a group of TRs receives information from the Data Collectors,
  142. they collectively chose a set S of DCs and a set of counters such
  143. that every TR in the group has a valid entry for every counter,
  144. from every DC in the set.
  145. To be valid, an entry must not only be well-formed, but must also
  146. have the x coordinate in its shares corresponding to the
  147. TR's position in the list of TRs.
  148. 2. For each Data Collector's report, the i'th TR decrypts its part of
  149. the client's report using its curve25519 key. It uses SEED_i and
  150. SHAKE-256 to regenerate MASK(0) through MASK(C-1). Then for each
  151. share (x, y-CTR_c-MASK(x,c)) (note that x=i), the TR reconstructs the
  152. true share of the value for that DC and counter c by adding
  153. V+MASK(x,c) to the y coordinate to yield the share (x, y_final).
  154. 3. For every counter in the set, each TR computes the sum of the
  155. y_final values from all clients.
  156. 4. For every counter in the set, each TR publishes its a share of
  157. the sum as (x, SUM(y_final)).
  158. 5. If at least K TRs publish correctly, then the sum can be
  159. reconstructed using Lagrange polynomial interpolation. (See
  160. Appendix A).
  161. 6. If the reconstructed sum is greater than P/2, it is probably a negative
  162. value. The value can be obtained by subtracting P from the sum.
  163. (Negative values are generated when negative noise is added to small
  164. signals.)
  165. 7. If scaling has been applied, the sum is scaled by the scaling factor.
  166. (See Appendix D.)
  167. 4. The document format
  168. 4.1. The counters document.
  169. This document format builds on the line-based directory format used
  170. for other tor documents, described in Tor's dir-spec.txt.
  171. Using this format, we describe a "counters" document that publishes
  172. the shares collected by a given DC, for a single TR.
  173. The "counters" document has these elements:
  174. "privctr-dump-format" SP VERSION SP SigningKey
  175. [At start, exactly once]
  176. Describes the version of the dump format, and provides an ed25519
  177. signing key to identify the relay. The signing key is encoded in
  178. base64 with padding stripped. VERSION is "alpha" now, but should
  179. be "1" once this document is finalized.
  180. "starting-at" SP IsoTime
  181. [Exactly once]
  182. The start of the time period when the statistics here were
  183. collected.
  184. "ending-at" SP IsoTime
  185. [Exactly once]
  186. The end of the time period when the statistics here were
  187. collected.
  188. "share-parameters" SP Number SP Number
  189. [Exactly once]
  190. The number of shares needed to reconstruct the client's
  191. measurements (K), and the number of shares produced (N),
  192. respectively.
  193. "tally-reporter" SP Identifier SP Integer SP Key
  194. [At least twice]
  195. The curve25519 public key of each Tally Reporter that the relay
  196. believes in. (If the list does not match the list of
  197. participating Tally Reporters, they won't be able to find the
  198. relay's values correctly.) The identifiers are non-space,
  199. non-nul character sequences. The Key values are encoded in
  200. base64 with padding stripped; they must be unique within each
  201. counters document. The Integer values are the X coordinate of
  202. the shares associated with each Tally Reporter.
  203. "encrypted-to-key" SP Key
  204. [Exactly once]
  205. The curve25519 public key to which the report below is encrypted.
  206. Note that it must match one of the Tally Reporter options above.
  207. "report" NL
  208. "----- BEGIN ENCRYPTED MESSAGE-----" NL
  209. Base64Data
  210. "----- END ENCRYPTED MESSAGE-----" NL
  211. [Exactly once]
  212. An encrypted document, encoded in base64. The plaintext format is
  213. described in section 4.2. below. The encryption is as specified in
  214. section 5 below, with STRING_CONSTANT set to "privctr-shares-v1".
  215. "signature" SP Signature
  216. [At end, exactly once]
  217. The Ed25519 signature of all the fields in the document, from the
  218. first byte, up to but not including the "signature" keyword here.
  219. The signature is encoded in base64 with padding stripped.
  220. 4.2. The encrypted "shares" document.
  221. The shares document is sent, encrypted, in the "report" element above.
  222. Its plaintext contents include these fields:
  223. "encrypted-seed" NL
  224. "----- BEGIN ENCRYPTED MESSAGE-----" NL
  225. Base64Data
  226. "----- END ENCRYPTED MESSAGE-----" NL
  227. [At start, exactly once.]
  228. An encrypted document, encoded in base64. The plaintext value is
  229. the 32-byte value SEED_i for this TR. The encryption is as
  230. specified in section 5 below, with STRING_CONSTANT set to
  231. "privctr-seed-v1".
  232. "d" SP Keyword SP Integer
  233. [Any number of times]
  234. For each counter, the name of the counter, and the obfuscated Y
  235. coordinate of this TR's share for that counter. (The Y coordinate
  236. is calculated as y-CTR_c as in 3.1 above.) The order of counters
  237. must correspond to the order used when generating the MASK() values;
  238. different clients do not need to choose the same order.
  239. 5. Hybrid encryption
  240. This scheme is taken from rend-spec-v3.txt, section 2.5.3, replacing
  241. "secret_input" and "STRING_CONSTANT". It is a hybrid encryption
  242. method for encrypting a message to a curve25519 public key PK.
  243. We generate a new curve25519 keypair (sk,pk).
  244. We run the algorithm of rend-spec-v3.txt 2.5.3, replacing
  245. "secret_input" with Curve25519(sk,PK) | SigningKey, where
  246. SigningKey is the DC's signing key. (Including the DC's SigningKey
  247. here prevents one DC from replaying another one's data.)
  248. We transmit the encrypted data as in rend-spec-v3.txt 2.5.3,
  249. prepending pk.
  250. Appendix A. Shamir secret sharing for the impatient
  251. In Shamir secret sharing, you want to split a value in a finite
  252. field into N shares, such that any K of the N shares can
  253. reconstruct the original value, but K-1 shares give you no
  254. information at all.
  255. The key insight here is that you can reconstruct a K-degree
  256. polynomial given K+1 distinct points on its curve, but not given
  257. K points.
  258. So, to split a secret, we going to generate a (K-1)-degree
  259. polynomial. We'll make the Y intercept of the polynomial be our
  260. secret, and choose all the other coefficients at random from our
  261. field.
  262. Then we compute the (x,y) coordinates for x in [1, N]. Now we
  263. have N points, any K of which can be used to find the original
  264. polynomial.
  265. Moreover, we can do what PrivCount wants here, because adding the
  266. y coordinates of N shares gives us shares of the sum: If P1 is
  267. the polynomial made to share secret A and P2 is the polynomial
  268. made to share secret B, and if (x,y1) is on P1 and (x,y2) is on
  269. P2, then (x,y1+y2) will be on P1+P2 ... and moreover, the y
  270. intercept of P1+P2 will be A+B.
  271. To reconstruct a secret from a set of shares, you have to either
  272. go learn about Lagrange polynomials, or just blindly copy a
  273. formula from your favorite source.
  274. Here is such a formula, as pseudocode^Wpython, assuming that
  275. each share is an object with a _x field and a _y field.
  276. def interpolate(shares):
  277. for sh in shares:
  278. product_num = FE(1)
  279. product_denom = FE(1)
  280. for sh2 in shares:
  281. if sh2 is sh:
  282. continue
  283. product_num *= sh2._x
  284. product_denom *= (sh2._x - sh._x)
  285. accumulator += (sh._y * product_num) / product_denom
  286. return accumulator
  287. Appendix B. An alternative way to pick X coordinates
  288. Above we describe a system where everybody knows the same TRs and
  289. puts them in the same order, and then does Shamir secret sharing
  290. using "x" as the x coordinate for the x'th TR.
  291. But what if we remove that requirement by having x be based on a hash
  292. of the public key of the TR? Everything would still work, so long as
  293. all users chose the same K value. It would also let us migrate TR
  294. sets a little more gracefully.
  295. Appendix C. Sampling floating-point Gaussian noise for differential privacy
  296. Background:
  297. When we add noise to a counter value (signal), we want the added noise to
  298. protect all of the bits in the signal, to ensure differential privacy.
  299. But because noise values are generated from random double(s) using
  300. floating-point calculations, the resulting low bits are not distributed
  301. evenly enough to ensure differential privacy.
  302. As implemented in the C "double" type, IEEE 754 double-precision
  303. floating-point numbers contain 53 significant bits in their mantissa. This
  304. means that noise calculated using doubles can not ensure differential
  305. privacy for client activity larger than 2**53:
  306. * if the noise is scaled to the magnitude of the signal using
  307. multiplication, then the low bits are unprotected,
  308. * if the noise is not scaled, then the high bits are unprotected.
  309. But the operations in the noise transform also suffer from floating-point
  310. inaccuracy, further affecting the low bits in the mantissa. So we can only
  311. protect client activity up to 2**46 with Laplacian noise. (We assume that
  312. the limit for Gaussian noise is similar.)
  313. Our noise generation procedure further reduces this limit to 2**42. For
  314. byte counters, 2**42 is 4 Terabytes, or the observed bandwidth of a 1 Gbps
  315. relay running at full speed for 9 hours. It may be several years before we
  316. want to protect this much client activity. However, since the mitigation is
  317. relatively simple, we specify that it MUST be implemented.
  318. Procedure:
  319. Data collectors MUST sample noise as follows:
  320. 1. Generate random double(s) in [0, 1] that are integer multiples of
  321. 2**-53.
  322. TODO: the Gaussian transform in step 2 may require open intervals
  323. 2. Generate a Gaussian floating-point noise value at random with sigma 1,
  324. using the random double(s) generated in step 1.
  325. 3. Multiply the floating-point noise by the floating-point sigma value.
  326. 4. Truncate the scaled noise to an integer to remove the fractional bits.
  327. (These bits can never correspond to signal bits, because PrivCount only
  328. collects integer counters.)
  329. 5. If the floating-point sigma value from step 3 is large enough that any
  330. noise value could be greater than or equal to 2**46, we need to
  331. randomise the low bits of the integer scaled noise value. (This ensures
  332. that the low bits of the signal are always hidden by the noise.)
  333. If we use the sample_unit_gaussian() transform in nickm/privcount_nm:
  334. A. The maximum r value is sqrt(-2.0*ln(2**-53)) ~= 8.57, and the
  335. maximal sin(theta) values are +/- 1.0. Therefore, the generated
  336. noise values can be greater than or equal to 2**46 when the sigma
  337. value is greater than 2**42.
  338. B. Therefore, the number of low bits that need to be randomised is:
  339. N = floor(sigma / 2**42)
  340. C. We randomise the lowest N bits of the integer noise by replacing them
  341. with a uniformly distributed N-bit integer value in 0...(2**N)-1.
  342. 6. Add the integer noise to the integer counter, before the counter is
  343. incremented in response to events. (This ensures that the signal value
  344. is always protected.)
  345. This procedure is security-sensitive: changing the order of
  346. multiplications, truncations, or bit replacements can expose the low or
  347. high bits of the signal or noise.
  348. As long as the noise is sampled using this procedure, the low bits of the
  349. signal are protected. So we do not need to "bin" any signals.
  350. The impact of randomising more bits than necessary is minor, but if we fail
  351. to randomise an unevenly distributed bit, client activity can be exposed.
  352. Therefore, we choose to randomise all bits that could potentially be affected
  353. by floating-point inaccuracy.
  354. Justification:
  355. Although this analysis applies to Laplacian noise, we assume a similar
  356. analysis applies to Gaussian noise. (If we add Laplacian noise on DCs,
  357. the total ends up with a Gaussian distribution anyway.)
  358. TODO: check that the 2**46 limit applies to Gaussian noise.
  359. This procedure results in a Gaussian distribution for the higher ~42 bits
  360. of the noise. We can safely ignore the value of the lower bits of the noise,
  361. because they are insignificant for our reporting.
  362. This procedure is based on section 5.2 of:
  363. "On Significance of the Least Significant Bits For Differential Privacy"
  364. Ilya Mironov, ACM CCS 2012
  365. https://www.microsoft.com/en-us/research/wp-content/uploads/2012/10/lsbs.pdf
  366. We believe that this procedure is safe, because we neither round nor smooth
  367. the noise values. The truncation in step 4 has the same effect as Mironov's
  368. "safe snapping" procedure. Randomising the low bits removes the 2**46 limit
  369. on the sigma value, at the cost of departing slightly from the ideal
  370. infinite-precision Gaussian distribution. (But we already know that these
  371. bits are distributed poorly, due to floating-point inaccuracy.)
  372. Mironov's analysis assumes that a clamp() function is available to clamp
  373. large signal and noise values to an infinite floating-point value.
  374. Instead of clamping, PrivCount's arithmetic wraps modulo P. We believe that
  375. this is safe, because any reported values this large will be meaningless
  376. modulo P. And they will not expose any client activity, because "modulo P"
  377. is an arithmetic transform of the summed noised signal value.
  378. Alternatives:
  379. We could round the encrypted value to the nearest multiple of the
  380. unprotected bits. But this relies on the MASK() value being a uniformly
  381. distributed random value, and it is less generic.
  382. We could also simply fail when we reach the 2**42 limit on the sigma value,
  383. but we do not want to design a system with a limit that low.
  384. We could use a pure-integer transform to create Gaussian noise, and avoid
  385. floating-point issues entirely. But we have not been able to find an
  386. efficient pure-integer Gaussian or Laplacian noise transform. Nor do we
  387. know if such a transform can be used to ensure differential privacy.
  388. Appendix D. Scaling large counters
  389. We do not believe that scaling will be necessary to collect PrivCount
  390. statistics in Tor. As of November 2017, the Tor network advertises a
  391. capacity of 200 Gbps, or 2**51 bytes per day. We can measure counters as
  392. large as ~2**61 before reaching the P/2 counter limit.
  393. If scaling becomes necessary, we can scale event values (and noise sigmas)
  394. by a scaling factor before adding them to the counter. Scaling may introduce
  395. a bias in the final result, but this should be insignificant for reporting.
  396. Appendix Z. Remaining client-side uncertainties
  397. [These are the uncertainties at the client side. I'm not considering
  398. TR-only operations here unless they affect clients.]
  399. Should we do a multi-level thing for the signing keys? That is, have
  400. an identity key for each TR and each DC, and use those to sign
  401. short-term keys?
  402. How to tell the DCs the parameters of the system, including:
  403. - who the TRs are, and what their keys are?
  404. - what the counters are, and how much noise to add to each?
  405. - how do we impose a delay when the noise parameters change?
  406. (this delay ensures differential privacy even when the old and new
  407. counters are compared)
  408. - or should we try to monotonically increase counter noise?
  409. - when the collection intervals start and end?
  410. - what happens in networks where some relays report some counters, and
  411. other relays report other counters?
  412. - do we just pick the latest counter version, as long as enough relays
  413. support it?
  414. (it's not safe to report multiple copies of counters)
  415. How the TRs agree on which DCs' counters to collect?
  416. How data is uploaded to DCs?
  417. What to say about persistence on the DC side?