123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568 |
- Filename: 288-privcount-with-shamir.txt
- Title: Privacy-Preserving Statistics with Privcount in Tor (Shamir version)
- Author: Nick Mathewson, Tim Wilson-Brown, Aaron Johnson
- Created: 1-Dec-2017
- Supercedes: 280
- Status: Accepted
- 0. Acknowledgments
- Tariq Elahi, George Danezis, and Ian Goldberg designed and implemented
- the PrivEx blinding scheme. Rob Jansen and Aaron Johnson extended
- PrivEx's differential privacy guarantees to multiple counters in
- PrivCount:
- https://github.com/privcount/privcount/blob/master/README.markdown#research-background
- Rob Jansen and Tim Wilson-Brown wrote the majority of the experimental
- PrivCount code, based on the PrivEx secret-sharing variant. This
- implementation includes contributions from the PrivEx authors, and
- others:
- https://github.com/privcount/privcount/blob/master/CONTRIBUTORS.markdown
- This research was supported in part by NSF grants CNS-1111539,
- CNS-1314637, CNS-1526306, CNS-1619454, and CNS-1640548.
- The use of a Shamir secret-sharing-based approach is due to a
- suggestion by Aaron Johnson (iirc); Carolin Zöbelein did some helpful
- analysis here.
- Aaron Johnson and Tim Wilson-Brown made improvements to the draft proposal.
- 1. Introduction and scope
- PrivCount is a privacy-preserving way to collect aggregate statistics
- about the Tor network without exposing the statistics from any single
- Tor relay.
- This document describes the behavior of the in-Tor portion of the
- PrivCount system. It DOES NOT describe the counter configurations,
- or any other parts of the system. (These will be covered in separate
- proposals.)
- 2. PrivCount overview
- Here follows an oversimplified summary of PrivCount, with enough
- information to explain the Tor side of things. The actual operation
- of the non-Tor components is trickier than described below.
- In PrivCount, a Data Collector (DC, in this case a Tor relay) shares
- numeric data with N different Tally Reporters (TRs). (A Tally Reporter
- performs the summing and unblinding roles of the Tally Server and Share
- Keeper from experimental PrivCount.)
- All N Tally Reporters together can reconstruct the original data, but
- no (N-1)-sized subset of the Tally Reporters can learn anything about
- the data.
- (In reality, the Tally Reporters don't reconstruct the original data
- at all! Instead, they will reconstruct a _sum_ of the original data
- across all participating relays.)
- In brief, the system works as follow:
- To share data, for each counter value V to be shared, the Data Collector
- first adds Gaussian noise to V in order to produce V', uses (K,N) Shamir
- secret-sharing to generate N shares of V' (K<=N, K being the
- reconstruction threshold), encrypts each share to a different Tally
- Reporter, and sends each encrypted share to the Tally Reporter it
- is encrypted for.
- The Tally Reporters then agree on the set S of Data Collectors that sent
- data to all of them, and each Tally Reporter forms a share of the aggregate
- value by decrypting the shares it received from the Data Collectors in S
- and adding them together. The Tally Reporters then, collectively, perform
- secret reconstruction, thereby learning the sum of all the different
- values V'.
- The use of Shamir secret sharing lets us survive up to N-K crashing TRs.
- Waiting until the end to agree on a set S of surviving relays lets us
- survive an arbitrary number of crashing DCs. In order to prevent bogus
- data from corrupting the tally, the Tally Reporters can perform the
- aggregation step multiple times, each time proceeding with a different
- subset of S and taking the median of the resulting values.
- Relay subsets should be chosen at random to avoid relays manipulating their
- subset membership(s). If an shared random value is required, all relays must
- submit their results, and then the next revealed shared random value can
- be used to select relay subsets. (Tor's shared random value can be
- calculated as soon as all commits have been revealed. So all relay results
- must be received *before* any votes are cast in the reveal phase for that
- shared random value.)
- Below we describe the algorithm in more detail, and describe the data
- format to use.
- 3. The algorithm
- All values below are B-bit integers modulo some prime P; we suggest
- B=62 and P = 2**62 - 2**30 - 1 (hex 0x3fffffffbfffffff). The size of
- this field is an upper limit on the largest sum we can calculate; it
- is not a security parameter.
- There are N Tally Reporters: every participating relay must agree on
- which N exist, and on their current public keys. We suggest listing
- them in the consensus networkstatus document. All parties must also
- agree on some ordering the Tally Reporters. Similarly, all parties
- must also agree on some value K<=N.
- There are a number of well-known "counters", identified known by ASCII
- identifiers. Each counter is a value that the participating relays
- will know how to count. Let C be the number of counters.
- 3.1. Data Collector (DC) side
- At the start of each period, every Data Collector ("client" below)
- initializes their state as follows
- 1. For every Tally Reporter with index i, the client constructs a
- random 32-byte random value SEED_i. The client then generates
- a pseudorandom bitstream of using the SHAKE-256
- XOF with SEED_i as its input, and divides this stream into
- C values, with the c'th value denoted by MASK(i, c).
- [To divide the stream into values, consider the stream 8 bytes at a
- time as unsigned integers in network (big-endian) order. For each
- such integer, clear the top (64-B) bits. If the result is less than
- P, then include the integer as one of the MASK(i, .) values.
- Otherwise, discard this 8-byte segment and proceed to the next
- value.]
- 2. The client encrypts SEED_i using the public key of Tally
- Reporter i, and remembers this encrypted value. It discards
- SEED_i.
- 3. For every counter c, the client generates a noise value Z_c
- from an appropriate Gaussian distribution. If the noise value is
- negative, the client adds P to bring Z_c into the range 0...(P-1).
- (The noise MUST be sampled using the procedure in Appendix C.)
- The client then uses Shamir secret sharing to generate
- N shares (x,y) of Z_c, 1 <= x <= N, with the x'th share to be used by
- the x'th Tally Reporter. See Appendix A for more on Shamir secret
- sharing. See Appendix B for another idea about X coordinates.
- The client picks a random value CTR_c and stores it in the counter,
- which serves to locally blind the counter.
- The client then subtracts (MASK(x, c)+CTR_c) from y, giving
- "encrypted shares" of (x, y0) where y0 = y-CTR_c.
- The client then discards all MASK values, all CTR values, and all
- original shares (x,y), all CTR and the noise value Z_c. For each
- counter c, it remembers CTR_c, and N shares of the form (x, y).
- To increment a counter by some value "inc":
- 1. The client adds "inc" to counter value, modulo P.
- (This step is chosen to be optimal, since it will happen more
- frequently than any other step in the computation.)
- Aggregate counter values that are close to P/2 MUST be scaled to
- avoid overflow. See Appendix D for more information. (We do not think
- that any counters on the current Tor network will require scaling.)
- To publish the counter values:
- 1. The client publishes, in the format described below:
- The list of counters it knows about
- The list of TRs it knows about
- For each TR:
- For each counter c:
- A list of (i, y-CTR_c-MASK(x,c)), which corresponds
- to the share for the i'th TR of counter c.
- SEED_i as encrypted earlier to the i'th TR's public key.
- 3.2. Tally Reporter (TR) side
- This section is less completely specified than the Data Collector's
- behavior: I expect that the TRs will be easier to update as we proceed.
- (Each TR has a long-term identity key (ed25519). It also has a
- sequence of short-term curve25519 keys, each associated with a single
- round of data collection.)
- 1. When a group of TRs receives information from the Data Collectors,
- they collectively chose a set S of DCs and a set of counters such
- that every TR in the group has a valid entry for every counter,
- from every DC in the set.
- To be valid, an entry must not only be well-formed, but must also
- have the x coordinate in its shares corresponding to the
- TR's position in the list of TRs.
- 2. For each Data Collector's report, the i'th TR decrypts its part of
- the client's report using its curve25519 key. It uses SEED_i and
- SHAKE-256 to regenerate MASK(0) through MASK(C-1). Then for each
- share (x, y-CTR_c-MASK(x,c)) (note that x=i), the TR reconstructs the
- true share of the value for that DC and counter c by adding
- V+MASK(x,c) to the y coordinate to yield the share (x, y_final).
- 3. For every counter in the set, each TR computes the sum of the
- y_final values from all clients.
- 4. For every counter in the set, each TR publishes its a share of
- the sum as (x, SUM(y_final)).
- 5. If at least K TRs publish correctly, then the sum can be
- reconstructed using Lagrange polynomial interpolation. (See
- Appendix A).
- 6. If the reconstructed sum is greater than P/2, it is probably a negative
- value. The value can be obtained by subtracting P from the sum.
- (Negative values are generated when negative noise is added to small
- signals.)
- 7. If scaling has been applied, the sum is scaled by the scaling factor.
- (See Appendix D.)
- 4. The document format
- 4.1. The counters document.
- This document format builds on the line-based directory format used
- for other tor documents, described in Tor's dir-spec.txt.
- Using this format, we describe a "counters" document that publishes
- the shares collected by a given DC, for a single TR.
- The "counters" document has these elements:
- "privctr-dump-format" SP VERSION SP SigningKey
- [At start, exactly once]
- Describes the version of the dump format, and provides an ed25519
- signing key to identify the relay. The signing key is encoded in
- base64 with padding stripped. VERSION is "alpha" now, but should
- be "1" once this document is finalized.
- "starting-at" SP IsoTime
- [Exactly once]
- The start of the time period when the statistics here were
- collected.
- "ending-at" SP IsoTime
- [Exactly once]
- The end of the time period when the statistics here were
- collected.
- "share-parameters" SP Number SP Number
- [Exactly once]
- The number of shares needed to reconstruct the client's
- measurements (K), and the number of shares produced (N),
- respectively.
- "tally-reporter" SP Identifier SP Integer SP Key
- [At least twice]
- The curve25519 public key of each Tally Reporter that the relay
- believes in. (If the list does not match the list of
- participating Tally Reporters, they won't be able to find the
- relay's values correctly.) The identifiers are non-space,
- non-nul character sequences. The Key values are encoded in
- base64 with padding stripped; they must be unique within each
- counters document. The Integer values are the X coordinate of
- the shares associated with each Tally Reporter.
- "encrypted-to-key" SP Key
- [Exactly once]
- The curve25519 public key to which the report below is encrypted.
- Note that it must match one of the Tally Reporter options above.
- "report" NL
- "----- BEGIN ENCRYPTED MESSAGE-----" NL
- Base64Data
- "----- END ENCRYPTED MESSAGE-----" NL
- [Exactly once]
- An encrypted document, encoded in base64. The plaintext format is
- described in section 4.2. below. The encryption is as specified in
- section 5 below, with STRING_CONSTANT set to "privctr-shares-v1".
- "signature" SP Signature
- [At end, exactly once]
- The Ed25519 signature of all the fields in the document, from the
- first byte, up to but not including the "signature" keyword here.
- The signature is encoded in base64 with padding stripped.
- 4.2. The encrypted "shares" document.
- The shares document is sent, encrypted, in the "report" element above.
- Its plaintext contents include these fields:
- "encrypted-seed" NL
- "----- BEGIN ENCRYPTED MESSAGE-----" NL
- Base64Data
- "----- END ENCRYPTED MESSAGE-----" NL
- [At start, exactly once.]
- An encrypted document, encoded in base64. The plaintext value is
- the 32-byte value SEED_i for this TR. The encryption is as
- specified in section 5 below, with STRING_CONSTANT set to
- "privctr-seed-v1".
- "d" SP Keyword SP Integer
- [Any number of times]
- For each counter, the name of the counter, and the obfuscated Y
- coordinate of this TR's share for that counter. (The Y coordinate
- is calculated as y-CTR_c as in 3.1 above.) The order of counters
- must correspond to the order used when generating the MASK() values;
- different clients do not need to choose the same order.
- 5. Hybrid encryption
- This scheme is taken from rend-spec-v3.txt, section 2.5.3, replacing
- "secret_input" and "STRING_CONSTANT". It is a hybrid encryption
- method for encrypting a message to a curve25519 public key PK.
- We generate a new curve25519 keypair (sk,pk).
- We run the algorithm of rend-spec-v3.txt 2.5.3, replacing
- "secret_input" with Curve25519(sk,PK) | SigningKey, where
- SigningKey is the DC's signing key. (Including the DC's SigningKey
- here prevents one DC from replaying another one's data.)
- We transmit the encrypted data as in rend-spec-v3.txt 2.5.3,
- prepending pk.
- Appendix A. Shamir secret sharing for the impatient
- In Shamir secret sharing, you want to split a value in a finite
- field into N shares, such that any K of the N shares can
- reconstruct the original value, but K-1 shares give you no
- information at all.
- The key insight here is that you can reconstruct a K-degree
- polynomial given K+1 distinct points on its curve, but not given
- K points.
- So, to split a secret, we going to generate a (K-1)-degree
- polynomial. We'll make the Y intercept of the polynomial be our
- secret, and choose all the other coefficients at random from our
- field.
- Then we compute the (x,y) coordinates for x in [1, N]. Now we
- have N points, any K of which can be used to find the original
- polynomial.
- Moreover, we can do what PrivCount wants here, because adding the
- y coordinates of N shares gives us shares of the sum: If P1 is
- the polynomial made to share secret A and P2 is the polynomial
- made to share secret B, and if (x,y1) is on P1 and (x,y2) is on
- P2, then (x,y1+y2) will be on P1+P2 ... and moreover, the y
- intercept of P1+P2 will be A+B.
- To reconstruct a secret from a set of shares, you have to either
- go learn about Lagrange polynomials, or just blindly copy a
- formula from your favorite source.
- Here is such a formula, as pseudocode^Wpython, assuming that
- each share is an object with a _x field and a _y field.
- def interpolate(shares):
- for sh in shares:
- product_num = FE(1)
- product_denom = FE(1)
- for sh2 in shares:
- if sh2 is sh:
- continue
- product_num *= sh2._x
- product_denom *= (sh2._x - sh._x)
- accumulator += (sh._y * product_num) / product_denom
- return accumulator
- Appendix B. An alternative way to pick X coordinates
- Above we describe a system where everybody knows the same TRs and
- puts them in the same order, and then does Shamir secret sharing
- using "x" as the x coordinate for the x'th TR.
- But what if we remove that requirement by having x be based on a hash
- of the public key of the TR? Everything would still work, so long as
- all users chose the same K value. It would also let us migrate TR
- sets a little more gracefully.
- Appendix C. Sampling floating-point Gaussian noise for differential privacy
- Background:
- When we add noise to a counter value (signal), we want the added noise to
- protect all of the bits in the signal, to ensure differential privacy.
- But because noise values are generated from random double(s) using
- floating-point calculations, the resulting low bits are not distributed
- evenly enough to ensure differential privacy.
- As implemented in the C "double" type, IEEE 754 double-precision
- floating-point numbers contain 53 significant bits in their mantissa. This
- means that noise calculated using doubles can not ensure differential
- privacy for client activity larger than 2**53:
- * if the noise is scaled to the magnitude of the signal using
- multiplication, then the low bits are unprotected,
- * if the noise is not scaled, then the high bits are unprotected.
- But the operations in the noise transform also suffer from floating-point
- inaccuracy, further affecting the low bits in the mantissa. So we can only
- protect client activity up to 2**46 with Laplacian noise. (We assume that
- the limit for Gaussian noise is similar.)
- Our noise generation procedure further reduces this limit to 2**42. For
- byte counters, 2**42 is 4 Terabytes, or the observed bandwidth of a 1 Gbps
- relay running at full speed for 9 hours. It may be several years before we
- want to protect this much client activity. However, since the mitigation is
- relatively simple, we specify that it MUST be implemented.
- Procedure:
- Data collectors MUST sample noise as follows:
- 1. Generate random double(s) in [0, 1] that are integer multiples of
- 2**-53.
- TODO: the Gaussian transform in step 2 may require open intervals
- 2. Generate a Gaussian floating-point noise value at random with sigma 1,
- using the random double(s) generated in step 1.
- 3. Multiply the floating-point noise by the floating-point sigma value.
- 4. Truncate the scaled noise to an integer to remove the fractional bits.
- (These bits can never correspond to signal bits, because PrivCount only
- collects integer counters.)
- 5. If the floating-point sigma value from step 3 is large enough that any
- noise value could be greater than or equal to 2**46, we need to
- randomise the low bits of the integer scaled noise value. (This ensures
- that the low bits of the signal are always hidden by the noise.)
- If we use the sample_unit_gaussian() transform in nickm/privcount_nm:
- A. The maximum r value is sqrt(-2.0*ln(2**-53)) ~= 8.57, and the
- maximal sin(theta) values are +/- 1.0. Therefore, the generated
- noise values can be greater than or equal to 2**46 when the sigma
- value is greater than 2**42.
- B. Therefore, the number of low bits that need to be randomised is:
- N = floor(sigma / 2**42)
- C. We randomise the lowest N bits of the integer noise by replacing them
- with a uniformly distributed N-bit integer value in 0...(2**N)-1.
- 6. Add the integer noise to the integer counter, before the counter is
- incremented in response to events. (This ensures that the signal value
- is always protected.)
- This procedure is security-sensitive: changing the order of
- multiplications, truncations, or bit replacements can expose the low or
- high bits of the signal or noise.
- As long as the noise is sampled using this procedure, the low bits of the
- signal are protected. So we do not need to "bin" any signals.
- The impact of randomising more bits than necessary is minor, but if we fail
- to randomise an unevenly distributed bit, client activity can be exposed.
- Therefore, we choose to randomise all bits that could potentially be affected
- by floating-point inaccuracy.
- Justification:
- Although this analysis applies to Laplacian noise, we assume a similar
- analysis applies to Gaussian noise. (If we add Laplacian noise on DCs,
- the total ends up with a Gaussian distribution anyway.)
- TODO: check that the 2**46 limit applies to Gaussian noise.
- This procedure results in a Gaussian distribution for the higher ~42 bits
- of the noise. We can safely ignore the value of the lower bits of the noise,
- because they are insignificant for our reporting.
- This procedure is based on section 5.2 of:
- "On Significance of the Least Significant Bits For Differential Privacy"
- Ilya Mironov, ACM CCS 2012
- https://www.microsoft.com/en-us/research/wp-content/uploads/2012/10/lsbs.pdf
- We believe that this procedure is safe, because we neither round nor smooth
- the noise values. The truncation in step 4 has the same effect as Mironov's
- "safe snapping" procedure. Randomising the low bits removes the 2**46 limit
- on the sigma value, at the cost of departing slightly from the ideal
- infinite-precision Gaussian distribution. (But we already know that these
- bits are distributed poorly, due to floating-point inaccuracy.)
- Mironov's analysis assumes that a clamp() function is available to clamp
- large signal and noise values to an infinite floating-point value.
- Instead of clamping, PrivCount's arithmetic wraps modulo P. We believe that
- this is safe, because any reported values this large will be meaningless
- modulo P. And they will not expose any client activity, because "modulo P"
- is an arithmetic transform of the summed noised signal value.
- Alternatives:
- We could round the encrypted value to the nearest multiple of the
- unprotected bits. But this relies on the MASK() value being a uniformly
- distributed random value, and it is less generic.
- We could also simply fail when we reach the 2**42 limit on the sigma value,
- but we do not want to design a system with a limit that low.
- We could use a pure-integer transform to create Gaussian noise, and avoid
- floating-point issues entirely. But we have not been able to find an
- efficient pure-integer Gaussian or Laplacian noise transform. Nor do we
- know if such a transform can be used to ensure differential privacy.
- Appendix D. Scaling large counters
- We do not believe that scaling will be necessary to collect PrivCount
- statistics in Tor. As of November 2017, the Tor network advertises a
- capacity of 200 Gbps, or 2**51 bytes per day. We can measure counters as
- large as ~2**61 before reaching the P/2 counter limit.
- If scaling becomes necessary, we can scale event values (and noise sigmas)
- by a scaling factor before adding them to the counter. Scaling may introduce
- a bias in the final result, but this should be insignificant for reporting.
- Appendix Z. Remaining client-side uncertainties
- [These are the uncertainties at the client side. I'm not considering
- TR-only operations here unless they affect clients.]
- Should we do a multi-level thing for the signing keys? That is, have
- an identity key for each TR and each DC, and use those to sign
- short-term keys?
- How to tell the DCs the parameters of the system, including:
- - who the TRs are, and what their keys are?
- - what the counters are, and how much noise to add to each?
- - how do we impose a delay when the noise parameters change?
- (this delay ensures differential privacy even when the old and new
- counters are compared)
- - or should we try to monotonically increase counter noise?
- - when the collection intervals start and end?
- - what happens in networks where some relays report some counters, and
- other relays report other counters?
- - do we just pick the latest counter version, as long as enough relays
- support it?
- (it's not safe to report multiple copies of counters)
- How the TRs agree on which DCs' counters to collect?
- How data is uploaded to DCs?
- What to say about persistence on the DC side?
|