300-walking-onions.txt 22 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511
  1. Filename: 300-walking-onions.txt
  2. Title: Walking Onions: Scaling and Saving Bandwidth
  3. Author: Nick Mathewson
  4. Created: 5-Feb-2019
  5. Status: Draft
  6. 0. Status
  7. This proposal describes a mechanism called "Walking Onions" for
  8. scaling the Tor network and reducing the amount of client bandwidth
  9. used to maintain a client's view of the Tor network.
  10. This is a draft proposal; there are problems left to be solved and
  11. questions left to be answered. Once those parts are done, we can
  12. fill in section 4 with the final details of the design.
  13. 1. Introduction
  14. In the current Tor network design, we assume that every client has a
  15. complete view of all the relays in the network. To achieve this,
  16. clients download consensus directories at regular intervals, and
  17. download descriptors for every relay listed in the directory.
  18. The substitution of microdescriptors for regular descriptors
  19. (proposal 158) and the use of consensus diffs (proposal 140) have
  20. lowered the bytes that clients must dedicate to directory operations.
  21. But we still face the problem that, if we force each client to know
  22. about every relay in the network, each client's directory traffic
  23. will grow linearly with the number of relays in the network.
  24. Another drawback in our current system is that client directory
  25. traffic is front-loaded: clients need to fetch an entire directory
  26. before they begin building circuits. This places extra delays on
  27. clients, and extra load on the network.
  28. To anonymize the world, we will need to scale to a much larger number
  29. of relays and clients: requiring clients to know about every relay in
  30. the set simply won't scale, and requiring every new client to download
  31. a large document is also problematic.
  32. There are obvious responses here, and some other anonymity tools have
  33. taken them. It's possible to have a client only use a fraction of
  34. the relays in a network--but doing so opens the client to _epistemic
  35. attacks_, in which the difference in clients' views of the
  36. network is used to partition their traffic. It's also possible to
  37. move the problem of selecting relays from the client to the relays
  38. themselves, and let each relay select the next relay in turn--but
  39. this choice opens the client to _route capture attacks_, in which a
  40. malicious relay selects only other malicious relays.
  41. In this proposal, I'll describe a design for eliminating up-front
  42. client directory downloads. Clients still choose relays at random,
  43. but without ever having to hold a list of all the relays. This design
  44. does not require clients to trust relays any more than they do today,
  45. or open clients to epistemic attacks.
  46. I hope to maintain feature parity with the current Tor design; I'll
  47. list the places in which I haven't figured out how to do so yet.
  48. I'm naming this design "walking onions". The walking onion (Allium x
  49. proliferum) reproduces by growing tiny little bulbs at the
  50. end of a long stalk. When the stalk gets too top-heavy, it flops
  51. over, and the little bulbs start growing somewhere new.
  52. The rest of this document will run as follows. In section 2, I'll
  53. explain the ideas behind the "walking onions" design, and how they
  54. can eliminate the need for regular directory downloads. In section 3, I'll
  55. answer a number of follow-up questions that arise, and explain how to
  56. keep various features in Tor working. Section 4 (not yet written)
  57. will elaborate all the details needed to turn this proposal into a
  58. concrete set of specification changes.
  59. 2. Overview
  60. 2.1. Recapping proposal 141
  61. Back in Proposal 141 ("Download server descriptors on demand"), Peter
  62. Palfrader proposed an idea for eliminating ahead-of-time descriptor
  63. downloads. Instead of fetching all the descriptors in advance, a
  64. client would fetch the descriptor for each relay in its path right
  65. before extending the circuit to that relay. For example, if a client
  66. has a circuit from A->B and wants to extend the circuit to C, the
  67. client asks B for C's descriptor, and then extends the circuit to C.
  68. (Note that the client needs to fetch the descriptor every time it
  69. extends the circuit, so that an observer can't tell whether the
  70. client already had the descriptor or not.)
  71. There are a couple of limitations for this design:
  72. * It still requires clients to download a consensus.
  73. * It introduces a extra round-trip to each hop of the circuit
  74. extension process.
  75. I'll show how to solve these problems in the two sections below.
  76. 2.2. An observation about the ntor handshake.
  77. I'll start with an observation about our current circuit extension
  78. handshake, ntor: it should not actually be necessary to know a
  79. relay's onion key before extending to it.
  80. Right now, the client sends:
  81. NODEID (The relay's identity)
  82. KEYID (The relay's public onion key)
  83. CLIENT_PK (a diffie-hellman public key)
  84. and the relay responds with:
  85. SERVER_PK (a diffie-hellman public key)
  86. AUTH (a function of the relay's private keys and
  87. *all* of the public keys.)
  88. Both parties generate shared symmetric keys from the same inputs
  89. that are are used to create the AUTH value.
  90. The important insight here is that we could easily change
  91. this handshake so that the client sends only CLIENT_PK, and receives
  92. NODEID and KEYID as part of the response.
  93. In other words, the client needs to know the relay's onion key to
  94. _complete_ the handshake, but doesn't actually need to know the
  95. relay's onion key in order to _initiate_ the handshake.
  96. This is the insight that will let us save a round trip: When the
  97. client goes to extend a circuit from A->B to C, it can send B a
  98. request to extend to C and retrieve C's descriptor in a single step.
  99. Specifically, the client sends only CLIENT_PK, and relay B can include C's
  100. keys as part of the EXTENDED cell.
  101. 2.3. Extending by certified index
  102. Now I'll explain how the client can avoid having to download a
  103. list of relays entirely.
  104. First, let's look at how a client chooses a random relay today.
  105. First, the client puts all of the relays in a list, and computes a
  106. weighted bandwidth for each one. For example, suppose the relay
  107. identities are R1, R2, R3, R4, and R5, and their bandwidth weights
  108. are 50, 40, 30, 20, and 10. The client makes a table like this:
  109. Relay Weight Range of index values
  110. R1 50 0..49
  111. R2 40 50..89
  112. R3 30 90..119
  113. R4 20 120..139
  114. R5 10 140..149
  115. To choose a random relay, the client picks a random index value
  116. between 0 and 149 inclusive, and looks up the corresponding relay in
  117. the table. For example, if the client's random number is 77, it will
  118. choose R2. If its random number is 137, it chooses R4.
  119. The key observation for the "walking onions" design is that the
  120. client doesn't actually need to construct this table itself.
  121. Instead, we will have this table be constructed by the authorities
  122. and distributed to all the relays.
  123. Here's how it works: let's have the authorities make a new kind of
  124. consensus-like thing. We'll call it an Efficient Network Directory
  125. with Individually Verifiable Entries, or "ENDIVE" for short. This
  126. will differ from the client's index table above in two ways. First,
  127. every entry in the ENDIVE is normalized so that the bandwidth
  128. weights maximum index is 2^32-1:
  129. Relay Normalized weight Range of index values
  130. R1 0x55555546 0x00000000..0x55555545
  131. R2 0x44444438 0x55555546..0x9999997d
  132. R3 0x3333332a 0x9999997e..0xcccccca7
  133. R4 0x2222221c 0xcccccca8..0xeeeeeec3
  134. R5 0x1111113c 0xeeeeeec4..0xffffffff
  135. Second, every entry in the ENDIVE is timestamped and signed by the
  136. authorities independently, so that when a client sees a line from the
  137. table above, it can verify that it came from an authentic ENDIVE.
  138. When a client has chosen a random index, one of these entries will
  139. prove to the client that a given relay corresponds to that index.
  140. Because of this property, we'll be calling these entries "Separable
  141. Network Index Proofs", or "SNIP"s for short.
  142. For example, a single SNIP from the table above might consist of:
  143. * A range of times during which this SNIP is valid
  144. * R1's identity
  145. * R1's ntor onion key
  146. * R1's address
  147. * The index range 0x00000000..0x55555545
  148. * A signature of all of the above, by a number of authorities
  149. Let's put it together. Suppose that the client has a circuit from
  150. A->B, and it wants to extend to a random relay, chosen randomly
  151. weighted by bandwidth.
  152. 1. The client picks a random index value between 0 and 2**32 - 1. It
  153. sends that index to relay B in its EXTEND cell, along with a
  154. g^x value for the ntor handshake.
  155. Note: the client doesn't send an address or identity for the next
  156. relay, since it doesn't know what relay it has chosen! (The
  157. combination of an index and a g^x value is what I'm calling a
  158. "walking onion".)
  159. 2. Now, relay B looks up the index in its most recent ENDIVE, to
  160. learn which relay the client selected.
  161. (For example, suppose that the client's random index value is
  162. 0x50000001. This index value falls between 0x00000000 and
  163. 0x55555546 in the table above, so the relay B sees that the client
  164. has chosen R1 as its next hop.)
  165. 3. Relay B sends a create cell to R1 as usual. When it gets a
  166. CREATED reply, it includes the authority-signed SNIP for
  167. R1 as part of the EXTENDED cell.
  168. 4. As part of verifying the handshake, the client verifies that the
  169. SNIP was signed by enough authorities, that its timestamp
  170. is recent enough, and that it actually corresponds to the
  171. random index that the client selected.
  172. Notice the properties we have with this design:
  173. - Clients can extend circuits without having a list of all the
  174. relays.
  175. - Because the client's random index needs to match a routing
  176. entry signed by the authorities, the client is still selecting
  177. a relay randomly by weight. A hostile relay cannot choose
  178. which relay to send the client.
  179. On a failure to extend, a relay should still report the routing entry
  180. for the other relay that it couldn't connect to. As before, a client
  181. will start a new circuit if a partially constructed circuit is a
  182. partial failure.
  183. We could achieve a reliability/security tradeoff by letting clients
  184. offer the relay a choice of two or more indices to extend to.
  185. This would help reliability, but give the relay more influence over
  186. the path. We'd need to analyze this impact.
  187. In the next section, I'll discuss a bunch of details that we need to
  188. straighten out in order to make this design work.
  189. 3. Sorting out the details.
  190. 3.1. Will these routing entries fit in EXTEND2 and EXTENDED2 cells?
  191. The EXTEND2 cell is probably big enough for this design. The random
  192. index that the client sends can be a new "link specifier" type,
  193. replacing the IP and identity link specifiers.
  194. The EXTENDED2 cell is likely to need to grow here. We'll need to
  195. implement proposal 249 ("Allow CREATE cells with >505 bytes of
  196. handshake data") so that EXTEND2 and EXTENDED2 cells can be larger.
  197. 3.2. How should SNIPs be signed?
  198. We have a few options, and I'd like to look into the possibilities
  199. here more closely.
  200. The simplest possibility is to use **multiple signatures** on each
  201. SNIP, the way we do today for consensuses. These signatures should
  202. be made using medium-term Ed25519 keys from the authorities. At a
  203. cost of 64 bytes per signature, at 9 authorities, we would need 576
  204. bytes for each SNIP. These signatures could be batch-verified to
  205. save time at the client side. Since generating a signature takes
  206. around 20 usec on my mediocre laptop, authorities should be able to
  207. generate this many signatures fairly easily.
  208. Another possibility is to use a **threshold signature** on each SNIP,
  209. so that the authorities collaboratively generate a short signature
  210. that the clients can verify. There are multiple threshold signature
  211. schemes that we could consider here, though I haven't yet found one
  212. that looks perfect.
  213. Another possibility is to use organize the SNIPs in a **merkle tree
  214. with a signed root**. For this design, clients could download the
  215. signed root periodically, and receive the hash-path from the signed
  216. root to the SNIP. This design might help with
  217. certificate-transparency-style designs, and it would be necessary if we
  218. ever want to move to a postquantum signature algorithm that requires
  219. large signatures.
  220. Another possibility (due to a conversation among Chelsea Komlo, Sajin
  221. Sasy, and Ian Goldberg), is to *use SNARKs*. (Why not? All the cool
  222. kids are doing it!) For this, we'd have the clients download a
  223. signed hash of the ENDIVE periodically, and have the authorities
  224. generate a SNARK for each SNIP, proving its presence in that
  225. document.
  226. 3.3. How can we detect authority misbehavior?
  227. We might want to take countermeasures against the possibility that a
  228. quorum of corrupt or compromised authorities give some relays a
  229. different set of SNIPs than they give other relays.
  230. If we incorporate a merkle tree or a hash chain in the design, we can
  231. use mechanisms similar to certificate transparency to ensure that the
  232. authorities have a consistent log of all the entries that they have
  233. ever handed out.
  234. 3.4. How many types of weighted node selection are there, and how do we
  235. handle them?
  236. Right now, there are multiple weights that we use in Tor:
  237. * Weight for exit
  238. * Weight for guard
  239. * Weight for middle node
  240. We also filter nodes for several properties, such as flags they have.
  241. To reproduce this behavior, we should enumerate the various weights
  242. and filters that we use, and (if there are not too many) create a
  243. separate index for each. For example, the Guard index would weight
  244. every node for selection as guard, assigning 0 weight to non-Guard
  245. nodes. The Exit index would weight every node for selection as an
  246. exit, assigning 0 weight to non-Exit nodes.
  247. When choosing a relay, the client would have to specify which index
  248. to use. We could either have a separate (labeled) set of SNIPs
  249. entries for each index, or we could have each SNIP have a separate
  250. (labeled) index range for each index.
  251. REGRESSION: the client's choice of which index to use would leak the
  252. next router's position and purpose in the circuit. This information
  253. is something that we believe relays can infer now, but it's not a
  254. desired feature that they can.
  255. 3.5. Does this design break onion service introduce handshakes?
  256. In rend-spec-v3.txt section 3.3.2, we specify a variant of ntor for
  257. use in INTRODUCE2 handshakes. It allows the client to send encrypted
  258. data as part of its initial ntor handshake, but requires the client
  259. to know the onion service's onion key before it sends its initial
  260. handshake.
  261. That won't be a problem for us here, though: we still require clients
  262. to fetch onion service descriptors before contacting a onion
  263. service.
  264. 3.6. How does the onion service directory work here?
  265. The onion service directory is implemented as a hash ring, where
  266. each relay's position in the hash ring is decided by a hash of its
  267. identity, the current date, and a shared random value that the
  268. authorities compute each day.
  269. To implement this hash ring using walking onions, we would need to
  270. have an extra index based not on bandwidth, but on position in the
  271. hash ring. Then onion services and clients could build a circuit,
  272. then extend it one more hop specifying their desired index in the
  273. hash ring.
  274. We could either have a command to retrieve a trio of hashring-based
  275. routing entries by index, or to retrieve (or connect to?) the n'th item
  276. after a given hashring entry.
  277. 3.7. How can clients choose guard nodes?
  278. We can reuse the fallback directories here. A newly bootstrapping
  279. client would connect to a fallback directory, then build a three-hop
  280. circuit, and finally extend the three-hop circuit by indexing to a
  281. random guard node. The random guard node's SNIP would
  282. contain the information that the client needs to build real circuits
  283. through that guard in the future. Because the client would be
  284. building a three-hop circuit, the fallback directory would not learn
  285. the client's guards.
  286. (Note that even if the extend attempt fails, we should still pick the
  287. node as a possible guard based on its router entry, so that other
  288. nodes can't veto our choice of guards.)
  289. 3.8. Does the walking onions design preclude postquantum circuit handshakes?
  290. Not at all! Both proposal 263 (ntru) and proposal 270 (newhope) work
  291. by having the client generate an ephemeral key as part of its initial
  292. handshake. The client does not need to know the relay's onion key to
  293. do this, so we can still integrate those proposals with this one.
  294. 3.9. Does the walking onions design stop us from changing the network
  295. topology?
  296. For Tor to continue to scale, we will someday need to accept that not
  297. every relay can be simultaneously connected to every other relay.
  298. Therefore, we will need to move from our current clique topology
  299. assumption to some other topology.
  300. There are also proposals to change node selection rules to generate
  301. routes providing better performance, or improved resistance to local
  302. adversaries.
  303. We can, I think, implement this kind of proposal by changing the way
  304. that ENDIVEs are generated. Instead giving every relay the same
  305. ENDIVE, the authorities would generate different ENDIVEs for
  306. different relays, depending on the probability distribution of which
  307. relay should be chosen after which in the network topology. In the
  308. extreme case, this would produce O(n) ENDIVEs and O(n^2) SNIPs. In
  309. practice, I hope that we could do better by having the network
  310. topology be non-clique, and by having many relays share the same
  311. distribution of successors.
  312. 3.10. How can clients handle exit policies?
  313. This is an unsolved challenge. If the client tells the middle relay
  314. its target port, it leaks information inappropriately.
  315. One possibility is to try to gather exit policies into common
  316. categories, such as "most ports supported" and "most common ports
  317. supported".
  318. Another (inefficient) possibility is for clients to keep trying exits
  319. until they find one that works.
  320. Another (inefficient) possibility is to require that clients who use
  321. unusual ports fall back to the old mechanism for route selection.
  322. 3.11. Can this approach support families?
  323. This is an unsolved challenge.
  324. One (inefficient) possibility is for clients to generate circuits and
  325. discard those that use multiple relays in the same family.
  326. One (not quite compatible) possibility is for the authorities to sort
  327. the ENDIVE so that relays in the same family are adjacent to
  328. one another. The index-bounds part of each SNIP would also
  329. have to include the bounds of the family. This approach is not quite
  330. compatible with the status quo, because it prevents relays from
  331. belonging to more than one family.
  332. One interesting possibility (due to Chelsea Komlo, Sajin Sasy, and
  333. Ian Goldberg) is for the middle node to take responsibility for
  334. family enforcement. In this design, the client might offer the middle
  335. node multiple options for the next relay's index, and the middle node
  336. would choose the first such relay that is neither in its family nor
  337. its predecessor's family. We'd need to look for a way to make sure
  338. that the middle node wasn't biasing the path selection.
  339. (TODO: come up with more ideas here.)
  340. 3.12. Can walking onions support IP-based and country-based restrictions?
  341. This is an unsolved challenge.
  342. If the user's restrictions do not exclude most paths, one
  343. (inefficient) possibility is for the user to generate paths until
  344. they generate one that they like. This idea becomes inefficient
  345. if the user is excluding most paths.
  346. Another (inefficient and fingerprintable) possibility is to require
  347. that clients who use complex path restrictions fall back to the old
  348. mechanism for route selection.
  349. (TODO: come up with better ideas here.)
  350. 3.13. What scaling problems have we not solved with this design?
  351. The walking onions design doesn't solve (on its own) the problem that
  352. the authorities need to know about every relay, and arrange to have
  353. every relay tested.
  354. The walking onions design doesn't solve (on its own) the problem that
  355. relays need to have a list of all the relays. (But see section 3.9
  356. above.)
  357. 3.14. Should we still have clients download a consensus when they're
  358. using walking onions?
  359. There are some fields in the current consensus directory documents
  360. that the clients will still need, like the list of supported
  361. protocols and network parameters. A client that uses walking onions
  362. should download a new flavor of consensus document that contains only
  363. these fields, and does not list any relays. In some signature
  364. schemes, this consensus would contain a digest of the ENDIVE -- see
  365. 3.2 above.
  366. (Note that this document would be a "consensus document" but not a
  367. "consensus directory", since it doesn't list any relays.)
  368. 4. Putting it all together
  369. [This is the section where, in a later version of this proposal, I
  370. would specify the exact behavior and data formats to be used here.
  371. Right now, I'd say we're too early in the design phase.]
  372. A.1. Acknowledgments
  373. Thanks to Peter Palfrader for his original design in proposal 141,
  374. and to the designers of PIR-Tor, both of which inspired aspects of
  375. this Walking Onions design.
  376. Thanks to Chelsea Komlo, Sajin Sasy, and Ian Goldberg for feedback on
  377. an earlier version of this design.
  378. Thanks to David Goulet, Teor, and George Kadianakis for commentary on
  379. earlier versions of this draft.
  380. A.2. Additional ideas
  381. Teor notes that there are ways to try to get this idea to apply to
  382. one-pass circuit construction, something like the old onion design.
  383. We might be able to derive indices and keys from the same seeds,
  384. even. I don't see a way to do this without losing forward secrecy,
  385. but it might be worth looking at harder.