302-padding-machines-for-onion-clients.txt 14 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302
  1. Filename: 302-padding-machines-for-onion-clients.txt
  2. Title: Hiding onion service clients using padding
  3. Author: George Kadianakis, Mike Perry
  4. Created: Thursday 16 May 2019
  5. Status: Closed
  6. Implemented-In: 0.4.1.1-alpha
  7. NOTE: Please look at section 3 of padding-spec.txt now, not this document.
  8. 0. Overview
  9. Tor clients use "circuits" to do anonymous communications. There are various
  10. types of circuits. Some of them are for navigating the normal Internet,
  11. others are for fetching Tor directory information, others are for connecting
  12. to onion services, while others are simply for measurements and testing.
  13. It's currently possible for MITM type of adversaries (like tor-network-level
  14. and local-area-network adversaries) to distinguish Tor circuit types from
  15. each other using a wide array of metadata and distinguishers.
  16. In this proposal, we study various techniques that can be used to
  17. distinguish client-side onion service circuits and provide WTF-PAD circuit
  18. padding machines (using prop#254) to hide them against certain adversaries.
  19. 1. Motivation
  20. We are writing this proposal for various reasons:
  21. 1) We believe that in an ideal setting MITM adversaries should not be able
  22. to distinguish circuit types by inspecting traffic. Tor traffic should
  23. look amorphous to an outside observer to maximize uncertainty and
  24. anonymity properties.
  25. Client-side onion service circuits are an easy target for this proposal,
  26. because we believe we can improve their privacy with low bandwidth
  27. overhead.
  28. 2) We want to start experimenting with the WTF-PAD subsystem of Tor, and
  29. this use-case provides us with a good testbed.
  30. 3) We hope that by actually starting to use the WTF-PAD subsystem of Tor, we
  31. will encourage more researchers to start experimenting with it.
  32. 2. Scope of the proposal [SCOPE]
  33. Given the above, this proposal sets forth to use the WTF-PAD system to hide
  34. client-side onion service circuits against the classifiers of paper by Kwon
  35. et al. above.
  36. By client-side onion service circuits we refer to these two types of circuits:
  37. - Client-side introduction circuits: Circuit from client to the introduction point
  38. - Client-side rendezvous circuits: Circuit from client to the rendezvous point
  39. Service-side onion service circuits are not in scope for this proposal, and
  40. this is because hiding those would require more bandwidth and also more
  41. advanced WTF-PAD features.
  42. Furthermore, this proposal only aims to cloak the naive distinguishing
  43. features mentioned in the [KNOWN_DISTINGUISHERS] section, and can by no
  44. means guarantee that client-side onion service circuits are totally
  45. indistinguishable by other means.
  46. The machines specified in this proposal are meant to be lightweight and
  47. created for a specific purpose. This means that they can be easily extended
  48. with additional states to do more advanced hiding.
  49. 3. Known distinguishers against onion service circuits [KNOWN_DISTINGUISHERS]
  50. Over the past years it's been assumed that motivated adversaries can
  51. distinguish onion-service traffic from normal Tor traffic given their
  52. special characteristics.
  53. As far as we know, there has been relatively little research-level work done
  54. to this direction. The main article published in this area is the USENIX
  55. paper "Circuit Fingerprinting Attacks: Passive Deanonymization of Tor Hidden
  56. Services" by Kwon et al. [0]
  57. The above paper deals with onion service circuits in sections 3.2 and 5.1.
  58. It uses the following three "naive" circuit features to distinguish circuits:
  59. 1) Circuit construction sequence
  60. 2) Number of incoming and outgoing cells
  61. 3) Duration of Activity ("DoA")
  62. All onion service circuits have particularly loud signatures to the above
  63. characteristics, but WTF-PAD (prop#254) gives us tools to effectively
  64. silence those signatures to the point where the paper's classifiers won't
  65. work.
  66. 4. Hiding circuit features using WTF-PAD
  67. According to section [KNOWN_DISTINGUISHERS] there are three circuit features
  68. we are attempting to hide. Here is how we plan to do this using the WTF-PAD
  69. system:
  70. 1) Circuit construction sequence
  71. The USENIX paper uses the directions of the first 10 cells sent in a
  72. circuit to fingerprint them. Client-side onion service circuits have
  73. unique circuit construction sequences and hence they can be fingeprinted
  74. using just the first 10 cells.
  75. We use WTF-PAD to destroy this feature of onion service circuits by
  76. carefully sending padding cells (relay DROP cells) during circuit
  77. construction and making them look exactly like most general tor circuits
  78. up till the end of the circuit construction sequence.
  79. 2) Number of incoming and outgoing cells
  80. The USENIX paper uses the amount of incoming and outgoing cells to
  81. distinguish circuit types. For example, client-side introduction circuits
  82. have the same amount of incoming and outgoing cells, whereas client-side
  83. rendezvous circuits have more incoming than outgoing cells.
  84. We use WTF-PAD to destroy this feature by changing the number of cells
  85. sent in introduction circuits. We leave rendezvous circuits as is, since
  86. the actual rendezvous traffic flow usually resembles well normal Tor
  87. circuits.
  88. 3) Duration of Activity ("DoA")
  89. The USENIX paper uses the period of time during which circuits send and
  90. receive cells to distinguish circuit types. For example, client-side
  91. introduction circuits are really short lived, wheras service-side
  92. introduction circuits are very long lived. OTOH, rendezvous circuits have
  93. the same median lifetime as general Tor circuits which is 10 minutes.
  94. We use WTF-PAD to destroy this feature of client-side introduction
  95. circuits by setting a special WTF-PAD option, which keeps the circuits
  96. open for 10 minutes completely mimicking the DoA of general Tor circuits.
  97. 4.1. A dive into general circuit construction sequences [CIRCCONSTRUCTION]
  98. In this section we give an overview of how circuit construction looks like
  99. to a network or guard-level adversary. We use this knowledge to make the
  100. right padding machines that can make intro and rend circuits look like these
  101. general circuits.
  102. In particular, most general Tor circuits used to surf the web or download
  103. directory information, start with the following 6-cell relay cell sequence (cells
  104. surrounded in [brackets] are outgoing, the others are incoming):
  105. [EXTEND2] -> EXTENDED2 -> [EXTEND2] -> EXTENDED2 -> [BEGIN] -> CONNECTED
  106. When this is done, the client has established a 3-hop circuit and also
  107. opened a stream to the other end. Usually after this comes a series of DATA
  108. cell that either fetches pages, establishes an SSL connection or fetches
  109. directory information:
  110. [DATA] -> [DATA] -> DATA -> DATA
  111. The above stream of 10 relay cells defines the grand majority of general
  112. circuits that come out of Tor browser during our testing, and it's what we
  113. are gonna use to make introduction and rednezvous circuits blend in.
  114. Please note that in this section we only investigate relay cells and not
  115. connection-level cells like CREATE/CREATED or AUTHENTICATE/etc. that are
  116. used during the link-layer handshake. The rationale is that connection-level
  117. cells depend on the type of guard used and are not an effective fingerprint
  118. for a network/guard-level adversary.
  119. 5. WTF-PAD machines
  120. For the purposes of this proposal we will make use of four WTF-PAD machines
  121. as follows:
  122. - Client-side introduction circuit hiding machine (origin-side)
  123. - Client-side introduction circuit hiding machine (relay-side)
  124. - Client-side rendezvous circuit hiding machine (origin-side)
  125. - Client-side rendezvous circuit hiding machine (relay-side)
  126. In the following sections we will analyze these machines.
  127. 5.1. Client-side introduction circuit hiding machines [INTRO_CIRC_HIDING]
  128. These two machines are meant to hide client-side introduction circuits. The
  129. origin-side machine sits on the client and sends padding towards the
  130. introduction circuit, whereas the relay-side machine sits on the middle-hop
  131. (second hop of the circuit) and sends padding towards the client. The
  132. padding from the origin-side machine terminates at the middle-hop and does
  133. not get forwarded to the actual introduction point.
  134. Both of these machines only get activated for introduction circuits, and
  135. only after an INTRODUCE1 cell has been sent out.
  136. This means that before the machine gets activated our cell flow looks like this:
  137. [EXTEND2] -> EXTENDED2 -> [EXTEND2] -> EXTENDED2 -> [EXTEND2] -> EXTENDED2 -> [INTRODUCE1]
  138. Comparing the above with section [CIRCCONSTRUCTION], we see that the above
  139. cell sequence matches the one from general circuits up to the first 7 cells.
  140. However, in normal introduction circuits this is followed by an
  141. INTRODUCE_ACK and then the circuit gets teared down, which does not match
  142. the sequence from [CIRCCONSTRUCTION].
  143. Hence when our machine is used, after sending an [INTRODUCE1] cell, we also
  144. send a [PADDING_NEGOTIATE] cell, which gets answered by a PADDING_NEGOTIATED
  145. cell and an INTRODUCE_ACKED cell. This makes us match the [CIRCCONSTRUCTION]
  146. sequence up to the first 10 cells.
  147. After that, we continue sending padding from the relay-side machine so as to
  148. fake a directory download, or an SSL connection setup. We also want to
  149. continue sending padding so that the connection stays up longer to destroy
  150. the "Duration of Activity" fingerprint.
  151. To calculate the padding overhead, we see that the origin-side machine just
  152. sends a single [PADDING_NEGOATIATE] cell, wheras the origin-side machine
  153. sends a PADDING_NEGOTIATED cell and between 7 to 10 DROP cells. This means
  154. that the average overhead of this machine is 11 padding cells.
  155. In terms of WTF-PAD terminology, these machines have three states (START,
  156. OBF, END). They move from the START to OBF state when the first
  157. non-padding cell is received on the circuit, and they stay in the OBF
  158. state until all the padding gets depleted. The OBF state is controlled by
  159. a histogram which specifies the parameters described in the paragraphs
  160. above. After all the padding finishes, it moves to END state.
  161. We also set a special WTF-PAD flag which keeps the circuit open even after
  162. the introduction is performed. In particular, with this feature the circuit
  163. will stay alive for the same durations as normal web circuits before they
  164. expire (usually 10 minutes).
  165. 5.2. Client-side rendezvous circuit hiding machines
  166. The rendezvous circuit machines apply on client-side rendezvous circuits and
  167. only after the rendezvous point has been established (REND_ESTABLISHED has
  168. been received). Up to that point, the following cell sequence has been
  169. observed on the circuit:
  170. [EXTEND2] -> EXTENDED2 -> [EXTEND2] -> EXTENDED2 -> [ESTABLISH_REND] -> REND_ESTABLISHED
  171. which matches the general circuit construction sequence [CIRCCONSTRUCTION]
  172. up to the first 6 cells. However after that, normal rendezvous circuits
  173. receive a RENDEZVOUS2 cell followed by a [BEGIN] and a CONNECTED, which does
  174. not fit the circuit construction sequence we are trying to imitate.
  175. Hence our machine gets activated right after REND_ESTABLISHED is received,
  176. and continues by sending a [PADDING_NEGOTIATE] and a [DROP] cell, before
  177. receiving a PADDING_NEGOTIATED and a DROP cell, effectively blending into
  178. the general circuit construction sequence on the first 10 cells.
  179. After that our machine gets deactivated, and we let the actual rendezvous
  180. circuit shape the traffic flow. Since rendezvous circuits usually immitate
  181. general circuits (their purpose is to surf the web), we can expect that they
  182. will look alike.
  183. In terms of overhead, this machine is quite light. Both sides send 2 padding
  184. cells, for a total of 4 padding cells.
  185. 6. Overhead analysis
  186. Given the parameters above, intro circuit machines have an overhead of 11
  187. padding cells, and rendezvous circuit machines have an overhead of 4
  188. cpadding ells. . This means that for every intro and rendezvous circuit
  189. there will be an overhead of 15 padding cells in average, which is about
  190. 7.5kb.
  191. In the PrivCount paper [1] we learn that the Tor network sees about 12
  192. million successful descriptor fetches per day. We can use this figure to
  193. assume that the Tor network also sees about 12 million intro and rendezvous
  194. circuits per day. Given the 7.5kb overhead of each of these circuits, we get
  195. that our padding machines infer an additional 94GB overhead per day on the
  196. network, which is about 3.9GB per hour.
  197. XXX Isn't this kinda intense????? Using the graphs from metrics we see that
  198. the Tor network has total capacity of 300 Gbit/s which is about 135000GB per
  199. hour, so 3.9GB per hour is not that much, but still...
  200. 7. Discussion
  201. 7.1. Alternative approaches
  202. These machines try to hide onion service client-side circuits by obfuscating
  203. their looks. This is a reasonable approach, but if the resulting circuits
  204. look unlike any other Tor circuits, they would still be fingerprintable just
  205. by that fact.
  206. Another approach we could take is make normal client circuits look like
  207. onion service circuits, or just make normal clients establish fake onion
  208. service circuits periodically. The hope here is that the adversary won't be
  209. able to distinguish fake onion service circuits from real ones. This
  210. approach has not been taken yet, mainly because it requires additional
  211. WTF-PAD features and poses greater overhead risks.
  212. 7.2. Future work
  213. As discussed in [SCOPE], this proposal only aims to hide some very specific
  214. features of client-side onion service circuits. There is lots of work to be
  215. done here to see what other features can be used to distinguish such
  216. circuits, and also what other classifiers can be built using deep learning
  217. and whatnot.
  218. ---
  219. [0]: https://www.usenix.org/node/190967
  220. https://blog.torproject.org/technical-summary-usenix-fingerprinting-paper
  221. [1]: "Understanding Tor Usage with Privacy-Preserving Measurement"
  222. by Akshaya Mani, T Wilson-Brown, Rob Jansen, Aaron Johnson, and Micah Sherr
  223. In Proceedings of the Internet Measurement Conference 2018 (IMC 2018).