bridgedb-spec.txt 19 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392
  1. BridgeDB specification
  2. Karsten Loesing
  3. Nick Mathewson
  4. 0. Preliminaries
  5. This document specifies how BridgeDB processes bridge descriptor files
  6. to learn about new bridges, maintains persistent assignments of bridges
  7. to distributors, and decides which bridges to give out upon user
  8. requests.
  9. Some of the decisions here may be suboptimal: this document is meant to
  10. specify current behavior as of August 2013, not to specify ideal
  11. behavior.
  12. 1. Importing bridge network statuses and bridge descriptors
  13. BridgeDB learns about bridges by parsing bridge network statuses,
  14. bridge descriptors, and extra info documents as specified in Tor's
  15. directory protocol. BridgeDB parses one bridge network status file
  16. first and at least one bridge descriptor file and potentially one extra
  17. info file afterwards.
  18. BridgeDB scans its files on sighup.
  19. BridgeDB does not validate signatures on descriptors or networkstatus
  20. files: the operator needs to make sure that these documents have come
  21. from a Tor instance that did the validation for us.
  22. 1.1. Parsing bridge network statuses
  23. Bridge network status documents contain the information of which bridges
  24. are known to the bridge authority and which flags the bridge authority
  25. assigns to them.
  26. We expect bridge network statuses to contain at least the following two
  27. lines for every bridge in the given order (format fully specified in Tor's
  28. directory protocol):
  29. "r" SP nickname SP identity SP digest SP publication SP IP SP ORPort
  30. SP DirPort NL
  31. "a" SP address ":" port NL (no more than 8 instances)
  32. "s" SP Flags NL
  33. BridgeDB parses the identity and the publication timestamp from the "r"
  34. line, the OR address(es) and ORPort(s) from the "a" line(s), and the
  35. assigned flags from the "s" line, specifically checking the assignment
  36. of the "Running" and "Stable" flags.
  37. BridgeDB memorizes all bridges that have the Running flag as the set of
  38. running bridges that can be given out to bridge users.
  39. BridgeDB memorizes assigned flags if it wants to ensure that sets of
  40. bridges given out should contain at least a given number of bridges
  41. with these flags.
  42. 1.2. Parsing bridge descriptors
  43. BridgeDB learns about a bridge's most recent IP address and OR port
  44. from parsing bridge descriptors.
  45. In theory, both IP address and OR port of a bridge are also contained
  46. in the "r" line of the bridge network status, so there is no mandatory
  47. reason for parsing bridge descriptors. But the functionality described
  48. in this section is still implemented in case we need data from the
  49. bridge descriptor in the future.
  50. Bridge descriptor files may contain one or more bridge descriptors.
  51. We expect a bridge descriptor to contain at least the following lines in
  52. the stated order:
  53. "@purpose" SP purpose NL
  54. "router" SP nickname SP IP SP ORPort SP SOCKSPort SP DirPort NL
  55. "published" SP timestamp
  56. ["opt" SP] "fingerprint" SP fingerprint NL
  57. "router-signature" NL Signature NL
  58. BridgeDB parses the purpose, IP, ORPort, nickname, and fingerprint
  59. from these lines.
  60. BridgeDB skips bridge descriptors if the fingerprint is not contained
  61. in the bridge network status parsed earlier or if the bridge does not
  62. have the Running flag.
  63. BridgeDB discards bridge descriptors which have a different purpose
  64. than "bridge". BridgeDB can be configured to only accept descriptors
  65. with another purpose or not discard descriptors based on purpose at
  66. all.
  67. BridgeDB memorizes the IP addresses and OR ports of the remaining
  68. bridges.
  69. If there is more than one bridge descriptor with the same fingerprint,
  70. BridgeDB memorizes the IP address and OR port of the most recently
  71. parsed bridge descriptor.
  72. If BridgeDB does not find a bridge descriptor for a bridge contained in
  73. the bridge network status parsed before, it does not add that bridge
  74. to the set of bridges to be given out to bridge users.
  75. 1.3. Parsing extra-info documents
  76. BridgeDB learns if a bridge supports a pluggable transport by parsing
  77. extra-info documents.
  78. Extra-info documents contain the name of the bridge (but only if it is
  79. named), the bridge's fingerprint, the type of pluggable transport(s) it
  80. supports, and the IP address and port number on which each transport
  81. listens, respectively.
  82. Extra-info documents may contain zero or more entries per bridge. We expect
  83. an extra-info entry to contain the following lines in the stated order:
  84. "extra-info" SP name SP fingerprint NL
  85. "transport" SP transport SP IP ":" PORT ARGS NL
  86. BridgeDB parses the fingerprint, transport type, IP address, port and any
  87. arguments that are specified on these lines. BridgeDB skips the name. If
  88. the fingerprint is invalid, BridgeDB skips the entry. BridgeDB memorizes
  89. the transport type, IP address, port number, and any arguments that are be
  90. provided and then it assigns them to the corresponding bridge based on the
  91. fingerprint. Arguments are comma-separated and are of the form k=v,k=v.
  92. Bridges that do not have an associated extra-info entry are not invalid.
  93. 2. Assigning bridges to distributors
  94. A "distributor" is a mechanism by which bridges are given (or not
  95. given) to clients. The current distributors are "email", "https",
  96. and "unallocated".
  97. BridgeDB assigns bridges to distributors based on an HMAC hash of the
  98. bridge's ID and a secret and makes these assignments persistent.
  99. Persistence is achieved by using a database to map node ID to
  100. distributor.
  101. Each bridge is assigned to exactly one distributor (including
  102. the "unallocated" distributor).
  103. BridgeDB may be configured to support only a non-empty subset of the
  104. distributors specified in this document.
  105. BridgeDB may be configured to use different probabilities for assigning
  106. new bridges to distributors.
  107. BridgeDB does not change existing assignments of bridges to
  108. distributors, even if probabilities for assigning bridges to
  109. distributors change or distributors are disabled entirely.
  110. 3. Giving out bridges upon requests
  111. Upon receiving a client request, a BridgeDB distributor provides a
  112. subset of the bridges assigned to it.
  113. BridgeDB only gives out bridges that are contained in the most recently
  114. parsed bridge network status and that have the Running flag set (see
  115. Section 1).
  116. BridgeDB may be configured to give out a different number of bridges
  117. (typically 4) depending on the distributor.
  118. BridgeDB may define an arbitrary number of rules. These rules may
  119. specify the criteria by which a bridge is selected. Specifically,
  120. the available rules restrict the IP address version, OR port number,
  121. transport type, bridge relay flag, or country in which the bridge
  122. should not be blocked.
  123. 4. Selecting bridges to be given out based on IP addresses
  124. BridgeDB may be configured to support one or more distributors which
  125. gives out bridges based on the requestor's IP address. Currently, this
  126. is how the HTTPS distributor works.
  127. The goal is to avoid handing out all the bridges to users in a similar
  128. IP space and time.
  129. # Someone else should look at proposals/ideas/old/xxx-bridge-disbursement
  130. # to see if this section is missing relevant pieces from it. -KL
  131. BridgeDB fixes the set of bridges to be returned for a defined time
  132. period.
  133. BridgeDB considers all IP addresses coming from the same /24 network
  134. as the same IP address and returns the same set of bridges. From here on,
  135. this non-unique address will be referred to as the IP address's 'area'.
  136. BridgeDB divides the IP address space equally into a small number of
  137. # Note, changed term from "areas" to "disjoint clusters" -MF
  138. disjoint clusters (typically 4) and returns different results for requests
  139. coming from addresses that are placed into different clusters.
  140. # I found that BridgeDB is not strict in returning only bridges for a
  141. # given area. If a ring is empty, it considers the next one. Is this
  142. # expected behavior? -KL
  143. #
  144. # This does not appear to be the case, anymore. If a ring is empty, then
  145. # BridgeDB simply returns an empty set of bridges. -MF
  146. #
  147. # I also found that BridgeDB does not make the assignment to areas
  148. # persistent in the database. So, if we change the number of rings, it
  149. # will assign bridges to other rings. I assume this is okay? -KL
  150. BridgeDB maintains a list of proxy IP addresses and returns the same
  151. set of bridges to requests coming from these IP addresses.
  152. The bridges returned to proxy IP addresses do not come from the same
  153. set as those for the general IP address space.
  154. BridgeDB can be configured to include bridge fingerprints in replies
  155. along with bridge IP addresses and OR ports.
  156. BridgeDB can be configured to display a CAPTCHA which the user must solve
  157. prior to returning the requested bridges.
  158. The current algorithm is as follows. An IP-based distributor splits
  159. the bridges uniformly into a set of "rings" based on an HMAC of their
  160. ID. Some of these rings are "area" rings for parts of IP space; some
  161. are "category" rings for categories of IPs (like proxies). When a
  162. client makes a request from an IP, the distributor first sees whether
  163. the IP is in one of the categories it knows. If so, the distributor
  164. returns an IP from the category rings. If not, the distributor
  165. maps the IP into an "area" (that is, a /24), and then uses an HMAC to
  166. map the area to one of the area rings.
  167. When the IP-based distributor determines from which area ring it is handing
  168. out bridges, it identifies which rules it will use to choose appropriate
  169. bridges. Using this information, it searches its cache of rings for one
  170. that already adheres to the criteria specified in this request. If one
  171. exists, then BridgeDB maps the current "epoch" (N-hour period) and the
  172. IP's area (/24) to a point on the ring based on HMAC, and hands out
  173. bridges at that point. If a ring does not already exist which satisfies this
  174. request, then a new ring is created and filled with bridges that fulfill
  175. the requirements. This ring is then used to select bridges as described.
  176. "Mapping X to Y based on an HMAC" above means one of the following:
  177. - We keep all of the elements of Y in some order, with a mapping
  178. from all 160-bit strings to positions in Y.
  179. - We take an HMAC of X using some fixed string as a key to get a
  180. 160-bit value. We then map that value to the next position of Y.
  181. When giving out bridges based on a position in a ring, BridgeDB first
  182. looks at flag requirements and port requirements. For example,
  183. BridgeDB may be configured to "Give out at least L bridges with port
  184. 443, and at least M bridges with Stable, and at most N bridges
  185. total." To do this, BridgeDB combines to the results:
  186. - The first L bridges in the ring after the position that have the
  187. port 443, and
  188. - The first M bridges in the ring after the position that have the
  189. flag stable and that it has not already decided to give out, and
  190. - The first N-L-M bridges in the ring after the position that it
  191. has not already decided to give out.
  192. After BridgeDB selects appropriate bridges to return to the requestor, it
  193. then prioritises the ordering of them in a list so that as many criteria
  194. are fulfilled as possible within the first few bridges. This list is then
  195. truncated to N bridges, if possible. N is currently defined as a
  196. piecewise function of the number of bridges in the ring such that:
  197. /
  198. | 1, if len(ring) < 20
  199. |
  200. N = | 2, if 20 <= len(ring) <= 100
  201. |
  202. | 3, if 100 <= len(ring)
  203. \
  204. The bridges in this sublist, containing no more than N bridges, are the
  205. bridges returned to the requestor.
  206. 5. Selecting bridges to be given out based on email addresses
  207. BridgeDB can be configured to support one or more distributors that are
  208. giving out bridges based on the requestor's email address. Currently,
  209. this is how the email distributor works.
  210. The goal is to bootstrap based on one or more popular email service's
  211. sybil prevention algorithms.
  212. # Someone else should look at proposals/ideas/old/xxx-bridge-disbursement
  213. # to see if this section is missing relevant pieces from it. -KL
  214. BridgeDB rejects email addresses containing other characters than the
  215. ones that RFC2822 allows.
  216. BridgeDB may be configured to reject email addresses containing other
  217. characters it might not process correctly.
  218. # I don't think we do this, is it worthwhile? -MF
  219. BridgeDB rejects email addresses coming from other domains than a
  220. configured set of permitted domains.
  221. BridgeDB normalizes email addresses by removing "." characters and by
  222. removing parts after the first "+" character.
  223. BridgeDB can be configured to discard requests that do not have the
  224. value "pass" in their X-DKIM-Authentication-Result header or does not
  225. have this header. The X-DKIM-Authentication-Result header is set by
  226. the incoming mail stack that needs to check DKIM authentication.
  227. BridgeDB does not return a new set of bridges to the same email address
  228. until a given time period (typically a few hours) has passed.
  229. # Why don't we fix the bridges we give out for a global 3-hour time period
  230. # like we do for IP addresses? This way we could avoid storing email
  231. # addresses. -KL
  232. # The 3-hour value is probably much too short anyway. If we take longer
  233. # time values, then people get new bridges when bridges show up, as
  234. # opposed to then we decide to reset the bridges we give them. (Yes, this
  235. # problem exists for the IP distributor). -NM
  236. # I'm afraid I don't fully understand what you mean here. Can you
  237. # elaborate? -KL
  238. #
  239. # Assuming an average churn rate, if we use short time periods, then a
  240. # requestor will receive new bridges based on rate-limiting and will (likely)
  241. # eventually work their way around the ring; eventually exhausting all bridges
  242. # available to them from this distributor. If we use a longer time period,
  243. # then each time the period expires there will be more bridges in the ring
  244. # thus reducing the likelihood of all bridges being blocked and increasing
  245. # the time and effort required to enumerate all bridges. (This is my
  246. # understanding, not from Nick) -MF
  247. # Also, we presently need the cache to prevent replays and because if a user
  248. # sent multiple requests with different criteria in each then we would leak
  249. # additional bridges otherwise. -MF
  250. BridgeDB can be configured to include bridge fingerprints in replies
  251. along with bridge IP addresses and OR ports.
  252. BridgeDB can be configured to sign all replies using a PGP signing key.
  253. BridgeDB periodically discards old email-address-to-bridge mappings.
  254. BridgeDB rejects too frequent email requests coming from the same
  255. normalized address.
  256. To map previously unseen email addresses to a set of bridges, BridgeDB
  257. proceeds as follows:
  258. - It normalizes the email address as above, by stripping out dots,
  259. removing all of the localpart after the +, and putting it all
  260. in lowercase. (Example: "John.Doe+bridges@example.COM" becomes
  261. "johndoe@example.com".)
  262. - It maps an HMAC of the normalized address to a position on its ring
  263. of bridges.
  264. - It hands out bridges starting at that position, based on the
  265. port/flag requirements, as specified at the end of section 4.
  266. See section 4 for the details of how bridges are selected from the ring
  267. and returned to the requestor.
  268. 6. Selecting unallocated bridges to be stored in file buckets
  269. # Kaner should have a look at this section. -NM
  270. BridgeDB can be configured to reserve a subset of bridges and not give
  271. them out via one of the distributors.
  272. BridgeDB assigns reserved bridges to one or more file buckets of fixed
  273. sizes and write these file buckets to disk for manual distribution.
  274. BridgeDB ensures that a file bucket always contains the requested
  275. number of running bridges.
  276. If the requested number of bridges in a file bucket is reduced or the
  277. file bucket is not required anymore, the unassigned bridges are
  278. returned to the reserved set of bridges.
  279. If a bridge stops running, BridgeDB replaces it with another bridge
  280. from the reserved set of bridges.
  281. # I'm not sure if there's a design bug in file buckets. What happens if
  282. # we add a bridge X to file bucket A, and X goes offline? We would add
  283. # another bridge Y to file bucket A. OK, but what if A comes back? We
  284. # cannot put it back in file bucket A, because it's full. Are we going to
  285. # add it to a different file bucket? Doesn't that mean that most bridges
  286. # will be contained in most file buckets over time? -KL
  287. #
  288. # This should be handled the same as if the file bucket is reduced in size.
  289. # If X returns, then it should be added to the appropriate distributor. -MF
  290. 7. Displaying Bridge Information
  291. After bridges are selected using one of the methods described in
  292. Sections 4 - 6, they are output in one of two formats. Bridges are
  293. formatted as:
  294. <address:port> NL
  295. Pluggable transports are formatted as:
  296. <transportname> SP <address:port> [SP arglist] NL
  297. where arglist is an optional space-separated list of key-value pairs in
  298. the form of k=v.
  299. Previously, each line was prepended with the "bridge" keyword, such as
  300. "bridge" SP <address:port> NL
  301. "bridge" SP <transportname> SP <address:port> [SP arglist] NL
  302. # We don't do this anymore because Vidalia and TorLauncher don't expect it.
  303. # See the commit message for b70347a9c5fd769c6d5d0c0eb5171ace2999a736.
  304. 8. Writing bridge assignments for statistics
  305. BridgeDB can be configured to write bridge assignments to disk for
  306. statistical analysis.
  307. The start of a bridge assignment is marked by the following line:
  308. "bridge-pool-assignment" SP YYYY-MM-DD HH:MM:SS NL
  309. YYYY-MM-DD HH:MM:SS is the time, in UTC, when BridgeDB has completed
  310. loading new bridges and assigning them to distributors.
  311. For every running bridge there is a line with the following format:
  312. fingerprint SP distributor (SP key "=" value)* NL
  313. The distributor is one out of "email", "https", or "unallocated".
  314. Both "email" and "https" distributors support adding keys for "port",
  315. "flag" and "transport". Respectively, the port number, flag name, and
  316. transport types are the values. These are used to indicate that
  317. a bridge matches certain port, flag, transport criteria of requests.
  318. The "https" distributor also allows the key "ring" with a number as
  319. value to indicate to which IP address area the bridge is returned.
  320. The "unallocated" distributor allows the key "bucket" with the file
  321. bucket name as value to indicate which file bucket a bridge is assigned
  322. to.