4_data.tex 1.7 KB

12345678910
  1. \section{Data and Setup}
  2. \label{sec:data}
  3. We discuss the data sources used to build Tutela.
  4. Ethereum transactions are downloaded from the \texttt{crypto\_ethereum} dataset using BigQuery, including all transactions from August 7th, 2015 to October 1st, 2021\footnote{In the web application, Tutela is updated weekly.}. In total, this amounts to 4 terabytes of data with over 1B rows.
  5. In addition, we assume access to a list of known addresses obtained from a public Kaggle challenge\footnote{See the list of labelled Ethereum addresses found at \url{https://www.kaggle.com/hamishhall/labelled-ethereum-addresses}.}, containing almost 20,000 labelled addresses corresponding to different centralized exchanges, decentralized exchanges, relayers, DeFi applications, and more.
  6. This list will be used to identify exchange addresses for heuristics and apply known constraints on the inferred identity of clustered addresses.
  7. Additionally, we create a partition of the transaction data from \texttt{crypto\_ethereum} pertaining to Tornado Cash pools. This is done by checking that the address receiving a transaction is a Tornado Cash smart contract (taken from the BigQuery dataset \texttt{tornado\_cash\_transactions}). To capture the transactions executed by the Ethereum virtual machine (e.g., through a smart contract), we use the \texttt{crypto\_ethereum.traces} table. In the special case that a withdrawal from a Tornado Cash pool is made by a relayer, we decode the input code using the contract ABI to find the recipient address. In total, we uncover around 97,365 deposit and 83,782 withdraw transactions across all pools. These two transaction sets will be used for Tornado Cash-specific heuristics.