Awesome
Similar Scam Contract Detector
Description
This agent detects the creation of scam contracts based on bytecode similarities to known scam contracts provided by other forta bots like bot 0xf715450e392acb385eabdb8fc94278b3821d2c9a148de777726673895c7283a0.
How does it work?
This bot will listen to every contract creation event and retrieve the runtime bytecode of the created contract. Then a CFG of the bytecode will be built, and instructions of every function will be extracted to be vectorized using doc2vec model. Finally, the vectorized function features of the contract will be compared with the vectorized function features of the known scam contracts using FAISS. That is, this bot performs function-level semantic similarity detection.
When calculating the similarity between contracts, we define the similarity of contract $C_1$ and $C_2$ equals:
$$Sim(C_1, C_2) = \sum_{f_i \in C_1} log \frac{P(f_i, f_2^*)}{P(f_i, \bar{f_2})},$$
where $f_i$ represents $C_1$'s $i$-th function, $f_2^*$ represents $C_2$'s most similar function to $f_1$, and $\bar{f_2}$ represents the mean of $C_2$'s all functions. $P(f_i, f_2^*)$ and $P(f_i, \bar{f_2})$ are the probabilities of $f_i$ being semantically similar to $f_2^*$ and $\bar{f_2}$ respectively. The probability $P(\cdot)$ is calculated by:
$$P(f_i, f_j) = \frac{1}{1 + e^{-k * cos(f_i, f_2)}},$$
where $k$ is a hyperparameter and $cos(\cdot, \cdot)$ is the cosine similarity between two vectors.
Finally, the similarity score will be normalized by:
$$ score = \frac{Sim(C_1, C_2)}{Sim(C_1, C_1)}$$
Supported chains
All chains that Forta supports
Alerts
- NEW-SCAMMER-CONTRACT-CODE-HASH
- Fired when a similar contract is identified based on the code similarity hash
- Severity is always set to "medium"
- Type is always set to "suspicious"
- Metadata:
alert_hash
- the alert hash from the handleAlert function that brought this scammer into scope (so the scammer contract that the new scammer contract is similar to)new_scammer_eoa
- the EOA that created the new scammer contractnew_scammer_contract_address
- the address of the new scammer contractscammer_eoa
- the EOA that created the original scammer contractscammer_contract_address
- the original scammer contractsimilarity_score
- score that expresses the similarity between the original scammer contract and the new scammer contract identifiedsimilarity_hash
- the similarity hash for grouping the two contracts together
- Labels
- New scammer EOA will be set as
scam
- New scammer contract address will be set as
scam
- confidence will be calculated by the above mentioned equation.
- New scammer EOA will be set as
Test data
npm run sequence tx0x77ef021978dc893297a77a51990efab1ef9234006a1d97bb78678354d92de632,0xe350cf63228ae2277b0e5b49089c6f255acd481cea19892749357fe74edbd0f7,tx0xa2819befc5c19c3a51fbbea8557e4dfebd2be41cdd7359462c18027a364e7fae,0xc3b228892e92ebf86f7e71bc202279a0a4863ca83f73fa7c8df9a592a59943cb,tx0x77ef021978dc893297a77a51990efab1ef9234006a1d97bb78678354d92de632,tx0x136454296922d5c6908061434dcd3645995fe9419a147d0fe5eab6d5eb8fea9a
The above test script should raise alerts two times, one for the second transaction (starts with tx
) and one for the third transaction.
Train the model
The model is trained on slither-audited-smart-contracts
dataset. After processing there will be more than 2,000,000 functions for our model to learn unsupervisedly. The training process takes roughly 3 hour on M1 Max.
python construct_dataset.py && python train.py
Future work
- Train with larger dataset, like paradigm-data-portal.
- Flag precise malicious function calls. (require upstreaming flagging)