Home

Awesome

GUI Grounding Pre-training Data for SeeClick

This project is the GUI Grounding Pre-training dataset construction project for SeeClick, using the Common Crawl dataset as the source of URLs, crawling web page data using Selenium, and extracting web element grounding data for continuous pre-training of SeeClick.

This is the English introduction of the project.

中文 README

Project Structure

How to Use

  1. Preparing the Common Crawl dataset in advance, and unzip it to a specific directory.
  2. Install Chrome browser and the corresponding version of ChromeDriver.
  3. Install Python dependencies.
pip install -r requirements.txt
  1. Run preprocess_cdx.py to extract URLs and remove duplicates.
python preprocess_cdx.py --cdx_file_path /path/to/cdx --unique_cdx_file_path /path/to/unique_cdx
  1. Run main.py to crawl data.
python main.py --cdx_file_path /path/to/unique_cdx --out_root /path/to/output --num_workers 20