Awesome

This version doesn't have BatchNorm layers for fine-tuning. If you want to use such model for training, you should add these layers manually.

The procedure for conversion was pretty interesting:

I unpacked ARCore iOS framework and took tflite model of facemesh. You can download it here
Paper doesn't state any architecture details, so I looked at Netron graph visualization to reverse-engineer number of input-output channels and operations.
Made them in pytorch and transfer raw weights from tflite file semi-manually into pytorch model definition. (see Convert-FaceMesh.ipynb for details)

However, predict_on_image function normalizes your image itself, so you can even treat resized image as np.array as input

See Inference-FaceMesh.ipynb notebook for usage example