Home

Awesome

🚀 Awesome LLMs on Device: A Must-Read Comprehensive Hub by Nexa AI

<div align="center">

Discord

On-device Model Hub / Nexa SDK Documentation

</div> <div style="text-align: center;"> <img src="resources/Summary_of_on-device_LLMs_evolution.jpeg" alt="Summary of on-device LLMs’ evolution" width="800"> <div style="font-size: 10px;">Summary of On-device LLMs’ Evolution</div> </div>

🌟 About This Hub

Welcome to the ultimate hub for on-device Large Language Models (LLMs)! This repository is your go-to resource for all things related to LLMs designed for on-device deployment. Whether you're a seasoned researcher, an innovative developer, or an enthusiastic learner, this comprehensive collection of cutting-edge knowledge is your gateway to understanding, leveraging, and contributing to the exciting world of on-device LLMs.

🚀 Why This Hub is a Must-Read

📚 What's Inside Our Hub

Foundations and Preliminaries

Evolution of On-Device LLMs

LLM Architecture Foundations

On-Device LLMs Training

Limitations of Cloud-Based LLM Inference and Advantages of On-Device Inference

The Performance Indicator of On-Device LLMs

Efficient Architectures for On-Device LLMs

ModelPerformanceComputational EfficiencyMemory Requirements
MobileLLMHigh accuracy, optimized for sub-billion parameter modelsEmbedding sharing, grouped-query attentionReduced model size due to deep and thin structures
EdgeShardUp to 50% latency reduction, 2× throughput improvementCollaborative edge-cloud computing, optimal shard placementDistributed model components reduce individual device load
LLMCadUp to 9.3× speedup in token generationGenerate-then-verify, token tree generationSmaller LLM for token generation, larger LLM for verification
Any-Precision LLMSupports multiple precisions efficientlyPost-training quantization, memory-efficient designSubstantial memory savings with versatile model precisions
Breakthrough MemoryUp to 4.5× performance improvementPIM and PNM technologies enhance memory processingEnhanced memory bandwidth and capacity
MELTing PointProvides systematic performance evaluationAnalyzes impacts of quantization, efficient model evaluationEvaluates memory and computational efficiency trade-offs
LLMaaS on deviceReduces context switching latency significantlyStateful execution, fine-grained KV cache compressionEfficient memory management with tolerance-aware compression and swapping
LocMoEReduces training time per epoch by up to 22.24%Orthogonal gating weights, locality-based expert regularizationMinimizes communication overhead with group-wise All-to-All and recompute pipeline
EdgeMoESignificant performance improvements on edge devicesExpert-wise bitwidth adaptation, preloading expertsEfficient memory management through expert-by-expert computation reordering
JetMoEOutperforms Llama27B and 13B-Chat with fewer parametersReduces inference computation by 70% using sparse activation8B total parameters, only 2B activated per input token
Pangu-$\pi$ ProNeural architecture, parameter initialization, and optimization strategy for billion-level parameter modelsEmbedding sharing, tokenizer compressionReduced model size via architecture tweaking
Zamba22x faster time-to-first-token, a 27% reduction in memory overhead, and a 1.29x lower generation latency compared to Phi3-3.8B.Hybrid Mamba2/Attention architecture and shared transformer block2.7B parameters, fewer KV-states due to reduced attention

Model Compression and Parameter Sharing

Collaborative and Hierarchical Model Approaches

Memory and Computational Efficiency

Mixture-of-Experts (MoE) Architectures

Hybrid Architectures

General Efficiency and Performance Improvements

Model Compression and Optimization Techniques for On-Device LLMs

Quantization

Pruning

Knowledge Distillation

Low-Rank Factorization

Hardware Acceleration and Deployment Strategies

Popular On-Device LLMs Framework

Hardware Acceleration

Applications

Model Reference

ModelInstitutePaper
Gemini NanoGoogleGemini: A Family of Highly Capable Multimodal Models
Octopus series modelNexa AIOctopus v2: On-device language model for super agent<br>Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent<br>Octopus v4: Graph of language models<br>Octopus: On-device language model for function calling of software APIs
OpenELM and Ferret-v2AppleOpenELM is a significant large language model integrated within iOS to enhance application functionalities. <br>Ferret-v2 significantly improves upon its predecessor, introducing enhanced visual processing capabilities and an advanced training regimen.
Phi seriesMicrosoftPhi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
MiniCPMTsinghua UniversityA GPT-4V Level Multimodal LLM on Your Phone
Gemma2-9BGoogleGemma 2: Improving Open Language Models at a Practical Size
Qwen2-0.5BAlibaba GroupQwen Technical Report

Tutorials and Learning Resources

🤝 Join the On-Device LLM Revolution

We believe in the power of community! If you're passionate about on-device AI and want to contribute to this ever-growing knowledge hub, here's how you can get involved:

  1. Fork the repository
  2. Create a new branch for your brilliant additions
  3. Make your updates and push your changes
  4. Submit a pull request and become part of the on-device LLM movement

⭐ Star History ⭐

Star History Chart

📖 Cite Our Work

If our hub fuels your research or powers your projects, we'd be thrilled if you could cite our paper here:

@article{xu2024device,
  title={On-Device Language Models: A Comprehensive Review},
  author={Xu, Jiajun and Li, Zhiyuan and Chen, Wei and Wang, Qun and Gao, Xin and Cai, Qi and Ling, Ziyuan},
  journal={arXiv preprint arXiv:2409.00088},
  year={2024}
}

📄 License

This project is open-source and available under the MIT License. See the LICENSE file for more details.

Don't just read about the future of AI – be part of it. Star this repo, spread the word, and let's push the boundaries of on-device LLMs together! 🚀🌟