Awesome
Github Social
Real-time collaborative repository recommendations based on GitHub stars.
About
Application shows related GitHub projects, by analysing GitHub stars.
Application is using offline data that is updated continously from GitHub API. The seed database has been extracted from Github Archive, and GH Torrent websites. Specifically:
- List of GitHub Repositories and Users (stored in PostgreSQL)
- List of starred Repositories of each User (stored in Redis)
Used algorithm
Application is using Memory-based, Item-based Collaborative Filtering algorithm using modifier Sørensen–Dice coefficient for detecting similarity between given two repositories.
We use similar approach to predictor, with important differences, among others:
- Instead of computing intersection of stars between given repository and all repositories related to it, similarities are computed massively using zunionstore Redis command.
- Similarities are computed and cached using Lua script executed directly in Redis instance.
- For repositories with thousand stars, a representative sample of 100-5000 users is taken.
- Employ optimizations by computations on sets of integers instead set of strings.
- Redis is used in 32bit mode and with increased shared integer pool to improve memory usage.
- The "popularity penalty factor" is used for discovering less popular repositories. The penalty factor can be provided by user.
- For real-time recommendations, ignoring users that have more than 1000 stars.
The similarity formula reads as follows:
|U(A)| ∩ |U(B)|
S(A, B) = -------------------
|U(A)| + P * |U(B)|
Where A
is subject repository, B
is related repository, U(x)
is set of users starring x
repository, and P
is a "popularity penalty factor" provided by user in UI.
The algorithm is implemented in redis_recommender.rb.
Performance
Algorithm is able to analyse hundreds of thousands of stars well under 1 second while maintaining memory usage less than 1GB on GitHub dataset. One Redis database with caching is enough for handling GitHub-size dataset.
Recommendation speed can be improved by introducing more Redis slaves.
Requirements
- Ruby 2.1.0
- PostgreSQL 9.x
- Redis 3.0.0, preferably 32bit
Technologies
- Ruby & Rails 4
- CoffeeScript & Angular 1.2
- Rails Assets
- SCSS, SLIM
- Sidekiq
Production installation
Application requires Redis and PostgreSQL database dumps. They can be downloaded using bin/download
script. Please download only if you really need to test live data.
curl -o db/dump.rdb http://sheerun.net/dump.rdb
curl -o db/dump.sql.gz http://sheerun.net/dump.sql.gz
You'll also need compiled redis instance in 32bit mode, and increased shared integer count:
#define REDIS_SHARED_INTEGERS 15000000
make 32bit
After your redis instance is up and running with downloaded dump.rdb
, and PostgreSQL with imported dump.sql.gz
, you can bundle application:
bundle install
bin/rake db:create
bin/rake db:migrate
You also need to create github application with callback set to:
http://localhost:3000/auth/github/callback
And add .env
file with following configuration:
GITHUB_KEY=xxx
GITHUB_SECRET=yyy
Application and sidekiq worker can be started with:
bin/foreman start
Contributing
We need help with following:
- Making recommendation engine even more performant
- Better front-end design and interaction (author is Ruby developer)
- Improvements in recommendation algorithm to get better suggestions
- Testing, fixing and maintaining application.
If you think you could help, please post issue or pull request on this repository.
License
This project is MIT-licensed.