Awesome

Github Social

Real-time collaborative repository recommendations based on GitHub stars.

About

Application shows related GitHub projects, by analysing GitHub stars.

Application is using offline data that is updated continously from GitHub API. The seed database has been extracted from Github Archive, and GH Torrent websites. Specifically:

List of GitHub Repositories and Users (stored in PostgreSQL)
List of starred Repositories of each User (stored in Redis)

Used algorithm

Application is using Memory-based, Item-based Collaborative Filtering algorithm using modifier Sørensen–Dice coefficient for detecting similarity between given two repositories.

We use similar approach to predictor, with important differences, among others:

Instead of computing intersection of stars between given repository and all repositories related to it, similarities are computed massively using zunionstore Redis command.
Similarities are computed and cached using Lua script executed directly in Redis instance.
For repositories with thousand stars, a representative sample of 100-5000 users is taken.
Employ optimizations by computations on sets of integers instead set of strings.
Redis is used in 32bit mode and with increased shared integer pool to improve memory usage.
The "popularity penalty factor" is used for discovering less popular repositories. The penalty factor can be provided by user.
For real-time recommendations, ignoring users that have more than 1000 stars.

The similarity formula reads as follows:

            |U(A)| ∩ |U(B)|
S(A, B) = -------------------
          |U(A)| + P * |U(B)|

Where A is subject repository, B is related repository, U(x) is set of users starring x repository, and P is a "popularity penalty factor" provided by user in UI.

The algorithm is implemented in redis_recommender.rb.

Performance

Algorithm is able to analyse hundreds of thousands of stars well under 1 second while maintaining memory usage less than 1GB on GitHub dataset. One Redis database with caching is enough for handling GitHub-size dataset.

Recommendation speed can be improved by introducing more Redis slaves.

Requirements

Ruby 2.1.0
PostgreSQL 9.x
Redis 3.0.0, preferably 32bit

Technologies

Ruby & Rails 4
CoffeeScript & Angular 1.2
Rails Assets
SCSS, SLIM
Sidekiq

Production installation

Application requires Redis and PostgreSQL database dumps. They can be downloaded using bin/download script. Please download only if you really need to test live data.

curl -o db/dump.rdb http://sheerun.net/dump.rdb
curl -o db/dump.sql.gz http://sheerun.net/dump.sql.gz

You'll also need compiled redis instance in 32bit mode, and increased shared integer count:

#define REDIS_SHARED_INTEGERS 15000000

make 32bit

After your redis instance is up and running with downloaded dump.rdb, and PostgreSQL with imported dump.sql.gz, you can bundle application:

bundle install
bin/rake db:create
bin/rake db:migrate

You also need to create github application with callback set to:

http://localhost:3000/auth/github/callback

And add .env file with following configuration:

GITHUB_KEY=xxx
GITHUB_SECRET=yyy

Application and sidekiq worker can be started with:

bin/foreman start

Contributing

We need help with following:

Making recommendation engine even more performant
Better front-end design and interaction (author is Ruby developer)
Improvements in recommendation algorithm to get better suggestions
Testing, fixing and maintaining application.

If you think you could help, please post issue or pull request on this repository.

License

This project is MIT-licensed.