Awesome
hits
A simple & easy way to see how many people have viewed your GitHub Repository.
Why?
We have a few projects on GitHub ... <br />
Sadly, we have had no idea how many people
are reading/using the projects because GitHub only shares "traffic" stats
for the past 14 days and not in "real time".
(unless people star/watch the repo) Also, manually checking who has viewed a
project is exceptionally tedious when you have more than a handful of projects.
We want to know the popularity of each of our repos to know what people are finding useful and help us decide where we need to be investing our time.
What?
A simple way to add (very basic) analytics to your GitHub repos.
There are already many "badges" that people use in their repos. See: github.com/dwyl/repo-badges <br /> But we haven't seen one that gives a "hit counter" of the number of times a GitHub page has been viewed ... <br /> So, in today's mini project we're going to create a basic Web Counter.
https://en.wikipedia.org/wiki/Web_counter
What Data to Capture/Store?
The first question we asked ourselves was: What is the minimum possible amount of (useful/unique) info we can store per visit (to one of our projects)?
-
date + time (timestamp) when the person visited the site/page. <br /> https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Date/now
-
url being visited. i.e. which project was viewed.
-
user-agent the browser/device (or "crawler") visiting the site/page https://en.wikipedia.org/wiki/User_agent
-
IP Address of the client. (for checking uniqueness)
-
Language of the person's web browser. Note: While not "essential", we added Browser Language as the 5th piece of data (when it is set/sent by the browser/device) because it's insightful to know what language people are using so that we can determine if we should be translating/"localising" our content.
"Common Log Format" (CLF) ?
We initially considered using the "Common Log Format" (CLF) because it's well-known/understood. see: https://en.wikipedia.org/wiki/Common_Log_Format
An example log entry:
127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
Real example:
84.91.136.21 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) 007 [05/Aug/2017:16:50:51 -0000] "GET github.com/dwyl/phase-two HTTP/1.0" 200 42247
The data makes sense when viewed as a table:
| IP Address of Client | User Identifier | User ID | Date+Imte of Request | Request "Verb" and URL of Request | HTTP Status Code | Size of Response | | -------------|:-----------|:--|:------------:|:--------:|:--|--|--| | 84.91.136.21 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) | 007 | [05/Aug/2017:16:50:51 -0000] | "GET github.com/dwyl/phase-two HTTP/1.0" | 200 | 42247 |
On further reflection, we think the "Common Log Format" is inneficient as it contains a lot of duplicate and some useless data.
We can do better.
Alternative Log Format ("ALF")
From the CLF we can remove:
- IP Address, User Identifier and User ID can be condensed into a single hash (see below).
- "GET"" - the word is implied by the service we are running (we only accept GET requests)
- Response size is irrelevant and will be the same for most requests.
| Timestamp | URL | User Agent | IP Address | Language | Hit Count | | ------------- |:------------|:------------|:------------:|:--------:| | 1436570536950 | github.com/dwyl/the-book | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) | 84.91.136.21 | EN-GB | 42 |
In the log entry (example) described above the first 3 bits of data will identify the "user" requesting the page/resource, so rather than duplicating the data in an inefficient string, we can hash it!
Any repeating user-identifying data should be concactenated
Log entries are stored as a ("pipe" delimited) String
which can be parsed and re-formatted into any other format:
1436570536950|github.com/dwyl/phase-two|Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)|88.88.88.88|EN-US|42
Reducing Storage (Costs)
If a person views multiple pages, three pieces of data are duplicated: User Agent, IP Address and Language. Rather than storing this data multiple times, we hash the data and store the hash as a lookup.
Hash Long Repeating (Identical) Data
If we run the following Browser|IP|Language
String
:
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)|84.91.136.21|EN-US'
through a SHA hash function we get: 8HKg3NB5Cf
(always)<sup>1</sup>.
Sample code:
var hash = require('./lib/hash.js');
var user_agent_string = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)|88.88.88.88|EN-US';
var agent_hash = hash(user_agent_string, 10); // 8HKg3NB5Cf
<sup>1</sup>Note: SHA hash is always 40 characters, but we truncate it because 10 alphanumeric characters (selected from a set of 26 letters + 10 digits) means there are 36<sup>10</sup> = 3,656,158,440,062,976 (three and a half Quadrillion) possible strings which we consider "enough" entropy. (if you disagree, tell us why in an issue!)
Hit Data With Hash
1436570536950|github.com/dwyl/the-book|8HKg3NB5Cf|42
How?
Place a badge (image) in your repo README.md
so others can
can see how popular the page is and you can track it.
Run it Your_self_!
Download (clone) the code to your local machine:
git clone https://github.com/dwyl/hits.git && cd hits
Note: you will need to have Node.js running on your localhost.
Install dependencies:
npm install
Run locally:
npm run dev
Visit: http://localhost:8000/any/url/count.svg
Data Storage
Recording the "hit" data is essential for this app to work and be useful.
We have built it to work with two "data stores": Filesystem and Redis <!-- and PostgreSQL. --> <br />
Note: you only need one storage option to be available.
Filesystem
Research
User Agents
How many user agents (web browsers + crawlers) are there? there appear to be fewer than a couple of thousand user agents. http://www.useragentstring.com/pages/useragentstring.php which means we could store them using a numeric index; 1 - 3000
But, storing the user agents using a numeric index means we
need to perform a lookup on each hit which requires network IO ...
(expensive!)
What if there was a way of deriving a String
representation of the
the user-agent string ... oh, that's right, here's one I made earlier...
https://github.com/dwyl/aguid
Log Formats
- Apache Log Sample: http://www.monitorware.com/en/logsamples/apache.php
Node.js http module headers
https://nodejs.org/api/http.html#http_message_rawheaders
Running the Test Suite locally
The test suite includes tests for 3 databases
therefore running the tests on your localhost
requires all 3 to be running.
Deploying and using the app only requires one of the databases to be available.