Awesome
hits
<div align="center"> <!-- service unavailabe: https://github.com/dwyl/technology-stack/issues/88 [![Dependency Status](https://img.shields.io/david/dwyl/hits-nodejs.svg?style=flat-square)](https://david-dm.org/dwyl/hits-nodejs) [![devDependency Status](https://img.shields.io/david/dev/dwyl/hits-nodejs.svg?style=flat-square)](https://david-dm.org/dwyl/hits-nodejs#info=devDependencies) --> </div>Why?
We have a few projects on GitHub ... <br />
We want to instantly see the popularity of each of our repos to know what people are finding useful and help us decide where we need to be investing our time.
While GitHub has a basic "traffic" tab which displays page view stats, GitHub only records the data for the past 14 days and then it gets reset. The data is not relayed to the "owner" in "real time" and you would need to use the API and "poll" for data ... Manually checking who has viewed a project is exceptionally tedious when you have more than a handful of projects.
What?
A simple & easy way to see how many people have viewed your GitHub Repository.
There are already many "badges" that people use in their repos. See: github.com/dwyl/repo-badges <br /> But we haven't seen one that gives a "hit counter" of the number of times a GitHub page has been viewed ... <br /> So, in today's mini project we're going to create a basic Web Counter.
https://en.wikipedia.org/wiki/Web_counter
How?
If you simply want to display a "hit count badge" in your project's GitHub page, visit: https://hits.dwyl.io to get the Markdown!
Want to Run it Yourself?!
To run the code on your localhost in 3 easy steps:
1. Download the Code:
Download (clone) the code to your local machine:
git clone https://github.com/dwyl/hits-nodejs.git && cd hits
Note: you will need to have Node.js running on your localhost.
2. Install the Dependencies
Install dependencies:
npm install
3. Run the Server
Run locally:
npm run dev
Now open Two web browser windows/tabs:
- first tab: http://localhost:8000/ (this is the hits "home page")
- second tab: http://localhost:8000/any/url/count.svg
Implementation Detail
In case anyone wants to know the thought process that went into building this...
What Data to Capture/Store?
The first question we asked ourselves was: What is the minimum possible amount of (useful/unique) info we can store per visit (to one of our projects)?
-
date + time (timestamp) when the person visited the site/page. <br /> https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Date/now
-
url being visited. i.e. which project was viewed.
-
user-agent the browser/device (or "crawler") visiting the site/page https://en.wikipedia.org/wiki/User_agent
-
IP Address of the client. (for checking uniqueness)
-
Language of the person's web browser. Note: While not "essential", we added Browser Language as the 5th piece of data (when it is set/sent by the browser/device) because it's insightful to know what language people are using so that we can determine if we should be translating/"localising" our content.
"Common Log Format" (CLF) ?
We initially considered using the "Common Log Format" (CLF) because it's well-known/understood. see: https://en.wikipedia.org/wiki/Common_Log_Format
An example log entry:
127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
Real example:
84.91.136.21 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) 007 [05/Aug/2017:16:50:51 -0000] "GET github.com/dwyl/phase-two HTTP/1.0" 200 42247
The data makes sense when viewed as a table:
IP Address of Client | User Identifier | User ID | Date+Imte of Request | Request "Verb" and URL of Request | HTTP Status Code | Size of Response |
---|---|---|---|---|---|---|
84.91.136.21 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) | 007 | [05/Aug/2017:16:50:51 -0000] | "GET github.com/dwyl/phase-two HTTP/1.0" | 200 | 42247 |
On further reflection, we think the "Common Log Format" is inneficient as it contains a lot of duplicate and some useless data.
We can do better.
Alternative Log Format ("ALF")
From the CLF we can remove:
- IP Address, User Identifier and User ID can be condensed into a single hash (see below).
- "GET"" - the word is implied by the service we are running (we only accept GET requests)
- Response size is irrelevant and will be the same for most requests.
Timestamp | URL | User Agent | IP Address | Language | Hit Count |
---|---|---|---|---|---|
1436570536950 | github.com/dwyl/the-book | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) | 84.91.136.21 | EN-GB | 42 |
In the log entry (example) described above the first 3 bits of data will identify the "user" requesting the page/resource, so rather than duplicating the data in an inefficient string, we can hash it!
Any repeating user-identifying data should be concactenated
Log entries are stored as a ("pipe" delimited) String
which can be parsed and re-formatted into any other format:
1436570536950|github.com/dwyl/phase-two|Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)|88.88.88.88|EN-US|42
Reducing Storage (Costs)
If a person views multiple pages, three pieces of data are duplicated: User Agent, IP Address and Language for each request/log. Rather than storing this data multiple times, we hash the data and store the hash as a lookup.
Hash Long Repeating (Identical) Data
If we run the following Browser|IP|Language
String
:
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)|84.91.136.21|EN-US'
through a SHA hash function we get: 8HKg3NB5Cf
(always)<sup>1</sup>.
Sample code:
var hash = require('./lib/hash.js');
var user_agent_string = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)|88.88.88.88|EN-US';
var agent_hash = hash(user_agent_string, 10); // 8HKg3NB5Cf
<sup>1</sup>Note: SHA hash is always 40 characters, but we truncate it because 10 alphanumeric characters (selected from a set of 26 letters + 10 digits) means there are 36<sup>10</sup> = 3,656,158,440,062,976 (three and a half Quadrillion) possible strings which we consider "enough" entropy. (if you disagree, tell us why in an issue!)
Hit Data With Hash
1436570536950|github.com/dwyl/the-book|8HKg3NB5Cf|42
We're sure you will agree this is considerably more compact.
Note: our log also strips the
github.com/
from the url so it's:
1436570536950|dwyl/the-book|8HKg3NB5Cf|42
Which is a considerable saving on "CLF" (see above)
Data Storage
We aren't using a "Database", rather we are using the filesystem.
Filesystem
For implementation see:
/lib/db_filesystem.js
Yes, we know Heroku does not give access to the Filesystem... If you want to run this on Heroku see: https://github.com/dwyl/hits-nodejs/issues/54
Research
User Agents
How many user agents (web browsers + crawlers) are there? there appear to be fewer than a couple of thousand user agents. https://www.useragentstring.com/pages/useragentstring.php which means we could store them using a numeric index; 1 - 3000
But, storing the user agents using a numeric index means we
need to perform a lookup on each hit which requires network IO ...
(expensive!)
What if there was a way of deriving a String
representation of the
the user-agent string ... oh, that's right, here's one I made earlier...
https://github.com/dwyl/aguid
Log Formats
- Apache Log Sample: https://www.monitorware.com/en/logsamples/apache.php (looked at the existing log formats, all were too verbose/wasteful for us!)
Node.js http module headers
https://nodejs.org/api/http.html#http_message_rawheaders
Running the Test Suite locally
The test suite includes tests for 3 databases
therefore running the tests on your localhost
requires all 3 to be running.
Deploying and using the app only requires one of the databases to be available.