Awesome
<div align="center">excid
cid
("content id") is a human-friendly
unique ID function
used by mobile-first/distributed apps.
Why?
We needed a way of running our App(s) across multiple servers/devices without fear/risk of collisions in the IDs of records.
How?
- Add
excid
to your list ofdeps
in yourmix.exs
file:
def deps do
[
{:excid, "~> 1.0.1"}
]
end
then run mix deps.get
to download.
- Define in your mix configuration which base you want to use to create the cid.
The value of the base should be the either
:base32
or:base58
. If the base is not defined thenexcid
will use thebase32
by default.
config :excid, base: :base32
- Call the
cid/1
function with aString
,Map
, orStruct
as the single argument:
Cid.cid("hello")
"zb2rhZfjRh2FHHB2RkHVEvL2vJnCTcu7kwRqgVsf9gpkLgteo"
Cid.cid(%{key: "value"})
"zb2rhn1C6ZDoX6rdgiqkqsaeK7RPKTBgEi8scchkf3xdsi8Bj"
If the argument is not one of the ones listed above
then the function will return "invalid data type"
.
Cid.cid([])
"invalid data type"
What?
The goal is to allow records to be created offline (e.g: on a mobile device) with the ID generated on the client and when the record is synched/saved to the server (database) it can be verified and the ID does not need to change. Furthermore because the ID is based on a cryptographic hash there is virtually zero chance of collision.
We could have used "Ye olde" UUID (random/nondeterministic) as the IDs and achieve the desired collision-avoidance, but we felt we could do much better and allow us to build a "checksum" mechanism into our app that will verify records created by any device to allow an effortless distributed/decentralised system.
This may all sound "esoteric" if you have not yet built
an "offline-first" application. Our hope is that this README.md
will clarify our reasoning for "reinventing the wheel".
Context
In a "basic" Relational Database
(e.g: PostgreSQL, MySQL, MariaDB, etc.)
the default way of identifying records
is using an auto-incrementing integer: 1, 2, 3, etc.
This is the optimal way of referencing records
in a single (database) server setup
where the counter
for the number of records
in a given table
only needs to be known by one machine
and there is no chance of conflict.
e.g. if there are currently 100 blog posts in the database,
the server will assign the next blog post the id 101;
there is no chance that it will create a new
post with id 99
which would conflict with the post you wrote yesterday.
Centralised = Single Point of Failure
As users of software or beginners building apps, we don't often think about failure. The reality however, is that systems fail all-the-time. All the most popular systems/services fail even Google who have thousands of world-class engineers dedicated to keeping the system up.
Most Apps are designed to be centralised because it's much easier to get started and for the most part there won't be any issues.
The problem comes when something in the system (e.g: the database) fails.
If the single databases server handling all the requests fails, all users are affected by the outage for as long as it takes the team to revive it.
Having witnessed Distributed Denial of Service ("DDoS") attacks first-hand, we have experienced the pain of having to recover crashed servers with corrupted mission-critical customer data; it's bleak. Yes, there are services like CloudFlare or Imperva which promise to mitigate against DDoS attacks, however the reality is that they are only providing "frontend" protection; if for any reason your single-server database was to crash, your app will still be out-of-action regardless of having CloudFlare.
Why Decentralise?
If you have never had the experience of being offline or the service you are using being interrupted, then you either live in hyper-connected Paolo Alto (with backup/redundant networks and city-wide WiFi) or simply don't use the Internet "enough" to notice the outages.
All the work we do depends on having access to the Internet.
We need to systematically reduce
that dependency.
Network and hardware fault-tolerance is essential for many apps and enables a whole new "class" of apps to be created.
Specifically applications that are "federated". see: https://en.wikipedia.org/wiki/Federated_architecture
The Apps that we (@dwyl) are creating must be decentralised; there cannot be a single point of failure.
Decentralisation is not just "philosophical" argument, as creative technologists we are directly responsible for the technology we create. The lives of billions of people are at stake if we continue to allow the centralised control of our communication networks.
If you believe in the universal human right to privacy [Article 12] freedom from oppression and the Golden Rule, then logically this is the only thing to do.
Who?
Anyone who is techno-curious about the future of the Internet and wants to understand the way decentralised applications derive the IDs for content.
We feel that most apps can benefit
from being decentralised/distributed by default
because it means they work "offline" when any element fails
and data can easily be "synched" (and verified)
when connection is re-established.
If you want to build a mobile/offline-first progressive mobile web app (PWA) that feels native on both Android and iOS, then understanding CIDs is a good place to start.
If you are building apps that will use a single database instance for whatever reason (e.g: they aren't very "complex" or don't need to be distributed or work offline-first) keep enjoying the simplicity and maybe come back to this later when you feel you need this functionality.
What?
In a distributed database, we need a way of creating IDs for the records without any risk of "collision". We also need a consistent way of creating IDs both on the server and on the client (to allow for offline-first distributed apps).
Why Not Use UUIDs?
There are many ways of creating unique IDs, the most popular has historically been UUID (Universally Unique Identifier) https://en.wikipedia.org/wiki/Universally_unique_identifier
A UUID is a 128-bit number usually represented as base16 (hexadecimal) for example:
85594564-5be7-465f-b007-0fada384ed44
(via https://www.uuidgenerator.net )
Consider the following URL
(featuring a UUID as the id
of a record):
location-app.com/venues/85594564-5be7-465f-b007-0fada384ed44
It doesn't exactly roll off the tongue. 🙄
append-only log.
One of our ... readable/typeable<sup>1</sup>
123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghjkmnpqrstuvwxyz
With a 56 characters in the set, we can create IDs of the following lengths:<br />
- 2: 56^2 = 3,136
- 3: 56^3 = 175,616
- 4: 56^4 = 9,834,496
- 5: 56^5 = 550,731,776 ...
- 22:
Crucially, we can create a system where the IDs start out at 2 characters and increase gradually as required (similar to how Instagram grew their IDs as they scaled, see below).
Real world usage
cid
from a String
A real world use case for
wanting a cid
based on a String
is a URL shortener.
For example you have a long URL:
https://github.com/dwyl/phoenix-ecto-append-only-log-example
the cid
of this url is:
require Cid
Cid.cid("https://github.com/dwyl/phoenix-ecto-append-only-log-example") # > "zb2rhjLfmD2trAmpEi3ZJSpFtuNokCrpfFido7QHCQRVnGoZW"
We can then create a URLs table in our URL shortening app/service with the following entry:
inserted_at | URL (PK) | cid | short |
---|---|---|---|
1541609554 | https://github.com/dwyl/phoenix-ecto-append-only-log-example | zb2rhjLfmD2trAmpEi3ZJSpFtuNokCrpfFido7QHCQRVnGoZW | zb2 |
So the "short" url would be dwyl.co/zb2
This is a relatively "boring" but still perfect valid use case. If someone attempts to create a short URL for this (same) long URL, the URL shortening app will simply return dwyl.co/zb2 the same short URL each time.
The reason we can abbreviate the URL to just zb2
is because our SHORT URL service has a centralised Database/store.
If we wanted to run a decentralised content addressing system,
we would simply link to the full cid
:
dwyl.co/zb2rhjLfmD2trAmpEi3ZJSpFtuNokCrpfFido7QHCQRVnGoZW
Where the chance of cid
collision
is less than 1 in "the number of
atoms in the Universe".
If we generated 1 Billion CIDs per second
for the next Trillion years there would
still be less than a 0.001% chance of collision.<sup>3</sup>
Tests
The tests for this module are a combination of doctests, unit tests and property based tests.
To run the property based tests you will need to install IPFS. See https://github.com/dwyl/learn-ipfs#how for details.
The property based tests will run by default. These tests are more comprehensive
when compared to the "regular" tests. They will run the Cid.cid
function on
100 randomly generated strings and maps, comparing the results of these to the
IPFS generated cid, ensuring our function is correct in its implementation.
These tests take a little longer to run that the "regular" tests, so if you wish
you can skip them with mix test --exclude ipfs
Research, Background & Relevant Reading
- Real World examples of services that use Strings as IDs instead of Integers. Real World Examples
- (Please) Stop Using Unsafe Characters in URLs https://perishablepress.com/stop-using-unsafe-characters-in-urls/
- Safe characters for friendly URLs: https://stackoverflow.com/questions/695438/safe-characters-for-friendly-url
- When to use non-sequential Strings as IDs instead of Integers: https://softwareengineering.stackexchange.com/questions/361395/when-use-a-long-string-id-instead-of-a-integer
- Running out of numeric IDs in JavaScript! https://asana.com/developers/news/string-ids
- Clean URL: https://en.wikipedia.org/wiki/Clean_URL
- URL Slugs: https://seo-hacker.com/url-seo-tutorial
- Raft consensus: https://en.wikipedia.org/wiki/Raft_(computer_science)
- What are the odds of collisions for a hash function with 256-bit output? https://crypto.stackexchange.com/questions/39641/what-are-the-odds-of-collisions-for-a-hash-function-with-256-bit-output
- Collision (computer science): https://en.wikipedia.org/wiki/Collision_(computer_science)
- Hash Collision Probabilities: https://preshing.com/20110504/hash-collision-probabilities
- UUID collisions: https://softwareengineering.stackexchange.com/questions/130261/uuid-collisions
<br /> <hr /> <br />
Notes
<sup>1</sup> The quality of being "typeable" specifically on a mobile device, means that a human being can type an ID in a reasonable amount of time (e.g: fewer than 7 seconds) and
<sup>2</sup> The list of Discontinued Google services continues to grow https://en.wikipedia.org/wiki/Category:Discontinued_Google_services
<sup>3</sup> How to calculate collision probability in an ID system?
https://en.wikipedia.org/wiki/Universally_unique_identifier
<!-- ``` # consider the following iex> {:ok, buff} = "C56A4180-65AA-42EC-A945-5FD21DEC0538" |> String.replace("-", "") |> Base.decode16 iex> buff |> Base.encode64 # "xWpBgGWqQuypRV/SHewFOA==" (22 characters + 2 "=" signs as "padding") ``` -->With a Base16 character set and 32 character of ID length,
FAQs
How can we guarantee the the CID generated will be unique?
We will be using the SHA256 hash which to date has not incurred a single hash collision. 256 bits of data is used by most crypto currencies. Enough "smart people" have done the homework on this for us not to worry about it.
This is a good video on 256 bit hash collision probability: https://youtu.be/S9JGmA5_unY
This video does not cover the "Birthday Paradox" see: https://github.com/nelsonic/nelsonic.github.io/issues/576. But again, for the purposes of this answer and indeed any project we are likely to work on in our lifetime, when dealing with 256 bit hashes, the chance of a "birthday attack" creating a collision is "ignorable".
If all unique content creates a unique CID, how do we update content in our database?
The update version of content would be linked to the previous version using a
prev
field the way it happens in IPFS, Etherium and Bitcoin (so it will
be familiar to people)
prev: previous_cid
address example:
inserted | cid (PK)<sup>1</sup> | name | address | prev |
---|---|---|---|---|
1541609554 | gVSTedHFGBetxy | Bruce Wane | 1007 Mountain Drive, Gotham | null |
1541618643 | smnELuCmEaX42 | Bruce Wane | Rua Goncalo Afonso, Vila Madalena, Sao Paulo, 05436-100, Brazil | gVSTedHFGBetxy |
When a row does not have a prev
value then we know it is the first
time that content has been inserted into the database. When a prev
value
is defined in a row we know this is a new version of a previously inserted
content and we can "traverse the tree" to see all previous versions.