Distributed ID Generation

Distributed ID Generation Scheme#

Content comes from

Those Amazing Algorithms (Part Four) — Unique ID Generator Snowflake

Distributed ID Generator Solutions

Distributed Global ID Under High Concurrency

Problem Description#

In a single-machine system (for example, a MySQL instance), generating a unique ID is very simple, as it can be achieved directly using MySQL's built-in auto-increment ID feature.

However, in a distributed system with multiple shards (for example, a cluster composed of multiple MySQL instances where data is inserted), this problem becomes complex. The globally unique ID generated must meet the following requirements:

Ensure that the generated ID is globally unique
Future data migration between multiple shards should not be limited by the ID generation method
It is preferable for the generated ID to include time information, for example, the first k bits of the ID are a Timestamp, allowing data to be sorted by time directly through sorting the first k bits of the ID
The generated ID should ideally not exceed 64 bits
There are speed requirements for ID generation. For example, in a high-throughput scenario, tens of thousands of IDs need to be generated per second (Twitter's latest peak reached 143,199 Tweets/s, which is over 100,000 per second)
The entire service should ideally have no single point of failure

Timestamp#

Using time to create a unique ID is basically unfeasible in high concurrency or distributed environments, as IDs generated at the same time will be duplicated, failing to meet global uniqueness.

Using Database Auto-Increment#

Still utilizing the database to generate auto-increment IDs to ensure uniqueness. However, it should be noted that without database sharding, a fixed number of database tables can be used solely for generating auto-increment IDs, unrelated to business logic, and no further sharding will occur.
The advantage is that it is simple to implement and easy to understand. The downside is that the ID generation rate is affected by database performance and the network connecting to the database.

Using Redis Atomic Operations#

Does not rely on a database, is flexible and convenient, and performs better than a database.
Numeric IDs are naturally sortable, which is very helpful for pagination or results that require sorting.

Note:

In the case of a Redis cluster, different increment steps need to be set.
If the generated ID exceeds the auto-increment growth, a prefix + auto-increment + setting an expiration time can be used.

Using UUID/GUID#

Basic Concept#

UUID refers to a number generated on a single machine, which guarantees uniqueness across all machines in the same space-time.
UUID components: current date and time + clock sequence + random number + globally unique IEEE machine identifier.

Advantages and Disadvantages#

Advantages:

Simple, easy to code
ID generation performance is very good, basically no performance issues
Globally unique, can handle data migration, system data merging, or database changes with ease.
Disadvantages:
No sorting, cannot guarantee trend increment
UUIDs are often stored as strings, resulting in lower query efficiency
Storage space is relatively large; if it is a massive database, storage capacity must be considered.
Too long, 128 bits, not suitable for database primary keys.

Twitter Snowflake#

Snowflake is Twitter's open-source distributed ID generation algorithm, resulting in a long-type ID.
Its core idea is: high bits random + milliseconds + machine code (data center + machine ID) + 10-bit sequence number

ID Generation Process#

10 bits for the machine number, obtained from a Zookeeper cluster when the ID allocation worker starts (ensuring that all workers do not have duplicate machine numbers).
41 bits for the Timestamp: Each time a new ID is to be generated, the current Timestamp is obtained, and then the sequence number is generated in two scenarios:
1. If the current Timestamp is the same as the Timestamp of the previously generated ID (within the same millisecond), the sequence number of the previous ID + 1 is used as the new sequence number (12 bits); if all IDs in this millisecond are used up, wait for the next millisecond to continue (during this wait, no new IDs can be allocated).
2. If the current Timestamp is greater than the Timestamp of the previous ID, a random initial sequence number (12 bits) is generated as the first sequence number for this millisecond. Throughout the process, there is only an external dependency when the worker starts (it needs to obtain the worker number from Zookeeper), after which it can work independently, achieving decentralization.