Constant Identity

An attempt to implement identity capabilities into a flat file content management system.

Table of Contents

Identity of a Page
Is an Index Necessary?
1. File Format
  1. CSV
  2. TSV
2. File Size
Indexing Methods
Conclusion

The purpose of an identity is to assign a specific reference to an entity whose value may change, where it is expected that the changes that are likely to be made will not interfere with references to the entity in other places.

The most direct way to give an entity an identity is to assign it a unique number. A number is constant and does not need to have any particular meaning. This is the point of an identity. The only requirement for an identity is that it has to be unique (from other identities within the same scope).

In a relational database, this capability already exists in the form of a primary key. In a common file system, identity comes in the form of a file/folder path. A file/folder path is unique, but is not constant. The path value of a file/folder changes when it gets moved or renamed.

Identity of a Page

There is an ID property on a Page object, whose value can be set by either having an id property in the page header, or by having an id.data file in the folder associated with the page file:

Using File

.\
└── lot\
    └── page\
        ├── lorem-ipsum\
        │   └── id.data
        └── lorem-ipsum.page

Using Property

---
title: Lorem Ipsum
description: Lorem ipsum dolor sit amet.
id: 123
...

<p>Lorem ipsum dolor sit amet.</p>

If none of the above conditions are met, the default value of the ID property will be the unsigned value of the file creation time, which is unique but has the potential to change when the file is updated or moved from server to server. The uniqueness of this value also applies only when the author is manually creating the page files. If the author generates a number of pages at once, for example using a PHP script, then the gap between the page creation times becomes so close (possibly only a few milliseconds apart) that there is a possibility that several pages will have the same ID value. Hence, making this value not reliable as an identity.

Note that the value of a manual page creation time also affects the ID value, so having a time.data file or a page name structured in a YYYY-MM-DD or YYYY-MM-DD-hh-mm-ss pattern can also affect the automatic ID value. These methods are a long-established method for dealing with inconsistencies in the page file’s creation time values. As by default, the value of the Time property of a page file takes the value from the file’s creation time in the file system, which can change as the file is updated and/or moved between servers.

From the Page File

.\
└── lot\
    └── page\
        └── 2023-11-18-18-40-10.page

From the Page’s Time File

.\
└── lot\
    └── page\
        ├── lorem-ipsum\
        │   └── time.data
        └── lorem-ipsum.page

The first and second methods are the methods that the Tag extension has been using since its early use cases to make it easy to reference tags on pages. By using the id property value as a reference, the reference value will be constant, allowing authors to modify the tag name at will in the future without having to update all of the associated tags references that already exist in the pages.

However, obtaining the path of a tag file by its ID will not be as quick as obtaining a row in a database by its primary key, mainly due to the lack of an index. You will have to iterate over all the id.data files in the .\lot\tag folder to get the id value you are looking for:

$id = 123; // The tag ID to search for
$tag = new Tag; // An empty tag object

foreach (glob(LOT . D . 'tag' . D . '*' . D . 'id.data', GLOB_NOSORT) as $v) {
    if (!is_file($file = dirname($v) . '.page')) {
        // Is not associated with a public tag, skip!
        continue;
    }
    // Found a public tag!
    if (is_file($v) && $id === (int) trim((string) fgets(fopen($v, 'r')))) {
        $tag = new Tag($file);
        break;
    }
}

if ($tag->exist) {
    // …
}

This work load is not much of a problem since the number of tag files isn’t that much. Tag files also don’t have a depth level like regular page files, so using PHP’s glob() function is fast enough to find the tag ID you are looking for.

Another thing that makes identity feature difficult to implement is how we store the last ID data that was created. It is important to determine the next ID value and make sure it is distinct from the previous ID values. If you are using the Tag extension without the Panel, you will certainly be able to store the last ID value that you created in your mind, but this will not be the case if the ID value comes from a regular page file. The potential for you to forget the last ID value you created is very high as the number of pages increases.

The most practical way to keep track of the last ID value is to store it in a log file every time a new page is created, assuming that you are using a GUI that creates your page file on a POST request:

$last = (int) trim((string) fgets(fopen(ENGINE . D . 'log' . D . 'last-id.log', 'r')));

if ('POST' === $_SERVER['REQUEST_METHOD']) {
    // Store the next ID to a page file…
    $next = $last + 1;
    file_put_contents(LOT . D . 'page' . D . 'lorem-ipsum' . D . 'id.data', (string) $next);
    // Store the next ID as the last ID that was created…
    file_put_contents(ENGINE . D . 'log' . D . 'last-id.log', (string) $next);
}

However, the last ID value becomes unreliable when the user deletes the log file, both intentionally and accidentally. To ensure that the last ID value is always valid, it is necessary to scan all ID values in the page files and then return the largest value of it to increment later. This is very expensive:

$last = 0;

foreach (g(LOT . D . 'page', 'data', true) as $k => $v) {
    if ('id.data' !== basename($k)) {
        continue;
    }
    $current = (int) trim((string) fgets(fopen($k, 'r')));
    if ($current > $last) {
        $last = $current;
    }
}

$next = $last + 1;

// …

You can’t do this every time because it slows down the server. The more visitors your site gets, the more your server will slow down due to this file system scan process. You will probably need to define a list of events that will trigger this task to run. Hopefully, this task can be run once and the result can then be stored in a log file to be used while no new pages are created.

function last_page_id(): int {
    // Read last ID value from the log file
    if (is_file($file = ENGINE . D . 'log' . D . 'last-id.log')) {
        return (int) trim((string) fgets(fopen($file, 'r')));
    }
    // Find last ID value from the page files
    $last = 0;
    foreach (g(LOT . D . 'page', 'data', true) as $k => $v) {
        if ('id.data' !== basename($k)) {
            continue;
        }
        $current = (int) trim((string) fgets(fopen($k, 'r')));
        if ($current > $last) {
            $last = $current;
        }
    }
    // Store the last ID value to the log file
    file_put_contents($file, (string) $last);
    return $last;
}

$next = last_page_id() + 1;

Your task is then to delete the log file in order to refresh the last ID value in the log file, such as upon the successful creation of a new page. Or even faster, simply to increment the last ID value in the log file to avoid having to scan back through all the page files just to obtain the largest ID value. This is really just a speculation, but who cares about the density of numbers on each ID value? As long as they are unique, it does not matter:

$last = last_page_id();

if ('POST' === $_SERVER['REQUEST_METHOD']) {
    file_put_contents(ENGINE . D . 'log' . D . 'last-id.log', (string) ($last + 1));
}

Is an Index Necessary?

We have found a way to ensure that the ID will be unique on each page creation cycle, but now we have a new problem related to the speed of getting the file path based on an ID reference. For now, only the Tag extension uses the identity feature to reference each tag with pages. The process of finding tags by ID can be considered not very optimized. It requires an iteration over all of the tag files in order to find the ID you are looking for, then returns the name of the tag so that it can be assembled into a file path so that it can then be returned as a tag object. This is what actually happens behind the scenes:

$kind = (array) ($page->kind ?? []);
$tags = [];

foreach ($kind as $k) {
    if (!is_int($k)) {
        continue;
    }
    foreach (glob(LOT . D . 'tag' . D . '*' . D . 'id.data', GLOB_NOSORT) as $v) {
        if (!is_file($file = dirname($v) . '.page')) {
            continue;
        }
        $id = (int) trim((string) fgets(fopen($v, 'r')));
        if ($id === $k) {
            $tags[] = $file;
            break;
        }
    }
}

$tags = new Tags($tags);

Imagine trying to get the file path of a regular page from the page ID. You may end up having to recursively iterate over the page files just to get an ID value. At this point, it seems important to have an index file.

I don’t know exactly how databases store their indexes, but it’s actually quite easy to simulate this feature. All you need is a list of ID paired with the page file path. So when you try to find the page file path by its ID, all you have to do is to scan this index file to find the ID you are looking for, and then return the page file path whose value is right next to the ID.

File Format

JSON, PHP serial, and YAML are not options, as you might think. They are far too complex to parse, and usually require the entire file contents to be loaded into memory in order to properly construct the data. You need a file format that stores each record on a single line. This allows us to stream the index file line by line, so it doesn’t take up too much memory to read. In this case, the options can be CSV and TSV.

CSV

1,/lot/page/index
2,/lot/page/about
3,/lot/page/article
4,/lot/page/contact

TSV

1	/lot/page/index
2	/lot/page/about
3	/lot/page/article
4	/lot/page/contact

File Size

File size is also a consideration because the smaller the file, the faster the data can be extracted from it. To save some extra bytes, we can omit the delimiter. IDs are just numbers, so it won’t be hard to separate them from the file path data. The file contents below are no longer valid CSV and TSV file contents, so it would be better if we didn’t consider this file as a file with a specific format:

1/lot/page/index
2/lot/page/about
3/lot/page/article
4/lot/page/contact

We also need to determine how strictly an ID must be unique. For example, in the file path data above, they all start with the string '/lot/page/'. We can omit those too, considering that the unique ID only applies to the .\lot\page folder (e.g. we can have a tag file and a regular page file with the same ID value):

1/index
2/about
3/article
4/contact

The context can be specified by where the index file is stored, so you can simply store this index file in the .\lot\page folder, for example, as .id file:

.\
└── lot\
    ├── page\
    │   ├── …
    │   └── .id
    └── tag\
        ├── …
        └── .id

Indexing Methods

Building the Index

Building an index file involves scanning all of the page files in a folder. This task is expected to be done once or at the specified events, e.g. to correct invalid indices or to re-generate the index file if the file is missing:

function create_page_index(string $folder): void {
    $rows = [];
    foreach (g($folder, 'archive,draft,page', true) as $k => $v) {
        if (!is_file($v = dirname($k) . D . pathinfo($k, PATHINFO_FILENAME) . D . 'id.data')) {
            continue;
        }
        $id = (int) trim((string) fgets(fopen($v, 'r')));
        // Remove folder prefix and file extension, normalize directory separator to `/`
        $route = strtr(substr($k, strlen($folder), -(strlen(pathinfo($k, PATHINFO_EXTENSION)) + 1)), [D => '/']);
        $rows[$id] = $id . $route;
    }
    if ($rows) {
        // Sort by ID in ascending order, so that the last row will have the largest ID value
        ksort($rows);
        file_put_contents($folder . D . '.id', implode("\n", $rows), LOCK_EX);
    }
}

// Perform this task on specific events…
create_page_index(LOT . D . 'page');

Obtaining the File Path from Index

The process of obtaining the file path data by ID is achieved by reading the index file line by line:

function get_page_from_id(int $id, string $folder): Page {
    $page = new Page; // An empty page object
    $h = fopen($folder . D . '.id', 'r');
    while (false !== ($row = fgets($h))) {
        if ("" === ($row = trim($row))) {
            continue;
        }
        if (0 !== strpos($row, $id . '/')) {
            continue;
        }
        $route = $folder . D . substr($row, strlen($id . '/'));
        if ($file = exist([
            $route . '.archive',
            $route . '.page'
        ], 1)) {
            $page = new Page($file);
        }
        break;
    }
    fclose($h);
    return $page;
}

$page = get_page_from_id(123, LOT . D . 'page');

if ($page->exist) {
    // …
}

Obtaining the Last ID from Index

I have to reference one of the answers in a forum. This is an efficient way to read the last line of a file, even from a very very very very very very very very very very very very big file:

// <https://stackoverflow.com/a/1510248/1163000>
function get_last_page_index(string $folder): string {
    $row = "";
    $h = fopen($folder . D . '.id', 'r');
    fseek($h, $cursor = -1, SEEK_END);
    $c = fgetc($h);
    // Trim trailing new line of the file
    while ("\n" === $c || "\r" === $c) {
        fseek($h, --$cursor, SEEK_END);
        $c = fgetc($h);
    }
    // Read until the start of file or first new line character
    while (false !== $c && "\n" !== $c && "\r" !== $c) {
        $row = $c . $row;
        fseek($h, --$cursor, SEEK_END);
        $c = fgetc($h);
    }
    fclose($h);
    return $row;
}

function get_last_page_id(string $folder): int {
    return (int) strstr(get_last_page_index($folder), '/', true);
}

test(get_last_page_index(LOT . D . 'page'), get_last_page_id(LOT . D . 'page'));

Adding New Entry to Index

Adding an entry to the index is done by appending a new row, no need to read the entire index file first. We also know that the last row always contains data with the largest ID value, so each new row to be added must have an even larger ID value:

function set_page_index(string $row, string $folder): void {
    $void = !is_file($folder . D . '.id') || 0 === filesize($folder . D . '.id');
    file_put_contents($folder . D . '.id', ($void ? "" : "\n") . $row, FILE_APPEND | LOCK_EX);
}

$last = get_last_page_id(LOT . D . 'page');
$route = '/article/lorem-ipsum';

set_page_index(($last + 1) . $route, LOT . D . 'page');

// Don’t forget to store the `id.data` file as well
file_put_contents(LOT . D . 'page' . $route . D . 'id.data', (string) ($last + 1));

Conclusion

Don’t worry about the complexity of the concept in this article. This feature will be available as a separate extension in the future. For now, this feature is not really necessary because you can obviously identify a file path by its path, and that is the most efficient way to obtain a file location 😉

1 Comment

Taufik Nurrohman

Wednesday, January 03, 2024 08:43 PM

Another way to create an index is to store the reference as individual files, where each reference file (the file named route, in this case) contains the relative path to the page file, without the extension. This should provide faster, direct read/write, but may take longer to copy/paste in the future due to the nature of a file system:

.\lot\page\
└── .id\
    ├── 1\
    │   └── route
    ├── 2\
    │   └── route
    ├── 3\
    │   └── route
    ├── 4\
    │   └── route
    └── …