Update client-architecture.md

* Update details
2026-03-14 14:35:46 +01:00 · 2025-10-03 11:50:49 +10:00 · 2025-10-03 11:50:49 +10:00 · 634c46a4ec
commit 634c46a4ec
parent 1341396da7
1 changed files with 28 additions and 64 deletions
--- a/docs/client-architecture.md
+++ b/docs/client-architecture.md
@ -104,82 +104,46 @@ This exclusion process can be illustrated by the following activity diagram. A '

 ![Client Side Filtering Determination](./puml/client_side_filtering_rules.png)

-## Understanding the Client Processing Process of Online State
-When the client is processing your online data state, the application will generate the following output:
+## Understanding how the client processes online state
+When you see `Fetching items from the OneDrive API for Drive ID:` or `Generating a /delta response from the OneDrive API for this Drive ID:` the client isn’t stuck—it’s working through paged change sets from Microsoft Graph using your current delta token, reconciling them with the local database, and safely scheduling work. Microsoft Graph returns paged results and signals either `@odata.nextLink` (more pages to fetch) or `@odata.deltaLink` (caught up; keep this token for next time) - the client follows those links until it reaches a stable point. Page sizing and paging behaviour are controlled by the Microsoft Graph API service.

-#### Fetching a Delta Response Example
-```text
-Fetching items from the OneDrive API for Drive ID: xxxxxx ..............................................
-```
-#### Generating a Delta Response Example
-```text
-Generating a /delta response from the OneDrive API for this Drive ID: xxxxxx and Item ID: xxxxxx .......
-```
-This section explains what’s happening under the hood, why it can take time on large libraries, and what you can do to make it faster and more predictable.
-
-### High Level Explanation
-* Microsoft Graph returns changes in paged bundles (~200 items per bundle). The client must iterate every bundle to reconcile your online state; you’ll therefore see repeated processing dots (`.`) messages while it works through each page. This page size is **set by Microsoft and not configurable.**
-* A full scan (first run, use of `--resync`) will always take longer than an incremental scan because the client must enumerate all items online and locally. Subsequent runs are much faster once the /delta token is established.
-* Overall speed is bounded by a mix of network throughput, CPU, filesystem and disk I/O, and how many items (files + folders) you have. A complex online state with deep trees cost more metadata operations than simple directory structures and file counts.
-* To review exactly what the client is doing, consider adding `--verbose` to your client input options to review exactly what the client is doing.
-
-### Application Processing Steps: Where the time goes (phase-by-phase)
-The flow diagrams above show the main application decision points. The log lines below correspond to the key phases you’ll see during a typical run (standalone or monitor).
-
-1. **Fetch current changes from Microsoft Graph**
-    * **Application Output:** `Fetching items from the OneDrive API for Drive ID: …`
-    * **What happens:** The client requests change bundles (≈200 items per bundle) using your current delta token. If the token is invalid or it’s a first run, it performs a broader enumeration.
-    * **Why it can be slow:** High item counts, network latency, or Microsoft Graph API throttling.
-2. **Process each bundle of changes**
+### What a typical cycle looks like
+1. **Fetching online state**
+    * **Application Output:** `Fetching items from the OneDrive API for Drive ID: …` or `Generating a /delta response from the OneDrive API for this Drive ID:`
+    * The client requests the next page of changes using your current delta token.
+2. **Processing received items**
    * **Application Output:** `Processing N applicable changes and items received from Microsoft OneDrive`
-    * **What happens:** For each item, we classify (new/changed/deleted/excluded), reconcile with the local database, and queue any work (download, upload, delete, rename).
-    * **Why it can be slow:** Many directories and small files increase metadata churn; each bundle must be applied in order. Bundle size is fixed by Graph.
-3. **Database integrity checks**
+    * Each item received is classified (add/update/delete/excluded), matched against local state, and queued for action.
+3. **Execute required actions**
+    * Download new or modified files, Delete local data that has been deleted online, Create new local directories
+4. **Database Integrity**
    * **Application Output:** `Performing a database consistency and integrity check on locally stored data`
-    * **What happens:** The application will verify local metadata invariants so subsequent actions don’t corrupt state. This is quick on SSDs but can be noticeable on slow disks.
-4. **Local filesystem scan**
+    * Integrity pass to prevent state corruption
+5. **Local scan for new local data**
    * **Application Output:** `Scanning the local file system '…' for new data to upload`
-    * **What happens:** The application will walk your configured sync root, applying client-side filtering rules and discovering local items to upload.
-    * **Why it can be slow:** Deep folder trees; slow or network filesystems; A complex local state with deep trees with potentially exlusions and inclusions to filter and determine what needs to be uploaded.
-5. **Final reconciliation & actions**
-    * **Application Output:** `Number of items to download from Microsoft OneDrive: X`
-    * **What happens:** The application will execute the final action queues. On a healthy delta run this step is short; on a first run or after --resync it can be significant.
-    * **Why it can be slow:** Many small files; constrained bandwidth; server-side throttling.
+    * Traverse local filesystem, honouring client side filtering rules
+6. **True-Up**
+    * **Application Output:** `Performing a last examination of the most recent online data within Microsoft OneDrive to complete the reconciliation process`
+    * Final scan of online to ensure that everything is in the state it is meant to be

-### Why a --resync is slower (by design)
-A `--resync` discards the known-good delta token and forces a full online + local walk to re-learn state. This is essential after certain errors or configuration change, but using it routinely will always cost more time than an incremental run. After the first successful scan, subsequent syncs drop from hours to minutes because the delta token narrows the change set dramatically.
+### Why first runs or --resync take longer
+A first run (or a deliberate `--resync`) must enumerate the entire tree to establish a known-good baseline; subsequent incremental runs are much faster because the delta token limits work to just the changes since last time.

 ### What affects performance the most
-* **Item count & structure:** Many folders and small files dominate metadata work.
-* **Network quality:** Latency and throughput directly affect how quickly we can iterate Graph pages and transfer content.
+* **Item count & Online structure:** Many folders and files dominate metadata work leading to more metadata churn
+* **Network** (latency and throughput affect page iteration and transfers) Latency and throughput directly affect how quickly we can iterate Microsoft Graph API responses and transfer content.
 * **Local Disk & filesystem:** SSDs perform metadata and DB work far faster than spinning disks or remote mounts. Your filesystem type (e.g., ext4, XFS, ZFS) matters and should be tuned appropriately.
 * **File Indexing:** Disable File Indexing (Tracker, Baloo, Searchmonkey, Pinot and others) as these are adding latency and disk I/O to your operaions slowing down your performance.
 * **CPU & memory:** Classification and hashing are CPU-bound; insufficient RAM or swap can slow DB and traversal work.
-* **First run vs incremental:** First runs / `--resync` must enumerate everything; incremental runs use the delta token and are much faster.
-
-### Practical ways to improve throughput
-1. Avoid unnecessary `--resync`. Only use it when the client explicitly advises you to. It forces a full scan.
-2. Use client-side filtering to skip noise. Prune build artefacts, caches, temp folders, and other churn using skip_dir, skip_file, skip_dotfiles, etc. Reducing item count speeds up every phase.
-3. Prefer SSD/NVMe for the sync root & DB. Faster metadata, faster DB integrity checks, faster local scans.
-4. Stable, low-latency network. Wi-Fi with high packet loss dramatically slows down page iteration and transfers.
-5. Let incremental sync do its thing. After the first complete pass, don’t interrupt; later cycles will be dramatically faster thanks to the delta token.
-6. Right-size system resources. Ensure adequate RAM/swap and avoid filesystem-level encryption that adds significant CPU overhead if your hardware is modest.

 ## Delta Response vs Generated Delta Response
-By default, the client uses Microsoft Graph’s `/delta` to fetch just-the-changes since the last successful sync. This is fast because the server does the heavy lifting and returns paged change bundles.
-
-However, there are specific scenarios where using `/delta` would be incorrect or unsafe. In these cases the client intentionally falls back to generating a “simulated delta” — i.e., it performs a targeted online tree walk and builds the current state itself. That is naturally more time-consuming than consuming a server-computed delta.
-
-### When the client switches to a simulated delta
-The sync engine will generate a simulated delta whenever any of the conditions below are true:
-
-1. National Cloud deployments don’t support `/delta`. National Cloud environments lack `/delta` feature support. To maintain correctness, the client enumerates the relevant paths online and synthesises a change set.
-2. The use of `--single-directory` scope. When you restrict the sync to a single online directory, a naïve /delta against the drive can include changes outside that scope. The simulated delta ensures we only consider the current, in-scope subtree for accurate reconciliation.
-3. The use of `--download-only --cleanup-local-files`. In this mode, consuming raw /delta can replay online delete/replace churn in a way that causes valid local files to be deleted (e.g., user deletes a folder online, then recreates it via the web). The simulated delta captures the present online state and intentionally ignores those intermediate delete/replace events, so local “keep” semantics are preserved.
-4. The uuse of 'Shared Folders'. Calling `/delta` on a shared folder path often targets the owner’s entire drive, not just the shared subtree you see. With sync_list, this mismatch can mean nothing appears to match (paths are rooted from the owner’s drive, not your shared mount point). The client therefore walks the shared folder itself, normalises paths, and constructs a simulated delta that reflects exactly what’s shared with you.
-
-**Why this is slower:** A simulated delta requires walking the online tree (and, for large or deeply nested shares, that’s work). The trade-off is deliberate: safety and correctness over speed.
+By default, the client uses Microsoft Graph’s `/delta` to retrieve changes efficiently. In a few situations, however, using `/delta` would be wrong or unsafe for your intent. In those cases the client generates a delta by walking the relevant online subtree and synthesising the current state before reconciling it locally. This is intentionally slower but correct.

+### When the client deliberately generates a delta
+* Some national cloud deployments where a needed delta endpoint/feature isn’t available. Capabilities differ by resource and cloud; when a required delta isn’t available, we walk the tree and synthesise the change set.
+* The use of `--single-directory` scope. A naïve drive-level /delta can include changes outside your intended scope. Generating a delta ensures only the in-scope subtree is considered.
+* The use of `--download-only --cleanup-local-files`. Raw /delta may replay online delete/replace churn that would remove valid local files you intend to keep. Generated delta captures the current online state and intentionally ignores those intermediate events to protect local data.
+* The use of 'Shared Folders'. Calling `/delta` on a shared path can be rooted at the owner’s drive, so your filters may not match what you see as “the shared folder”. Generated delta walks the shared subtree and normalises paths so the queue reflects what’s truly shared with you.

 ## File conflict handling - default operational modes