diff --git a/.github/actions/build-doc/Dockerfile b/.github/actions/build-doc/Dockerfile
index b3396d2..f241900 100644
--- a/.github/actions/build-doc/Dockerfile
+++ b/.github/actions/build-doc/Dockerfile
@@ -8,6 +8,8 @@ RUN pip install sphinxcontrib-plantuml==0.30
RUN pip install breathe==4.35.0
+RUN pip install myst-parser==3.0.1
+
COPY download_releases.py /usr/local/bin
COPY build.sh /usr/local/bin/build.sh
diff --git a/conf.py b/conf.py
index 295916a..edab58c 100644
--- a/conf.py
+++ b/conf.py
@@ -17,6 +17,7 @@
'sphinx_rtd_theme',
'sphinxcontrib.plantuml',
'breathe',
+ 'myst_parser'
]
html_theme = "sphinx_rtd_theme"
diff --git a/manual/images/preoptimized.svg b/manual/images/preoptimized.svg
new file mode 100644
index 0000000..170c4bc
--- /dev/null
+++ b/manual/images/preoptimized.svg
@@ -0,0 +1,644 @@
+
+
+
+
diff --git a/manual/index.rst b/manual/index.rst
index 7c7487f..6da116f 100644
--- a/manual/index.rst
+++ b/manual/index.rst
@@ -12,3 +12,4 @@ SyNAP Manual
framework_api.rst
npu_operators.rst
java.rst
+ test.md
diff --git a/manual/inference.md b/manual/inference.md
new file mode 100644
index 0000000..bf85537
--- /dev/null
+++ b/manual/inference.md
@@ -0,0 +1,148 @@
+# Inference
+
+## Introduction
+
+1. The easiest way to get started is using the CLI commands.
+2. For application development, a C++ library API etc.
+
+The simplest way to start experimenting with *Synp* is to use the sample precompiled models and applications that come preinstalled on the board.
+
+> **Important**: On Android the sample models can be found in `/vendor/firmware/models/` while on Yocto Linux they are in `/usr/share/synap/models/`. In this document we will refer to this directory as `$MODELS`.
+
+The models are organized in broad categories according to the type of data they take in input and the information they generate in output. Inside each category, models are organized per topic (for example "imagenet") and for each topic a set of models and sample input data is provided.
+
+For each category a corresponding command line test application is provided.
+
+| **Category** | **Input** | **Output** | **Test App** |
+|-----------------------|-----------|--------------------------------------------------|---------------------|
+| image_classification | image | probabilities (one per class) | synap_cli_ic |
+| object_detection | image | detections (bound.box+class+probability) | synap_cli_od |
+| image_processing | image | image | synap_cli_ip |
+
+In addition to the specific applications listed above `synap_cli` can be used to execute models of all categories. The purpose of this application is not to provide high-level outputs but to measure inference timings. This is the only sample application that can be used with models requiring secure inputs or outputs.
+
+### `synap_cli_ic` application
+
+This command line application allows to easily execute *image_classification* models.
+
+It takes in input:
+- the converted synap model (*.synap* extension)
+- one or more images (*jpeg* or *png* format)
+
+It generates in output:
+- the top 5 most probable classes for each input image provided
+
+> **Note**: The jpeg/png input image(s) are resized in SW to the size of the network input tensor. This is not included in the classification time displayed.
+
+Example:
+```sh
+$ cd $MODELS/image_classification/imagenet/model/mobilenet_v2_1.0_224_quant
+$ synap_cli_ic -m model.synap ../../sample/goldfish_224x224.jpg
+Loading network: model.synap
+Input image: ../../sample/goldfish_224x224.jpg
+Classification time: 3.00 ms
+Class Confidence Description
+ 1 18.99 goldfish, Carassius auratus
+ 112 9.30 conch
+ 927 8.70 trifle
+ 29 8.21 axolotl, mud puppy, Ambystoma mexicanum
+ 122 7.71 American lobster, Northern lobster, Maine lobster, Homarus americanus
+```
+
+### `synap_cli_od` application
+
+This command line application allows to easily execute *object_detection* models.
+
+It takes in input:
+- the converted synap model (*.synap* extension)
+- optionally the confidence threshold for detected objects
+- one or more images (*jpeg* or *png* format)
+
+It generates in output:
+- the list of object detected for each input image provided and for each of them the following information:
+ - bounding box
+ - class index
+ - confidence
+
+> **Note**: The jpeg/png input image(s) are resized in SW to the size of the network input tensor.
+
+Example:
+```sh
+$ cd $MODELS/object_detection/people/model/mobilenet224_full1/
+$ synap_cli_od -m model.synap ../../sample/sample001_640x480.jpg
+Input image: ../../sample/sample001_640x480.jpg (w = 640, h = 480, c = 3)
+Detection time: 26.94 ms
+# Score Class Position Size Description
+0 0.95 0 94,193 62,143 person
+```
+
+> **Important**: The output of object detection models is not standardized, many different formats exist. The output format used has to be specified when the model is converted, see `model_conversion_tutorial`. If this information is missing or the format is unknown `synap_cli_od` doesn’t know how to interpret the result and so it fails with an error message: *"Failed to initialize detector"*.
+
+### `synap_cli_ip` application
+
+This command line application allows to execute *image_processing* models. The most common case is the execution of super-resolution models that take in input a low-resolution image and generate in output a higher resolution image.
+
+It takes in input:
+- the converted synap model (*.synap* extension)
+- optionally the region of interest in the image (if supported by the model)
+- one or more raw images with one of the following extensions: *nv12*, *nv21*, *rgb*, *bgr*, *bgra*, *gray* or *bin*
+
+It generates in output:
+- a file containing the processed image in for each input file.
+
+ The output file is called `outimage_x.`, where `` is the index of the corresponding input file, `` and `` are the dimensions of the image, and `` depends on the type of the output image, for example `nv12` or `rgb`. The output files are created in the current directory, and this can be changed with the `--out-dir` option.
+
+> **Note**: The input image(s) are automatically resized to the size of the network input tensor. This is not supported for `nv12`: if the network takes in input an `nv12` image, the file provided in input must have the same format and the *WxH* dimensions of the image must correspond to the dimensions of the input tensor of the network.
+
+> **Note**: Any `png` and `jpeg` image can be converted to `nv12` and rescaled to the required size using the `image_to_raw` command available in the *SyNAP* `toolkit` (for more info see `using-docker-label`). In the same way the generated raw `nv12` or `rgb` images can be converted to `png` or `jpeg` format using the `image_from_raw` command.
+
+Example:
+```sh
+$ cd $MODELS/image_processing/super_resolution/model/sr_qdeo_y_uv_1920x1080_3840x2160
+$ synap_cli_ip -m model.synap ../../sample/ref_1920x1080.nv12
+Input buffer: input_0 size: 1036800
+Input buffer: input_1 size: 2073600
+Output buffer: output_13 size: 4147200
+Output buffer: output_14 size: 8294400
+
+Input image: ../../sample/ref_1920x1080.nv12
+Inference time: 30.91 ms
+Writing output to file: outimage0_3840x2160.nv12
+```
+
+### `synap_cli_ic2` application
+
+This application executes two models in sequence, the input image is fed to the first model and its output is then fed to the second one which is used to perform classification as in `synap_cli_ic`. It provides an easy way to experiment with 2-stage inference, where for example the first model is a *preprocessing* model for downscaling and/or format conversion and the second is an *image_classification* model.
+
+It takes in input:
+- the converted synap *preprocessing* model (*.synap* extension)
+- the converted synap *classification* model (*.synap* extension)
+- one or more images (*jpeg* or *png* format)
+
+It generates in output:
+- the top 5 most probable classes for each input image provided
+
+> **Note**: The shape of the output tensor of the first model must match that of the input of the second model.
+
+Example:
+```sh
+$ pp=$MODELS/image_processing/preprocess/model/convert_nv12@1920x1080_rgb@224x224
+$ cd $MODELS/image_classification/imagenet/model/mobilenet_v2_1.0_224_quant
+$ synap_cli_ic2 -m $pp/model.synap -m2 model.synap ../../sample/goldfish_1920x1080.nv12
+
+Inference time: 4.34 ms
+Class Confidence Description
+ 1 19.48 goldfish, Carassius auratus
+ 122 10.68 American lobster, Northern lobster, Maine lobster, Homarus americanus
+ 927 9.69 trifle
+ 124 9.69 crayfish, crawfish, crawdad, crawdaddy
+ 314 9.10 cockroach, roach
+```
+
+The classification output is very close to what we get in `synap_cli_ic`, the minor difference is due to the difference in the image rescaled from NV12. The bigger overall inference time is due to the processing required to perform rescale and conversion of the input 1920x1080 image.
+
+### `synap_cli` application
+
+This command line application can be used to run models of all categories. The purpose of `synap_cli` is not to show inference results but to benchmark the network execution times. So it provides additional options that allow to run inference multiple times in order to collect statistics.
+
+An additional feature is that `synap_cli` can automatically generate input images with random content. This
\ No newline at end of file
diff --git a/manual/introduction.md b/manual/introduction.md
new file mode 100644
index 0000000..0edcbc0
--- /dev/null
+++ b/manual/introduction.md
@@ -0,0 +1,61 @@
+Introduction
+============
+
+SyNAP is a software tool that optimizes neural network models for on-device inference by targeting *NPU* or *GPU* hardware accelerators in [Synaptics Astra Embedded Processors](https://www.synaptics.com/products/embedded-processors). To do this, it takes models in their original representation (e.g., Tensorflow Lite, PyTorch, or ONNX) and compiles them to a binary network graph `.synap` format specific to the target hardware, ready for inference.
+
+Optimizing models for NPU
+-------------------------
+
+Optimization of models for embedded applications using ahead-of-time compilation can usually be done with a [single command](optimizing_models.md). Optimization options (e.g. [mixed quantization](tutorials/model_import), [heterogeneous inference](heterogeneous_inference)) can be also passed at compile time using a [YAML metafile](conversion-metafile), and the model can be signed and encrypted to support Synaptics SyKURE™ secure inference technology.
+
+
+
+> [!NOTE]
+> While optimal for the target hardware, a pre-optimized model is target specific and will fail to execute on different hardware.
+
+Running inference
+-----------------
+
+There are a number of ways you can run [inference](inference.md) using compiled `.synap` models on Synaptics Astra hardware:
+
+- Image classification, object detection, and image processing using `synap_cli` commands.
+- Gstreamer plugin and Python examples for streaming media (e.g., webcam object detection).
+- Embedded applications developed in C++ or Python can use the [SyNAP Framework API](./framework_api.rst).
+
+> [!IMPORTANT]
+> The simplest way to start experimenting with *SyNAP* is to use the sample precompiled models and applications that come preinstalled on the Synaptics Astra board.
+
+JIT compilation
+---------------
+
+For portable apps (e.g., targeting Android) you might consider the [JIT compilation](jit_compilation.md) approach instead. This approach uses a Tensorflow Lite external delegate to run inference using the original `.tflite` model directly.
+
+This offers the greatest hardware portability, but there are a few disadvantages to this approach. Using this method requires any hardware-specific optimizations be done in the TensorFlow training or TFLite model export stages, which is much more involved than post-training quantization using SyNAP. Additionally, initialization time can take a few seconds on first inference, and secure media paths are not available.
+
+Model Profiling & Benchmarks
+----------------------------
+
+SyNAP provides [analysis tools](sysfs-inference-counter) in order to identify bottlenecks and optimize models. These include:
+
+- Overall model inference timing
+- NPU runtime statistics (e.g., overall layer and I/O buffer utilization)
+- Model profiling (e.g., per-layer operator type, execution time, memory usage)
+
+You can also find a [comprehensive list of reference models and benchmarks](benchmark).
+
+NPU Hardware
+------------
+
+SyNAP aims to make best use of supported [neural network operators](npu_operators) in order to accelerate on-device inference using the available NPU or GPU hardware. The NPUs themselves consist of several distinct types of functional unit:
+
+- **Convolutional Core**: Optimized to only execute convolutions (int8, int16, float 16).
+- **Tensor Processor**: Optimized to execute highly parallel operations (int8, int16, float 16).
+- **Parallel Processing Unit**: 128-bit SIMD execution unit (slower, but more flexible).
+- **Internal RAM**: Used to cache data and weights.
+
+
+| Chip | Neural Network Core | Tensor Processor | Parallel Processing Unit |
+|--------------|---------------------|--------------------|--------------------------|
+| VS640, SL1640| 4 | 2 Full + 4 Lite | 1 |
+| VS680, SL1680| 22 | 8 Full | 1 |
+
diff --git a/manual/java.md b/manual/java.md
new file mode 100644
index 0000000..c2d155b
--- /dev/null
+++ b/manual/java.md
@@ -0,0 +1,103 @@
+# Direct Access in Android Applications
+
+In Android, in addition to NN API, SyNAP can be directly accessed by applications. Direct access to SyNAP main benefits are zero-copy input/output and execution of optimized models compiled ahead of time with the SyNAP toolkit.
+
+Access to SyNAP can be performed via custom JNI C++ code using the `synapnb` library. The library can be used as usual, the only constraint is to use the Synap allocator, which can be obtained with `synap_allocator()`.
+
+Another option is to use custom JNI C code using the `synap_device` library. In this case, there are no constraints. The library allows creating new I/O buffers with the function `synap_allocate_io_buffer`. It is also possible to use existing DMABUF handles obtained, for instance, from gralloc with `synap_create_io_buffer`. The DMABUF can be accessed with standard Linux DMABUF APIs (i.e., `mmap`/`munmap`/`ioctls`).
+
+SyNAP provides a sample JNI library that shows how to use the `synap_device` library in a Java application. The code is located in `java` and can be included in an existing Android application by adding the following lines to the `settings.gradle` of the application:
+
+```groovy
+include ':synap'
+project(':synap').projectDir = file("[absolute path to synap]/java")
+```
+
+The code can then be used as follows:
+
+```java
+package com.synaptics.synap;
+
+public class InferenceEngine {
+
+ /**
+ * Perform inference using the model passed in data
+ *
+ * @param model EBG model
+ * @param inputs arrays containing model input data, one byte array per network input,
+ * of the size expected by the network
+ * @param outputs arrays where to store output of the network, one byte array per network
+ * output, of the size expected by the network
+ */
+ public static void infer(byte[] model, byte[][] inputs, byte[][] outputs) {
+
+ Synap synap = Synap.getInstance();
+
+ // load the network
+ Network network = synap.createNetwork(model);
+
+ // create input buffers and attach them to the network
+ IoBuffer[] inputBuffers = new IoBuffer[inputs.length];
+ Attachment[] inputAttachments = new Attachment[inputs.length];
+
+ for (int i = 0; i < inputs.length; i++) {
+ // create the input buffer of the desired length
+ inputBuffers[i] = synap.createIoBuffer(inputs[i].length);
+
+ // attach the buffer to the network (make sure you keep a reference to the
+ // attachment to avoid it is garbage collected and destroyed)
+ inputAttachments[i] = network.attachIoBuffer(inputBuffers[i]);
+
+ // set the buffer as the i-th input of the network
+ inputAttachments[i].useAsInput(i);
+
+ // copy the input data to the buffer
+ inputBuffers[i].copyFromBuffer(inputs[i], 0, 0, inputs[i].length);
+ }
+
+ // create the output buffers and attach them to the network
+ IoBuffer[] outputBuffers = new IoBuffer[outputs.length];
+ Attachment[] outputAttachments = new Attachment[inputs.length];
+
+ for (int i = 0; i < outputs.length; i++) {
+ // create the output buffer of the desired length
+ outputBuffers[i] = synap.createIoBuffer(outputs[i].length);
+
+ // attach the buffer to the network (make sure you keep a reference to the
+ // attachment to avoid it is garbage collected and destroyed)
+ outputAttachments[i] = network.attachIoBuffer(outputBuffers[i]);
+
+ // set the buffer as the i-th output of the network
+ outputAttachments[i].useAsOutput(i);
+ }
+
+ // run the network
+ network.run();
+
+ // copy the result data to the output buffers
+ for (int i = 0; i < outputs.length; i++) {
+ outputBuffers[i].copyToBuffer(outputs[i], 0, 0, outputs[i].length);
+ }
+
+ // release resources (it will be done automatically when the objects are garbage
+ // collected but this may take some time so it is better to release them explicitly
+ // as soon as possible)
+
+ network.release(); // this will automatically release the attachments
+
+ for (int i = 0 ; i < inputs.length; i++) {
+ inputBuffers[i].release();
+ }
+
+ for (int i = 0 ; i < outputs.length; i++) {
+ outputBuffers[i].release();
+ }
+
+ }
+
+}
+```
+
+> **Note**:
+>
+> To simplify application development by default, VSSDK allows untrusted applications (such as applications sideloaded or downloaded from the Google Play store) to use the SyNAP API. Since the API uses limited hardware resources, this can lead to situations in which a 3rd party application interferes with platform processes. To restrict access to SyNAP only to platform applications, remove the file `vendor/vsi/sepolicy/synap_device/untrusted_app.te`.
\ No newline at end of file
diff --git a/manual/jit_compilation.md b/manual/jit_compilation.md
new file mode 100644
index 0000000..8c58ce9
--- /dev/null
+++ b/manual/jit_compilation.md
@@ -0,0 +1,230 @@
+# JIT Compilation
+
+## Introduction
+
+Just-in-time compilation enables the execution of TensorFlow Lite models directly. For applications that require portability (e.g., must be able to run on an Astra embedded board or an Android phone), the JIT compilation approach offers flexibility at the cost of performance.
+
+For embedded applications, it is recommended you use ahead-of-time compilation instead.
+
+> **Note:**
+> JIT compilation is flexible, but initialization time can take a few seconds, and additional optimization and secure media paths are not available.
+
+## Online Inference with NNAPI
+
+
+
+When a model is loaded and executed via NNAPI, it is automatically converted to the internal representation suitable for execution on the NPU. This conversion doesn't take place when the model is loaded but when the first inference is executed. This is because the size of the input(s) is needed to perform the conversion, and with some models, this information is available only at inference time. If the input size is specified in the model, then the provided input(s) must match this size. In any case, it is not possible to change the size of the input(s) after the first inference.
+
+The model compilation has been heavily optimized, but even so, it can take several milliseconds up to a few seconds for typical models, so it is suggested to execute an inference once just after the model has been loaded and prepared. One of the techniques used to speed up model compilation is caching. Some results of the computations performed to compile a model are cached in a file so that they don't have to be executed again the next time the same model is compiled.
+
+On Android, the cache file is saved by default to `/data/vendor/synap/nnhal.cache` and will contain up to 10,000 entries, which corresponds to a good setting for NNAPI utilization on an average system. The cache path and size can be changed by setting the properties `vendor.SYNAP_CACHE_PATH` and `vendor.SYNAP_CACHE_CAPACITY`. Setting the capacity to 0 will disable the cache. An additional possibility to speed up model compilation is to use the NNAPI cache, see [nnapi-caching](#nnapi-caching).
+
+On Yocto Linux, there is no NNAPI cache, but we still have smaller per-process cache files named `synap-cache.` in the `/tmp/` directory.
+
+## Benchmarking Models with NNAPI
+
+It is possible to benchmark the execution of a model with online conversion using the standard Android NNAPI tool `android_arm_benchmark_model` from [TensorFlow Performance Measurement](https://www.tensorflow.org/lite/performance/measurement).
+
+A custom version of this tool optimized for SyNAP platforms called `benchmark_model` is already preinstalled on the board in `/vendor/bin`.
+
+Benchmarking a model is quite simple:
+
+1. Download the tflite model to be benchmarked, for example:
+ ```
+ https://storage.googleapis.com/download.tensorflow.org/models/mobilenet_v1_2018_08_02/mobilenet_v1_0.25_224_quant.tgz
+ ```
+2. Copy the model to the board, for example in the `/data/local/tmp` directory:
+ ```
+ $ adb push mobilenet_v1_0.25_224_quant.tflite /data/local/tmp
+ ```
+3. Benchmark the model execution on the NPU with NNAPI (android only):
+ ```
+ $ adb shell benchmark_model --graph=/data/local/tmp/mobilenet_v1_0.25_224_quant.tflite --use_nnapi=true --nnapi_accelerator_name=synap-npu
+
+ INFO: STARTING!
+ INFO: Tensorflow Version : 2.15.0
+ INFO: Log parameter values verbosely: [0]
+ INFO: Graph: [/data/local/tmp/mobilenet_v1_0.25_224_quant.tflite]
+ INFO: Use NNAPI: [1]
+ INFO: NNAPI accelerator name: [synap-npu]
+ INFO: NNAPI accelerators available: [synap-npu,nnapi-reference]
+ INFO: Loaded model /data/local/tmp/mobilenet_v1_0.25_224_quant.tflite
+ INFO: Initialized TensorFlow Lite runtime.
+ INFO: Created TensorFlow Lite delegate for NNAPI.
+ INFO: NNAPI delegate created.
+ WARNING: NNAPI SL driver did not implement SL_ANeuralNetworksDiagnostic_registerCallbacks!
+ VERBOSE: Replacing 31 out of 31 node(s) with delegate (TfLiteNnapiDelegate) node, yielding 1 partitions for the whole graph.
+ INFO: Explicitly applied NNAPI delegate, and the model graph will be completely executed by the delegate.
+ INFO: The input model file size (MB): 0.497264
+ INFO: Initialized session in 66.002ms.
+ INFO: Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
+ INFO: count=1 curr=637079
+
+ INFO: Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
+ INFO: count=520 first=2531 curr=2793 min=1171 max=9925 avg=1885.74 std=870
+
+ INFO: Inference timings in us: Init: 66002, First inference: 637079, Warmup (avg): 637079, Inference (avg): 1885.74
+ INFO: Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
+ INFO: Memory footprint delta from the start of the tool (MB): init=7.40234 overall=7.83203
+ ```
+
+> **Important:**
+> NNAPI is the standard way to perform online inference on the NPU in Android, but it isn't the most efficient or the most flexible one. The suggested way to perform online inference on Synaptics platforms is via the `timvx` delegate. For more information see section [online_benchmarking_timvx](#online_benchmarking_timvx).
+
+If for any reason some of the layers in the model cannot be executed on the NPU, they will automatically fall back to CPU execution. This can occur, for example, in case of specific layer types, options, or data types not supported by NNAPI or SyNAP. In this case, the network graph will be partitioned into multiple delegate kernels as indicated in the output messages from `benchmark_model`, for example:
+```
+$ adb shell benchmark_model ...
+...
+INFO: Initialized TensorFlow Lite runtime.
+INFO: Created TensorFlow Lite delegate for NNAPI.
+Explicitly applied NNAPI delegate, and the model graph will be partially executed by the delegate w/ 2 delegate kernels.
+...
+```
+
+Executing part of the network on the CPU will increase inference times, sometimes considerably. To better understand which are the problematic layers and where the time is spent, it can be useful to run `benchmark_model` with the option `--enable_op_profiling=true`. This option generates a detailed report of the layers executed on the CPU and the time spent executing them. For example, in the execution here below, the network contains a `RESIZE_NEAREST_NEIGHBOR` layer which falls back to CPU execution:
+```
+$ adb shell benchmark_model ... --enable_op_profiling=true
+...
+Operator-wise Profiling Info for Regular Benchmark Runs:
+============================== Run Order ==============================
+ [node type] [first] [avg ms] [%] [cdf%] [mem KB] [times called] [Name]
+ TfLiteNnapiDelegate 3.826 4.011 62.037% 62.037% 0.000 1 []:64
+RESIZE_NEAREST_NEIGHBOR 0.052 0.058 0.899% 62.936% 0.000 1 []:38
+ TfLiteNnapiDelegate 2.244 2.396 37.064% 100.000% 0.000 1 []:65
+```
+
+Execution of the model (or part of it) on the NPU can also be confirmed by looking at the SyNAP `inference_count` file in `sysfs` (see section [sysfs-inference-counter](#sysfs-inference-counter)).
+
+For an even more in-depth analysis, it is possible to obtain detailed layer-by-layer inference timing by setting the profiling property before running `benchmark_model`:
+```
+$ adb shell setprop vendor.NNAPI_SYNAP_PROFILE 1
+$ adb shell benchmark_model --graph=/data/local/tmp/mobilenet_v1_0.25_224_quant.tflite --use_nnapi=true --nnapi_accelerator_name=synap-npu
+```
+On Android, the profiling information will be available in `/sys/class/misc/synap/device/misc/synap/statistics/network_profile` while `benchmark_model` is running. On Yocto Linux, the same information is in `/sys/class/misc/synap/statistics/network_profile`.
+
+> **Note:**
+> When `vendor.NNAPI_SYNAP_PROFILE` is enabled, the network is executed step-by-step, so the overall inference time becomes meaningless and should be ignored.
+
+## NNAPI Compilation Caching
+
+NNAPI compilation caching provides even greater speedup than the default SyNAP cache by caching entire compiled models, but it requires some support from the application (see [Android Neural Networks API Compilation Caching](https://source.android.com/devices/neural-networks/compilation-c
+
+aching)) and requires more disk space.
+
+NNAPI caching support must be enabled by setting the corresponding android property:
+```
+$ adb shell setprop vendor.npu.cache.model 1
+```
+
+As explained in the official Android documentation, for NNAPI compilation cache to work, the user has to provide a directory to store the cached model and a unique key for each model. The unique key is normally determined by computing some hash on the entire model.
+
+This can be tested using `benchmark_model`:
+```
+$ adb shell benchmark_model --graph=/data/local/tmp/mobilenet_v1_0.25_224_quant.tflite --use_nnapi=true --nnapi_accelerator_name=synap-npu --delegate_serialize_dir=/data/local/tmp/nnapiacache --delegate_serialize_token='`md5sum -b /data/local/tmp/mobilenet_v1_0.25_224_quant.tflite`'
+```
+
+During the first execution of the above command, NNAPI will compile the model and add it to the cache:
+```
+INFO: Initialized TensorFlow Lite runtime.
+INFO: Created TensorFlow Lite delegate for NNAPI.
+NNAPI delegate created.
+ERROR: File /data/local/tmp/nnapiacache/a67461dd306cfd2ff0761cb21dedffe2_6183748634035649777.bin couldn't be opened for reading: No such file or directory
+INFO: Replacing 31 node(s) with delegate (TfLiteNnapiDelegate) node, yielding 1 partitions.
+...
+Inference timings in us: Init: 34075, First inference: 1599062, Warmup (avg): 1.59906e+06, Inference (avg): 1380.86
+```
+
+In all the following executions, NNAPI will load the compiled model directly from the cache, so the first inference will be faster:
+```
+INFO: Initialized TensorFlow Lite runtime.
+INFO: Created TensorFlow Lite delegate for NNAPI.
+NNAPI delegate created.
+INFO: Replacing 31 node(s) with delegate (TfLiteNnapiDelegate) node, yielding 1 partitions.
+...
+Inference timings in us: Init: 21330, First inference: 90853, Warmup (avg): 1734.13, Inference (avg): 1374.59
+```
+
+## Disabling NPU Usage from NNAPI
+
+It is possible to make the NPU inaccessible from NNAPI by setting the property `vendor.NNAPI_SYNAP_DISABLE` to 1. In this case, any attempt to run a model via NNAPI will always fall back to CPU.
+
+NNAPI execution with NPU enabled:
+```
+$ adb shell setprop vendor.NNAPI_SYNAP_DISABLE 0
+$ adb shell 'echo > /sys/class/misc/synap/device/misc/synap/statistics/inference_count'
+$ adb shell benchmark_model --graph=/data/local/tmp/mobilenet_v1_0.25_224_quant.tflite --use_nnapi=true --nnapi_accelerator_name=synap-npu
+Inference timings in us: Init: 24699, First inference: 1474732, Warmup (avg): 1.47473e+06, Inference (avg): 1674.03
+$ adb shell cat /sys/class/misc/synap/device/misc/synap/statistics/inference_count
+1004
+```
+
+NNAPI execution with NPU disabled:
+```
+$ adb shell setprop vendor.NNAPI_SYNAP_DISABLE 1
+$ adb shell 'echo > /sys/class/misc/synap/device/misc/synap/statistics/inference_count'
+$ adb shell benchmark_model --graph=/data/local/tmp/mobilenet_v1_0.25_224_quant.tflite --use_nnapi=true --nnapi_accelerator_name=synap-npu
+Inference timings in us: Init: 7205, First inference: 15693, Warmup (avg): 14598.5, Inference (avg): 14640.3
+$ adb shell cat /sys/class/misc/synap/device/misc/synap/statistics/inference_count
+0
+```
+
+> **Note:**
+> It will still be possible to perform online inference on the NPU using the *timvx* tflite delegate.
+
+## Online Inference with *TimVx* Delegate
+
+NNAPI is not the only way to perform online inference on the NPU. It is possible to run a model without using NNAPI by loading it with the standard TensorFlow Lite API and then using the *timvx* tflite delegate. This delegate has been optimized to call directly the SyNAP API, so it can most often provide better performance and fewer limitations than the standard NNAPI.
+
+Another advantage of the `timvx` delegate is that it is also available on Yocto Linux platforms which don't support NNAPI. The only limitation of this approach is that being a delegate for the standard TensorFlow runtime, it doesn't support the execution of other model formats such as ONNX.
+
+*timvx* tflite delegate internal workflow is similar to that of NNAPI: when a tflite model is loaded, it is automatically converted to the internal representation suitable for execution on the NPU. This conversion doesn't take place when the model is loaded but when the first inference is executed.
+
+## Benchmarking Models with *TimVx* Delegate
+
+Synaptics `benchmark_model` tool provides built-in support for both the standard `nnapi` delegate and the optimized `timvx` delegate.
+
+Benchmarking a model with `timvx` delegate is as simple as using NNAPI:
+
+1. Download the tflite model to be benchmarked, for example:
+ ```
+ https://storage.googleapis.com/download.tensorflow.org/models/mobilenet_v1_2018_08_02/mobilenet_v1_0.25_224_quant.tgz
+ ```
+2. Copy the model to the board, for example in the `/data/local/tmp` directory:
+ ```
+ $ adb push mobilenet_v1_0.25_224_quant.tflite /data/local/tmp
+ ```
+3. Benchmark the model execution on the NPU with `timvx` delegate (both android and linux):
+ ```
+ $ adb shell benchmark_model --graph=/data/local/tmp/mobilenet_v1_0.25_224_quant.tflite --external_delegate_path=libvx_delegate.so
+
+ INFO: STARTING!
+ INFO: Tensorflow Version : 2.15.0
+ INFO: Log parameter values verbosely: [0]
+ INFO: Graph: [/data/local/tmp/mobilenet_v1_0.25_224_quant.tflite]
+ INFO: External delegate path: [/vendor/lib64/libvx_delegate.so]
+ INFO: Loaded model /data/local/tmp/mobilenet_v1_0.25_224_quant.tflite
+ INFO: Initialized TensorFlow Lite runtime.
+ INFO: Vx delegate: allowed_cache_mode set to 0.
+ INFO: Vx delegate: device num set to 0.
+ INFO: Vx delegate: allowed_builtin_code set to 0.
+ INFO: Vx delegate: error_during_init set to 0.
+ INFO: Vx delegate: error_during_prepare set to 0.
+ INFO: Vx delegate: error_during_invoke set to 0.
+ INFO: EXTERNAL delegate created.
+ VERBOSE: Replacing 31 out of 31 node(s) with delegate (Vx Delegate) node, yielding 1 partitions for the whole graph.
+ INFO: Explicitly applied EXTERNAL delegate, and the model graph will be completely executed by the delegate.
+ INFO: The input model file size (MB): 0.497264
+ INFO: Initialized session in 25.573ms.
+ INFO: Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
+ type 54 str SoftmaxAxis0
+ INFO: count=277 first=201009 curr=863 min=811 max=201009 avg=1760.78 std=11997
+
+ INFO: Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
+ INFO: count=876 first=1272 curr=1730 min=810 max=6334 avg=1096.48 std=476
+
+ INFO: Inference timings in us: Init: 25573, First inference: 201009, Warmup (avg): 1760.78, Inference (avg): 1096.48
+ INFO: Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
+ INFO: Memory footprint delta from the start of the tool (MB): init=15.4688 overall=43.2852
+ ```
+
+Comparing the timings with those in section [online_benchmarking_nnapi](#online_benchmarking_nnapi), we can notice that even for this simple model, `timvx` delegate provides better performances than NNAPI (average inference time 1096 us vs 1885).
\ No newline at end of file
diff --git a/manual/npu_operators.md b/manual/npu_operators.md
new file mode 100644
index 0000000..bd70fde
--- /dev/null
+++ b/manual/npu_operators.md
@@ -0,0 +1,500 @@
+# NPU Operators
+
+This section summarizes neural network operators supported by the SyNAP VS6x0/SL16x0 class of NPUs and accompanying software stack. For each operator type, the supported tensor types and execution engines are also documented. Designing networks that maximize the use of operators executed in the NN core will provide the best performance.
+
+##
+
+### Execution Engines
+
+| Acronym | Description |
+|---------|----------------------------------|
+| NN | Neural Network Engine |
+| PPU | Parallel Processing Unit |
+| TP | Tensor Processor |
+
+### Tensor Types
+
+| Acronym | Description |
+|---------|----------------------------------|
+| asym-u8 | asymmetric affine uint8 |
+| asym-i8 | asymmetric affine int8 |
+| pc-sym-i8 | per channel symmetric int8 |
+| fp32 | floating point 32 bits |
+| fp16 | floating point 16 bits |
+| h | half |
+| int16 | int16 |
+| int32 | int32 |
+
+> **Note:**
+> int16 dynamic fixed point convolution is supported by the NN Engine in their multiplication. Other layers follow the tables; if asym-u8 is not available in the NN column, int16 is also not available.
+
+## Basic Operations
+
+
+| Operator | Input | Kernel | Output | NN | TP | PPU |
+|---------------------|-------------|-------------|-----------|-------------|-------------|-------------|
+| CONV2D | asym-u8 | asym-u8 | asym-u8 | ✔ | | |
+| | asym-i8 | pc-sym-i8 | asym-i8 | ✔ | | ✔ |
+| | fp32 | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | fp16 | | | ✔ |
+| CONV1D | asym-u8 | asym-u8 | asym-u8 | ✔ | | |
+| | asym-i8 | pc-sym-i8 | asym-i8 | ✔ | | ✔ |
+| | fp32 | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | fp16 | | | ✔ |
+| DECONVOLUTION | asym-u8 | asym-u8 | asym-u8 | ✔ | | |
+| | asym-i8 | pc-sym-i8 | asym-i8 | ✔ | | ✔ |
+| | fp32 | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | fp16 | | | ✔ |
+| DECONVOLUTION1D | asym-u8 | asym-u8 | asym-u8 | ✔ | | |
+| | asym-i8 | pc-sym-i8 | asym-i8 | ✔ | | ✔ |
+| | fp32 | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | fp16 | | | ✔ |
+| GROUPED_CONV2D | asym-u8 | asym-u8 | asym-u8 | ✔ | | |
+| | asym-i8 | pc-sym-i8 | asym-i8 | ✔ | | ✔ |
+| | fp32 | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | fp16 | | | ✔ |
+| FULLY_CONNECTED | asym-u8 | asym-u8 | asym-u8 | | ✔ | |
+| | asym-i8 | pc-sym-i8 | asym-i8 | | ✔ | |
+| | fp32 | fp32 | fp32 | | | ✔ |
+
+
+> **Note:**
+> Convolutions are executed in the NN engine only if they satisfy the following conditions: `**stride == 1**, **kernel_size <= 15x15**, **dilation size + kernel size <= 15x15**`. If any of these conditions are not satisfied, the convolution will require support of the TP core and will run considerably slower.
+
+## Activation Operations
+
+
+| Operator | Input | Output | NN | TP | PPU |
+|---------------|----------|-----------|---------------|---------------|---------------|
+| ELU | asym-u8 | asym-u8 | | ✔ | ✔ |
+| | asym-i8 | asym-i8 | | ✔ | ✔ |
+| | fp32 | fp32 | | ✔ | ✔ |
+| | fp16 | fp16 | | ✔ | ✔ |
+| HARD_SIGMOID | asym-u8 | asym-u8 | | ✔ | ✔ |
+| | asym-i8 | asym-i8 | | ✔ | ✔ |
+| | fp32 | fp32 | | ✔ | ✔ |
+| | fp16 | fp16 | | ✔ | ✔ |
+| SWISH | asym-u8 | asym-u8 | | ✔ | |
+| | asym-i8 | asym-i8 | | ✔ | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | |
+| LEAKY_RELU | asym-u8 | asym-u8 | | ✔ | |
+| | asym-i8 | asym-i8 | | ✔ | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | |
+| PRELU | asym-u8 | asym-u8 | | ✔ | |
+| | asym-i8 | asym-i8 | | ✔ | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | |
+| RELU | asym-u8 | asym-u8 | | ✔ | |
+| | asym-i8 | asym-i8 | | ✔ | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | |
+| RELUN | asym-u8 | asym-u8 | | ✔ | |
+| | asym-i8 | asym-i8 | | ✔ | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | |
+| RSQRT | asym-u8 | asym-u8 | | | ✔ |
+| | asym-i8 | asym-i8 | | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| SIGMOID | asym-u8 | asym-u8 | | ✔ | |
+| | asym-i8 | asym-i8 | | ✔ | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | |
+| SOFTRELU | asym-u8 | asym-u8 | | | ✔ |
+| | asym-i8 | asym-i8 | | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| SQRT | asym-u8 | asym-u8 | | | ✔ |
+| | asym-i8 | asym-i8 | | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| TANH | asym-u8 | asym-u8 | | ✔ | |
+| | asym-i8 | asym-i8 | | ✔ | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | |
+| ABS | asym-u8 | asym-u8 | | ✔ | |
+| | asym-i8 | asym-i8 | | ✔ | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | |
+| CLIP | asym-u8 | asym-u8 | | ✔ | ✔ |
+| | asym-i8 | asym-i8 | | ✔ | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | ✔ |
+| EXP | asym-u8 | asym-u8 | | | ✔ |
+| | asym-i8 | asym-i8 | | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| LOG | asym-u8 | asym-u8 | | ✔ | ✔ |
+| | asym-i8 | asym-i8 | | ✔ | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | ✔ |
+| NEG | asym-u8 | asym-u8 | | ✔ | ✔ |
+| | asym-i8 | asym-i8 | | ✔ | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | ✔ |
+| MISH | asym-u8 | asym-u8 | | ✔ | ✔ |
+| | asym-i8 | asym-i8 | | ✔ | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | ✔ |
+| SOFTMAX | asym-u8 | asym-u8 | | | ✔ |
+| | asym-i8 | asym-i8 | | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| LOG_SOFTMAX | asym-u8 | asym-u8 | | | ✔ |
+| | asym-i8 | asym-i8 | | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| SQUARE | asym-u8 | asym-u8 | | | ✔ |
+| | asym-i8 | asym-i8 | | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| SIN | asym-u8 | asym-u8 | | | ✔ |
+| | asym-i8 | asym-i8 | | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| LINEAR | asym-u8 | asym-u8 | | | ✔ |
+| | asym-i8 | asym-i8 | | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| ERF | asym-u8 | asym-u8 | | ✔ | ✔ |
+| | asym-i8 | asym-i8 | | ✔ | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | ✔ |
+| GELU | asym-u8 | asym-u8 | | ✔ | ✔ |
+| | asym-i8 | asym-i8 | | ✔ | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | ✔ |
+
+## Elementwise Operations
+
+| Operator | Input | Output | NN | TP | PPU |
+|---------------|----------|-----------|-----------------|-----------------|-----------------|
+| ADD | asym-u8 | asym-u8 | ✔ | | |
+| | asym-i8 | asym-i8 | ✔ | | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| SUBTRACT | asym-u8 | asym-u8 | ✔ | | |
+| | asym-i8 | asym-i8 | ✔ | | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| MULTIPLY | asym-u8 | asym-u8 | | | ✔ |
+| | asym-i8 | asym-i8 | | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| DIVIDE | asym-u8 | asym-u8 | | | ✔ |
+| | asym-i8 | asym-i8 | | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| MAXIMUM | asym-u8 | asym-u8 | | | ✔ |
+| | asym-i8 | asym-i8 | | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| MINIMUM | asym-u8 | asym-u8 | | | ✔ |
+| | asym-i8 | asym-i8 | | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| POW | asym-u8 | asym-u8 | | | ✔ |
+| | asym-i8 | asym-i8 | | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| FLOORDIV | asym-u8 | asym-u8 | | | ✔ |
+| | asym-i8 | asym-i8 | | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| MATRIXMUL | asym-u8 | asym-u8 | | | ✔ |
+| | asym-i8 | asym-i8 | | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| RELATIONAL_OPS| asym-u8 | bool8 | | | ✔ |
+| | asym-i8 | bool8 | | | ✔ |
+| | fp32 | bool8 | | | ✔ |
+| | fp16 | bool8 | | | ✔ |
+| | bool8 | bool8 | | | ✔ |
+| LOGICAL_OPS | bool8 | bool8 | | | ✔ |
+| LOGICAL_NOT | bool8 | bool8 | | | ✔ |
+| SELECT | asym-u8 | asym-u8 | | | ✔ |
+| | asym-i8 | asym-i8 | | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| | bool8 | bool8 | | | ✔ |
+| ADDN | asym-u8 | asym-u8 | | | ✔ |
+| | asym-i8 | asym-i8 | | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+
+## Normalization Operations
+
+| Operator | Input | Output | NN | TP | PPU |
+|-------------------|--------|-----------|-------------------|-------------------|-------------------|
+| BATCH_NORM | asym-u8| asym-u8 | | | ✔ |
+| | asym-i8| asym-i8 | | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| LRN2 | asym-u8| asym-u8 | | ✔ | |
+| | asym-i8| asym-i8 | | ✔ | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | |
+| L2_NORMALIZE | asym-u8| asym-u8 | | | ✔ |
+| | asym-i8| asym-i8 | | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| LAYER_NORM | asym-u8| asym-u8 | | | ✔ |
+| | asym-i8| asym-i8 | | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| INSTANCE_NORM | asym-u8| asym-u8 | | | ✔ |
+| | asym-i8| asym-i8 | | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| BATCHNORM_SINGLE | asym-u8| asym-u8 | | | ✔ |
+| | asym-i8| asym-i8 | | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| MOMENTS | asym-u8| asym-u8 | | | ✔ |
+| | asym-i8| asym-i8 | | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| GROUP_NORM | asym-u8| asym-u8 | | | ✔ |
+| | asym-i8| asym-i8 | | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+
+## Reshape Operations
+
+
+| Operator | Input | Output | NN | TP | PPU |
+|--------------------------|-------|---------|--------------------------|--------------------------|--------------------------|
+| EXPAND_BROADCAST | asym-u8| asym-u8| | | ✔ |
+| | asym-i8| asym-i8| | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| SLICE | asym-u8| asym-u8| | ✔ | |
+| | asym-i8| asym-i8| | ✔ | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | |
+| SPLIT | asym-u8| asym-u8| | ✔ | |
+| | asym-i8| asym-i8| | ✔ | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | |
+| CONCAT | asym-u8| asym-u8| | ✔ | |
+| | asym-i8| asym-i8| | ✔ | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | |
+| STACK | asym-u8| asym-u8| | ✔ | |
+| | asym-i8| asym-i8| | ✔ | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | |
+| UNSTACK | asym-u8| asym-u8| | ✔ | |
+| | asym-i8| asym-i8| | ✔ | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | |
+| RESHAPE | asym-u8| asym-u8| | ✔ | |
+| | asym-i8| asym-i8| | ✔ | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | |
+| SQUEEZE | asym-u8| asym-u8| | ✔ | |
+| | asym-i8| asym-i8| | ✔ | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | |
+| PERMUTE | asym-u8| asym-u8| | ✔ | |
+| | asym-i8| asym-i8| | ✔ | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | |
+| REORG | asym-u8| asym-u8| | ✔ | |
+| | asym-i8| asym-i8| | ✔ | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | |
+| SPACE2DEPTH | asym-u8| asym-u8| | ✔ | |
+| | asym-i8| asym-i8| | ✔ | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | |
+| DEPTH2SPACE | asym-u8| asym-u8| | ✔ | |
+| | asym-i8| asym-i8| | ✔ | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | |
+| | bool8 | bool8 | | | |
+| BATCH2SPACE | asym-u8| asym-u8| | ✔ | |
+| | asym-i8| asym-i8| | ✔ | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | |
+| SPACE2BATCH | asym-u8| asym-u8| | ✔ | |
+| | asym-i8| asym-i8| | ✔ | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | |
+| PAD | asym-u8| asym-u8| | ✔ | |
+| | asym-i8| asym-i8| | ✔ | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | |
+| REVERSE | asym-u8| asym-u8| | ✔ | |
+| | asym-i8| asym-i8| | ✔ | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | |
+| STRIDED_SLICE | asym-u8| asym-u8| | ✔ | |
+| | asym-i8| asym-i8| | ✔ | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | |
+| REDUCE | asym-u8| asym-u8| | | ✔ |
+| | asym-i8| asym-i8| | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| ARGMAX | asym-u8| asym-u8 / int16 / int32| | | ✔ |
+| | asym-i8| asym-u8 / int16 / int32| | | ✔ |
+| | fp32 | int32 | | | ✔ |
+| | fp16 | asym-u8 / int16 / int32| | | ✔ |
+| ARGMIN | asym-u8| asym-u8 / int16 / int32| | | ✔ |
+| | asym-i8| asym-u8 / int16 / int32| | | ✔ |
+| | fp32 | int32 | | | ✔ |
+| | fp16 | asym-u8 / int16 / int32| | | ✔ |
+| SHUFFLECHANNEL | asym-u8| asym-u8| | ✔ | |
+| | asym-i8| asym-i8| | ✔ | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | |
+
+## RNN Operations
+
+
+| Operator | Input | Kernel | Output | NN | TP | PPU |
+|-------------------|--------|----------|-----------|-------------------|-------------------|-------------------|
+| LSTMUNIT_OVXLIB | asym-u8| asym-u8 | asym-u8 | | ✔ | ✔ |
+| | asym-i8| pc-sym-i8| asym-i8 | | ✔ | ✔ |
+| | fp32 | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | fp16 | | ✔ | ✔ |
+| CONV2D_LSTM | asym-u8| asym-u8 | asym-u8 | ✔ | | |
+| | asym-i8| pc-sym-i8| asym-i8 | ✔ | | |
+| | fp32 | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | fp16 | | | ✔ |
+| CONV2D_LSTM_CELL | asym-u8| asym-u8 | asym-u8 | ✔ | | |
+| | asym-i8| pc-sym-i8| asym-i8 | ✔ | | |
+| | fp32 | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | fp16 | | | ✔ |
+| LSTM_OVXLIB | asym-u8| asym-u8 | asym-u8 | | ✔ | ✔ |
+| | asym-i8| pc-sym-i8| asym-i8 | | ✔ | ✔ |
+| | fp32 | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | fp16 | | ✔ | ✔ |
+| GRUCELL_OVXLIB | asym-u8| asym-u8 | asym-u8 | | ✔ | ✔ |
+| | asym-i8| pc-sym-i8| asym-i8 | | ✔ | ✔ |
+| | fp32 | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | fp16 | | ✔ | ✔ |
+| GRU_OVXLIB | asym-u8| asym-u8 | asym-u8 | | ✔ | ✔ |
+| | asym-i8| pc-sym-i8| asym-i8 | | ✔ | ✔ |
+| | fp32 | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | fp16 | | ✔ | ✔ |
+| SVDF | asym-u8| asym-u8 | asym-u8 | | ✔ | ✔ |
+| | asym-i8| pc-sym-i8| asym-i8 | | ✔ | ✔ |
+| | fp32 | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | fp16 | | ✔ | ✔ |
+
+## Pooling Operations
+
+| Operator | Input | Output | NN | TP | PPU |
+|-----------------|--------|--------|-----------------|-----------------|-----------------|
+| POOL | asym-u8| asym-u8| ✔ | ✔ | |
+| | asym-i8| asym-i8| ✔ | ✔ | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | |
+| ROI_POOL | asym-u8| asym-u8| | ✔ | ✔ |
+| | asym-i8| asym-i8| | ✔ | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | ✔ |
+| POOLWITHARGMAX | asym-u8| asym-u8| | | ✔ |
+| | asym-i8| asym-i8| | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| UPSAMPLE | asym-u8| asym-u8| | | ✔ |
+| | asym-i8| asym-i8| | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+
+## Miscellaneous Operations
+
+| Operator | Input| Output| NN | TP | PPU |
+|-------------------|------|-------|-------------------|-------------------|-------------------|
+| PROPOSAL | asym-u8| asym-u8| | | ✔ |
+| | asym-i8| asym-i8| | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| VARIABLE | asym-u8| asym-u8| | ✔ | |
+| | asym-i8| asym-i8| | ✔ | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | |
+| DROPOUT | asym-u8| asym-u8| | | ✔ |
+| | asym-i8| asym-i8| | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| RESIZE | asym-u8| asym-u8| | | ✔ |
+| | asym-i8| asym-i8| | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| DATACONVERT | asym-u8| asym-u8| | ✔ | |
+| | asym-i8| asym-i8| | ✔ | |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | ✔ | |
+| FLOOR | asym-u8| asym-u8| | | ✔ |
+| | asym-i8| asym-i8| | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| EMBEDDING_LOOKUP | asym-u8| asym-u8| | | ✔ |
+| | asym-i8| asym-i8| | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| GATHER | asym-u8| asym-u8| | | ✔ |
+| | asym-i8| asym-i8| | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| GATHER_ND | asym-u8| asym-u8| | | ✔ |
+| | asym-i8| asym-i8| | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| SCATTER_ND | asym-u8| asym-u8| | | ✔ |
+| | asym-i8| asym-i8| | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| GATHER_ND_UPDATE | asym-u8| asym-u8| | | ✔ |
+| | asym-i8| asym-i8| | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| TILE | asym-u8| asym-u8| | | ✔ |
+| | asym-i8| asym-i8| | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| ELTWISEMAX | asym-u8| asym-u8| | | ✔ |
+| | asym-i8| asym-i8| | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| SIGNAL_FRAME | asym-u8| asym-u8| | | ✔ |
+| | asym-i8| asym-i8| | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| CONCATSHIFT | asym-u8| asym-u8| | | ✔ |
+| | asym-i8| asym-i8| | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| UPSAMPLESCALE | asym-u8| asym-u8| | | ✔ |
+| | asym-i8| asym-i8| | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| ROUND | asym-u8| asym-u8| | | ✔ |
+| | asym-i8| asym-i8| | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| CEIL | asym-u8| asym-u8| | | ✔ |
+| | asym-i8| asym-i8| | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| SEQUENCE_MASK | asym-u8| asym-u8| | | ✔ |
+| | asym-i8| asym-i8| | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| REPEAT | asym-u8| asym-u8| | | ✔ |
+| | asym-i8| asym-i8| | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| ONE_HOT | asym-u8| asym-u8| | | ✔ |
+| | asym-i8| asym-i8| | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
+| CAST | asym-u8| asym-u8| | | ✔ |
+| | asym-i8| asym-i8| | | ✔ |
+| | fp32 | fp32 | | | ✔ |
+| | fp16 | fp16 | | | ✔ |
\ No newline at end of file
diff --git a/manual/optimizing_models.md b/manual/optimizing_models.md
new file mode 100644
index 0000000..53b913b
--- /dev/null
+++ b/manual/optimizing_models.md
@@ -0,0 +1,1354 @@
+Optimizing Models
+=================
+
+
+Model Conversion
+----------------
+
+The ``SyNAP`` toolkit allows to convert a model from its original format to
+an internal representation optimized for the target hardware.
+The conversion tool and utilities can run on Linux, MacOS or Windows hosts inside a *Docker* container.
+Only `Docker` and the ``toolkit`` image are required, no additional dependencies have to be installed.
+
+## Supported formats
+
+* Tensorflow Lite (``.tflite`` extension)
+* ONNX (``.onnx`` extension)
+* TorchScript (``.torchscript``, ``.pt``, ``.pth`` extensions)
+ - TorchScript format only. See :ref:`working-with-pytorch-models-label` for more information.
+* Tensorflow (``.pb`` extension)
+* Caffe (``.prototxt`` extension)
+ - Caffe 1.x only. Caffe2, Caffe-SSD and Caffe-LSTM not supported
+
+
+NOTE - Support for ``.pb`` and ``.prototxt`` formats is deprecated.
+
+
+Running SyNAP Tools
+-------------------
+
+First you must [install the SyNAP tools](synap_installation.md) via Docker container or pip package.
+
+The toolkit provides a number of tools to convert and manipulate models and images.
+
+Model conversion can be performed using the ``convert`` command.
+It takes in input:
+
+ - a network model
+ - the target HW for which to convert the model (e.g. VS680 or VS640)
+ - the name of the directory where to generate the converted model
+ - an optional yaml metafile that can be used to specify customized conversion options
+ (mandatory for .pb models)
+
+In output it generates three files:
+
+ - **model.synap** the converted network model
+ - **model_info.txt** additional information about the generated model for user reference, including:
+
+ - input/output tensors attributes
+ - subgraph splitting
+ - layer table
+ - operation table
+ - memory usage
+
+ - **quantization_info.txt**
+ additional quantization information (only if the model is quantized using the toolkit)
+
+An additional ``cache`` directory is also generated to speedup future compilations of the same model.
+
+Example:
+
+.. code-block:: shell
+
+ $ synap convert --model mobilenet_v1_quant.tflite --target VS680 --out-dir mnv1
+ $ ls mnv1
+ model_info.txt model.synap cache
+
+
+
+In the case of ``Caffe`` models the weights are not in the ``.prototxt`` file but
+stored in a separate file, generally with ``.caffemodel`` extension. This file has to be provided
+in input to the converter tool as well. Example::
+
+ $ synap convert --model mnist.prototxt --weights mnist.caffemodel --target VS680 --out-dir out
+
+.. important::
+
+ The model file and the output directory specified must be inside or below a directory mounted
+ inside the Docker container (see ``-v`` option in the ``synap`` alias above).
+
+.. raw:: latex
+
+ \clearpage
+
+
+.. _conversion-metafile:
+
+Conversion Metafile
+-------------------
+
+When converting a model it is possible to provide a yaml metafile to customize the generated model,
+for example it is possible to specify:
+
+ - the data representation in memory (nhwc or nchw)
+ - model quantization options
+ - output dequantization
+ - input preprocessing options
+ - delegate to be used for inference (npu, gpu, cpu)
+
+Example::
+
+ $ synap convert --model mobilenet_v1_quant.tflite --meta mobilenet.yaml \
+ --target VS680 --out-dir mnv1
+
+This metafile is mandatory when converting a Tensorflow ``.pb`` model. It can be completely
+omitted when converting a quantized ``.tflite`` model.
+
+The best way to understand the content of a metafile is probably to first look at an example,
+here below the one for a typical *mobilenet_v1* model, followed by a detailed description of each
+field. Most of the fields are optional, mandatory fields are explicitly marked.
+
+
+.. code-block:: yaml
+
+ delegate: npu
+
+ data_layout: nhwc
+
+ security:
+ secure: true
+ file: ../security.yaml
+
+ inputs:
+ - name: input
+ shape: [1, 224, 224, 3]
+ means: [128, 128, 128]
+ scale: 128
+ format: rgb
+ security: any
+ preprocess:
+ type: nv21
+ size: [1920, 1080]
+ crop: true
+
+ outputs:
+ - name: MobilenetV1/Predictions/Reshape_1
+ dequantize: false
+ format: confidence_array
+
+ quantization:
+ data_type: uint8
+ scheme: default
+ mode: standard
+ algorithm: standard
+ options:
+ dataset:
+ - ../../sample/*_224x224.jpg
+
+
+.. raw:: latex
+
+ \clearpage
+
+- ``delegate``
+
+ Select the delegate to use for inference. Available delegates are:
+
+ ``default`` (default, automatically select delegate according to the target HW)
+
+ ``npu``
+
+ ``gpu``
+
+ ``cpu``
+
+ If not specified the default delegate for the target hardware is used.
+ It is also possible to specify the delegate on a layer-by-layer basis.
+ See section :ref:`heterogeneous_inference`.
+
+- ``data_layout``
+
+ The data layout in memory, allowed values are: ``default``, ``nchw`` and ``nhwc``.
+
+ For Tensorflow and Tensorflow Lite models the default is ``nhwc``. Forcing the converted
+ model to be ``nchw`` might provide some performance advantage when the input data is already
+ in this format since no additional data reorganization is needed.
+
+ For Caffe and ONNX models the default is ``nchw``. In this case it is not possible to force to
+ ``nhwc``.
+
+- ``input_format``
+
+ Format of the input tensors. This is an optional string that will be attached as an attribute
+ to all the network input tensors for which a "format" field has not been specified.
+
+- ``output_format``
+
+ Format of the ouput tensors. This is an optional string that will be attached as an attribute
+ to all the network ouput tensors for which a "format" field has not been specified.
+
+- ``security``
+
+ This section contains security configuration for the model.
+ If this section is not present, security is disabled.
+ Security is only supported with the ``npu`` delegate.
+
+ - ``secure``
+
+ If true enable security for the model.
+ For secure models it is also possible to specify the security policy for each input and output.
+ A secure model is encrypted and signed at conversion time so that its structure and weights will
+ not be accessible and its authenticity can be verified. This is done by a set of keys and
+ certificates files whose path is contained in a security file.
+
+ - ``file``
+ Path to the security file. This is a ``yaml`` file with the following fields::
+
+ encryption_key: ``
+ signature_key: ``
+ model_certificate: ``
+ vendor_certificate: ``
+
+ Both relative and absolute paths can be used.
+ Relative paths are considered relative to the location of the security file itself.
+ The same fields can also be specified directly in the model metafile in place of the 'file' field.
+ For detailed information on the security policies and how to generate and authenticate a
+ secure model please refer to SyNAP_SyKURE.pdf
+
+
+- ``inputs``
+ :sup:`(pb)`
+
+ Must contain one entry for each input of the network. Each entry has the following fields:
+
+ - ``name``
+ :sup:`(pb)`
+
+ Name of the input in the network graph. For ``tflite`` and ``onnx`` models this field is not
+ required but can still be used to specify a different input layer than the default input of the
+ network. This feature allows to convert just a subset of a network without having to
+ manually edit the source model. For ``.pb`` models or when ``name`` is not specified
+ the inputs must be in the same order as they appear in the model.
+ When this field is specified the ``shape`` field is mandatory.
+
+ - ``shape``
+ :sup:`(pb)`
+
+ Shape of the input tensor. This is a list of dimensions, the order is given by the layout
+ of the input tensor in the model (even if a different layout is selected for the compiled model).
+ The first dimension must represent by convention the number of samples *N* (also known as
+ "batch size") and is ignored in the generated model which always works with a batch-size of 1.
+ When this field is specified the ``name`` field is mandatory.
+
+ - ``means``
+
+ Used to normalize the range of input values.
+ A list of mean values, one for each channel in the corresponding input.
+ If a single value is specified instead of a list, it will be used for all
+ the channels. If not specified a mean of ``0`` is assumed.
+
+ The *i-th* channel of each input is normalized as: ``norm = (in - means[i]) / scale``
+
+ Normalization is necessary to bring the input values in the range used when the model has
+ been trained. SyNAP does this computation in three occasions:
+
+ - to normalize data from *image* quantization files when the network is quantized
+ (note that this doesn't apply to *numpy* quantization files, in this case it is assumed that
+ the numpy files have already been normalized)
+ - to normalize input data at inference time in the NPU when the network is compiled with
+ preprocessing enabled (see the ``preprocess`` option here below)
+ - to normalize input data in SW when the network is compiled *without* preprocessing
+ and input data is assigned using the ``Tensor assign()`` method in the SyNAP library
+
+ Note: when converting an 8-bits pre-quantized model and no ``means`` and ``scale``
+ are specified they are automatically inferred from the quantization information under
+ the assumption that the input is an 8-bits image.
+ This allows to convert a pre-quantized model without having to explicitly specify the
+ preprocessing information.
+ In this case an unspecified mean and scale is not equivalent to specifying a scale of 1 and a mean of 0.
+ To avoid any ambiguity it's suggested to always specify both means and scale explicitly.
+
+
+ - ``scale``
+
+ Used to normalize the range of input values.
+ The scale is a single value for all the channels in the corresponding input.
+ If not specified a scale of ``1`` is assumed.
+ More details on normalization in the description of the ``means`` field here above.
+
+
+ - ``format``
+
+ Information about the type and organization of the data in the tensor.
+ The content and meaning of this string is custom-defined, however SyNAP Toolkit and
+ SyNAP ``Preprocessor`` recognize by convention an initial format type optionally followed
+ by one or more named attributes:
+
+ `` [=value]...``
+
+ Recognised types are:
+
+ ``rgb`` (default): 8-bits RGB or RGBA or grayscale image
+
+ ``bgr``: 8-bits BGR image or BGRA or grayscale image
+
+ Recognised attributes are:
+
+ ``keep_proportions=1`` (default): preserve aspect-ratio when resizing an image using ``Preprocessor`` or during quantization.
+ ``keep_proportions=0``: don't preserve aspect-ratio when resizing an image using ``Preprocessor`` or during quantization
+
+ Any additional attribute if present is ignored by SyNAP.
+
+ - ``preprocess``
+
+ Input preprocessing options for this input tensor. It can contain the following fields:
+
+ - ``type``: format of the input data (e.g. ``rgb``, ``nv12``) see the table below
+
+ - ``size``: size of the input image as a list [H, W]
+
+ - ``crop``: enable runtime cropping of the input image
+
+ The meaning of each field is explained in detail in the preprocessing section here below.
+ Preprocessing is only supported with the ``npu`` delegate.
+
+ - ``security``
+
+ Security policy for this input tensor. This field is only considered for secure models and
+ can have the following values:
+
+ ``any`` (default): the input can be either in secure or non-secure memory
+
+ ``secure``: the input must be in secure memory
+
+ ``non-secure``: the input must be in non-secure memory
+
+
+- ``outputs``
+ :sup:`(pb)`
+
+ Must contain one entry for each input of the network. Each entry has the following fields:
+
+ - ``name``
+ :sup:`(pb)`
+
+ Name of the output in the network graph. For ``tflite`` and ``onnx`` models this field is not
+ required but can still be used to specify a different output layer than the default output of the
+ network. This feature allows to convert just a subset of a network without having to
+ manually edit the source model. For ``.pb`` and ``.onnx`` models or when ``name`` is not specified
+ the outputs must be in the same order as they appear in the model.
+
+ - ``dequantize``
+
+ The output of the network is internally dequantized and converted to ``float``. This is more
+ efficient then performing the conversion in software.
+
+ - ``format``
+
+ Information about the type and organization of the data in the tensor.
+ The content and meaning of this string is custom-defined, however SyNAP ``Classifier`` and
+ ``Detector`` postprocessors recognize by convention an initial format type optionally followed
+ by one or more named attributes:
+
+ `` [=value]...``
+
+ All fields are separated by one or more spaces. No spaces allowed between the key and the value.
+ Example:
+
+ ``confidence_array class_index_base=0``
+
+ See the ``Classifier`` and ``Detector`` classes for a description of the specific attributes supported.
+
+ - ``security``
+
+ Security policy for this output tensor. This field is only considered for secure models and
+ can have the following values:
+
+ ``secure-if-input-secure`` (default): the output buffer must be in secure memory if at least one input is in secure memory
+
+ ``any``: the output can be either in secure or non-secure memory
+
+
+- ``quantization``
+ :sup:`(q)`
+
+ Quantization options are required when quantizing a model during conversion, they are
+ not needed when importing a model which is already quantized.
+ Quantization is only supported with the ``npu`` delegate.
+
+ - ``data_type``
+
+ Data type used to quantize the network. The same data type is used for both activation data
+ and weights. Available data types are:
+
+ ``uint8`` (default)
+
+ ``int8``
+
+ ``int16``
+
+ ``float16``
+
+ Quantizing to 8 bits provides the best performance in terms of inference speed.
+ Quantizing to ``int16`` can provide higher inference accuracy at the price of higher inference
+ times. Interesting tradeoffs between speed and accuracy can be achieved using *mixed quantization*,
+ that is specifying the data type on a layer-by-layer basis. See section :ref:`mixed_quantization`.
+
+ - ``scheme``
+
+ Select the quantization scheme.
+ Available schemes are:
+
+ ``default`` (default)
+
+ ``asymmetric_affine``
+
+ ``dynamic_fixed_point``
+
+ ``perchannel_symmetric_affine``
+
+ Scheme ``asymmetric_affine`` is only supported for data types ``int8`` and ``uint8``.
+ Scheme ``dynamic_fixed_point`` is only supported for data types ``int8`` and ``int16``.
+ Scheme ``perchannel_symmetric_affine`` is only supported for data type ``int8``.
+ If the scheme is not specfied or set to ``default``, if will be automatically selected according to the
+ data type: ``asymmetric_affine`` will be used for ``uint8``, ``dynamic_fixed_point`` for signed
+ types ``int8`` and ``int16``.
+
+ - ``mode``
+
+ Select the quantization mode.
+ Available modes are:
+
+ ``standard`` (default)
+
+ ``full``
+
+ The ``standard`` mode should be used most of the times. In this mode only the layer-types for
+ which this makes sense are quantized. Other layer types where quantization is not helpful
+ are left unchanged (e.g. layers which just change the layout of the data).
+ The ``full`` mode forces the quantization of all layers. This can in some cases reduce the
+ inference accuracy so should be used only when needed. One case where this is useful is for
+ example when the standard quantization doesn't quantize the initial layer so that the input
+ remains in float16 which would require data type conversion in software.
+
+
+ - ``algorithm``
+
+ Select the quantization algorithm.
+ Available algorithms are:
+
+ ``standard`` (default)
+
+ ``kl_divergence``
+
+ ``moving_average``
+
+ - ``options``
+
+ Special options for fine tuning the quantization in specific cases. Normally not needed.
+
+ - ``dataset``
+ :sup:`(q)`
+
+ Quantization dataset(s), that it the set of input files to be used to quantize the model.
+ In case of multi-input networks, it is necessary to specify one dataset per input.
+ Each dataset will consist of the sample files to be applied to the corresponding input during
+ quantization.
+
+ A sample file can be provided in one of two forms:
+
+ 1. as an image file (``.jpg`` or ``.png``)
+
+ 2. as a NumPy file (``.npy``)
+
+ Image files are suitable when the network inputs are images, that is 4-dimensional tensors
+ (NCHW or NHWC). In this case the ``means`` and ``scale`` values specified for the corresponding
+ input are applied to each input image before it is used to quantize the model. Furthermore
+ each image is resized to fit the input tensor.
+
+ NumPy files can be used for all kind of network inputs.
+ A NumPy file shall contain an array of data with the same shape as the corresponding network input.
+ In this case it is not possible to specify a ``means`` and ``scale`` for the input,
+ any preprocessing if needed has to be done when the NumPy file is generated.
+
+ To avoid having to manually list the files in the quantization dataset for each input, the
+ quantization dataset is instead specified with a list of *glob expressions*, one glob
+ expression for each input. This makes it very easy to specify as quantization dataset
+ for one input the entire content of a directory, or a subset of it.
+ For example all the *jpeg* files in directory *samples* can be indicated with:
+
+ ``samples/*.jpg``
+
+ Both relative and absolute paths can be used. Relative paths are considered relative to
+ the location of the metafile itself. It is not possible to specify a mix of image and ``.npy``
+ files for the same input.
+ For more information on the glob specification syntax, please refer to the python
+ documentation: https://docs.python.org/3/library/glob.html
+
+ If the special keyword ``random`` is specified, a random data file will be automatically generated
+ for this input. This option is useful for preliminary timing tests, but not for actual quantization.
+
+ If this field is not specified, quantization is disabled.
+
+
+.. note::
+
+ The fields marked :sup:`(pb)` are mandatory when converting ``.pb`` models.
+ The fields marked :sup:`(q)` are mandatory when quantizing models.
+
+.. note::
+
+ The metafile also supports limited variable expansion: ``${ENV:name}`` anywhere in the metafile
+ is replaced with the content of the environment variable *name* (or with the empty string if the
+ variable doesn't exist). ``${FILE:name}`` in a format string is replaced with the content of the
+ corresponding file (the file path is relative to that of the conversion metafile itself).
+ This feature should be used sparingly as it makes the metafile not self-contained.
+
+
+.. _preprocessing:
+
+Preprocessing
+-------------
+
+The size, layout, format and range of the data to be provided in the input tensor(s) of a network
+is defined when the network model is created and trained.
+For example a typical `mobilenet-v1` `.tflite` model will expect an input image of size 224x224,
+with NHWC layout and channels organized in RGB order, with each pixel value normalized (rescaled)
+in the range [-1, 1].
+
+Unfortunately, in real world usage, the image to be processed is rarely available in this exact format.
+For example the image may come from a camera in 1920x1080 YUV format. This image must then be converted
+to RGB, resized and normalized to match the expected input.
+Many libraries exist to perform this kind of conversion, but the problem is that these computations
+are quite compute-intensive, so even if deeply optimized, doing this on the CPU will often require
+more time than that required by the inference itself.
+
+Another option is to retrain the network to accept in input the same data format that will be available
+at runtime. This option, while sometimes a good idea, also presents its own problems. For example
+it might not always be possible or practical to retrain a network, especially if the task has to
+be repeated for several input sizes and formats.
+
+To simplify and speedup this task, SyNAP Toolkit allows to automatically insert input preprocessing
+code when a model is converted. This code is executed directly in the NPU and in some cases can be an order
+of magnitude faster than the equivalent operation in the CPU. An alternative to adding the preprocessing
+to the original model is to create a separate "preprocessing model" whose only purpose is to convert
+the input image to the desired format and size, and then execute the two models in sequence without
+any additional data copy, see :ref:`buffer_sharing`
+This can be convenient if the original model is large and the input can come in a variety of possible
+formats. Preprocessing models for the most common cases already come preinstalled.
+
+The available preprocessing options are designed for images and support 5 kinds of transformations:
+
+- format conversion (e.g YUV to RGB, or RGB to BGR)
+- cropping
+- resize and downscale (without preserving proportions)
+- normalization to the required value range (e.g. normalize [0, 255] to [-1, 1])
+- data-type conversion (from uint8 to the data type of the network input layer, eg float16 or int16)
+
+Preprocessing is enabled by specifying the ``preprocess`` section in the input specification
+in the `.yaml` file. This section contains the following fields (the fields marked :sup:`(*)` are mandatory).
+Note that the *mean* and *scale* used to normalize the input values don't appear here because they are
+the same used to quantize the model (see ``means`` and ``scale`` fields in the input specification).
+
+
+``type``:sup:`(*)`
+~~~~~~~~~~~~~~~~~~
+
+This field specifies the format of the input data that will be provided to the network.
+Only image formats are supported at the moment. The SyNAP toolkit will add the required operations to
+convert the input data to the ``format`` and layout expected by the network input tensor.
+If the ``format`` of the network input tensor is not specified, it is assumed to be ``rgb`` by default.
+If this field is set to the empty string or to "``none``", no preprocessing is applied.
+
+Not all conversion are supported: ``gray`` input can only be used if the input tensor has 1 channel.
+All the other input formats except ``float32`` can only be used if the input tensor has 3 channels.
+
+Some input formats generates multiple data inputs for one network tensor. For example if ``nv12``
+is specified the converted network will have two inputs: the first for the ``y`` channel,
+the second for the ``uv`` channels. The preprocessing code will combine the data from these two
+inputs to feed the single ``rgb`` or ``bgr`` input tensor of the network.
+
+The following table contains a summary of all the supported input formats and for each the properties
+and meaning of each generated input tensor.
+Note that the layout of the input data is always ``NHWC`` except for the ``rgb888-planar``
+and ``float32`` formats.
+In all cases `H` and `W` represent the height and width of the input image.
+If the size of the input image is not explicitly specified these are taken from the ``H`` and ``W``
+of the network input tensor. In all cases each pixel component is represented with 8 bits.
+
+The ``float32`` type is a bit special in the sense that in this case the input is not considered
+to be an 8-bits image but raw 32-bits floating point values which are converted to the actual data type
+of the tensor. For this reason any tensor shape is allowed and resizing via the ``size`` field is not supported.
+
+..
+ Original json output from Acuity:
+ +------------------------------+-----------+-------------+-----------+-----------------------------+
+ | Preprocessing Type | Input# | Layout | Format | Input Description |
+ +==============================+===========+=============+===========+=============================+
+ | yuv444 | 0 | N1HW | y8 | Y component |
+ | +-----------+-------------+-----------+-----------------------------+
+ | | 1 | N1HW | u8 | U component |
+ | +-----------+-------------+-----------+-----------------------------+
+ | | 2 | N1HW | v8 | V component |
+ +------------------------------+-----------+-------------+-----------+-----------------------------+
+ | yuv420 | 0 | N1HW | y8 | Y component |
+ | +-----------+-------------+-----------+-----------------------------+
+ | | 1 | N1HW | u8 | U component |
+ | +-----------+-------------+-----------+-----------------------------+
+ | | 2 | N1HW | v8 | V component |
+ +------------------------------+-----------+-------------+-----------+-----------------------------+
+ | nv12 | 0 | N1HW | y8 | Y component |
+ | +-----------+-------------+-----------+-----------------------------+
+ | | 1 | N1H(Wx2) | uv8 | UV components interleaved |
+ +------------------------------+-----------+-------------+-----------+-----------------------------+
+ | gray | 0 | N1HW | y8 | Y component |
+ +------------------------------+-----------+-------------+-----------+-----------------------------+
+ | rgb | 0 | N1H(Wx3) | rgb | RGB components interleaved |
+ +------------------------------+-----------+-------------+-----------+-----------------------------+
+ | bgra | 0 | N1H(Wx4) | bgra | BGRA components interleaved |
+ +------------------------------+-----------+-------------+-----------+-----------------------------+
+ | rgb888p | 0 | N3HW | rgb | RGB components planar |
+ +------------------------------+-----------+-------------+-----------+-----------------------------+
+ | rgb888p3 | 0 | N1HW | r8 | Red component |
+ | +-----------+-------------+-----------+-----------------------------+
+ | | 1 | N1HW | g8 | Green component |
+ | +-----------+-------------+-----------+-----------------------------+
+ | | 2 | N1HW | b8 | Blue component |
+ +------------------------------+-----------+-------------+-----------+-----------------------------+
+
+
++------------------------------+-----------+-------------+-----------+-----------------------------+
+| Preprocessing Type | Input# | Shape | Format | Input Description |
++==============================+===========+=============+===========+=============================+
+| yuv444 | 0 | NHW1 | y8 | Y component |
+| +-----------+-------------+-----------+-----------------------------+
+| | 1 | NHW1 | u8 | U component |
+| +-----------+-------------+-----------+-----------------------------+
+| | 2 | NHW1 | v8 | V component |
++------------------------------+-----------+-------------+-----------+-----------------------------+
+| yuv420 | 0 | NHW1 | y8 | Y component |
+| +-----------+-------------+-----------+-----------------------------+
+| | 1 | N(H/2)(W/2)1| u8 | U component |
+| +-----------+-------------+-----------+-----------------------------+
+| | 2 | N(H/2)(W/2)1| v8 | V component |
++------------------------------+-----------+-------------+-----------+-----------------------------+
+| nv12 | 0 | NHW1 | y8 | Y component |
+| +-----------+-------------+-----------+-----------------------------+
+| | 1 | N(H/2)(W/2)2| uv8 | UV components interleaved |
++------------------------------+-----------+-------------+-----------+-----------------------------+
+| nv21 | 0 | NHW1 | y8 | Y component |
+| +-----------+-------------+-----------+-----------------------------+
+| | 1 | N(H/2)(W/2)2| vu8 | VU components interleaved |
++------------------------------+-----------+-------------+-----------+-----------------------------+
+| gray | 0 | NHW1 | y8 | Y component |
++------------------------------+-----------+-------------+-----------+-----------------------------+
+| rgb | 0 | NHW3 | rgb | RGB components interleaved |
++------------------------------+-----------+-------------+-----------+-----------------------------+
+| bgra | 0 | NHW4 | bgra | BGRA components interleaved |
++------------------------------+-----------+-------------+-----------+-----------------------------+
+| rgb888p | 0 | N3HW | rgb | RGB components planar |
++------------------------------+-----------+-------------+-----------+-----------------------------+
+| rgb888p3 | 0 | NHW1 | r8 | Red component |
+| +-----------+-------------+-----------+-----------------------------+
+| | 1 | NHW1 | g8 | Green component |
+| +-----------+-------------+-----------+-----------------------------+
+| | 2 | NHW1 | b8 | Blue component |
++------------------------------+-----------+-------------+-----------+-----------------------------+
+| float32 | 0 | any | | Floating point data |
++------------------------------+-----------+-------------+-----------+-----------------------------+
+
+
+.. note::
+
+ Specifying a *dummy* preprocessing (for example from ``rgb`` input to ``rgb`` tensor) can be
+ a way to implement normalization and data-type conversion using the NPU HW instead of doing the
+ same operations in SW.
+
+
+``size``
+~~~~~~~~
+
+This optional field allows to specify the size of the input image as a list containing the H and W
+dimensions in this order. Preprocessing will rescale the input image to the size of the corresponding
+input tensor of the network. The proportions of the input image are not preserved.
+If this field is not specified the `WxH` dimension of the input image will be the same as the
+W and H of the network tensor.
+
+
+``crop``
+~~~~~~~~~
+
+Enable cropping. If specified, 4 additional scalar input tensors are added to the model (they can be
+seen in the generated ``model_info.txt``).
+These inputs contain a single 32 bits integer each and are used to specify at runtime
+the dimension and origin of the cropping rectangle inside the input image.
+If security is enabled these additional inputs will have security attribute "any" so that
+it is always possible to specify the cropping coordinates from the user application even if
+the model and the other input / output tensors are secure.
+The cropping inputs are added after the original model input in the following order:
+
+ - width of the cropping rectangle
+ - height of the cropping rectangle
+ - left coordinate of the cropping rectangle
+ - top coordinate of the cropping rectangle
+
+These inputs should be written using the ``Tensor`` scalar ``assign()`` method which accepts
+a value in pixels and converts it to the internal representation.
+Preprocessing will rescale the specified cropping rectangle to the size of the corresponding
+input tensor of the network. The proportions of the input image are not preserved.
+The area of the image outside the cropping rectangle is ignored.
+The cropping coordinates must be inside the dimension of the input image, oherwise the content
+of the resulting image is undefined.
+
+
+Model Quantization
+------------------
+
+In order to efficiently run a model on the NPU HW it has to be *quantized*.
+Quantization consist of reducing the precision of the weights and activations of the model, so that
+computations can be done using 8-bits or 16-bits integer values, instead of the much more computationally
+intensive 32 bits floating point.
+A common side-effect of quantization is often to reduce the accuracy of the results, so it must be done
+with care.
+
+There are three ways in which a model can be quantized:
+
+ - during training, using quantization-aware training features available in recent training
+ framework such as Tensorflow and Pytorch. These techniques allow to compensate for the
+ reduced precision induced by quantization during the training phase itself, thus providing
+ in priciple better results.
+
+ - after training, using the same training framework, to convert a trained floating point model
+ into a quantized one (e.g. convert the model to a quantized ``uint8`` ``.tflite`` model.
+ The advantage of both these methods is that they benefit from advances
+ in the quantization techniques in these frameworks and the generated model is still a standard
+ model, so the effect of quantization can be tested and evaluated using standard tools.
+
+ - when converting the model using the SyNAP toolkit. This is the most convenient way to quantize
+ models outside any traning framework and to take advantage of specific features of the SyNAP
+ NPU and toolkit (e.g. 16-bits or mixed-type quantization).
+
+
+In order to quantize a model it is necessary to determine an estimate of the range
+of the output values of each layer. This can be done by running the model on a set of sample
+input data and analyzing the resulting activations for each layer.
+To achieve a good quantization these sample inputs should be as representative as possible of
+the entire set of expected inputs. For example for a classification network the quantization
+dataset should contain at least one sample for each class. This would be the bare minimum,
+better quantization results can be achieved by providing multiple samples for each class,
+for example in different conditions of size, color and orientation. In case of multi-input
+networks, each input must be fed with an appropriate sample at each inference.
+
+
+Quantization Images Resize
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The image files in the quantization dataset don't have to match the size of the input tensor.
+SyNAP toolkit automatically resizes each image to fit the input tensor. Starting from SyNAP 2.6.0
+this transformation is done by preserving the aspect-ratio of the image content. If the image and
+the tensor have different aspect ratios, gray bands are added to the input
+image so that the actual content is not distorted.
+This corresponds to what is normally done at runtime and is important in order to achieve a
+reliable quantization. The aspect ratio is not preserved if the ``format`` string of the
+corresponding input contains the ``keep_proportions=0`` attribute: in this case the image is simply
+resized to fill the entire input tensor.
+
+
+Data Normalizaton
+~~~~~~~~~~~~~~~~~
+
+When a model is trained the input data are often normalized in order to bring them to a range
+more suitable for training. It's quite common to bring them in a range [-1, 1] by subtracting the mean
+of the data distribution and dividing by the range (or standard deviation).
+A different mean value can be used for each channel.
+
+In order to perform quantization correctly it is important to apply the same transformation to the
+input images or input samples used. If this is not done, the model will be quantized using
+a data distribution that is not the same as that used during training (and during inference)
+with poor results. This information has to be specified in the ``means`` and ``scale`` fields
+in the conversion metafile and will be applied to all input *image* files in the quantization
+dataset for the corresponding input using the formula::
+
+ norm = (in - means[channel]) / scale
+
+
+For *data* (`.npy``) files this is not done, it is assumed that they are already normalized.
+
+In addition, the same transformation must also be applied at runtime on the input data when doing
+inference. If the model has been compiled with preprocessing enabled, data normalization is
+embedded in the model and will take place during inference inside the NPU.
+Otherwise data has to be normalized in SW. The ``Tensor`` class provides an ``assign()`` method
+that does exactly this, using the same ``means`` and ``scale`` fields specified
+in the conversion metafile (this method is smart enough to skip SW normalization when normalization
+is embedded in the model).
+
+HW and SW normalization can be used interchangeably, and provide the same result.
+NPU normalization is generally somewhat faster, but this has to be checked case by case.
+In case of SW normalization, using the same mean for all the channels or using a mean of 0
+and scale of 1 can in some cases improve performance: for example if affine quantization is used
+the normalization and quantization formula (``qval = (normalized_in + zero_point) * qscale``)
+can become one the inverse of the other thus resulting in a very efficient direct data copy.
+
+The ``Tensor::assign()`` method is optimized to handle each case in the most efficient way possible.
+If needed this could be further improved by the customer by taking advantage of the
+ARM NEON SIMD instructions.
+
+
+Quantization and Accuracy
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+As already noted quantizing a model, even if done correctly, will often result is some sort of
+accuracy loss when compared to the original floating point model.
+This effect can be reduced by quantizing the model to 16 bits, but the inference time will be higher.
+As a rule of thumb quantizing a model to 16 bits will double the inference time compared to the same
+model quantized to 8 bits.
+
+The quantization errors introduced are not uniform across all the layers, they might be small for
+some layer and relevant for others. The *Quantization Entropy* is a measure of the error introduced
+in each layer.
+
+A ``quantizaton_entropy.txt`` file can be generated by quantizing a model with the ``kl_divergence``
+algorithm. This file will contain the quantization entropy for each weight and activation tensor
+in the network. It can be used as a guide to understand where errors are introduced in the network.
+Each entropy value is in the range [0, 1], the closer to 1 the higher the quantization
+error introduced. The ``kl_divergence`` algorithm is an iterative algorithm based on
+https://arxiv.org/pdf/1501.07681v1.pdf and tries to minimize the Kullback-Leibler divergence
+between the original and quantized outputs. It is slower than the standard algorithm but
+can produce more accurate results.
+
+The quantization error for problematic layers can be reduced by keeping them in float16 or
+quantizing them to 16 bits integer using mixed quantization.
+
+
+Per-Channel Quantization
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+SyNAP supports per-channel quantization by specifiying the ``perchannel_symmetric_affine`` quantization scheme.
+With this scheme weights scales are computed per-channel (each channel has its own scale),
+while activations will still have a single scale and bias for the entire tensor an in ``asymmetric_affine`` quantization.
+When weight values distribution changes a lot from one channel to the other, having a separate scale
+for each channel can provide a more accurate approximation of the original weights and so an improved
+inference accuracy
+
+
+.. _mixed_quantization:
+
+
+Mixed Quantization
+~~~~~~~~~~~~~~~~~~
+
+Mixed quantization is a feature of the SyNAP toolkit that allows to choose the data type to be used
+for each layer when a network is quantized during conversion.
+This allows to achieve a custom balance between inference speed and accuracy.
+
+Different approaches are possible:
+
+ - quantize the entire network to 16 bits and keep just the input in 8 bits.
+ This provides the best accuracy possible and can be convenient when the input is an 8-bits image
+ since it avoids the need to perform the 8-to-16 bits conversion is SW (note that this is not
+ needed if preprocessing is used as it will also take care of the type conversion)
+
+ - quantize most of the network in 8 bits and just the *problematic* layers with ``int16`` or
+ even ``float16``.
+ The quantization entropy can provide a guide to select the layers which would get
+ more benefit from 16 bits. Note however that each change in data-type requires a conversion
+ layer before and after it, so it is normally a good idea to avoid changing data-type too
+ many times
+
+ - quantize the initial part (*backbone*) of the network in ``uint8`` and switch to ``int16`` for the
+ last part (*head*). This is often a good choice when the input of the network is an 8-bits
+ image, as networks should not be too sensitive in general to small noise in the input.
+ Using 16 bits processing in the head allows to compute the final results (e.g. bounding boxes)
+ with much greater precision without adding too much in term of inference time
+
+
+To see how this is done let's consider the very simple model in :ref:`quant_sample_model`.
+
+.. _quant_sample_model:
+.. uml::
+ :scale: 50%
+ :caption: Sample Model
+
+ skinparam monochrome true
+ skinparam handwritten false
+ hide members
+ hide methods
+ hide fields
+ interface input1
+ class conv1
+ class conv2
+ class conv3
+ class conv4
+ class conv5
+ class conv6
+
+ input1 --> conv1
+ conv1 --> conv2
+ conv2 --> conv3
+ conv3 --> conv4
+ conv2 --> conv5
+ conv5 --> conv6
+
+This model has one input and six convolutions.
+We've already seen how to compile it with uniform quantization, for example using 16 bits integers:
+
+.. code-block:: yaml
+
+ quantization:
+ data_type: int16
+
+
+Instead of a single type, the ``data_type`` field can contain an association map between
+layer-names and layer-types. Layer names are those that appear in the model to be converted, it's
+easy to see them using free tools such as *Netron*. So, the previous example is equivalent to:
+
+.. code-block:: yaml
+
+ quantization:
+ data_type:
+ input1: int16
+ conv1: int16
+ conv2: int16
+ conv3: int16
+ conv4: int16
+ conv5: int16
+ conv6: int16
+
+
+To perform mixed-type quantization just select the desired type for each layer. The only limitation
+is that ``uint8`` and ``int8`` types can't be both present at the same time. For example we can
+choose to quantize the input and first convolution to 8 bits, the internal convolutions to 16 bits,
+and to keep the final convolutions in floating point:
+
+.. code-block:: yaml
+
+ quantization:
+ data_type:
+ input1: uint8
+ conv1: uint8
+ conv2: int16
+ conv3: int16
+ conv4: float16
+ conv5: int16
+ conv6: float16
+
+Real models can often have well above one hundred layers, so writing an exhaustive list of all the layers
+can become confusing and error-prone. To keep the type specification simpler there are a few
+shortcuts that can be used. First of all, layers can be omitted: layers not explicitly
+listed will be quantized by default to ``uint8``. Furthermore, some special conventions in the layer
+name specification can help:
+
+ - INPUTS : this special name is automatically expanded to the names of all the inputs of the network
+ - '*@layerId*' : a name preceded by the '@' suffix is interpreted as a *layerID* (see note below)
+ - *layername...* : a name followed by three dots, is expanded to the names of all the layers that
+ *follows* the layer specified in the model (in execution order). Useful when for example
+ we want to use the same data type for the head of the network or an entire branch.
+ - ``'*'`` : expanded to the names of all the layers that haven't been explicitly specified
+
+The type specifications are applied in the order they are declared (except for '*') so it is possible
+to further override the type of layers already specified.
+
+.. note::
+
+ During the compilation of a model several optimizations are applied and some layers
+ in the original network may be fused together or optimized away completely.
+ For optimized away layers it is of course not possible to specify the data type.
+ For fused layers the issue is that they will not have the same name as the original layers.
+ In this case it is possible to identify them by *layerId*: a *layerId* is a unique identifier
+ assigned to each compiled layer. This is also a convenient way to identify layers in case the
+ original model has layers with ambiguous or empty names. It is possible to see the list of all
+ layerIDs for a compiled model in the generated ``quantization_info.yaml``
+ or ``quantization_entropy.txt`` file.
+
+
+Lets's see a few examples applied to our sample network.
+
+.. code-block:: yaml
+
+ # Quantize input1 as int8, everything else as int16
+ quantization:
+ data_type:
+ INPUTS: int8
+ '*': int16
+
+
+.. code-block:: yaml
+
+ # Quantize as uint8 but use int16 for conv3, conv4, conv5, conv6
+ quantization:
+ data_type:
+ '*': uint8
+ conv2...: int16
+
+
+
+.. code-block:: yaml
+
+ # Quantize as uint8 but use int16 for conv3, conv4, conv6 but float16 for conv5
+ quantization:
+ data_type:
+ '*': uint8
+ conv2...: int16
+ conv5: float16
+
+In the two examples above the specification ``'*': uint8`` could have been avoided since ``uint8``
+is already the default, but helps in making the intention more explicit.
+
+If we specify the data type for a layer that has been fused, we will get a "*Layer name*" error at conversion time.
+In this case we have to look for the *layerId* of the corresponding fused layer in ``quantization_info.yaml``
+and use the "@" syntax as explained above. For example if in our sample model ``conv5`` and ``conv6``
+have been fused, we will get an error if we specify the type for ``conv5`` alone.
+Looking in ``quantization_info.yaml`` we can find the ID of the fused layer, as in:
+``'@Conv_Conv_5_200_Conv_Conv_6_185:weight':``
+
+
+We can then use this layer ID in the metafile to specify the data type of the fused layers:
+
+.. code-block:: yaml
+
+ # Quantize as uint8 but use int16 for conv3, conv4, conv6 but float16 for fused conv5+conv6
+ quantization:
+ data_type:
+ '*': uint8
+ conv2...: int16
+ '@Conv_Conv_5_200_Conv_Conv_6_185': float16
+
+
+
+.. raw:: latex
+
+ \clearpage
+
+
+.. _heterogeneous_inference:
+
+Heterogeneous Inference
+-----------------------
+
+In some cases it can be useful to execute different parts of a network on different hardware.
+For example consider an object detection network, where the initial part contains a bunch of convolutions
+and the final part some postprocessing layer such as `TFLite_Detection_PostProcess`.
+The NPU is heavily optimized for executing convolutions, but doesn't support the postprocessing layer,
+so the best approach would be to execute the initial part of the network on the NPU
+and the postprocessing on the CPU.
+
+This can be achieved by specifying the delegate to be used on a per-layer basis, using the same syntax
+as we've seen for mixed quantization in section :ref:`mixed_quantization`.
+For example, considering again the Model in :ref:`quant_sample_model`, we can specify that
+all layers should be executed on the NPU, except ``conv5`` and the layers that follows it
+which we want to execute on the GPU:
+
+.. code-block:: yaml
+
+ # Execute the entire model on the NPU, except conv5 and conv6
+ delegate:
+ '*': npu
+ conv5: gpu
+ conv5...: gpu
+
+Another advantage of distributing processing to different hardware delegates is that
+when the model is organized in multiple independent branches (so that a branch can be executed
+without having to wait for the result of another branch), and each is executed on a different HW unit
+then the branches can be executed in parallel.
+
+In this way the overall inference time can be reduced to the time it takes to execute the slowest branch.
+Branch parallelization is always done automatically whenever possible.
+
+.. note::
+
+ Branch parallelization should not be confused with in-layer parallelization, which is also
+ always active when possible. In the example above the two branches `(conv3,conv4)` and `(conv5,conv6)`
+ are executed in parallel, the former the NPU and the latter on the GPU.
+ In addition, each convolution layer is parallelized internally by taking advantage
+ of the parallelism available in the NPU and GPU HW.
+
+.. raw:: latex
+
+ \clearpage
+
+.. _model_conversion_tutorial:
+
+Model Conversion Tutorial
+-------------------------
+Let's see how to convert and run a typical object-detection model.
+
+ 1. Download the sample `ssd_mobilenet_v1_1_default_1.tflite` object-detection model:
+
+ https://tfhub.dev/tensorflow/lite-model/ssd_mobilenet_v1/1/default/1
+
+ 2. Create a conversion metafile ``ssd_mobilenet.yaml`` with the content here below
+ (Important: be careful that newlines and formatting must be respected but they are lost
+ when doing copy-paste from a pdf)::
+
+ outputs:
+ - name: Squeeze
+ dequantize: true
+ format: tflite_detection_boxes y_scale=10 x_scale=10 h_scale=5 w_scale=5 anchors=${ANCHORS}
+ - name: convert_scores
+ dequantize: true
+ format: per_class_confidence class_index_base=-1
+
+ A few notes on the content of this file:
+
+ "``name: Squeeze``" and "``name: convert_scores``"
+ explicitly specifiy the output tensors
+ where we want model conversion to stop. The last layer (``TFLite_Detection_PostProcess``)
+ is a custom layer not suitable for NPU acceleration, so it is implemented in software
+ in the ``Detector`` postprocessor class.
+
+ "``dequantize: true``"
+ performs conversion from quantized to float directly in the NPU.
+ This is much faster than doing conversion in software.
+
+ "``tflite_detection_boxes``" and "``convert_scores``"
+ represents the content and data organization in these tensors
+
+ "``y_scale=10``" "``x_scale=10``" "``h_scale=5``" "``w_scale=5``"
+ corresponds to the parameters in the ``TFLite_Detection_PostProcess`` layer in the network
+
+ "``${ANCHORS}``"
+ is replaced at conversion time with the ``anchor`` tensor from the
+ ``TFLite_Detection_PostProcess`` layer. This is needed to be able to compute the bounding
+ boxes during postprocessing.
+
+ "``class_index_base=-1``"
+ this model has been trained with an additional background class
+ as index 0, so we subtract 1 from the class index during postprocessing to conform to the
+ standard `coco` dataset labels.
+
+
+ 3. Convert the model (be sure that the model, meta and output dir are in a directory visible
+ in the container, see ``-v`` option in :ref:`running-toolkit-label`)::
+
+ $ synap convert --model ssd_mobilenet_v1_1_default_1.tflite --meta ssd_mobilenet.yaml --target VS680 --out-dir compiled"
+
+ 4. Push the model to the board::
+
+ $ adb root
+ $ adb remount
+ $ adb shell mkdir /data/local/tmp/test
+ $ adb push compiled/model.synap /data/local/tmp/test
+
+
+ 5. Execute the model::
+
+ $ adb shell
+ # cd /data/local/tmp/test
+ # synap_cli_od -m model.synap $MODELS/object_detection/coco/sample/sample001_640x480.jpg"
+
+ Input image: /vendor/firmware/.../sample/sample001_640x480.jpg (w = 640, h = 480, c = 3)
+ Detection time: 5.69 ms
+ # Score Class Position Size Description
+ 0 0.70 2 395,103 69, 34 car
+ 1 0.68 2 156, 96 71, 43 car
+ 2 0.64 1 195, 26 287,445 bicycle
+ 3 0.64 2 96,102 18, 16 car
+ 4 0.61 2 76,100 16, 17 car
+ 5 0.53 2 471, 22 167,145 car
+
+
+.. _model-profiling-label:
+
+Model Profiling
+---------------
+
+When developing and optimizing a model it can be useful to understand how the execution time is
+distributed among the layers of the network. This provides an indication of which layers are executed
+efficiently and which instead represent bottlenecks.
+
+In order to obtain this information the network has to be executed step by step so that
+each single timing can be measured. For this to be possible the network must be generated with
+additional profiling instructions by calling ``synap_convert.py`` with the ``--profiling`` option,
+for example::
+
+$ synap convert --model mobilenet_v2_1.0_224_quant.tflite --target VS680 --profiling --out-dir mobilenet_profiling
+
+.. note::
+
+ Even if the execution time of each layer doesn't change between *normal* and *profiling* mode,
+ the overall execution time of a network compiled with profiling enabled will be noticeably
+ higher than that of the same network compiled without profiling, due to the fact that NPU
+ execution has to be started and suspended several times to collect the profiling data.
+ For this reason profiling should normally be disabled, and enabled only when needed for
+ debugging purposes.
+
+.. note::
+
+ When a model is converted using SyNAP toolkit, layers can be fused, replaced with equivalent
+ operations and/or optimized away, hence it is generally not possible to find a one-to-one
+ correspondence between the items in the profiling information and the nodes in the original network.
+ For example adjacent convolution, ReLU and Pooling layer are fused together in a single
+ *ConvolutionReluPoolingLayer* layer whenever possible.
+ Despite these optimizations the correspondence is normally not too difficult to find.
+ The layers shown in the profiling correspond to those listed in the `model_info.txt` file
+ generated when the model is converted.
+
+After each execution of a model compiled in profiling mode, the profiling information will be
+available in `sysfs`, see :ref:`sysfs-networks`. Since this information is not persistent
+but goes away when the network is destroyed, the easiest way to collect it is by using `synap_cli`
+program. The ``--profling `` option allows to save a copy of the `sysfs` `network_profile` file
+to a specified file before the network is destroyed::
+
+ $ adb push mobilenet_profiling $MODELS/image_classification/imagenet/model/
+ $ adb shell
+ # cd $MODELS/image_classification/imagenet/model/mobilenet_profiling
+ # synap_cli -m model.synap --profiling mobilenet_profiling.txt random
+
+ # cat mobilenet_profiling.txt
+ pid: 21756, nid: 1, inference_count: 78, inference_time: 272430, inference_last: 3108, iobuf_count: 2, iobuf_size: 151529, layers: 34
+ | lyr | cycle | time_us | byte_rd | byte_wr | type
+ | 0 | 152005 | 202 | 151344 | 0 | TensorTranspose
+ | 1 | 181703 | 460 | 6912 | 0 | ConvolutionReluPoolingLayer2
+ | 2 | 9319 | 51 | 1392 | 0 | ConvolutionReluPoolingLayer2
+ | 3 | 17426 | 51 | 1904 | 0 | ConvolutionReluPoolingLayer2
+ | 4 | 19701 | 51 | 1904 | 0 | ConvolutionReluPoolingLayer2
+ ...
+ | 28 | 16157 | 52 | 7472 | 0 | ConvolutionReluPoolingLayer2
+ | 29 | 114557 | 410 | 110480 | 0 | FullyConnectedReluLayer
+ | 30 | 137091 | 201 | 2864 | 1024 | Softmax2Layer
+ | 31 | 0 | 0 | 0 | 0 | ConvolutionReluPoolingLayer2
+ | 32 | 0 | 0 | 0 | 0 | ConvolutionReluPoolingLayer2
+ | 33 | 670 | 52 | 1008 | 0 | ConvolutionReluPoolingLayer2
+
+
+Compatibility with SyNAP 2.x
+----------------------------
+
+SyNAP 3.x is fully backward compatible with SyNAP 2.x.
+
+ - It is possible to execute models compiled with SyNAP 3.x toolkit with SyNAP 2.x runtime.
+ The only limitation is that in this case heterogeneous compilation is not available and the
+ entire model will be executed on the NPU. This can be done by specifying the ``--out-format nb``
+ option when converting the model. In this case the toolkit will generate in output the legacy
+ ``model.nb`` and ``model.json`` files instead of the ``model.synap`` file::
+
+ $ synap convert --model mobilenet_v2_1.0_224_quant.tflite --target VS680 --out-format nb --out-dir mobilenet_legacy
+
+ - It is possible to execute models compiled with SyNAP 2.x toolkit with SyNAP 3.x runtime
+
+ - SyNAP 3.x API is an extension of SyNAP 2.x API, so all the existing applications can be used
+ without any modification
+
+
+.. _working-with-pytorch-models-label:
+
+
+Working with PyTorch Models
+---------------------------
+
+PyTorch framework supports very flexible models where the architecture and behaviour of the network
+is defined using Python classes instead of fixed graph layers as for example in `TFLite`.
+When saving a model, normally only the ``state_dict``, that is the learnable parameters, are saved and not
+the model structure itself (https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-loading-model-for-inference).
+The original Python code used to define the model is needed to reload the model
+and execute it. For this reason there is no way for the toolkit to directly import a PyTorch model
+from a `.pt` file containing only the learnable parameters.
+
+When saving a torch model in a `.pt` file it is also possible to include references to the Python classes
+defining the model but even in this case it's impossible to recreate the model from just the `.pt` file
+without the exaact python source tree used to generate it.
+
+A third possibility is to save the model in `TorchScript` format. In this case the saved model
+contains both the the learnable parameters `and` the model structure.
+
+This format can be imported directly using the SyNAP toolkit.
+
+For more info on how to save a model in the `TorchScript` format see:
+https://pytorch.org/tutorials/beginner/saving_loading_models.html#export-load-model-in-torchscript-format
+
+An alternative way to save a model in TorchScript format is to use `tracing`.
+Tracing records the operations that are executed when a model is run and is a good way to convert
+a model when exporting with ``torch.jit.script`` is problematic, for example when the model
+has a dynamic structure.
+In both cases the generated file will have the same format, so models saved with tracing can also be imported directly.
+A detailed comparison of the two techniques is available online searching for "pytorch tracing vs scripting".
+
+Here below an example of saving a torch model with scripting or tracing:
+
+.. code-block:: python
+
+ import torch
+ import torchvision
+
+ # An instance of your model
+ model = torchvision.models.mobilenet_v2(pretrained=True)
+
+ # Switch the model to eval model
+ model.eval()
+
+ # Generate a torch.jit.ScriptModule via scripting
+ mobilenet_scripted = torch.jit.script(model)
+
+ # Save the scripted model in TorchScript format
+ mobilenet_scripted.save("mobilenet_scripted.torchscript")
+
+
+ # An example input you would normally provide to your model's forward() method.
+ example = torch.rand(1, 3, 224, 224)
+
+ # Generate a torch.jit.ScriptModule via tracing
+ mobilenet_traced = torch.jit.trace(model, example)
+
+ # Save the traced model in TorchScript format
+ mobilenet_traced.save("mobilenet_traced.torchscript")
+
+
+.. important::
+
+ Even if there exists multiple possible ways to save a PyTorch model to a file, there is no
+ agreed convention for the extension used in the different cases, and `.pt` or `.pth` extension is commonly used
+ no matter the format of the file. Only `TorchScript` models can be imported with the SyNAP toolkit,
+ if the model is in a different format the import will fail with an error message.
+
+.. note::
+
+ Working with `TorchScript` models is not very convenient when performing mixed quantization or
+ heterogeneous inference, as the model layers sometimes don't have names or the name is modified during the
+ import process and/or there is not a one-to-one correspondence between the layers in the original
+ model and the layers in the imported one. The suggestion in this case is to compile the model
+ with the ``--preserve`` option and then look at the intermediate ``build/model.onnx`` file
+ inside the output directory.
+
+
+An even more portable alternative to exporting a model to TorchScript is to export it to ONNX format.
+The required code is very similar to the one used to trace the model:
+
+.. code-block:: python
+
+ import torch
+ import torchvision
+
+ # An instance of your model
+ model = torchvision.models.mobilenet_v2(pretrained=True)
+
+ # Switch the model to eval model
+ model.eval()
+
+ # Export the model in ONNX format
+ torch.onnx.export(model, torch.rand(1, 3, 224, 224), "mobilenet.onnx")
+
+
+
+Importing YOLO PyTorch Models
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The popular YOLO library from `ultralytics` provides pretrained .pt models on their website.
+All these models are not in `TorchScript` format and so can't be imported directly with the SyNAP toolkit.
+nevertheless it's very easy to export them to `ONNX` or `TorchScript` so that they can be imported:
+
+.. code-block:: python
+
+ from ultralytics import YOLO
+
+ # Load an official YOLO model
+ model = YOLO("yolov8s.pt")
+
+ # Export the model in TorchScript format
+ model.export(format="torchscript", imgsz=(480, 640))
+
+ # Export the model in ONNX format
+ model.export(format="onnx", imgsz=(480, 640))
+
+
+More information on exporting YOLO models to ONNX in https://docs.ultralytics.com/modes/export/
+Most public-domain machine learning packages provide similar export functions for their PyTorch models.
diff --git a/manual/statistics.md b/manual/statistics.md
new file mode 100644
index 0000000..28a4a3c
--- /dev/null
+++ b/manual/statistics.md
@@ -0,0 +1,130 @@
+Statistics and Usage
+====================
+
+SyNAP provides NPU usage information and statistics via the standard Linux `/sysfs` interface. The `/sysfs` allows providing information about system devices and resources using a pseudo file-system where each piece of information is seen as a file that can be read/written by the user using standard tools.
+
+On Android, statistics are available in `/sys/class/misc/synap/device/misc/synap/statistics/`:
+
+```bash
+$ SYNAP_STAT_DIR=/sys/class/misc/synap/device/misc/synap/statistics
+```
+
+On Yocto Linux, they are in `/sys/class/misc/synap/statistics/`:
+
+```bash
+$ SYNAP_STAT_DIR=/sys/class/misc/synap/statistics
+```
+
+```bash
+$ ls $SYNAP_STAT_DIR
+inference_count inference_time network_profile networks
+```
+
+**Important**: The content of the statistics files is only available from the **root** user.
+
+**Note**: There are no statistics regarding inference performed on the CPU or the GPU. CPU inference can occur at the user-space level and it's not possible to track it inside the SyNAP driver.
+
+## `inference_count`
+
+This file contains the total number of inferences performed on the NPU since system startup. Example:
+
+```bash
+# cat $SYNAP_STAT_DIR/inference_count
+1538
+```
+
+## `inference_time`
+
+This file contains the total time spent doing NPU inferences since system startup. It is a 64-bit integer expressed in microseconds. Example:
+
+```bash
+# cat $SYNAP_STAT_DIR/inference_time
+32233264
+```
+
+## `networks`
+
+This file contains detailed information for each network *currently* loaded in the NPU driver with a line per network. Each line contains the following information:
+
+- **pid**: process that created the network
+- **nid**: unique network ID
+- **inference_count**: number of inferences for this network
+- **inference_time**: total inference time for this network in µs
+- **inference_last**: last inference time for this network in µs
+- **iobuf_count**: number of I/O buffers currently registered to the network
+- **iobuf_size**: total size of I/O buffers currently registered to the network
+- **layers**: number of layers in the network
+
+Example:
+
+```bash
+# cat $SYNAP_STAT_DIR/networks
+pid: 3628, nid: 38, inference_count: 22, inference_time: 40048, inference_last: 1843, iobuf_count: 2, iobuf_size: 151529, layers: 34
+pid: 3155, nid: 4, inference_count: 3, inference_time: 5922, inference_last: 1843, iobuf_count: 2, iobuf_size: 451630, layers: 12
+```
+
+**Important**: This file will be empty if there is no network currently loaded.
+
+It's easy to show in real-time the information for all the networks currently loaded with the standard `watch` command:
+
+```bash
+# watch -n 1 cat $SYNAP_STAT_DIR/networks
+```
+
+## `network_profile`
+
+This file contains detailed information for each network *currently* loaded in the NPU driver, with a line per network. The information in each line is the same as in the `networks` file. Additionally, if a model has been compiled offline with profiling enabled or executed online with profiling enabled, the corresponding line will be followed by detailed layer-by-layer information:
+
+- **lyr**: index of the layer (or group of layers)
+- **cycle**: number of execution cycles
+- **time_us**: execution time in µs
+- **byte_rd**: number of bytes read
+- **byte_wr**: number of bytes written
+- **ot**: operation type (NN: Neural Network core, SH: Shader, TP: TensorProcessor)
+- **name**: operation name
+
+Example:
+
+```bash
+# cat $SYNAP_STAT_DIR/network_profile
+pid: 21756, nid: 1, inference_count: 78, inference_time: 272430, inference_last: 3108, iobuf_count: 2, iobuf_size: 151529, layers: 34
+| lyr | cycle | time_us | byte_rd | byte_wr | ot | name
+| 0 | 153811 | 202 | 151344 | 0 | TP | TensorTranspose
+| 1 | 181903 | 461 | 6912 | 0 | NN | ConvolutionReluPoolingLayer2
+| 2 | 9321 | 52 | 1392 | 0 | NN | ConvolutionReluPoolingLayer2
+| 3 | 17430 | 51 | 1904 | 0 | NN | ConvolutionReluPoolingLayer2
+| 4 | 19878 | 51 | 1904 | 0 | NN | ConvolutionReluPoolingLayer2
+...
+| 28 | 16248 | 51 | 7472 | 0 | NN | ConvolutionReluPoolingLayer2
+| 29 | 125706 | 408 | 120720 | 0 | TP | FullyConnectedReluLayer
+| 30 | 137129 | 196 | 2848 | 1024 | SH | Softmax2Layer
+| 31 | 0 | 0 | 0 | 0 | -- | ConvolutionReluPoolingLayer2
+| 32 | 0 | 0 | 0 | 0 | -- | ConvolutionReluPoolingLayer2
+| 33 | 671 | 51 | 1008 | 0 | NN | ConvolutionReluPoolingLayer2
+```
+
+## Clearing Statistics
+
+Statistics can be cleared by writing to either the `inference_count` or `inference_time` file. Example:
+
+```bash
+# cat $SYNAP_STAT_DIR/inference_time
+32233264
+# echo > $SYNAP_STAT_DIR/inference_time
+# cat $SYNAP_STAT_DIR/inference_time
+0
+# cat $SYNAP_STAT_DIR/inference_count
+0
+```
+
+## Using `/sysfs` Information
+
+The information available from `/sysfs` can be easily used from scripts or tools. For example, to get the average NPU utilization in a 5 seconds period:
+
+```bash
+us=5000000;
+echo > $SYNAP_STAT_DIR/inference_time;
+usleep $us;
+npu_usage=$((`cat $SYNAP_STAT_DIR/inference_time`*100/us));
+echo "Average NPU usage: $npu_usage%"
+```
diff --git a/manual/synap_installation.md b/manual/synap_installation.md
new file mode 100644
index 0000000..38d6ae8
--- /dev/null
+++ b/manual/synap_installation.md
@@ -0,0 +1,120 @@
+# Installing SyNAP
+
+## Docker
+
+Please follow these guidelines for Docker installation, but always consult the official [Docker documentation](https://docs.docker.com/get-docker/) for more details.
+
+### Linux/Ubuntu
+
+Install Docker with the following command:
+
+```shell
+apt-get install docker.io
+```
+
+To run Docker as a non-root user, execute these commands once after installing Docker (for more details, see the [Docker post-install guide](https://docs.docker.com/engine/install/linux-postinstall/)):
+
+```shell
+# Create the docker group if it doesn't already exist
+sudo groupadd docker
+# Add the current user "$USER" to the docker group
+sudo usermod -aG docker $USER
+```
+
+### macOS - Docker
+
+Install Docker on macOS using the `brew` package manager. If `brew` is not installed, follow the instructions on the [Homebrew website](https://brew.sh/).
+
+```shell
+brew install docker
+```
+
+**Important**: The Docker GUI is not free for commercial use on macOS. An alternative is `Colima`.
+
+### macOS - Colima
+
+`Colima` is a free alternative to Docker on macOS, suitable for container runtimes without a GUI. Install `Colima` and necessary tools with the following commands:
+
+```shell
+brew install colima
+mkdir -p ~/.docker/cli-plugins
+brew install docker-buildx
+ln -sfn $(brew --prefix)/opt/docker-buildx/bin/docker-buildx ~/.docker/cli-plugins/docker-buildx
+colima start --vm-type vz --mount-type virtiofs --cpu 4 --memory 8 --disk 80
+```
+
+To start Colima after a system restart:
+
+```shell
+colima start
+```
+
+For more information on using Colima, see this [guide](https://smallsharpsoftwaretools.com/tutorials/use-colima-to-run-docker-containers-on-macos/).
+
+### Windows
+
+Install Docker on Windows using WSL2 inside a Linux Virtual Machine. Docker running directly in Windows is not compatible with other VMs.
+
+#### WSL2 Installation Steps
+
+1. Open the *Windows PowerShell* as Administrator and run the command to install WSL2:
+
+ ```shell
+ > wsl --install
+ ```
+
+ Restart your computer when the installation is complete.
+
+2. Install *Ubuntu-22.04* using *Windows PowerShell*:
+
+ ```shell
+ > wsl --install -d Ubuntu-22.04
+ ```
+
+3. Open *Windows Terminal*, select the *Ubuntu-22.04* distribution, and follow the Linux/Ubuntu installation instructions for Docker above.
+
+For further information on WSL2, refer to Microsoft’s [WSL install guide](https://learn.microsoft.com/en-us/windows/wsl/install) and [WSL setup guide](https://learn.microsoft.com/en-us/windows/wsl/setup/environment).
+
+## Installing SyNAP Tools
+
+Before installing the SyNAP toolkit, ensure Docker is functioning by running the `hello-world` image:
+
+```shell
+$ docker run hello-world
+```
+
+If you see a welcome message from Docker, proceed with installing the toolkit. If not, check the Docker installation steps.
+
+Download the SyNAP toolkit Docker image from the GitHub repository:
+
+```shell
+docker pull ghcr.io/synaptics-synap/toolkit:#SyNAP_Version#
+```
+
+The toolkit’s latest version is available [here](https://github.com/synaptics-synap/toolkit/pkgs/container/toolkit).
+
+## Running SyNAP Tools
+
+After installing Docker and the SyNAP toolkit, you can run the model conversion tool directly in a Docker container. To simplify repeated commands, create an alias:
+
+```shell
+alias synap='docker run -i --rm -u $(id -u):$(id -g) -v $HOME:$HOME -w $(pwd) ghcr.io/synaptics-synap/toolkit:#SyNAP_Version#'
+```
+
+This setup ensures that:
+
+- The container runs interactively.
+- The container is removed after exiting.
+- The container runs with your user ID and group ID.
+- Your home directory is mounted to the container.
+- The working directory is set to your current directory.
+
+To use the tool, type:
+
+```shell
+$ synap help
+```
+
+This command provides help and usage information for the toolkit. For detailed commands, run `synap COMMAND --help`.
+
+**Important**: If you receive a *Permission Denied* error, revisit the Docker setup for non-root users as described above.
\ No newline at end of file
diff --git a/manual/test.md b/manual/test.md
new file mode 100644
index 0000000..6748d01
--- /dev/null
+++ b/manual/test.md
@@ -0,0 +1,3 @@
+# Sample file
+
+Here is a sample md file
\ No newline at end of file