调试数值错误

当 onnx-mlir 编译的推理可执行文件产生的数值结果与训练框架产生的结果不一致时，使用 utils/RunONNXModel.py python 脚本来调试数值错误。这个 python 脚本将通过 onnx-mlir 和一个参考后端运行模型，并逐层比较这两个后端产生的中间结果。

先决条件

将 ONNX_MLIR_HOME 环境变量设置为 onnx-mlir HOME 目录的路径。onnx-mlir 的 HOME 目录指的是包含 bin、lib 等子文件夹的父文件夹，这些子文件夹中可以找到 ONNX-MLIR 的可执行文件和库。

参考后端

onnx-mlir 的输出可以通过使用参考 ONNX 后端或 protobuf 中的参考输入和输出来验证。

要使用参考后端进行验证，请通过运行 pip install onnxruntime 安装 onnxruntime。要使用不同的测试后端，只需将导入 onnxruntime 的代码替换为其他符合 ONNX 的后端即可。
要使用参考输出来验证，请使用 --verify=ref --load-ref=data_folder，其中 data_folder 是包含输入和输出的 protobuf 文件的文件夹路径。此指南介绍了如何从 numpy 数组创建 protobuf 文件。

用法

utils/RunONNXModel.py 支持以下命令行选项

$ python ../utils/RunONNXModel.py  --help
usage: RunONNXModel.py [-h] [--log-to-file [LOG_TO_FILE]] [--model MODEL] [--compile-args COMPILE_ARGS] [--compile-only] [--compile-using-input-shape] [--print-input]
                       [--print-output] [--save-onnx PATH] [--verify {onnxruntime,ref}] [--verify-all-ops] [--verify-with-softmax] [--verify-every-value] [--rtol RTOL]
                       [--atol ATOL] [--save-so PATH | --load-so PATH] [--save-ref PATH] [--load-ref PATH | --shape-info SHAPE_INFO] [--lower-bound LOWER_BOUND]
                       [--upper-bound UPPER_BOUND]

optional arguments:
  -h, --help                  show this help message and exit
  --log-to-file [LOG_TO_FILE] Output compilation messages to file, default compilation.log
  --model MODEL               Path to an ONNX model (.onnx or .mlir)
  --compile-args COMPILE_ARGS Arguments passed directly to onnx-mlir command. See bin/onnx-mlir --help
  --compile-only              Only compile the input model
  --compile-using-input-shape Compile the model by using the shape info getting from the inputs in the reference folder set by --load-ref
  --print-input               Print out inputs
  --print-output              Print out inference outputs produced by onnx-mlir
  --save-onnx PATH            File path to save the onnx model. Only effective if --verify=onnxruntime
  --verify {onnxruntime,ref}  Verify the output by using onnxruntime or reference inputs/outputs. By default, no verification. When being enabled, --verify-with-softmax or --verify-every-value must be used to specify verification mode.
  --verify-all-ops            Verify all operation outputs when using onnxruntime
  --verify-with-softmax       Verify the result obtained by applying softmax to the output
  --verify-every-value        Verify every value of the output using atol and rtol
  --rtol RTOL                 Relative tolerance for verification
  --atol ATOL                 Absolute tolerance for verification
  --save-so PATH              File path to save the generated shared library of the model
  --load-so PATH              File path to load a generated shared library for inference, and the ONNX model will not be re-compiled
  --save-ref PATH             Path to a folder to save the inputs and outputs in protobuf
  --load-ref PATH             Path to a folder containing reference inputs and outputs stored in protobuf. If --verify=ref, inputs and outputs are reference data for verification
  --shape-info SHAPE_INFO     Shape for each dynamic input of the model, e.g. 0:1x10x20,1:7x5x3. Used to generate random inputs for the model if --load-ref is not set
  --lower-bound LOWER_BOUND   Lower bound values for each data type. Used inputs. E.g. --lower-bound=int64:-10,float32:-0.2,uint8:1. Supported types are bool, uint8, int8, uint16, int16, uint32, int32, uint64, int64,float16, float32, float64
  --upper-bound UPPER_BOUND   Upper bound values for each data type. Used to generate random inputs. E.g. --upper-bound=int64:10,float32:0.2,uint8:9. Supported types are bool, uint8, int8, uint16, int16, uint32, int32, uint64, int64, float16, float32, float64

用于比较模型在两种不同编译选项下的辅助脚本。

基于上述 utils/runONNXModel.py，utils/CheckONNXModel.py 允许用户在两种不同的编译选项下运行给定模型两次，并比较其结果。这让用户可以简单地测试一个新选项，将编译器的安全版本（例如 -O0 或 -O3）与更高级的版本（例如 -O3 或 -O3 --march=x86-64）进行比较。只需使用 --ref-compile-args 和 --test-compile-args 标志指定编译选项，使用 --model 标志指定模型，并在存在动态形状输入时可能使用 --shape-info。完整选项列在 --help 标志下。

跟踪编译器步骤。

为了更深入地了解编译器在做什么，您可能希望在每次转换通过后打印编译器的输出。这些标志在 mlir 文献中有详细记录，我们编写了一个脚本，通过使用自定义脚本自动添加这些标志。此脚本位于 utils/onnx-mlir-print.sh，通过添加所有所需的编译器选项，以及最后一个参数（指示保存编译器输出日志的文件）来使用它。

该编译器日志通常可能有 10-100 次通过。我们有一个工具可以隔离给定的通过。使用 utils/IsolatePass.py -m <log-file-name> -l 列出所有这些通过的名称。然后您可以决定调查给定通过的编译器输出。例如，如果对 ONNX 到 Krnl 语言的转换感兴趣，可以添加 -p "convert-onnx-to-krnl" 选项，其中 convert-onnx-to-krnl 是实际通过的名称。-p 选项只接受与 -l 选项列出的通过匹配的正则表达式。或者，选项 -n 34 将隔离 -l 选项列出的第 34 个通过。

假设您对 convert-onnx-to-krnl 之前或之后的通过感兴趣，您可以分别使用 -a -1 或 -a 1 附加选项。-a 选项还可以列出正则表达式，在这种情况下，它将打印匹配该正则表达式的下一个通过。

完整选项列在 --help 标志下。

调试为运算符生成的代码。

如果您知道或怀疑某个特定的 ONNX MLIR 运算符产生了不正确的结果，并希望缩小问题范围，我们提供了几个有用的 Krnl 运算符，允许在运行时打印张量的值或具有原始数据类型的值。

要在特定程序点打印张量的值，请注入以下代码（其中 X 是要打印的张量）

create.krnl.printTensor("Tensor X: ", X);

注意：目前仅当张量秩小于 4 时才打印张量内容。

要打印一条消息，后跟一个值，请注入以下代码（其中 val 是要打印的值，valType 是其类型）

create.krnl.printf("inputElem: ", val, valType);

查找内存错误

如果您知道或怀疑 onnx-mlir 编译的推理可执行文件存在与内存分配相关的问题，可以使用 valgrind 框架或 mtrace 内存工具来促进调试。这些工具跟踪内存分配/释放相关的 API，并可以检测内存问题，例如内存泄漏。

然而，如果问题与内存访问有关，特别是缓冲区溢出问题，则调试起来非常困难，因为运行时错误发生在问题点之外。可以使用 “Electric Fence 库” 来调试这些问题。它帮助您检测两个常见的编程问题：超出 malloc() 内存分配边界的软件，以及访问已被 free() 释放的内存分配的软件。与其他内存调试器不同，Electric Fence 将检测读访问和写访问，并精确指出导致错误的指令。

由于 Electric Fence 库不受 RedHat 官方支持，您需要自己下载、构建并安装源代码。安装后，在生成推理可执行文件时使用 “-lefence” 选项链接此库。然后简单执行它，这将导致运行时错误并停止在导致内存访问问题的地方。您可以使用调试器或上一节中描述的调试打印函数来识别该位置。

onnx-mlir

操作指南

参考资料

开发

工具

工具