kaori
A C++ library for barcode extraction and matching
|
kaori is a header-only C++ libary for counting the frequency of barcodes in FASTQ files. We support a variety of barcode designs including single, combinatorial and dual barcodes in either single- or paired-end data. Users can specify a maximum number of mismatches for identification of the target sequence (i.e., across both the constant and variable regions). Gzipped FASTQ files can be processed, provided Zlib is available.
kaori is a header-only library, so it can be easily used by just #include
ing the relevant source files:
Check out the reference documentation for more details.
We define the "barcoding element" as the full sequence to be matched by kaori. The barcoding element is parametrized as a template that consists of constant and variable regions. Each variable region is associated with a pool of possible barcode sequences; the template can be realized into a specific sequence of a barcoding element by replacing its variable regions with valid barcode sequences.
Given a template sequence and the associated barcode pools, kaori will scan each read for the template using bitwise comparisons. If a suitable match to the template is found, the sequence of the read at each variable region is extracted and searched against the pool of known barcodes. Imperfect barcode matches are identified through a trie-based search; the total number of mismatches is summed across the constant and variable regions. kaori also caches information about imperfect matches to avoid a redundant look-up when the same sequence is encountered in later reads.
Our approach is fast and relatively easy to implement compared to full-blown sequence aligners. Any number of mismatches are supported and the framework can be easily adapted to new barcoding configurations. However, the downside is that indels are not supported in the search process. We consider this limitation to be acceptable as indels are quite rare in (Illumina) sequencing data.
Each barcoding configuration is processed by a different kaori handler. The code shown above for SingleBarcodeSingleEnd
can be re-used for different handlers:
The library exports a number of utilities to easily construct a new handler - see the process_data.hpp
documentation for the handler expectations. This can be used to quickly extend kaori to handle other configurations. If you have a configuration that is not supported here, create an issue and we'll see what we can do. (Or even better, a pull request.)
The bitwise comparison for the constant template requires a compile-time specification of the maximum template length. In our applications, we use templating to dispatch across a set of possible template lengths. This improves efficiency for shorter templates while still retaining support for larger templates. For example:
FetchContent
If you're using CMake, you just need to add something like this to your CMakeLists.txt
:
Then you can link to kaori to make the headers available during compilation:
find_package()
You can install the library by cloning a suitable version of this repository and running the following commands:
Then you can use find_package()
as usual:
If you're not using CMake, the simple approach is to just copy the files - either directly or with Git submodules - and include their path during compilation with, e.g., GCC's -I
. You'll also need to link to the byteme header-only library as well as Zlib.