Skip to content

Data Science & Image Processing amalgam library in C/C++

License

Notifications You must be signed in to change notification settings

GerHobbelt/owemdjee

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
Sorry, we had to truncate this directory to 1,000 files. 509 entries were omitted from the list.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

owemdjee

Data Science & Image Processing amalgam library in C/C++.

This place is a gathering spot & integration workplace for the C & C++ libraries we choose to use. Think "Façade Pattern" and you're getting warm. 😉 The heavy data lifting will be done in the referenced libraries, while this lib will provide some glue and common ground for them to work in/with.

Reason for this repo

git submodules hasn't been the most, ah, "user-friendly" methods to track and manage a set of libraries that you wish to track at source level.

A few problems have been repeatedly observed over our lifetime with git:

  • when it so happens that the importance & interest in a submoduled library is perhaps waning and you want to migrate to another, you can of course invoke git to ditch the old sow and bring in the shiny new one, but that stuff gets quite finicky when you are pedalling back & forth through your commit tree when, e.g. bughunting or maintenance work on a release branch which isn't up to snuff with the fashion kids yet.

    Yup, that's been much less of a problem since about 2018, but old scars need more than a pat on the arm to heal, if you get my drift.

  • folks haven't always been the happy campers they were supposed to be when they're facing a set of submodules and want to feel safe and sure in their "knowledge" that each library X is at commit Y, when the top of the module tree is itself at commit Z, for we are busy producing a production release, perhaps? That's a wee bit stressful and there have been enough "flukes" with git to make that a not-so-ironclad-as-we-would-like position.

    Over time, I've created several bash shell scripts to help with that buzzin' feelin' of absolute certainty. Useful perhaps, but the cuteness of those wears off pretty darn quickly when many nodes in the submodule tree start cluttering their git repo with those.

And?

This repo is made to ensure we have a single point of reference for all the data munching stuff, at least.

We don't need to git submodule add all those data processing libs in our applications this way, as this is a single submodule to bother that project with. The scripts and other material in here will provide the means to ensure your build and test tools can quickly and easily ensure that everyone in here is at the commit spot they're supposed to be.

And when we want to add another lib about data/image processing, we do that in here, so the application-level git repo sees a very stable singular submodule all the time: this repo/lib, not the stuff that will change over time as external libs gain and loose momentum over time. (We're talking multiyear timespans here!)

Critique?

It's not the most brilliant solution to our problems, as this, of course, becomes a single point of failure that way, but experience in the past with similar "solutions" has shown that it's maybe not always fun, but at least we keep track of the management crap in one place and that was worth it, every time.

And why not do away with git submodule entirely and use packages instead? Because this stuff is important enough that other, quite painful experience has shown us that (binary & source) packages are a wonder and a hassle too: I'ld rather have my code tracked and tagged at source level all the way because that has reduced several bug situations from man-weeks to man-hours: like Gentoo, compile it all, one compiler only. Doesn't matter if the bug is in your own code or elsewhere, there are enough moments like that where one is helped enormously by the ability to step through and possibly tweak a bit of code here or there temporarily to help the debugging process that I, at least, prefer full source code.

And that's what this repo is here to provide: the source code gathered and ready for use on our machines.

Why is this repo a solution? And does it scale?

The worst bit first: it scales like rotten eggs. The problem there is two-fold: first, there's (relatively) few people who want to track progress at the bleeding edge, so tooling is consequently limited in power and availability, compared to conservative approaches (count the number of package managers lately?).

Meanwhile, I'm in a spot where I want to ride the bleeding edge, at least most of the time, and I happen to like it that way: my world is much more R&D than product maintenance, so having a means to track, relatively easy, the latest developments in subjects and materiel of interest is a boon to me. Sure, I'll moan and rant about it once in a while, but if I wanted to really get rid of the need to be flexible and adapt to changes, sometimes often, I'ld have gone with the conservative stability of package managers and LTS releases already. Which I've done for other parts of my environment, but do not intend to do for the part which is largely covered by this repo: source libraries which I intend to use or am using already in research tools I'm developing for others and myself.

For that purpose, this repo is a solution, though -- granted -- a sub-optimal one in that it doesn't scale very well. I don't think there's any automated process available to make this significantly faster and more scalable anyway: the fact that I'm riding the bleeding edge and wish to be able to backpedal at will when the latest change of direction or state of affairs of a component is off the rails (from my perspective at least), requires me to be flexible and adaptable to the full gamut of change. There are alternative approaches, also within the git world, but they haven't shown real appeal vs. old skool git submodules -- which is cranky at times and a pain in the neck when you want to ditch something but still need it in another dev branch, moan moan moan, but anyway... -- so here we are.

Side note: submodules which have been picked up for experimentation and inspection but have been deleted from this A list later on are struck through in the overview below: the rationale there is that we can thus still observe why we struck it off the list, plus never make the mistake of re-introducing it after a long time, forgetting that we once had a look already, without running into the struck-through entry and having to re-evaluate the reason at least, before we re-introduce an item.


Intent

Inter-process communications (IPC)

Lowest possible run-time cost, a.k.a. "run-time overhead": the aim is to have IPC which does not noticably impact UX (User Experience of the application: responsiveness / UI) on reeasonably powered machines. (Users are not expected to have the latest or fastest hardware.)

As at least large images will be transfered (PDF page renders) we need to have a binary-able protocol.

Programming Languages used: intent and purposes

We expect to use these languages in processes which require this type of IPC:

  • C / C++ (backend No.1)

    • PDF renderer (mupdf)
    • metadata & annotations extractor (mupdf et al)
    • very probably also the database interface (SQLite)
    • [page] image processing (leptonica, openCV, ImageMagick?, what-ever turns out to be useful and reasonable to integrate (particularly between PDF page renderer and OCR engine to help us provide a user-tunable PDF text+metadata extractor
    • OCR (tesseract)
    • "A.I."-assisted tooling to help process and clean PDFs: cover pages, abstract/summary extraction for meta-research, etc. (think ngrams, xdelta, SVM, tensors, author identification, document categorization, document similarity / [near-]duplicate / revision detection, tagging, ...)
    • document identifier key generator a.k.a. content hasher for creating unique key for each document, which can be used as database record index, etc.
      • old: Qiqqa SHA1B
      • new: BLAKE3+Base36
  • C# ("business logic" / "middleware": the glue logic)

  • Java (SOLR / Lucene: our choice for the "full text search database" ~ backend No.2)

  • JavaScript (UI, mostly. Think electron, web browser, Chromelyalso, WebView2plus, that sort of thing)

Here we intend to use the regular SOLR APIs, which does not require specialized binary IPC.

We may probably choose to use a web-centric UI approach where images are compressed and cached in the backend, while being provided as <picture> or <img> tag references (URLs) in the HTML generated by the backend. However, we keep our options open ATM as furtheer testing is expected to hit a few obstacles there (smart caching required as we will be processing lots of documents in "background bulk processes" alongside the browsing and other more direct user activity) so a websocket or similar push technology may be employed: there we may benefit from dedicated IPC for large binary and text data transfers.

Scripting the System: Languages Considered for Scripting by Users

Python has been considered. Given its loud presence in the AI communities, we still may integrate it one day. However, personally I'm not a big fan of the language and don't use it unless it's prudent to do, e.g. when extending or tweaking previous works produced by others. Also, it turns out, it's not exactly easy to integrate (CPython) and I don't see a need for it beyond this one project / product: Qiqqa.

I've looked at Lua for a scripting language suitable for users (used quite a lot in the gaming industries and elsewhere); initial trials to get something going did not uncover major obstacles, but the question "how do I debug Lua scripts?" does not produce any viable project / product that goes beyond the old skool printf-style debugging method. Not a prime candidate therefor, as we expect that users will pick this up, when they like it, and grow the user scripts to unanticipated size and complexity: I've seen this happen multiple times in my career. Lua does not provide a scalable growth path from my perspective due to the lack of a decent, customizable, debugger.

Third candidate is JavaScript. While Artifex/mupdf comes with mujs, which is a simple engine it suffers from two drawbacks: it's ES5 and also does not provide a debugger mechanism beyond old skool print. Nice for nerds, but this is user-facing and thus not a viable option.

The other JavaScript engines considered are of varying size, performance and complexity. Some of them offer ways to integrate them with the [F12] Chrome browser Developer Tools debugger, which would be very nice to have available. The road traveled there, along the various JavaScript engines is this:

  • cel-cpp 📁 🌐 -- C++ Implementations of the Common Expression Language. For background on the Common Expression Language see the cel-spec repo. Common Expression Language specification: the Common Expression Language (CEL) implements common semantics for expression evaluation, enabling different applications to more easily interoperate. Key Applications are (1) Security policy: organizations have complex infrastructure and need common tooling to reason about the system as a whole and (2) Protocols: expressions are a useful data type and require interoperability across programming languages and platforms.

  • cel-spec 📁 🌐 -- Common Expression Language specification: the Common Expression Language (CEL) implements common semantics for expression evaluation, enabling different applications to more easily interoperate. Key Applications are (1) Security policy: organizations have complex infrastructure and need common tooling to reason about the system as a whole and (2) Protocols: expressions are a useful data type and require interoperability across programming languages and platforms.

  • chibi-scheme 📁 🌐 -- Chibi-Scheme is a very small library intended for use as an extension and scripting language in C programs. In addition to support for lightweight VM-based threads, each VM itself runs in an isolated heap allowing multiple VMs to run simultaneously in different OS threads.

  • cppdap 📁 🌐 -- a C++11 library ("SDK") implementation of the Debug Adapter Protocol, providing an API for implementing a DAP client or server. cppdap provides C++ type-safe structures for the full DAP specification, and provides a simple way to add custom protocol messages.

  • cpython 📁 🌐 -- Python version 3. Note: Building a complete Python installation requires the use of various additional third-party libraries, depending on your build platform and configure options. Not all standard library modules are buildable or useable on all platforms.

  • duktape 📁 🌐 -- Duktape is an embeddable Javascript engine, with a focus on portability and compact footprint. Duktape is ECMAScript E5/E5.1 compliant, with some semantics updated from ES2015+, with partial support for ECMAScript 2015 (E6) and ECMAScript 2016 (E7), ES2015 TypedArray, Node.js Buffer bindings and comes with a built-in debugger.

  • ECMA262 📁 🌐 -- ECMAScript :: the source for the current draft of ECMA-262, the ECMAScript® Language Specification.

  • exprtk 📁 🌐 -- C++ Mathematical Expression Toolkit Library is a simple to use, easy to integrate and extremely efficient run-time mathematical expression parsing and evaluation engine. The parsing engine supports numerous forms of functional and logic processing semantics and is easily extensible.

  • guile 📁 🌐 -- Guile is Project GNU's extension language library. Guile is an implementation of the Scheme programming language, packaged as a library that can be linked into applications to give them their own extension language. Guile supports other languages as well, giving users of Guile-based applications a choice of languages.

  • harbour-core 📁 🌐 -- Harbour is the free software implementation of a multi-platform, multi-threading, object-oriented, scriptable programming language, backward compatible with Clipper/xBase. Harbour consists of a compiler and runtime libraries with multiple UI and database backends, its own make system and a large collection of libraries and interfaces to many popular APIs.

  • itcl 📁 🌐 -- Itcl is an object oriented extension for Tcl.

  • janet 📁 🌐 -- Janet is a (lispy) programming language for system scripting, expressive automation, and extending programs written in C or C++ with user scripting capabilities. Janet makes a good system scripting language, or a language to embed in other programs. It's like Lua and GNU Guile in that regard. It has more built-in functionality and a richer core language than Lua, but smaller than GNU Guile or Python. However, it is much easier to embed and port than Python or Guile.

  • jerryscript 📁 🌐 -- JerryScript is a lightweight JavaScript engine for resource-constrained devices such as microcontrollers. It can run on devices with less than 64 KB of RAM and less than 200 KB of flash memory.

    Key characteristics of JerryScript:

    • Full ECMAScript 5.1 standard compliance
    • 160K binary size when compiled for ARM Thumb-2
    • Heavily optimized for low memory consumption
    • Written in C99 for maximum portability
    • Snapshot support for precompiling JavaScript source code to byte code
    • Mature C API, easy to embed in applications

    Additional information can be found at the project page and Wiki.

  • jimtcl 📁 🌐 -- the Jim Interpreter is a small-footprint implementation of the Tcl programming language written from scratch. Currently Jim Tcl is very feature complete with an extensive test suite (see the tests directory). There are some Tcl commands and features which are not implemented (and likely never will be), including traces and Tk. However, Jim Tcl offers a number of both Tcl8.5 and Tcl8.6 features ({*}, dict, lassign, tailcall and optional UTF-8 support) and some unique features. These unique features include [lambda] with garbage collection, a general GC/references system, arrays as syntax sugar for [dict]tionaries, object-based I/O and more. Other common features of the Tcl programming language are present, like the "everything is a string" behaviour, implemented internally as dual ported objects to ensure that the execution time does not reflect the semantic of the language :)

  • miniscript 📁 🌐 -- the MiniScript scripting language.

  • mujs 📁 🌐 -- a lightweight ES5 Javascript interpreter designed for embedding in other software to extend them with scripting capabilities.

  • newlisp 📁 🌐 -- newLISP is a LISP-like scripting language for doing things you typically do with scripting languages: programming for the internet, system administration, text processing, gluing other programs together, etc. newLISP is a scripting LISP for people who are fascinated by LISP's beauty and power of expression, but who need it stripped down to easy-to-learn essentials. newLISP is LISP reborn as a scripting language: pragmatic and casual, simple to learn without requiring you to know advanced computer science concepts. Like any good scripting language, newLISP is quick to get into and gets the job done without fuss. newLISP has a very fast startup time, is small on resources like disk space and memory and has a deep, practical API with functions for networking, statistics, machine learning, regular expressions, multiprocessing and distributed computing built right into it, not added as a second thought in external modules.

  • owl 📁 🌐 -- Owl Lisp is a functional dialect of the Scheme programming language. It is mainly based on the applicative subset of the R7RS standard.

  • picoc 📁 🌐 -- PicoC is a very small C interpreter for scripting. It was originally written as a script language for a UAV's on-board flight system. It's also very suitable for other robotic, embedded and non-embedded applications. The core C source code is around 3500 lines of code. It's not intended to be a complete implementation of ISO C but it has all the essentials.

  • QuickJS 📁 🌐 -- a small and embeddable Javascript engine. It supports the ES2020 specification including modules, asynchronous generators, proxies and BigInt. It optionally supports mathematical extensions such as big decimal floating point numbers (BigDecimal), big binary floating point numbers (BigFloat) and operator overloading.

    • libbf 📁 🌐 -- a small library to handle arbitrary precision binary or decimal floating point numbers
    • QuickJS-C++-Wrapper 📁 🌐 -- quickjscpp is a header-only wrapper around the quickjs JavaScript engine, which allows easy integration into C++11 code. This wrapper also automatically tracks the lifetime of values and objects, is exception-safe, and automates clean-up.
    • QuickJS-C++-Wrapper2 📁 🌐 -- QuickJSPP is QuickJS wrapper for C++. It allows you to easily embed Javascript engine into your program.
    • txiki 📁 🌐 -- uses QuickJS as its kernel
  • sbcl 📁 🌐 -- SBCL is an implementation of ANSI Common Lisp, featuring a high-performance native compiler, native threads on several platforms, a socket interface, a source-level debugger, a statistical profiler, and much more.

  • ScriptX 📁 🌐 -- Tencent's ScriptX is a script engine abstraction layer. A variety of script engines are encapsulated on the bottom and a unified API is exposed on the top, so that the upper-layer caller can completely isolate the underlying engine implementation (back-end).

    ScriptX not only isolates several JavaScript engines (e.g. V8 and QuickJS), but can even isolate different scripting languages, so that the upper layer can seamlessly switch between scripting engine and scripting language without changing the code.

  • tcl 📁 🌐 -- the latest Tcl source distribution. Tcl provides a powerful platform for creating integration applications that tie together diverse applications, protocols, devices, and frameworks.

  • tclclockmod 📁 🌐 -- TclClockMod is the fastest, most powerful Tcl clock engine written in C. This Tcl clock extension is the faster Tcl-module for the replacement of the standard "clock" ensemble of tcl.

  • txiki 📁 🌐 -- a small and powerful JavaScript runtime. It's built on the shoulders of giants: it uses [QuickJS] as its JavaScript engine, [libuv] as the platform layer, [wasm3] as the WebAssembly engine and [curl] as the HTTP / WebSocket client.

  • VisualScriptEngine 📁 🌐 -- A visual scripting engine designed for embedding. The engine is written in modern C++ and compiles on several platforms with no external dependencies.

  • wxVisualScriptEngine 📁 🌐 -- a utility module for VisualScriptEngine which provides helper classes for embedding the engine in a wxWidgets application.

  • Facebook's Hermes, Samsung's Escargot and XS/moddablealso here, which led me to a webpage where various embeddable JS engines are compared size- and performance-wise.

  • Google's V8here too, as available in NodeJS, is deemed too complex for integration: when we go there, we could spend the same amount of effort on CPython integration -- though there again is the ever-present "how to debug this visually?!" question...)

  • JerryScript: ES2017/2020 (good!), there's noises about Chrome Developer Tools on the Net for this one. Small, designed for embedded devices. I like that.

  • mujs: ES5, no visual debugger. Out.

  • QuickJS: ES2020, DevTools or VS Code debugging seems to be available. Also comes with an interesting runtime: txiki, which we still need to take a good look at.

UPDATE 2021/June: JerryScript, duktape, XS/moddable, escargot: these have been dropped as we picked QuickJS. After some initial hassle with that codebase, we picked a different branch to test, which was cleaner and compiled out of the box (CMake > MSVC), which is always a good omen for a codebase when you have cross-platform portability in mind.


Libraries we're looking at for this intent:

IPC: flatbuffer et al for protocol design

  • arrow 📁 🌐 -- Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. The reference Arrow libraries contain many distinct software components:

    • Columnar vector and table-like containers (similar to data frames) supporting flat or nested types

    • Conversions to and from other in-memory data structures

    • Integration tests for verifying binary compatibility between the implementations (e.g. sending data from Java to C++)

    • IO interfaces to local and remote filesystems

    • Readers and writers for various widely-used file formats (such as Parquet, CSV)

    • Reference-counted off-heap buffer memory management, for zero-copy memory sharing and handling memory-mapped files

    • Self-describing binary wire formats (streaming and batch/file-like) for remote procedure calls (RPC) and interprocess communication (IPC)

  • avro 📁 🌐 -- Apache Avro™ is a data serialization system.

  • bebop 📁 🌐 -- an extremely simple, fast, efficient, cross-platform serialization format. Bebop is a schema-based binary serialization technology, similar to Protocol Buffers or MessagePack. In particular, Bebop tries to be a good fit for client–server or distributed web apps that need something faster, more concise, and more type-safe than JSON or MessagePack, while also avoiding some of the complexity of Protocol Buffers, FlatBuffers and the like.

  • bitsery 📁 🌐 -- header only C++ binary serialization library, designed around the networking requirements for real-time data delivery, especially for games. All cross-platform requirements are enforced at compile time, so serialized data do not store any meta-data information and is as small as possible.

  • capnproto 📁 🌐 -- Cap'n Proto is an insanely fast data interchange format and capability-based RPC system. Think JSON, except binary. Or think Protocol Buffers, except faster.

  • cereal 📁 🌐 -- C++11 serialization library

  • flatbuffers 📁 🌐 -- a cross platform serialization library architected for maximum memory efficiency. It allows you to directly access serialized data without parsing/unpacking it first, while still having great forwards/backwards compatibility.

  • GoldFish-CBOR 📁 🌐 -- a fast JSON and CBOR streaming library, without using memory. GoldFish can parse and generate very large JSON or CBOR documents. It has some similarities to a SAX parser, but doesn't use an event driven API, instead the user of the GoldFish interface is in control. GoldFish intends to be the easiest and one of the fastest JSON and CBOR streaming parser and serializer to use.

  • ion-c 📁 🌐 -- a C implementation of the Ion data notation. Amazon Ion is a richly-typed, self-describing, hierarchical data serialization format offering interchangeable binary and text representations. The text format (a superset of JSON) is easy to read and author, supporting rapid prototyping. The binary representation is efficient to store, transmit, and skip-scan parse. The rich type system provides unambiguous semantics for long-term preservation of data which can survive multiple generations of software evolution.

  • libbson 📁 🌐 -- a library providing useful routines related to building, parsing, and iterating BSON documents.

  • libnop 📁 🌐 -- libnop (C++ Native Object Protocols) is a header-only library for serializing and deserializing C++ data types without external code generators or runtime support libraries. The only mandatory requirement is a compiler that supports the C++14 standard.

  • libsmile 📁 🌐 -- C implementation of the Smile binary format (https://github.com/FasterXML/smile-format-specification).

    • discouraged; reason: for binary format record serialization we will be using bebop or reflect-cpp exclusively. All other communications will be JSON/JSON5/XML based.
  • mosquitto 📁 🌐 -- Eclipse Mosquitto is an open source implementation of a server for version 5.0, 3.1.1, and 3.1 of the MQTT protocol. It also includes a C and C++ client library, and the mosquitto_pub and mosquitto_sub utilities for publishing and subscribing.

  • msgpack-c 📁 🌐 -- MessagePack (a.k.a. msgpack) for C/C++ is an efficient binary serialization format, which lets you exchange data among multiple languages like JSON, except that it's faster and smaller. Small integers are encoded into a single byte and short strings require only one extra byte in addition to the strings themselves.

  • msgpack-cpp 📁 🌐 -- msgpack for C++: MessagePack is an efficient binary serialization format, which lets you exchange data among multiple languages like JSON, except that it's faster and smaller. Small integers are encoded into a single byte and short strings require only one extra byte in addition to the strings themselves.

  • protobuf 📁 🌐 -- Protocol Buffers - Google's data interchange format that is a language-neutral, platform-neutral, extensible mechanism for serializing structured data.

    • ☹discouraged🤧; reason: relatively slow run-time and (in my opinion) rather ugly & convoluted approach at build time. Has too much of a Java/CorporateProgramming smell, which has not lessened over the years, unfortunately.
  • reflect 📁 🌐 -- a C++20 Static Reflection library with optimized run-time execution and binary size, fast compilation times and platform agnostic, minimal API. The library only provides basic reflection primitives and is not a full-fledged, heavy, implementation for https://wg21.link/P2996 which is a language proposal with many more features and capabilities.

  • reflect-cpp 📁 🌐 -- a C++-20 library for fast serialization, deserialization and validation using reflection, similar to pydantic in Python, serde in Rust, encoding in Go or aeson in Haskell. As the aforementioned libraries are among the most widely used in the respective languages, reflect-cpp fills an important gap in C++ development. It reduces boilerplate code and increases code safety.

  • serde-cpp 📁 🌐 -- serialization framework for C++17, inspired by Rust serde project.

  • serdepp 📁 🌐 -- a C++17 low cost serialize deserialize adaptor library like Rust serde project.

  • swig 📁 🌐 -- SWIG (Simplified Wrapper and Interface Generator) is a software development tool (code generator) that connects programs written in C and C++ with a variety of high-level programming languages. It is used for building scripting language interfaces to C and C++ programs. SWIG simplifies development by largely automating the task of scripting language integration, allowing developers and users to focus on more important problems.

    SWIG 🌐 was not considered initially; more suitable for RPC than what we have in mind, which is purely data messages enchange. MAY be of use for transitional applications which are mixed-(programming-)language based, e.g. where we want to mix C/C++ and C# in a single Test Application.

  • thrift 📁 🌐 -- Apache Thrift is a lightweight, language-independent software stack for point-to-point RPC implementation. Thrift provides clean abstractions and implementations for data transport, data serialization, and application level processing. The code generation system takes a simple definition language as input and generates code across programming languages that uses the abstracted stack to build interoperable RPC clients and servers.

  • velocypack 📁 🌐 -- a fast and compact format for serialization and storage. These days, JSON (JavaScript Object Notation, see ECMA-404) is used in many cases where data has to be exchanged. Lots of protocols between different services use it, databases store JSON (document stores naturally, but others increasingly as well). It is popular, because it is simple, human-readable, and yet surprisingly versatile, despite its limitations. At the same time there is a plethora of alternatives ranging from XML over Universal Binary JSON, MongoDB's BSON, MessagePack, BJSON (binary JSON), Apache Thrift till Google's protocol buffers and ArangoDB's shaped JSON. When looking into this, we were surprised to find that none of these formats manages to combine compactness, platform independence, fast access to sub-objects and rapid conversion from and to JSON.

  • zpp_bits 📁 🌐 -- A modern, fast, C++20 binary serialization and RPC library, with just one header file.See also the benchmark.

  • ZeroMQ a.k.a. ØMQ:

  • FastBinaryEncoding 🌐

    • removed; reason: for binary format record serialization we will be using bebop exclusively. All other communications will be JSON/JSON5/XML based.
  • flatbuffers 🌐

    • removed; reason: see protobuf: same smell rising. Faster at run time, but still a bit hairy to my tastes while bebop et al are on to something potentially nice.
  • flatcc 🌐

    • removed; reason: see flatbuffers. When we don't dig flatbuffers, then flatcc is automatically pretty useless to us. Let's rephrase that professionally: "flatcc has moved out of scope for our project."

IPC: websockets, etc.: all communication means

  • blazingmq 📁 🌐 -- BlazingMQ is a modern, High-Performance Message Queue, which focuses on efficiency, reliability, and a rich feature set for modern-day workflows. At its core, BlazingMQ provides durable, fault-tolerant, highly performant, and highly available queues, along with features like various message routing strategies (e.g., work queues, priority, fan-out, broadcast, etc.), compression, strong consistency, poison pill detection, etc. Message queues generally provide a loosely-coupled, asynchronous communication channel ("queue") between application services (producers and consumers) that send messages to one another. You can think about it like a mailbox for communication between application programs, where 'producer' drops a message in a mailbox and 'consumer' picks it up at its own leisure. Messages placed into the queue are stored until the recipient retrieves and processes them. In other words, producer and consumer applications can temporally and spatially isolate themselves from each other by using a message queue to facilitate communication.

  • boringssl 📁 🌐 -- BoringSSL is a fork of OpenSSL that is designed to meet Google's needs.

  • cpp-httplib 📁 🌐 -- an extremely easy to setup C++11 cross platform HTTP/HTTPS library.

    NOTE: This library uses 'blocking' socket I/O. If you are looking for a library with 'non-blocking' socket I/O, this is not the one that you want.

  • cpp-ipc 📁 🌐 -- a high-performance inter-process communication using shared memory on Linux/Windows.

  • cpp-netlib 📁 🌐 -- modern C++ network programming library: cpp-netlib is a collection of network-related routines/implementations geared towards providing a robust cross-platform networking library.

  • cpp_rest_sdk 📁 🌐 -- the C++ REST SDK is a Microsoft project for cloud-based client-server communication in native code using a modern asynchronous C++ API design. This project aims to help C++ developers connect to and interact with services.

  • crow 📁 🌐 -- IPC / server framework. Crow is a very fast and easy to use C++ micro web framework (inspired by Python Flask).

    Interface looks nicer than oatpp...

  • ecal 📁 🌐 -- the enhanced Communication Abstraction Layer (eCAL) is a middleware that enables scalable, high performance interprocess communication on a single computer node or between different nodes in a computer network. eCAL uses a publish-subscribe pattern to automatically connect different nodes in the network. eCAL automatically chooses the best available data transport mechanism for each link:

    • Shared memory for local communication (incredible fast!)
    • UDP for network communication
  • iceoryx 📁 🌐 -- true zero-copy inter-process-communication. iceoryx is an inter-process-communication (IPC) middleware for various operating systems (currently we support Linux, macOS, QNX, FreeBSD and Windows 10). It has its origins in the automotive industry, where large amounts of data have to be transferred between different processes when it comes to driver assistance or automated driving systems. However, the efficient communication mechanisms can also be applied to a wider range of use cases, e.g. in the field of robotics or game development.

  • libetpan 📁 🌐 -- this mail library provides a portable, efficient framework for different kinds of mail access: IMAP, SMTP, POP and NNTP.

  • libwebsocketpp 📁 🌐 -- WebSocket++ is a header only C++ library that implements RFC6455 The WebSocket Protocol.

  • libwebsockets 📁 🌐 -- a simple-to-use C library providing client and server for HTTP/1, HTTP/2, WebSockets, MQTT and other protocols. It supports a lot of lightweight ancilliary implementations for things like JSON, CBOR, JOSE, COSE. It's very gregarious when it comes to event loop sharing, supporting libuv, libevent, libev, sdevent, glib and uloop, as well as custom event libs.

  • MPMCQueue 📁 🌐 -- a bounded multi-producer multi-consumer concurrent queue written in C++11.

  • MultipartEncoder 📁 🌐 -- a C++ implementation of encoding multipart/form-data. You may find the asynchronous http-client, i.e. cpprestsdk, does not support posting a multipart/form-data request. This MultipartEncoder is a work around to generate the body content of multipart/form-data format, so that then you can use a cpp HTTP-client, which is not limited to cpprestsdk, to post a multipart/form-data request by setting the encoded body content.

  • nanomsg-nng 📁 🌐 -- a rewrite of the Scalability Protocols library known as https://github.com/nanomsg/nanomsg[libnanomsg], which adds significant new capabilities, while retaining compatibility with the original. NNG is a lightweight, broker-less library, offering a simple API to solve common recurring messaging problems, such as publish/subscribe, RPC-style request/reply, or service discovery.

  • nghttp3 📁 🌐 -- an implementation of RFC 9114 <https://datatracker.ietf.org/doc/html/rfc9114>_ HTTP/3 mapping over QUIC and RFC 9204 <https://datatracker.ietf.org/doc/html/rfc9204>_ QPACK in C.

  • ngtcp2 📁 🌐 -- ngtcp2 project is an effort to implement RFC9000 <https://datatracker.ietf.org/doc/html/rfc9000>_ QUIC protocol.

  • OpenSSL 📁 🌐 -- OpenSSL is a robust, commercial-grade, full-featured Open Source Toolkit for the Transport Layer Security (TLS) protocol formerly known as the Secure Sockets Layer (SSL) protocol. The protocol implementation is based on a full-strength general purpose cryptographic library, which can also be used stand-alone.

  • readerwriterqueue 📁 🌐 -- a single-producer, single-consumer lock-free queue for C++.

  • restc-cpp 📁 🌐 -- a modern C++ REST Client library. The magic that takes the pain out of accessing JSON API's from C++. The design goal of this project is to make external REST API's simple and safe to use in C++ projects, but still fast and memory efficient.

  • restclient-cpp 📁 🌐 -- a simple REST client for C++, which wraps libcurl for HTTP requests.

  • shadesmar 📁 🌐 -- an IPC library that uses the system's shared memory to pass messages. Supports publish-subscribe and RPC.

  • sharedhashfile 📁 🌐 -- share hash tables with stable key hints stored in memory mapped files between arbitrary processes.

  • shmdata 📁 🌐 -- shares streams of framed data between processes (1 writer to many readers) via shared memory. It supports any kind of data stream: it has been used with multichannel audio, video frames, 3D models, OSC messages, and various others types of data. Shmdata is very fast and allows processes to access data streams without the need for extra copies.

  • SPSCQueue 📁 🌐 -- a single producer single consumer wait-free and lock-free fixed size queue written in C++11.

  • tcp_pubsub 📁 🌐 -- a minimal publish-subscribe library that transports data via TCP. tcp_pubsub does not define a message format but only transports binary blobs. It does however define a protocol around that, which is kept as lightweight as possible.

  • tcpshm 📁 🌐 -- a connection-oriented persistent message queue framework based on TCP or SHM IPC for Linux. TCPSHM provides a reliable and efficient solution based on a sequence number and acknowledge mechanism, that every sent out msg is persisted in a send queue until sender got ack that it's been consumed by the receiver, so that disconnects/crashes are tolerated and the recovery process is purely automatic.

  • telegram-bot-api 📁 🌐 -- the Telegram Bot API provides an HTTP API for creating Telegram Bots.

  • telegram-td 📁 🌐 -- TDLib (Telegram Database library) is a cross-platform library for building Telegram clients. It can be easily used from almost any programming language.

  • ucx 📁 🌐 -- Unified Communication X (UCX) is an optimized production proven-communication framework for modern, high-bandwidth and low-latency networks. UCX exposes a set of abstract communication primitives that utilize the best of available hardware resources and offloads. These include RDMA (InfiniBand and RoCE), TCP, GPUs, shared memory, and network atomic operations.

  • userver 📁 🌐 -- an open source asynchronous framework with a rich set of abstractions for fast and comfortable creation of C++ microservices, services and utilities. The framework solves the problem of efficient I/O interactions transparently for the developers. Operations that would typically suspend the thread of execution do not suspend it. Instead of that, the thread processes other requests and tasks and returns to the handling of the operation only when it is guaranteed to execute immediately. As a result you get straightforward source code and avoid CPU-consuming context switches from OS, efficiently utilizing the CPU with a small amount of execution threads.

  • uvw 📁 🌐 -- libuv wrapper in modern C++. uvw started as a header-only, event based, tiny and easy to use wrapper for libuv written in modern C++. Now it's finally available also as a compilable static library. The basic idea is to wrap the C-ish interface of libuv behind a graceful C++ API.

  • websocket-sharp 📁 🌐 -- a C# implementation of the WebSocket protocol client and server.

  • WinHttpPAL 📁 🌐 -- implements WinHttp API Platform Abstraction Layer for POSIX systems using libcurl

  • ice 🌐 -- Comprehensive RPC Framework: helps you network your software with minimal effort.

    • removed; reason: has a strong focus on the remote, i.e. R in RPC (thus a focus on things such as encryption, authentication, firewalling, etc.), which we don't want or need: all services are supposed to run on a single machine and comms go through localhost only. When folks find they need to distribute the workload across multiple machines, then we'll be entering a new era in Qiqqa usage and then will be soon enough to (re-)investigate the usefulness of this package.

Also, we are currently more interested in fast data serialization then RPC per se as we aim for a solution that's more akin to a REST API interface style.

  • corosync 📁 🌐 -- the Corosync Cluster Engine. The synchronization algorithm is used for every service in corosync to synchronize state of the system. The checkpoint synchronization algorithm is to synchronize checkpoints after a partition or merge of two or more partitions.

  • oatpp 🌐 -- IPC / server framework

    • removed; reason: see crow. We have picked crow as the preferred way forward, so any similar/competing product is out of scope unless crow throws a tantrum on our test bench after all, the chances of that being very slim.

IPC: ZeroMQ a.k.a. ØMQ

IPC: memory mapping

  • arrow 📁 🌐 -- Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. The reference Arrow libraries contain many distinct software components:

    • Columnar vector and table-like containers (similar to data frames) supporting flat or nested types

    • Conversions to and from other in-memory data structures

    • Integration tests for verifying binary compatibility between the implementations (e.g. sending data from Java to C++)

    • IO interfaces to local and remote filesystems

    • Readers and writers for various widely-used file formats (such as Parquet, CSV)

    • Reference-counted off-heap buffer memory management, for zero-copy memory sharing and handling memory-mapped files

    • Self-describing binary wire formats (streaming and batch/file-like) for remote procedure calls (RPC) and interprocess communication (IPC)

  • fmem 📁 🌐 -- a cross-platform library for opening memory-backed libc streams (a la UNIX fmemopen()).

  • fmemopen_windows 📁 🌐 -- provides FILE* handler based on memory backend for fread,fwrite etc. just like fmemopen on linux, but now on MS Windows.

  • libmio 📁 🌐 -- An easy to use header-only cross-platform C++11 memory mapping library. mio has been created with the goal to be easily includable (i.e. no dependencies) in any C++ project that needs memory mapped file IO without the need to pull in Boost.

  • libvrb 📁 🌐 -- implements a ring buffer, also known as a character FIFO or circular buffer, with a special property that any data present in the buffer, as well as any empty space, are always seen as a single contiguous extent by the calling program. This is implemented with virtual memory mapping by creating a mirror image of the buffer contents at the memory location in the virtual address space immediately after the main buffer location. This allows the mirror image to always be seen without doing any copying of data.

  • portable-memory-mapping 📁 🌐 -- portable Memory Mapping C++ Class (Windows/Linux)

  • shadesmar 📁 🌐 -- an IPC library that uses the system's shared memory to pass messages. Supports publish-subscribe and RPC.

  • sharedhashfile 📁 🌐 -- share hash tables with stable key hints stored in memory mapped files between arbitrary processes.

  • shmdata 📁 🌐 -- shares streams of framed data between processes (1 writer to many readers) via shared memory. It supports any kind of data stream: it has been used with multichannel audio, video frames, 3D models, OSC messages, and various others types of data. Shmdata is very fast and allows processes to access data streams without the need for extra copies.

  • stxxl 📁 🌐 -- STXXL is an implementation of the C++ standard template library STL for external memory (out-of-core) computations, i. e. STXXL implements containers and algorithms that can process huge volumes of data that only fit on disks.

  • tcpshm 📁 🌐 -- a connection-oriented persistent message queue framework based on TCP or SHM IPC for Linux. TCPSHM provides a reliable and efficient solution based on a sequence number and acknowledge mechanism, that every sent out msg is persisted in a send queue until sender got ack that it's been consumed by the receiver, so that disconnects/crashes are tolerated and the recovery process is purely automatic.

  • thrill 📁 🌐 -- an EXPERIMENTAL C++ framework for algorithmic distributed Big Data batch computations on a cluster of machines. More information at http://project-thrill.org.

IPC: JSON for protocol design

  • cJSON 📁 🌐 -- ultra-lightweight JSON parser in ANSI C.

  • glaze 📁 🌐 -- one of the fastest JSON libraries in the world. Glaze reads and writes from object memory, simplifying interfaces and offering incredible performance. Glaze also supports BEVE (binary efficient versatile encoding), CSV (comma separated value) and Binary data through the same API for maximum performance

  • GoldFish-CBOR 📁 🌐 -- a fast JSON and CBOR streaming library, without using memory. GoldFish can parse and generate very large JSON or CBOR documents. It has some similarities to a SAX parser, but doesn't use an event driven API, instead the user of the GoldFish interface is in control. GoldFish intends to be the easiest and one of the fastest JSON and CBOR streaming parser and serializer to use.

  • json 📁 🌐 -- N. Lohmann's JSON for Modern C++.

  • jsoncons 📁 🌐 -- a C++, header-only library for constructing JSON and JSON-like data formats such as CBOR. Compared to other JSON libraries, jsoncons has been designed to handle very large JSON texts. At its heart are SAX-style parsers and serializers. It supports reading an entire JSON text in memory in a variant-like structure. But it also supports efficient access to the underlying data using StAX-style pull parsing and push serializing. It supports incremental parsing into a user's preferred form, using information about user types provided by specializations of json_type_traits.

  • jsoncpp 📁 🌐 -- JsonCpp is a C++ library that allows manipulating JSON values, including serialization and deserialization to and from strings. It can also preserve existing comment in unserialization/serialization steps, making it a convenient format to store user input files.

  • json-jansson 📁 🌐 -- Jansson is a C library for encoding, decoding and manipulating JSON data.

  • rapidJSON 📁 🌐 -- TenCent's fast JSON parser/generator for C++ with both SAX & DOM style APIs.

  • simdjson 📁 🌐 -- simdjson : Parsing gigabytes of JSON per second. For NDJSON files, we can exceed 3 GB/s with our multithreaded parsing functions](https://github.com/simdjson/simdjson/blob/master/doc/parse_many.md).

  • tao-json 📁 🌐 -- taoJSON is a C++ header-only JSON library that provides a generic Value Class, uses Type Traits to interoperate with C++ types, uses an Events Interface to convert from and to JSON, JAXN, CBOR, MsgPack and UBJSON, and much more...

  • yyjson 📁 🌐 -- allegedly the fastest JSON library in C.

  • libsmile 🌐 -- "Smile" format, i.e. a compact binary JSON format

    • discouraged; reason: for binary format record serialization we will be using bebop or reflect-cpp exclusively. All other communications will be JSON/JSON5/XML based. I think we'd better standardize on using one or more of these:

      • custom binary exchange formats for those interchanges that demand highest performance and MAY carry large transfer loads.
      • JSON
      • TOML
      • XML
      • YAML

IPC: CBOR for protocol design

  • glaze 📁 🌐 -- one of the fastest JSON libraries in the world. Glaze reads and writes from object memory, simplifying interfaces and offering incredible performance. Glaze also supports BEVE (binary efficient versatile encoding), CSV (comma separated value) and Binary data through the same API for maximum performance

  • GoldFish-CBOR 📁 🌐 -- a fast JSON and CBOR streaming library, without using memory. GoldFish can parse and generate very large JSON or CBOR documents. It has some similarities to a SAX parser, but doesn't use an event driven API, instead the user of the GoldFish interface is in control. GoldFish intends to be the easiest and one of the fastest JSON and CBOR streaming parser and serializer to use.

  • jsoncons 📁 🌐 -- a C++, header-only library for constructing JSON and JSON-like data formats such as CBOR. Compared to other JSON libraries, jsoncons has been designed to handle very large JSON texts. At its heart are SAX-style parsers and serializers. It supports reading an entire JSON text in memory in a variant-like structure. But it also supports efficient access to the underlying data using StAX-style pull parsing and push serializing. It supports incremental parsing into a user's preferred form, using information about user types provided by specializations of json_type_traits.

  • libcbor 📁 🌐 -- a C library for parsing and generating CBOR, the general-purpose schema-less binary data format.

  • QCBOR 📁 🌐 -- a powerful, commercial-quality CBOR encoder/decoder that implements these RFCs:

    • RFC7049 The previous CBOR standard. Replaced by RFC 8949.
    • RFC8742 CBOR Sequences
    • RFC8943 CBOR Dates
    • RFC8949 The CBOR Standard. (Everything except sorting of encoded maps)
  • tinycbor 📁 🌐 -- Concise Binary Object Representation (CBOR) library for serializing data to disk or message channel.

IPC: YAML, TOML, etc. for protocol design

Not considered: reason: when we want the IPC protocol to be "human readable" in any form/approximation, we've decided to stick with JSON or XML (if we cannot help it -- I particularly dislike the verbosity and tag redundancy (open+close) in XML and consider it a lousy design choice for any purpose).

The more human readable formats (YAML, TOML, ...) are intended for human to machine communications, e.g. for feeding configurations into applications, and SHOULD NOT be used for IPC anywhere. (Though I must say I'm on the fence where it comes using YAML as an alternative IPC format where it replaces JSON; another contender there are the JSON5/JSON6 formats.)

Content Hashing (cryptographic strength i.e. "guaranteed" collision-free)

The bit about "guaranteed" collision-free is to be read as: hash algorithms in this section must come with strong statistical guarantees that any chance at a hash collision is negligible, even for extremely large collections. In practice this means: use cryptographic hash algorithms with a strength of 128 bits or more. (Qiqqa used a b0rked version SHA1 thus far, which is considered too weak as we already sample PDFs which cause a hash collision for the official SHA1 algo (and thus also collide in our b0rked SHA1 variant): while those can still be argued to be fringe case, I don't want to be bothered with this at all and thus choose to err on the side of 'better than SHA1B' here. Meanwhile, any library in here may contain weaker cryptographic hashes alongside: we won't be using those for content hashing.

  • BLAKE3 📁 🌐 -- cryptographic hash

  • boringssl 📁 🌐 -- BoringSSL is a fork of OpenSSL that is designed to meet Google's needs.

  • botan 📁 🌐 -- Botan (Japanese for peony flower) is a C++ cryptography library which' goal is to be the best option for cryptography in C++ by offering the tools necessary to implement a range of practical systems, such as TLS protocol, X.509 certificates, modern AEAD ciphers, PKCS#11 and TPM hardware support, password hashing, and post quantum crypto schemes.

  • cryptopp 📁 🌐 -- crypto library

  • libsodium 📁 🌐 -- an easy-to-use software library for encryption, decryption, signatures, password hashing, and more. It is a portable, cross-compilable, installable, packageable fork of NaCl, with a compatible API, and an extended API to improve usability even further.

  • md5-optimisation 📁 🌐 -- MD5 Optimisation Tricks: Beating OpenSSL’s Hand-tuned Assembly. Putting aside the security concerns with using MD5 as a cryptographic hash, there have been few developments on the performance front for many years, possibly due to maturity of implementations and existing techniques considered to be optimal. Several new tricks are employed which I’ve not seen used elsewhere, ultimately beating OpenSSL’s hand-optimized MD5 implementation by roughly 5% in the general case, and 23% for processors with AVX512 support.

  • OpenSSL 📁 🌐 -- its crypto library part, more specifically.

  • prvhash 📁 🌐 -- PRVHASH is a hash function that generates a uniform pseudo-random number sequence derived from the message. PRVHASH is conceptually similar (in the sense of using a pseudo-random number sequence as a hash) to keccak and RadioGatun schemes, but is a completely different implementation of such concept. PRVHASH is both a "randomness extractor" and an "extendable-output function" (XOF).

  • SipHash 📁 🌐 -- SipHash is a family of pseudorandom functions (PRFs) optimized for speed on short messages. This is the reference C code of SipHash: portable, simple, optimized for clarity and debugging. SipHash was designed in 2012 by Jean-Philippe Aumasson and Daniel J. Bernstein as a defense against hash-flooding DoS attacks.

    It is simpler and faster on short messages than previous cryptographic algorithms, such as MACs based on universal hashing, competitive in performance with insecure non-cryptographic algorithms, such as fhhash, cryptographically secure, with no sign of weakness despite multiple cryptanalysis projects by leading cryptographers, battle-tested, with successful integration in OSs (Linux kernel, OpenBSD, FreeBSD, FreeRTOS), languages (Perl, Python, Ruby, etc.), libraries (OpenSSL libcrypto, Sodium, etc.) and applications (Wireguard, Redis, etc.).

    As a secure pseudorandom function (a.k.a. keyed hash function), SipHash can also be used as a secure message authentication code (MAC). But SipHash is not a hash in the sense of general-purpose key-less hash function such as BLAKE3 or SHA-3. SipHash should therefore always be used with a secret key in order to be secure.

  • tink 📁 🌐 -- A multi-language, cross-platform library that provides cryptographic APIs that are secure, easy to use correctly, and hard(er) to misuse.

  • tink-cc 📁 🌐 -- Tink C++: Using crypto in your application shouldn't feel like juggling chainsaws in the dark. Tink is a crypto library written by a group of cryptographers and security engineers at Google. It was born out of our extensive experience working with Google's product teams, fixing weaknesses in implementations, and providing simple APIs that can be used safely without needing a crypto background. Tink provides secure APIs that are easy to use correctly and hard(er) to misuse. It reduces common crypto pitfalls with user-centered design, careful implementation and code reviews, and extensive testing. At Google, Tink is one of the standard crypto libraries, and has been deployed in hundreds of products and systems.

Hash-like Filters & Fast Hashing for Hash Tables et al (64 bits and less, mostly)

These hashes are for other purposes, e.g. fast lookup in dictionaries, fast approximate hit testing and set reduction through fast filtering (think bloom filter). These may be machine specific (and some of them are): these are never supposed to be used for encoding in storage or other means which crosses machine boundaries: if you want to use them for a database index, that is fine as long as you don't expect that database index to be readable by any other machine than the one which produced and uses these hash numbers.

As you can see from the list below, I went on a shopping spree, having fun with all the latest, including some possibly insane stuff that's only really useful for particular edge cases -- which we hope to avoid ourselves, for a while at least. Anyway, I'ld say we've got the motherlode here. Simple fun for those days when your brain-flag is at half-mast. Enjoy.

  • adaptiveqf 📁 🌐 -- Adaptive Quotient Filter (AQF) supports approximate membership testing and counting the occurrences of items in a data set. Like other AMQs, the AQF has a chance for false positives during queries. However, the AQF has the ability to adapt to false positives after they have occurred so they are not repeated. At the same time, the AQF maintains the benefits of a quotient filter, as it is small and fast, has good locality of reference, scales out of RAM to SSD, and supports deletions, counting, resizing, merging, and highly concurrent access.

  • adaptive-radix-tree 📁 🌐 -- implements the Adaptive Radix Tree (ART), as proposed by Leis et al. ART, which is a trie based data structure, achieves its performance, and space efficiency, by compressing the tree both vertically, i.e., if a node has no siblings it is merged with its parent, and horizontally, i.e., uses an array which grows as the number of children increases. Vertical compression reduces the tree height and horizontal compression decreases a node’s size.

  • BBHash 📁 🌐 -- Bloom-filter based minimal perfect hash function library.

    • left-for-dead; reason: has some GCC + Linux specific coding constructs; code isn't clean, which doesn't make my porting effort 'trustworthy'. Overall, if this is the alternative, we'll stick with gperf.
  • BCF-cuckoo-index 📁 🌐 -- Better Choice Cuckoo Filter (BCF) is an efficient approximate set representation data structure. Different from the standard Cuckoo Filter (CF), BCF leverages the principle of the power of two choices to select the better candidate bucket during insertion. BCF reduces the average number of relocations of the state-of-the-art CF by 35%.

    • left-for-dead; reason: has some GCC + Linux specific coding constructs: intrinsics + Linux-only API calls, which increase the cost of porting.
  • bitrush-index 📁 🌐 -- provides a serializable bitmap index able to index millions values/sec on a single thread. By default this library uses [ozbcbitmap] but if you want you can also use another compressed/uncrompressed bitmap. Only equality-queries (A = X) are supported.

  • bloom 📁 🌐 -- C++ Bloom Filter Library, which offers optimal parameter selection based on expected false positive rate, union, intersection and difference operations between bloom filters and compression of in-use table (increase of false positive probability vs space).

  • blurhash 📁 🌐 -- generates a compact representation of a placeholder for an image. You can also see nice examples and try it out yourself at blurha.sh! BlurHash takes an image and gives you a short string (only 20-30 characters!) that represents the placeholder for this image, which can be quickly decoded by any web client into an image that it shows while the real image is loading over the network.

  • cmph-hasher 📁 🌐 -- C Minimal Perfect Hashing Library for both small and (very) large hash sets.

  • cqf 📁 🌐 -- A General-Purpose Counting Filter: Counting Quotient Filter (CQF) supports approximate membership testing and counting the occurrences of items in a data set. This general-purpose AMQ is small and fast, has good locality of reference, scales out of RAM to SSD, and supports deletions, counting (even on skewed data sets), resizing, merging, and highly concurrent access.

  • crc32 📁 🌐 -- fast CRC32 library from https://create.stephan-brumme.com/crc32/

  • crc32c 📁 🌐 -- a few CRC32C implementations under an umbrella that dispatches to a suitable implementation based on the host computer's hardware capabilities. CRC32C is specified as the CRC that uses the iSCSI polynomial in RFC 3720. The polynomial was introduced by G. Castagnoli, S. Braeuer and M. Herrmann. CRC32C is used in software such as Btrfs, ext4, Ceph and leveldb.

  • CRoaring 📁 🌐 -- portable Roaring bitmaps in C (and C++). Bitsets, also called bitmaps, are commonly used as fast data structures. Unfortunately, they can use too much memory. To compensate, we often use compressed bitmaps. Roaring bitmaps are compressed bitmaps which tend to outperform conventional compressed bitmaps such as WAH, EWAH or Concise. They are used by several major systems such as Apache Lucene and derivative systems such as Solr and Elasticsearch, etc.. The CRoaring library is used in several systems such as Apache Doris.

  • Cuckoo_Filter 📁 🌐 -- a key-value filter using cuckoo hashing, substituting for bloom filter.

  • cuckoo-index 📁 🌐 -- Cuckoo Index (CI) is a lightweight secondary index structure that represents the many-to-many relationship between keys and partitions of columns in a highly space-efficient way. CI associates variable-sized fingerprints in a Cuckoo filter with compressed bitmaps indicating qualifying partitions. The problem of finding all partitions that possibly contain a given lookup key is traditionally solved by maintaining one filter (e.g., a Bloom filter) per partition that indexes all unique key values contained in this partition. To identify all partitions containing a key, we would need to probe all per-partition filters (which could be many). Depending on the storage medium, a false positive there can be very expensive. Furthermore, secondary columns typically contain many duplicates (also across partitions). Cuckoo Index (CI) addresses these drawbacks of per-partition filters. (It must know all keys at build time, though.)

  • dablooms 📁 🌐 -- a Scalable, Counting, Bloom Filter demonstrating a novel Bloom filter implementation that can scale, and provide not only the addition of new members, but reliable removal of existing members.

  • DCF-cuckoo-index 📁 🌐 -- the Dynamic Cuckoo Filter (DCF) is an efficient approximate membership test data structure. Different from the classic Bloom filter and its variants, DCF is especially designed for highly dynamic datasets and supports extending and reducing its capacity. The DCF design is the first to achieve both reliable item deletion and flexibly extending/reducing for approximate set representation and membership testing. DCF outperforms the state-of-the-art DBF designs in both speed and memory consumption.

  • dense_hash_map 📁 🌐 -- jg::dense_hash_map: a simple replacement for std::unordered_map with better performance but loose stable addressing as a trade-off.

  • EASTL 📁 🌐 -- EASTL (Electronic Arts Standard Template Library) is a C++ template library of containers, algorithms, and iterators useful for runtime and tool development across multiple platforms. It is a fairly extensive and robust implementation of such a library and has an emphasis on high performance above all other considerations.

  • emhash 📁 🌐 -- fast and memory efficient open addressing C++ flat hash table/map.

  • emphf-hash 📁 🌐 -- an efficient external-memory algorithm for the construction of minimal perfect hash functions for large-scale key sets, focusing on speed and low memory usage (2.61 N bits plus a small constant factor).

  • EWAHBoolArray 📁 🌐 -- a C++ compressed bitset data structure (also called bitset or bit vector). It supports several word sizes by a template parameter (16-bit, 32-bit, 64-bit). You should expect the 64-bit word-size to provide better performance, but higher memory usage, while a 32-bit word-size might compress a bit better, at the expense of some performance.

  • eytzinger 📁 🌐 -- fixed_eytzinger_map is a free implementation of Eytzinger’s layout, in a form of an STL-like generic associative container, broadly compatible with a well-established access patterns. An Eytzinger map, or BFS(breadth-first search) map, places elements in a lookup order, which leads to a better memory locality. In practice, such container can outperform searching in sorted arrays, like boost::flat_map, due to less cache misses made in a lookup process. In comparison with RB-based trees, like std::map, lookup in Eytzinger map can be multiple times faster. Some comparison graphs are given here.

  • farmhash 📁 🌐 -- FarmHash, a family of hash functions. FarmHash provides hash functions for strings and other data. The functions mix the input bits thoroughly but are not suitable for cryptography.

  • fastfilter_cpp 📁 🌐 -- Fast Filter: Fast approximate membership filter implementations (C++, research library)

  • fasthashing 📁 🌐 -- a few very fast (almost) strongly universal hash functions over 32-bit strings, as described by the paper: Owen Kaser and Daniel Lemire, Strongly universal string hashing is fast, Computer Journal (2014) 57 (11): 1624-1638. http://arxiv.org/abs/1202.4961

  • fifo_map 📁 🌐 -- a FIFO-ordered associative container for C++. It has the same interface as std::map, it can be used as drop-in replacement.

  • flat_hash_map 📁 🌐 -- a very fast hashtable.

  • flat.hpp 📁 🌐 -- a library of flat vector-like based associative containers.

  • fph-table 📁 🌐 -- the Flash Perfect Hash (FPH) library is a modern C++/17 implementation of a dynamic perfect hash table (no collisions for the hash), which makes the hash map/set extremely fast for lookup operations. We provide four container classes fph::DynamicFphSet,fph::DynamicFphMap,fph::MetaFphSet and fph::MetaFphMap. The APIs of these four classes are almost the same as those of std::unordered_set and std::unordered_map.

  • fsst 📁 🌐 -- Fast Static Symbol Table (FSST): fast text compression that allows random access. See also the PVLDB paper https://github.com/cwida/fsst/raw/master/fsstcompression.pdf. FSST is a compression scheme focused on string/text data: it can compress strings from distributions with many different values (i.e. where dictionary compression will not work well). It allows random-access to compressed data: it is not block-based, so individual strings can be decompressed without touching the surrounding data in a compressed block. When compared to e.g. LZ4 (which is block-based), FSST further achieves similar decompression speed and compression speed, and better compression ratio. FSST encodes strings using a symbol table -- but it works on pieces of the string, as it maps "symbols" (1-8 byte sequences) onto "codes" (single-bytes). FSST can also represent a byte as an exception (255 followed by the original byte). Hence, compression transforms a sequence of bytes into a (supposedly shorter) sequence of codes or escaped bytes. These shorter byte-sequences could be seen as strings again and fit in whatever your program is that manipulates strings. An optional 0-terminated mode (like, C-strings) is also supported.

  • gperf-hash 📁 🌐 -- This is GNU gperf, a program that generates C/C++ perfect hash functions for sets of key words.

  • gtl 📁 🌐 -- Greg's Template Library of useful classes, including a set of excellent hash map implementations, as well as a btree alternative to std::map and std::set. These are drop-in replacements for the standard C++ classes and provide the same API, but are significantly faster and use less memory. We also have a fast bit_vector implementation, which is an alternative to std::vector<bool> or std::bitset, providing both dynamic resizing and a good assortment of bit manipulation primitives, as well as a novel bit_view class allowing to operate on subsets of the bit_vector. We have lru_cache and memoize classes, both with very fast multi-thread versions relying of the mutex sharding of the parallel hashmap classes. We also offer an intrusive_ptr class, which uses less memory than std::shared_ptr, and is simpler to construct.

  • HashMap 📁 🌐 -- a hash table mostly compatible with the C++11 std::unordered_map interface, but with much higher performance for many workloads. This hash table uses open addressing with linear probing and backshift deletion. Open addressing and linear probing minimizes memory allocations and achieves high cache efficiency. Backshift deletion keeps performance high for delete heavy workloads by not clobbering the hash table with tombestones.

  • highwayhash 📁 🌐 -- Fast strong hash functions: SipHash/HighwayHash

  • hopscotch-map 📁 🌐 -- a C++ implementation of a fast hash map and hash set using hopscotch hashing and open-addressing to resolve collisions. It is a cache-friendly data structure offering better performances than std::unordered_map in most cases and is closely similar to google::dense_hash_map while using less memory and providing more functionalities.

  • iceberghashtable 📁 🌐 -- IcebergDB: High Performance Hash Tables Through Stability and Low Associativity is a fast, concurrent, and resizeable hash table implementation. It supports insertions, deletions and queries for 64-bit keys and values.

  • LDCF-hash 📁 🌐 -- The Logarithmic Dynamic Cuckoo Filter (LDCF) is an efficient approximate membership test data structure for dynamic big data sets. LDCF uses a novel multi-level tree structure and reduces the worst insertion and membership testing time from O(N) to O(1), while simultaneously reducing the memory cost of DCF as the cardinality of the set increases.

  • libart 📁 🌐 -- provides the Adaptive Radix Tree or ART. The ART operates similar to a traditional radix tree but avoids the wasted space of internal nodes by changing the node size. It makes use of 4 node sizes (4, 16, 48, 256), and can guarantee that the overhead is no more than 52 bytes per key, though in practice it is much lower.

  • libbloom 📁 🌐 -- a high-performance C server, exposing bloom filters and operations over them. The rate of false positives can be tuned to meet application demands, but reducing the error rate rapidly increases the amount of memory required for the representation. Example: Bloom filters enable you to represent 1MM items with a false positive rate of 0.1% in 2.4MB of RAM.

  • libbloomfilters 📁 🌐 -- libbf is a C++11 library which implements various Bloom filters, including:

    • A^2
    • Basic
    • Bitwise
    • Counting
    • Spectral MI
    • Spectral RM
    • Stable
  • libCRCpp 📁 🌐 -- easy to use and fast C++ CRC library.

  • libCSD 📁 🌐 -- a C++ library providing some different techniques for managing string dictionaries in compressed space. These approaches are inspired on the paper: "Compressed String Dictionaries", Nieves R. Brisaboa, Rodrigo Cánovas, Francisco Claude, Miguel A. Martínez-Prieto, and Gonzalo Navarro, 10th Symposium on Experimental Algorithms (SEA'2011), p.136-147, 2011.

  • libcuckoo 📁 🌐 -- provides a high-performance, compact hash table that allows multiple concurrent reader and writer threads.

  • lshbox 📁 🌐 -- a C++ Toolbox of Locality-Sensitive Hashing for Large Scale Image Retrieval. Locality-Sensitive Hashing (LSH) is an efficient method for large scale image retrieval, and it achieves great performance in approximate nearest neighborhood searching.

    LSHBOX is a simple but robust C++ toolbox that provides several LSH algrithms, in addition, it can be integrated into Python and MATLAB languages. The following LSH algrithms have been implemented in LSHBOX, they are:

    • LSH Based on Random Bits Sampling
    • Random Hyperplane Hashing
    • LSH Based on Thresholding
    • LSH Based on p-Stable Distributions
    • Spectral Hashing (SH)
    • Iterative Quantization (ITQ)
    • Double-Bit Quantization Hashing (DBQ)
    • K-means Based Double-Bit Quantization Hashing (KDBQ)
  • map_benchmark 📁 🌐 -- comprehensive benchmarks of C++ maps.

  • morton_filter 📁 🌐 -- a Morton filter -- a new approximate set membership data structure. A Morton filter is a modified cuckoo filter that is optimized for bandwidth-constrained systems. Morton filters use additional computation in order to reduce their off-chip memory traffic. Like a cuckoo filter, a Morton filter supports insertions, deletions, and lookup operations. It additionally adds high-throughput self-resizing, a feature of quotient filters, which allows a Morton filter to increase its capacity solely by leveraging its internal representation. This capability is in contrast to existing vanilla cuckoo filter implementations, which are static and thus require using a backing data structure that contains the full set of items to resize the filter. Morton filters can also be configured to use less memory than a cuckoo filter for the same error rate while simultaneously delivering insertion, deletion, and lookup throughputs that are, respectively, up to 15.5x, 1.3x, and 2.5x higher than a cuckoo filter. Morton filters in contrast to vanilla cuckoo filters do not require a power of two number of buckets but rather only a number that is a multiple of two. They also use fewer bits per item than a Bloom filter when the target false positive rate is less than around 1% to 3%.

  • mutable_rank_select 📁 🌐 -- Rank/Select Queries over Mutable Bitmaps. Given a mutable bitmap B[0..u) where n bits are set, the rank/select problem asks for a data structure built from B that supports rank(i) (the number of bits set in B[0..i], for 0 ≤ i < u), select(i) (the position of the i-th bit set, for 0 ≤ i < n), flip(i) (toggles B[i], for 0 ≤ i < u) and access(i) (return B[i], for 0 ≤ i < u). The input bitmap is partitioned into blocks and a tree index is built over them. The tree index implemented in the library is an optimized b-ary Segment-Tree with SIMD AVX2/AVX-512 instructions. You can test a block size of 256 or 512 bits, and various rank/select algorithms for the blocks such as broadword techniques, CPU intrinsics, and SIMD instructions.

  • nedtries 📁 🌐 -- an in-place bitwise binary Fredkin trie algorithm which allows for near constant time insertions, deletions, finds, closest fit finds and iteration. On modern hardware it is approximately 50-100% faster than red-black binary trees, it handily beats even the venerable O(1) hash table for less than 3000 objects and it is barely slower than the hash table for 10000 objects. Past 10000 objects you probably ought to use a hash table though, and if you need nearest fit rather than close fit then red-black trees are still optimal.

  • OZBCBitmap 📁 🌐 -- OZBC provides an efficent compressed bitmap to create bitmap indexes on high-cardinality columns. Bitmap indexes have traditionally been considered to work well for low-cardinality columns, which have a modest number of distinct values. The simplest and most common method of bitmap indexing on attribute A with K cardinality associates a bitmap with every attribute value V then the Vth bitmap rapresent the predicate A=V. This approach ensures an efficient solution for performing search but on high-cardinality attributes the size of the bitmap index increase dramatically. OZBC is a run-length-encoded hybrid compressed bitmap designed exclusively to create a bitmap indexes on L cardinality attributes where L>=16 and provide bitwise logical operations in running time complexity proportianl to the compressed bitmap size.

  • parallel-hashmap 📁 🌐 -- a set of hash map implementations, as well as a btree alternative to std::map and std::set

  • phf-hash 📁 🌐 -- a simple implementation of the CHD perfect hash algorithm. CHD can generate perfect hash functions for very large key sets -- on the order of millions of keys -- in a very short time.

  • poplar-trie 📁 🌐 -- a C++17 library of a memory-efficient associative array whose keys are strings. The data structure is based on a dynamic path-decomposed trie (DynPDT) described in the paper, Shunsuke Kanda, Dominik Köppl, Yasuo Tabei, Kazuhiro Morita, and Masao Fuketa: Dynamic Path-decomposed Tries, ACM Journal of Experimental Algorithmics (JEA), 25(1): 1–28, 2020. Poplar-trie is a memory-efficient updatable associative array implementation which maps key strings to values of any type like std::map<std::string,anytype>. DynPDT is composed of two structures: dynamic trie and node label map (NLM) structures.

  • PruningRadixTrie 📁 🌐 -- a 1000x faster Radix trie for prefix search & auto-complete, the PruningRadixTrie is a novel data structure, derived from a radix trie - but 3 orders of magnitude faster. A Pruning Radix trie is a novel Radix trie algorithm, that allows pruning of the Radix trie and early termination of the lookup. In many cases, we are not interested in a complete set of all children for a given prefix, but only in the top-k most relevant terms. Especially for short prefixes, this results in a massive reduction of lookup time for the top-10 results. On the other hand, a complete result set of millions of suggestions wouldn't be helpful at all for autocompletion. The lookup acceleration is achieved by storing in each node the maximum rank of all its children. By comparing this maximum child rank with the lowest rank of the results retrieved so far, we can heavily prune the trie and do early termination of the lookup for non-promising branches with low child ranks.

  • prvhash 📁 🌐 -- PRVHASH is a hash function that generates a uniform pseudo-random number sequence derived from the message. PRVHASH is conceptually similar (in the sense of using a pseudo-random number sequence as a hash) to keccak and RadioGatun schemes, but is a completely different implementation of such concept. PRVHASH is both a "randomness extractor" and an "extendable-output function" (XOF).

  • QALSH 📁 🌐 -- QALSH: Query-Aware Locality-Sensitive Hashing, is a package for the problem of Nearest Neighbor Search (NNS) over high-dimensional Euclidean spaces. Given a set of data points and a query, the problem of NNS aims to find the nearest data point to the query. It is a very fundamental probelm and has wide applications in many data mining and machine learning tasks. This package provides the external memory implementations (disk-based) of QALSH and QALSH+ for c-Approximate Nearest Neighbor Search (c-ANNS) under lp norm, where 0 < p ⩽ 2. The internel memory version can be found here.

  • QALSH_Mem 📁 🌐 -- Memory Version of QALSH: QALSH_Mem is a package for the problem of Nearest Neighbor Search (NNS). Given a set of data points and a query, the problem of NNS aims to find the nearest data point to the query. It has wide applications in many data mining and machine learning tasks. This package provides the internal memory implementations of two LSH schemes QALSH and QALSH+ for c-Approximate Nearest Neighbor Search (c-ANNS) under lp norm, where 0 < p ⩽ 2. The external version of QALSH and QALSH+ can be found here.

  • radix_tree 📁 🌐 -- STL like container of radix tree in C++.

  • rankselect 📁 🌐 -- space-efficient, high-performance rank & select structures on uncompressed bit sequences.

  • rapidhash 📁 🌐 -- very fast, high quality, platform-independent, this is the fastest recommended hash function by SMHasher. The fastest passing hash in SMHasher3. rapidhash is wyhash' official successor, with improved speed, quality and compatibility.

  • rax 📁 🌐 -- an ANSI C radix tree implementation initially written to be used in a specific place of Redis in order to solve a performance problem, but immediately converted into a stand alone project to make it reusable for Redis itself, outside the initial intended application, and for other projects as well. The primary goal was to find a suitable balance between performances and memory usage, while providing a fully featured implementation of radix trees that can cope with many different requirements.

  • RectangleBinPack 📁 🌐 -- the source code used in "A Thousand Ways to Pack the Bin - A Practical Approach to Two-Dimensional Rectangle Bin Packing." The code can be

  • RoaringBitmap 📁 🌐 -- Roaring bitmaps are compressed bitmaps which tend to outperform conventional compressed bitmaps such as WAH, EWAH or Concise. In some instances, roaring bitmaps can be hundreds of times faster and they often offer significantly better compression. They can even be faster than uncompressed bitmaps.

  • robin-hood-hashing 📁 🌐 -- robin_hood unordered map & set.

  • robin-map 📁 🌐 -- a C++ implementation of a fast hash map and hash set using open-addressing and linear robin hood hashing with backward shift deletion to resolve collisions.

  • rollinghashcpp 📁 🌐 -- randomized rolling hash functions in C++. This is a set of C++ classes implementing various recursive n-gram hashing techniques, also called rolling hashing (http://en.wikipedia.org/wiki/Rolling_hash), including Randomized Karp-Rabin (sometimes called Rabin-Karp), Hashing by Cyclic Polynomials (also known as Buzhash) and Hashing by Irreducible Polynomials.

  • RTree 📁 🌐 -- R-Tree: a Dynamic Index Structure for Spatial Searching, implemented as a C++ template, generally compatible with the STL and Boost C++ libraries.

  • semimap 📁 🌐 -- semi::static_map and semi::map: associative map containers with compile-time lookup! Normally, associative containers require some runtime overhead when looking up their values from a key. However, when the key is known at compile-time (for example, when the key is a literal) then this run-time lookup could technically be avoided. This is exactly what the goal of semi::static_map and semi::map is.

  • SipHash 📁 🌐 -- SipHash is a family of pseudorandom functions (PRFs) optimized for speed on short messages. This is the reference C code of SipHash: portable, simple, optimized for clarity and debugging. SipHash was designed in 2012 by Jean-Philippe Aumasson and Daniel J. Bernstein as a defense against hash-flooding DoS attacks.

    It is simpler and faster on short messages than previous cryptographic algorithms, such as MACs based on universal hashing, competitive in performance with insecure non-cryptographic algorithms, such as fhhash, cryptographically secure, with no sign of weakness despite multiple cryptanalysis projects by leading cryptographers, battle-tested, with successful integration in OSs (Linux kernel, OpenBSD, FreeBSD, FreeRTOS), languages (Perl, Python, Ruby, etc.), libraries (OpenSSL libcrypto, Sodium, etc.) and applications (Wireguard, Redis, etc.).

    As a secure pseudorandom function (a.k.a. keyed hash function), SipHash can also be used as a secure message authentication code (MAC). But SipHash is not a hash in the sense of general-purpose key-less hash function such as BLAKE3 or SHA-3. SipHash should therefore always be used with a secret key in order to be secure.

  • slot_map 📁 🌐 -- a Slot Map is a high-performance associative container with persistent unique keys to access stored values. Upon insertion, a key is returned that can be used to later access or remove the values. Insertion, removal, and access are all guaranteed to take O(1) time (best, worst, and average case). Great for storing collections of objects that need stable, safe references but have no clear ownership. The difference between a std::unordered_map and a dod::slot_map is that the slot map generates and returns the key when inserting a value. A key is always unique and will only refer to the value that was inserted.

  • smhasher 📁 🌐 -- benchmark and collection of fast hash functions for symbol tables or hash tables.

  • sparsehash 📁 🌐 -- fast (non-cryptographic) hash algorithms

  • sparse-map 📁 🌐 -- a C++ implementation of a memory efficient hash map and hash set. It uses open-addressing with sparse quadratic probing. The goal of the library is to be the most memory efficient possible, even at low load factor, while keeping reasonable performances.

  • sparsepp 📁 🌐 -- a fast, memory efficient hash map for C++. Sparsepp is derived from Google's excellent sparsehash implementation.

  • spookyhash 📁 🌐 -- a very fast non cryptographic hash function, designed by Bob Jenkins. It produces well-distributed 128-bit hash values for byte arrays of any length. It can produce 64-bit and 32-bit hash values too, at the same speed.

  • StronglyUniversalStringHashing 📁 🌐 -- very fast universal hash families on strings.

  • tensorstore 📁 🌐 -- TensorStore is an open-source C++ and Python software library designed for storage and manipulation of large multi-dimensional arrays.

  • thumbhash 📁 🌐 -- ThumbHash: a very compact representation of a placeholder for an image. Store it inline with your data and show it while the real image is loading for a smoother loading experience. It's similar to BlurHash but encodes more detail in the same space, also encodes the aspect ratio and gives more accurate colors.

  • unordered_dense 📁 🌐 -- ankerl::unordered_dense::{map, set} is a fast & densely stored hashmap and hashset based on robin-hood backward shift deletion for C++17 and later. The classes ankerl::unordered_dense::map and ankerl::unordered_dense::set are (almost) drop-in replacements of std::unordered_map and std::unordered_set. While they don't have as strong iterator / reference stability guaranties, they are typically much faster. Additionally, there are ankerl::unordered_dense::segmented_map and ankerl::unordered_dense::segmented_set with lower peak memory usage. and stable iterator/references on insert.

  • vqf 📁 🌐 -- Vector Quotient Filters: Overcoming the Time/Space Trade-Off in Filter Design. The VQF supports approximate membership testing of items in a data set. The VQF is based on Robin Hood hashing, like the quotient filter, but uses power-of-two-choices hashing to reduce the variance of runs, and thus offers consistent, high throughput across load factors. Power-of-two-choices hashing also makes it more amenable to concurrent updates.

  • wyhash 📁 🌐 -- No hash function is perfect, but some are useful. wyhash and wyrand are the ideal 64-bit hash function and PRNG respectively: solid, portable, fastest (especially for short keys), salted (using a dynamic secret to avoid intended attack).

  • xor-and-binary-fuse-filter 📁 🌐 -- XOR and Binary Fuse Filter library: Bloom filters are used to quickly check whether an element is part of a set. Xor filters and binary fuse filters are faster and more concise alternative to Bloom filters. They are also smaller than cuckoo filters. They are used in production systems.

  • xsg 📁 🌐 -- XOR BST implementations are related to the XOR linked list, a doubly linked list variant, from where we borrow the idea about how links between nodes are to be implemented. Modest resource requirements and simplicity make XOR scapegoat trees stand out of the BST crowd. All iterators (except end() iterators), but not references and pointers, are invalidated, after inserting or erasing from this XOR scapegoat tree implementation. You can dereference invalidated iterators, if they were not erased, but you cannot iterate with them. end() iterators are constant and always valid, but dereferencing them results in undefined behavior.

  • xxHash 📁 🌐 -- fast (non-cryptographic) hash algorithm

  • circlehash 📁 🌐 -- a family of non-cryptographic hash functions that pass every test in SMHasher.

    • removed; reason: written in Go; port to C/C++ is easy but just too much effort for too little gain; when we're looking for fast non-cryptographic hashes like this, we don't appreciate it to include 128-bit / 64-bit multiplications as those are generally slower than shift, add, xor. While this will surely be a nice hash, it doesn't fit our purposes.

Intermediate Data Storage / Caching / Hierarchical Data Stores (binary hOCR; document text revisions; ...)

  • CacheLib 📁 🌐 -- provides an in-process high performance caching mechanism, thread-safe API to build high throughput, low overhead caching services, with built-in ability to leverage DRAM and SSD caching transparently.
  • cachelot 📁 🌐 -- is an LRU cache that works at the speed of light. The library works with a fixed pre-allocated memory. You tell the memory size and LRU cache is ready. Small metadata, up to 98% memory utilization.
  • caches 📁 🌐 -- implements a simple thread-safe cache with several page replacement policies: LRU (Least Recently Used), FIFO (First-In/First-Out), LFU (Least Frequently Used)
  • c-blosc2 📁 🌐 -- a high performance compressor optimized for binary data (i.e. floating point numbers, integers and booleans), designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call.
  • localmemcache 📁 🌐 -- a key-value database and library that provides an interface similar to memcached but for accessing local data instead of remote data. It's based on mmap()'ed shared memory for maximum speed. It supports persistence, also making it a fast alternative to GDBM and Berkeley DB.
  • lru_cache 📁 🌐 -- LRU cache is a fast, header-only, generic C++ 17 [LRU cache][1] library, with customizable backend.
  • lrucache11 📁 🌐 -- A header only C++11 LRU Cache template class that allows you to define key, value and optionally the Map type. uses a double linked list and a std::unordered_map style container to provide fast insert, delete and update No dependencies other than the C++ standard library.
  • pelikan 📁 🌐 -- Pelikan is Twitter's unified cache backend.
  • stlcache 📁 🌐 -- STL::Cache is an in-memory cache for C++ applications. STL::Cache is just a simple wrapper over standard map, that implements some cache algorithms, thus allowing you to limit the storage size and automatically remove unused items from it. It is intended to be used for keeping any key/value data, especially when data's size are too big, to just put it into the map and keep the whole thing. With STL::Cache you could put enormous (really unlimited) amount of data into it, but it will store only some small part of your data. So re-usable data will be kept near your code and not so popular data will not spend expensive memory. STL::Cache uses configurable policies, for decisions, whether data are good, to be kept in cache or they should be thrown away. It is shipped with 8 policies and you are free to implement your own.

RAM-/disk-based large queues and stores: B+tree, LSM-tree, ...

  • arrow 📁 🌐 -- Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. The reference Arrow libraries contain many distinct software components:

    • Columnar vector and table-like containers (similar to data frames) supporting flat or nested types

    • Conversions to and from other in-memory data structures

    • Integration tests for verifying binary compatibility between the implementations (e.g. sending data from Java to C++)

    • IO interfaces to local and remote filesystems

    • Readers and writers for various widely-used file formats (such as Parquet, CSV)

    • Reference-counted off-heap buffer memory management, for zero-copy memory sharing and handling memory-mapped files

    • Self-describing binary wire formats (streaming and batch/file-like) for remote procedure calls (RPC) and interprocess communication (IPC)

  • cpp-btree 📁 🌐 -- in-memory B+-tree: an alternative for the priority queue as we expect the queue to grow huge, given past experience with Qiqqa.

  • ejdb 📁 🌐 -- an embeddable JSON database engine published under MIT license, offering a single file database, online backups support, a simple but powerful query language (JQL), based on the TokyoCabinet-inspired KV store iowow.

  • FASTER 📁 🌐 -- helps manage large application state easily, resiliently, and with high performance by offering (1) FASTER Log, which is a high-performance concurrent persistent recoverable log, iterator, and random reader library, and (2) FASTER KV as a concurrent key-value store + cache that is designed for point lookups and heavy updates. FASTER supports data larger than memory, by leveraging fast external storage (local or cloud). It also supports consistent recovery using a fast non-blocking checkpointing technique that lets applications trade-off performance for commit latency. Both FASTER KV and FASTER Log offer orders-of-magnitude higher performance than comparable solutions, on standard workloads.

  • iowow 📁 🌐 -- a C/11 file storage utility library and persistent key/value storage engine, supporting multiple key-value databases within a single file, online database backups and Write Ahead Logging (WAL) support. Good performance comparing its main competitors: lmdb, leveldb, kyoto cabinet.

  • libmdbx 📁 🌐 -- one of the fastest embeddable key-value ACID database without WAL. libmdbx surpasses the legendary LMDB in terms of reliability, features and performance.

  • libpmemobj-cpp 📁 🌐 -- a C++ binding for libpmemobj (a library which is a part of PMDK collection).

  • libshmcache 📁 🌐 -- a local share memory cache for multi processes. it is a high performance library because read mechanism is lockless. libshmcache is 100+ times faster than a remote interface such as redis.

  • Lightning.NET 📁 🌐 -- .NET library for OpenLDAP's LMDB key-value store

  • ligra-graph 📁 🌐 -- LIGRA: a Lightweight Graph Processing Framework for Shared Memory; works on both uncompressed and compressed graphs and hypergraphs.

  • lmdb 📁 🌐 -- OpenLDAP LMDB is an outrageously fast key/value store with semantics that make it highly interesting for many applications. Of specific note, besides speed, is the full support for transactions and good read/write concurrency. LMDB is also famed for its robustness when used correctly.

  • lmdb-safe 📁 🌐 -- A safe modern & performant C++ wrapper of LMDB. LMDB is an outrageously fast key/value store with semantics that make it highly interesting for many applications. Of specific note, besides speed, is the full support for transactions and good read/write concurrency. LMDB is also famed for its robustness.. when used correctly. The design of LMDB is elegant and simple, which aids both the performance and stability. The downside of this elegant design is a nontrivial set of rules that need to be followed to not break things. In other words, LMDB delivers great things but only if you use it exactly right. This is by conscious design. The lmdb-safe library aims to deliver the full LMDB performance while programmatically making sure the LMDB semantics are adhered to, with very limited overhead.

  • lmdb.spreads.net 📁 🌐 -- Low-level zero-overhead and the fastest LMDB .NET wrapper with some additional native methods useful for Spreads.

  • lmdb-store 📁 🌐 -- an ultra-fast NodeJS interface to LMDB; probably the fastest and most efficient NodeJS key-value/database interface that exists for full storage and retrieval of structured JS data (objects, arrays, etc.) in a true persisted, scalable, ACID compliant database. It provides a simple interface for interacting with LMDB.

  • lmdbxx 📁 🌐 -- lmdb++: a comprehensive C++11 wrapper for the LMDB embedded database library, offering both an error-checked procedural interface and an object-oriented resource interface with RAII semantics.

  • palmtree 📁 🌐 -- concurrent lock free B+Tree

  • parallel-hashmap 📁 🌐 -- a set of hash map implementations, as well as a btree alternative to std::map and std::set

  • pmdk 📁 🌐 -- the Persistent Memory Development Kit (PMDK) is a collection of libraries and tools for System Administrators and Application Developers to simplify managing and accessing persistent memory devices.

  • pmdk-tests 📁 🌐 -- tests for Persistent Memory Development Kit

  • pmemkv 📁 🌐 -- pmemkv is a local/embedded key-value datastore optimized for persistent memory. Rather than being tied to a single language or backing implementation, pmemkv provides different options for language bindings and storage engines.

  • pmemkv-bench 📁 🌐 -- benchmark for libpmemkv and its underlying libraries, based on leveldb's db_bench. The pmemkv_bench utility provides some standard read, write & remove benchmarks. It's based on the db_bench utility included with LevelDB and RocksDB, although the list of supported parameters is slightly different.

  • riegeli 📁 🌐 -- Riegeli/records is a file format for storing a sequence of string records, typically serialized protocol buffers. It supports dense compression, fast decoding, seeking, detection and optional skipping of data corruption, filtering of proto message fields for even faster decoding, and parallel encoding.

  • tlx-btree 📁 🌐 -- in-memory B+-tree: an alternative for the priority queue as we expect the queue to grow huge, given past experience with Qiqqa.

  • vmem 📁 🌐 -- libvmem and libvmmalloc are a couple of libraries for using persistent memory for malloc-like volatile uses. They have historically been a part of PMDK despite being solely for volatile uses. You may want consider using memkind instead in code that benefits from extra features like NUMA awareness.

  • vmemcache 📁 🌐 -- libvmemcache is an embeddable and lightweight in-memory buffered LRU caching solution. It's designed to fully take advantage of large capacity memory, such as Persistent Memory with DAX, through memory mapping in an efficient and scalable way.

HDF5 file format

  • h5cpp 📁 🌐 -- easy to use HDF5 C++ templates for Serial and Paralel HDF5. Hierarchical Data Format HDF5 is prevalent in high performance scientific computing, sits directly on top of sequential or parallel file systems, providing block and stream operations on standardized or custom binary/text objects. Scientific computing platforms come with the necessary libraries to read write HDF5 dataset. H5CPP simplifies interactions with popular linear algebra libraries, provides compiler assisted seamless object persistence, Standard Template Library support and comes equipped with a novel error handling architecture.

    • in-purgatory; reason: see the HDF5 entry below. But advertises to be an interface between OpenCV, Eigen, etc. at the same time...
  • HDF5 🌐

    • removed; reason: HDF5 is a nice concept but considered overkill right now; where we need disk stores, we'll be using SQLite or LMDB-like key-value stores instead. Such stores are not meant to be interchangeable with other software in their raw shape and we'll provide public access APIs instead, where applicable.
  • HighFive-HDF5 🌐

    • removed; reason: see the HDF5 entry above.

Data Storage / Caching / IPC: loss-less data compression

  • 7zip 📁 🌐 -- 7-Zip: 7-zip.org

  • 7-Zip-zstd 📁 🌐 -- 7-Zip ZS with support of additional Codecs: Zstandard, Brotli, LZ4, LZ5, Lizard, Fast LZMA2

  • bit7z 📁 🌐 -- a library offering a clean and simple interface to the 7-zip shared libraries.

  • brotli 📁 🌐 -- compression

  • bxzstr 📁 🌐 -- a header-only library for using standard c++ iostreams to access streams compressed with ZLib, libBZ2, libLZMA, or libZstd (.gz, .bz2, .xz, and .zst files). For decompression, the format is automatically detected. For compression, the only parameter exposed is the compression algorithm.

  • bzip2 📁 🌐 -- bzip2 with minor modifications to original sources.

  • bzip3 📁 🌐 -- a better, faster and stronger spiritual successor to BZip2. Features higher compression ratios and better performance thanks to a order-0 context mixing entropy coder, a fast Burrows-Wheeler transform code making use of suffix arrays and a RLE with Lempel Ziv+Prediction pass based on LZ77-style string matching and PPM-style context modeling. Like its ancestor, BZip3 excels at compressing text or code.

  • c-blosc2 📁 🌐 -- a high performance compressor optimized for binary data (i.e. floating point numbers, integers and booleans), designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call.

  • density 📁 🌐 -- a superfast compression library. It is focused on high-speed compression, at the best ratio possible. All three of DENSITY's algorithms are currently at the pareto frontier of compression speed vs ratio (cf. here for an independent benchmark).

  • densityxx 📁 🌐 -- the c++ version of density, which is a super fast compress library.

  • easylzma 📁 🌐 -- a C library and command line tools for LZMA compression and decompression. It uses a Igor Pavlov's reference implementation and SDK written in C.

  • fast-lzma2 📁 🌐 -- the Fast LZMA2 Library is a lossless high-ratio data compression library based on Igor Pavlov's LZMA2 codec from 7-zip. Binaries of 7-Zip forks which use the algorithm are available in the 7-Zip-FL2 project, the 7-Zip-zstd project, and the active fork of p7zip. The library is also embedded in a fork of XZ Utils, named FXZ Utils.

  • fast_pfor 📁 🌐 -- a research library with integer compression schemes. It is broadly applicable to the compression of arrays of 32-bit integers where most integers are small. The library seeks to exploit SIMD instructions (SSE) whenever possible.

  • fsst 📁 🌐 -- Fast Static Symbol Table (FSST): fast text compression that allows random access. See also the PVLDB paper https://github.com/cwida/fsst/raw/master/fsstcompression.pdf. FSST is a compression scheme focused on string/text data: it can compress strings from distributions with many different values (i.e. where dictionary compression will not work well). It allows random-access to compressed data: it is not block-based, so individual strings can be decompressed without touching the surrounding data in a compressed block. When compared to e.g. LZ4 (which is block-based), FSST further achieves similar decompression speed and compression speed, and better compression ratio. FSST encodes strings using a symbol table -- but it works on pieces of the string, as it maps "symbols" (1-8 byte sequences) onto "codes" (single-bytes). FSST can also represent a byte as an exception (255 followed by the original byte). Hence, compression transforms a sequence of bytes into a (supposedly shorter) sequence of codes or escaped bytes. These shorter byte-sequences could be seen as strings again and fit in whatever your program is that manipulates strings. An optional 0-terminated mode (like, C-strings) is also supported.

  • libbsc 📁 🌐 -- a library for lossless, block-sorting data compression. bsc is a high performance file compressor based on lossless, block-sorting data compression algorithms.

  • libCSD 📁 🌐 -- a C++ library providing some different techniques for managing string dictionaries in compressed space. These approaches are inspired on the paper: "Compressed String Dictionaries", Nieves R. Brisaboa, Rodrigo Cánovas, Francisco Claude, Miguel A. Martínez-Prieto, and Gonzalo Navarro, 10th Symposium on Experimental Algorithms (SEA'2011), p.136-147, 2011.

  • libdeflate 📁 🌐 -- heavily optimized library for DEFLATE/zlib/gzip compression and decompression.

  • libsais 📁 🌐 -- a library for fast linear time suffix array, longest common prefix array and Burrows-Wheeler transform construction based on induced sorting algorithm described in the following papers:

    • Ge Nong, Sen Zhang, Wai Hong Chan Two Efficient Algorithms for Linear Suffix Array Construction, 2009
    • Juha Karkkainen, Giovanni Manzini, Simon J. Puglisi Permuted Longest-Common-Prefix Array, 2009
    • Nataliya Timoshevskaya, Wu-chun Feng SAIS-OPT: On the characterization and optimization of the SA-IS algorithm for suffix array construction, 2014
    • Jing Yi Xie, Ge Nong, Bin Lao, Wentao Xu Scalable Suffix Sorting on a Multicore Machine, 2020

    The libsais is inspired by libdivsufsort, sais libraries by Yuta Mori and msufsort by Michael Maniscalco.

  • libzip 📁 🌐 -- a C library for reading, creating, and modifying zip and zip64 archives.

  • libzopfli 📁 🌐 -- Zopfli Compression Algorithm is a compression library programmed in C to perform very good, but slow, deflate or zlib compression.

  • lizard 📁 🌐 -- efficient compression with very fast decompression. Lizard (formerly LZ5) is a lossless compression algorithm which contains 4 compression methods:

    • fastLZ4 : compression levels -10...-19 are designed to give better decompression speed than [LZ4] i.e. over 2000 MB/s
    • fastLZ4 + Huffman : compression levels -30...-39 add Huffman coding to fastLZ4
    • LIZv1 : compression levels -20...-29 are designed to give better ratio than [LZ4] keeping 75% decompression speed
    • LIZv1 + Huffman : compression levels -40...-49 give the best ratio (comparable to [zlib] and low levels of [zstd]/[brotli]) at decompression speed of 1000 MB/s
  • lz4 📁 🌐 -- LZ4 is lossless compression algorithm, providing compression speed > 500 MB/s per core, scalable with multi-cores CPU. It features an extremely fast decoder, with speed in multiple GB/s per core, typically reaching RAM speed limits on multi-core systems.

  • lzbench 📁 🌐 -- an in-memory benchmark of open-source LZ77/LZSS/LZMA compressors. It joins all compressors into a single exe.

  • lzham_codec 📁 🌐 -- LZHAM is a lossless data compression codec, with a compression ratio similar to LZMA but with 1.5x-8x faster decompression speed.

  • lzma 📁 🌐 -- LZMA Utils is an attempt to provide LZMA compression to POSIX-like systems. The idea is to have a gzip-like command line tool and a zlib-like library, which would make it easy to adapt the new compression technology to existing applications.

  • p7zip 📁 🌐 -- p7zip-zstd = 7zip with extensions, including major modern codecs such as Brotli, Fast LZMA2, LZ4, LZ5, Lizard and Zstd.

  • shoco 📁 🌐 -- a fast compressor for short strings

  • snappy 📁 🌐 -- an up-to-date fork of google/snappy, a fast compression/decompression library. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression.

  • squash 📁 🌐 -- an abstraction library which provides a single API to access many compression libraries, allowing applications a great deal of flexibility when choosing a compression algorithm, or allowing a choice between several of them.

  • Turbo-Range-Coder 📁 🌐 -- TurboRC: Turbo Range Coder + rANS Asymmetric Numeral Systems is a very fast (branchless) Range Coder / Arithmetic Coder.

  • xz 📁 🌐 -- XZ Utils provide a general-purpose data-compression library plus command-line tools. The native file format is the .xz format, but also the legacy .lzma format is supported. The .xz format supports multiple compression algorithms, which are called "filters" in the context of XZ Utils. The primary filter is currently LZMA2. With typical files, XZ Utils create about 30 % smaller files than gzip.

  • zfp-compressed-arrays 📁 🌐 -- zfp is a compressed format for representing multidimensional floating-point and integer arrays. zfp provides compressed-array classes that support high throughput read and write random access to individual array elements. zfp also supports serial and parallel (OpenMP and CUDA) compression of whole arrays, e.g., for applications that read and write large data sets to and from disk.

  • zstd 📁 🌐 -- Zstandard, a.k.a. zstd, is a fast lossless compression algorithm, targeting real-time compression scenarios at zlib-level and better compression ratios.

  • lzo 🌐

    • removed; reason: gone as part of the first round of compression libraries' cleanup: we intend to support lz4 for fast work, plus zstd and maybe brotli for higher compression ratios, while we won't bother with anything else: the rest can be dealt with through Apache Tika or other thirdparty pipelines when we need to read (or write) them. See also: 7zip-Zstd, which is what I use for accessing almost all compressed material anywhere.
  • lzsse 🌐

    • removed; reason: see lzo above. LZ4 either overtakes this one or is on par (anno 2022 AD) and I don't see a lot happening here, so the coolness factor is slowly fading...
  • pithy 🌐

    • removed; reason: see lzo above. LZ4 either overtakes this one or is on par (anno 2022 AD) and I don't see a lot happening here, so the coolness factor is slowly fading...
  • xz-utils 🌐

    • removed; reason: see lzo2 above. When we want this, we can go through Apache Tika or other thirdparty pipelines.

File / Directory Tree Synchronization (local and remote)

  • cdc-file-transfer 📁 🌐 -- CDC File Transfer contains tools for syncing and streaming files from Windows to Windows or Linux. The tools are based on Content Defined Chunking (CDC), in particular FastCDC, to split up files into chunks.
  • CryptSync 📁 🌐 -- a small utility that synchronizes two folders while encrypting the contents in one folder. That means one of the two folders has all files unencrypted (the files you work with) and the other folder has all the files encrypted. This is best used together with cloud storage tools like OneDrive, DropBox or Google Drive.
  • csync2 📁 🌐 -- a cluster synchronization tool. It can be used to keep files on multiple hosts in a cluster in sync. Csync2 can handle complex setups with much more than just 2 hosts, handle file deletions and can detect conflicts.
  • filecopyex3 📁 🌐 -- a FAR plugin designed to bring to life all kinds of perverted fantasies on the topic of file copying, each of which will speed up the process by 5% 😄. At the moment, it has implemented the main features that are sometimes quite lacking in standard copiers.
  • FreeFileSync 📁 🌐 -- a folder comparison and synchronization application that creates and manages backup copies of all your important files. Instead of copying every file every time, FreeFileSync determines the differences between a source and a target folder and transfers only the minimum amount of data needed. FreeFileSync is available for Windows, macOS, and Linux.
  • lib_nas_lockfile 📁 🌐 -- lockfile management on NAS and other disparate network filesystem storage. To be combined with SQLite to create a proper Qiqqa Sync operation.
  • librsync 📁 🌐 -- a library for calculating and applying network deltas. librsync encapsulates the core algorithms of the rsync protocol, which help with efficient calculation of the differences between two files. The rsync algorithm is different from most differencing algorithms because it does not require the presence of the two files to calculate the delta. Instead, it requires a set of checksums of each block of one file, which together form a signature for that file. Blocks at any position in the other file which have the same checksum are likely to be identical, and whatever remains is the difference. This algorithm transfers the differences between two files without needing both files on the same system.
  • rclone 📁 🌐 -- Rclone ("rsync for cloud storage") is a command-line program to sync files and directories to and from different cloud storage providers. See the full list of all storage providers and their features.
  • rsync 📁 🌐 -- Rsync is a fast and extraordinarily versatile file copying tool for both remote and local files. Rsync uses a delta-transfer algorithm which provides a very fast method for bringing remote files into sync.
  • vcopy 📁 🌐 -- tool to safely copy files across various (local) hardware under circumstances where there may be another file writer active at the same time and/or the (USB?) connection is sometimes flakey or system I/O drivers buggered.
  • zsync2 📁 🌐 -- the advanced file download/sync tool zsync. zsync is a well known tool for downloading and updating local files from HTTP servers using the well known algorithms rsync used for diffing binary files. Therefore, it becomes possible to synchronize modifications by exchanging the changed blocks locally using Range: requests. The system is based on meta files called .zsync files. They contain hash sums for every block of data. The file is generated from and stored along with the actual file it refers to. Due to how system works, nothing but a "dumb" HTTP server is required to make use of zsync2. This makes it easy to integrate zsync2 into existing systems.

OCR: hOCR output format, other output formats? (dedicated binary?)

  • archive-hocr-tools 📁 🌐 -- a python package to ease hOCR parsing in a streaming manner.

  • hocr-fileformat 📁 🌐 -- tools to alidate and transform between OCR file formats (hOCR, ALTO, PAGE, FineReader)

  • hocr-spec 📁 🌐 -- the hOCR Embedded OCR Workflow and Output Format specification originally written by Thomas Breuel.

  • hocr-tools 📁 🌐 -- a Public Specification and tools for the hOCR Format.

    hOCR is a format for representing OCR output, including layout information, character confidences, bounding boxes, and style information. It embeds this information invisibly in standard HTML. By building on standard HTML, it automatically inherits well-defined support for most scripts, languages, and common layout options. Furthermore, unlike previous OCR formats, the recognized text and OCR-related information co-exist in the same file and survives editing and manipulation. hOCR markup is independent of the presentation.

Pattern Recognition

"A.I." for cover pages, image/page segmentation, including abstract & summary demarcation, "figure" and "table" detection & extraction from documents, ...

BLAS, LAPACK, ...

  • amd-fftw 📁 🌐 -- AOCL-FFTW is AMD optimized version of FFTW implementation targeted for AMD EPYC CPUs. It is developed on top of FFTW (version fftw-3.3.10). AOCL-FFTW achieves high performance as a result of its various optimizations involving improved SIMD Kernel functions, improved copy functions (cpy2d and cpy2d_pair used in rank-0 transform and buffering plan), improved 256-bit kernels selection by Planner and an optional in-place transpose for large problem sizes. AOCL-FFTW improves the performance of in-place MPI FFTs by employing a faster in-place MPI transpose function.

  • armadillo 📁 🌐 -- C++ library for linear algebra & scientific computing

  • autodiff 📁 🌐 -- a C++17 library that uses modern and advanced programming techniques to enable automatic computation of derivatives in an efficient, easy, and intuitive way.

  • BaseMatrixOps 📁 🌐 -- wrappers to C++ linear algebra libraries. No guarantees made about APIs or functionality.

  • blis 📁 🌐 -- BLIS is an award-winning portable software framework for instantiating high-performance BLAS-like dense linear algebra libraries. The framework was designed to isolate essential kernels of computation that, when optimized, immediately enable optimized implementations of most of its commonly used and computationally intensive operations. BLIS is written in ISO C99 and available under a new/modified/3-clause BSD license. While BLIS exports a new BLAS-like API, it also includes a BLAS compatibility layer which gives application developers access to BLIS implementations via traditional BLAS routine calls. An object-based API unique to BLIS is also available.

  • clBLAS 📁 🌐 -- the OpenCL™ BLAS portion of OpenCL's clMath. The complete set of BLAS level 1, 2 & 3 routines is implemented. In addition to GPU devices, the library also supports running on CPU devices to facilitate debugging and multicore programming. The primary goal of clBLAS is to make it easier for developers to utilize the inherent performance and power efficiency benefits of heterogeneous computing. clBLAS interfaces do not hide nor wrap OpenCL interfaces, but rather leaves OpenCL state management to the control of the user to allow for maximum performance and flexibility. The clBLAS library does generate and enqueue optimized OpenCL kernels, relieving the user from the task of writing, optimizing and maintaining kernel code themselves.

  • CLBlast 📁 🌐 -- the tuned OpenCL BLAS library. CLBlast is a modern, lightweight, performant and tunable OpenCL BLAS library written in C++11. It is designed to leverage the full performance potential of a wide variety of OpenCL devices from different vendors, including desktop and laptop GPUs, embedded GPUs, and other accelerators. CLBlast implements BLAS routines: basic linear algebra subprograms operating on vectors and matrices.

  • CLBlast-database 📁 🌐 -- the full database of tuning results for the CLBlast OpenCL BLAS library. Tuning results are obtained using CLBlast and the CLTune auto-tuner.

  • CLTune 📁 🌐 -- automatic OpenCL kernel tuning for CLBlast: CLTune is a C++ library which can be used to automatically tune your OpenCL and CUDA kernels. The only thing you'll need to provide is a tuneable kernel and a list of allowed parameters and values.

  • Cmathtuts 📁 🌐 -- a collection of linear algebra math tutorials in C for BLAS, LAPACK and other fundamental APIs. These include samples for BLAS, LAPACK, CLAPACK, LAPACKE, ATLAS, OpenBLAS ...

  • efftw 📁 🌐 -- Eigen-FFTW is a modern C++20 wrapper library around FFTW for Eigen.

  • ensmallen 📁 🌐 -- a high-quality C++ library for non-linear numerical optimization. ensmallen provides many types of optimizers that can be used for virtually any numerical optimization task. This includes gradient descent techniques, gradient-free optimizers, and constrained optimization. Examples include L-BFGS, SGD, CMAES and Simulated Annealing.

  • fastapprox 📁 🌐 -- approximate and vectorized versions of common mathematical functions (e.g. exponential, logarithm, and power, lgamma and digamma, cosh, sinh, tanh, cos, sin, tan, sigmoid and erf, Lambert W)

  • fastrange 📁 🌐 -- a fast alternative to the modulo reduction. It has accelerated some operations in Google's Tensorflow by 10% to 20%. Further reading : http://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/ See also: Daniel Lemire, Fast Random Integer Generation in an Interval, ACM Transactions on Modeling and Computer Simulation, January 2019 Article No. 3 https://doi.org/10.1145/3230636

  • ffts 📁 🌐 -- FFTS -- The Fastest Fourier Transform in the South.

  • gcem 📁 🌐 -- GCE-Math (Generalized Constant Expression Math) is a templated C++ library enabling compile-time computation of mathematical functions.

  • gemmlowp 📁 🌐 -- a small self-contained low-precision GEMM library. gemmlowp is not a full linear algebra library, only a GEMM library: it only does general matrix multiplication ("GEMM").

  • GraphBLAS 📁 🌐 -- SuiteSparse:GraphBLAS is a complete implementation of the GraphBLAS standard, which defines a set of sparse matrix operations on an extended algebra of semirings using an almost unlimited variety of operators and types. When applied to sparse adjacency matrices, these algebraic operations are equivalent to computations on graphs. GraphBLAS provides a powerful and expressive framework for creating graph algorithms based on the elegant mathematics of sparse matrix operations on a semiring.

  • h5cpp 📁 🌐 -- easy to use HDF5 C++ templates for Serial and Paralel HDF5. Hierarchical Data Format HDF5 is prevalent in high performance scientific computing, sits directly on top of sequential or parallel file systems, providing block and stream operations on standardized or custom binary/text objects. Scientific computing platforms come with the necessary libraries to read write HDF5 dataset. H5CPP simplifies interactions with popular linear algebra libraries, provides compiler assisted seamless object persistence, Standard Template Library support and comes equipped with a novel error handling architecture.

  • Imath 📁 🌐 -- a basic, light-weight, and efficient C++ representation of 2D and 3D vectors and matrices and other simple but useful mathematical objects, functions, and data types common in computer graphics applications, including the “half” 16-bit floating-point type.

  • itpp 📁 🌐 -- IT++ is a C++ library of mathematical, signal processing and communication classes and functions. Its main use is in simulation of communication systems and for performing research in the area of communications. The kernel of the library consists of generic vector and matrix classes, and a set of accompanying routines. Such a kernel makes IT++ similar to MATLAB or GNU Octave. The IT++ library originates from the former department of Information Theory at the Chalmers University of Technology, Gothenburg, Sweden.

  • kalman-cpp 📁 🌐 -- Kalman filter and extended Kalman filter implementation in C++. Implements Kalman, Extended Kalman, Second-order extended Kalman and Unscented Kalman filters.

  • kissfft 📁 🌐 -- KISS FFT - a mixed-radix Fast Fourier Transform based up on the principle, "Keep It Simple, Stupid."

  • lapack 📁 🌐 -- CBLAS + LAPACK optimized linear algebra libs

  • libalg 📁 🌐 -- the mathematical ALGLIB library for C++.

  • libbf 📁 🌐 -- a small library to handle arbitrary precision binary or decimal floating point numbers

  • libcnl 📁 🌐 -- The Compositional Numeric Library (CNL) is a C++ library of fixed-precision numeric classes which enhance integers to deliver safer, simpler, cheaper arithmetic types. CNL is particularly well-suited to: (1) compute on energy-constrained environments where FPUs are absent or costly; (2) compute on energy-intensive environments where arithmetic is the bottleneck such as simulations, machine learning applications and DSPs; and (3) domains such as finance where precision is essential.

  • libeigen 📁 🌐 -- a C++ template library for linear algebra: matrices, vectors, numerical solvers, and related algorithms.

  • math-atlas 📁 🌐 -- The ATLAS (Automatically Tuned Linear Algebra Software) project is an ongoing research effort focusing on applying empirical techniques in order to provide portable performance, delivering an efficient BLAS implementation, as well as a few routines from LAPACK.

  • mipp 📁 🌐 -- MyIntrinsics++ (MIPP): a portable wrapper for vector intrinsic functions (SIMD) written in C++11. It works for SSE, AVX, AVX-512 and ARM NEON (32-bit and 64-bit) instructions. MIPP wrapper supports simple/double precision floating-point numbers and also signed integer arithmetic (64-bit, 32-bit, 16-bit and 8-bit). With the MIPP wrapper you do not need to write a specific intrinsic code anymore. Just use provided functions and the wrapper will automatically generate the right intrisic calls for your specific architecture.

  • mlpack 📁 🌐 -- an intuitive, fast, and flexible C++ machine learning library, meant to be a machine learning analog to LAPACK, aiming to implement a wide array of machine learning methods and functions as a "swiss army knife" for machine learning researchers.

  • nfft 📁 🌐 -- Nonequispaced FFT (NFFT) is a software library, written in C, for computing non-equispaced fast Fourier transforms and related variations.

  • OpenBLAS 📁 🌐 -- an optimized BLAS (Basic Linear Algebra Subprograms) library based on GotoBLAS2 1.13 BSD version.

  • OpenCL-CLHPP 📁 🌐 -- Khronos OpenCL C++ Bindings: the interface is contained with a single C++ header file opencl.hpp and all definitions are contained within the namespace cl. There is no additional requirement to include cl.h and to use either the C++ or original C bindings; it is enough to simply include opencl.hpp. The bindings themselves are lightweight and correspond closely to the underlying C API. Using the C++ bindings introduces no additional execution overhead.

  • OpenCL-CTS 📁 🌐 -- the OpenCL Conformance Test Suite (CTS) for all versions of the Khronos OpenCL standard.

  • OpenCL-Guide 📁 🌐 -- OpenCL Guide: written to help developers get up and running quickly with the Khronos® Group's OpenCL™ programming framework. It is an introductory read that covers the background and key concepts of OpenCL, but also contains links to more detailed materials that developers can use to explore the capabilities of OpenCL that interest them most.

  • OpenCL-Headers 📁 🌐 -- C language headers for the OpenCL API.

  • OpenCL-SDK 📁 🌐 -- the Khronos OpenCL SDK. It brings together all the components needed to develop OpenCL applications.

  • OpenCL-Wrapper 📁 🌐 -- OpenCL is the most powerful programming language ever created, yet the OpenCL C++ bindings are cumbersome and the code overhead prevents many people from getting started. Lightweight C++17 OpenCL-Wrapper greatly simplifies OpenCL software development with C++ while keeping functionality and performance.

  • optim 📁 🌐 -- OptimLib is a lightweight C++ library of numerical optimization methods for nonlinear functions. Features a C++11/14/17 library of local and global optimization algorithms, as well as root finding techniques, derivative-free optimization using advanced, parallelized metaheuristic methods and constrained optimization routines to handle simple box constraints, as well as systems of nonlinear constraints.

  • QuantLib 📁 🌐 -- the free/open-source library for quantitative finance, providing a comprehensive software framework for quantitative finance. QuantLib is a free/open-source library for modeling, trading, and risk management in real-life.

  • sdsl-lite 📁 🌐 -- The Succinct Data Structure Library (SDSL) is a powerful and flexible C++11 library implementing succinct data structures. In total, the library contains the highlights of 40 [research publications][SDSLLIT]. Succinct data structures can represent an object (such as a bitvector or a tree) in space close to the information-theoretic lower bound of the object while supporting operations of the original object efficiently. The theoretical time complexity of an operation performed on the classical data structure and the equivalent succinct data structure are (most of the time) identical.

  • stan-math 📁 🌐 -- the Stan Math Library is a C++, reverse-mode automatic differentiation library designed to be usable, extensive and extensible, efficient, scalable, stable, portable, and redistributable in order to facilitate the construction and utilization of algorithms that utilize derivatives.

  • stats 📁 🌐 -- StatsLib is a templated C++ library of statistical distribution functions, featuring unique compile-time computing capabilities and seamless integration with several popular linear algebra libraries. Features a header-only library of probability density functions, cumulative distribution functions, quantile functions, and random sampling methods. Functions are written in C++11 constexpr format, enabling the library to operate as both a compile-time and run-time computation engine. Provides functions to compute the cdf, pdf, quantile, as well as random sampling methods, are available for the following distributions: Bernoulli, Beta, Binomial, Cauchy, Chi-squared, Exponential, F, Gamma, Inverse-Gamma, Inverse-Gaussian, Laplace, Logistic, Log-Normal, Normal (Gaussian), Poisson, Rademacher, Student's t, Uniform and Weibull. In addition, pdf and random sampling functions are available for several multivariate distributions: inverse-Wishart, Multivariate Normal and Wishart.

  • SuiteSparse 📁 🌐 -- a set of sparse-matrix-related packages written or co-authored by Tim Davis, available at https://github.com/DrTimothyAldenDavis/SuiteSparse . Packages:

    • AMD - approximate minimum degree ordering. This is the built-in AMD function in MATLAB.
    • BTF - permutation to block triangular form
    • CAMD - constrained approximate minimum degree ordering
    • CCOLAMD - constrained column approximate minimum degree ordering
    • CHOLMOD - sparse Cholesky factorization. Requires AMD, COLAMD, CCOLAMD, the BLAS, and LAPACK. Optionally uses METIS. This is chol and x=A\b in MATLAB.
    • COLAMD - column approximate minimum degree ordering. This is the built-in COLAMD function in MATLAB.
    • CSparse - a concise sparse matrix package, developed for my book, "Direct Methods for Sparse Linear Systems", published by SIAM. Intended primarily for teaching. For production, use CXSparse instead.
    • CXSparse - CSparse Extended. Includes support for complex matrices and both int or long integers. Use this instead of CSparse for production use; it creates a libcsparse.so (or dylib on the Mac) with the same name as CSparse. It is a superset of CSparse.
    • GraphBLAS - graph algorithms in the language of linear algebra. https://graphblas.org
    • KLU - sparse LU factorization, primarily for circuit simulation. Requires AMD, COLAMD, and BTF. Optionally uses CHOLMOD, CAMD, CCOLAMD, and METIS.
    • LAGraph - a graph algorithms library based on GraphBLAS. See also https://github.com/GraphBLAS/LAGraph
    • LDL - a very concise LDL' factorization package
  • tindicators 📁 🌐 -- a library of technical analysis indicators. It provides over 160 indicators and is blazing fast.

  • tinyexpr 📁 🌐 -- a very small recursive descent parser and evaluation engine for math expressions.

  • universal-numbers 📁 🌐 -- a header-only C++ template library for universal number arithmetic. The goal of the Universal Numbers Library is to offer applications alternatives to IEEE floating-point that are more efficient and mathematically robust. The Universal library is a ready-to-use header-only library that provides plug-in replacement for native types, and provides a low-friction environment to start exploring alternatives to IEEE floating-point in your own algorithms.

  • xsimd 📁 🌐 -- SIMD (Single Instruction, Multiple Data) instructions differ between microprocessor vendors and compilers. xsimd provides a unified means for using these features for library authors. It enables manipulation of batches of numbers with the same arithmetic operators as for single values. It also provides accelerated implementation of common mathematical functions operating on batches.

delta features & other feature extraction (see Qiqqa research notes)

  • diffutils 📁 🌐 -- the GNU diff, diff3, sdiff, and cmp utilities. Their features are a superset of the Unix features and they are significantly faster.

  • dtl-diff-template-library 📁 🌐 -- dtl is the diff template library written in C++.

  • google-diff-match-patch 📁 🌐 -- Diff, Match and Patch offers robust algorithms to perform the operations required for synchronizing plain text.

    1. Diff:
      • Compare two blocks of plain text and efficiently return a list of differences.
    2. Match:
      • Given a search string, find its best fuzzy match in a block of plain text. Weighted for both accuracy and location.
    3. Patch:
      • Apply a list of patches onto plain text. Use best-effort to apply patch even when the underlying text doesn't match.

    Originally built in 2006 to power Google Docs.

  • HDiffPatch 📁 🌐 -- a library and command-line tools for Diff & Patch between binary files or directories(folders); cross-platform; runs fast; create small delta/differential; support large files and limit memory requires when diff & patch.

  • libdist 📁 🌐 -- string distance related functions (Damerau-Levenshtein, Jaro-Winkler, longest common substring & subsequence) implemented as SQLite run-time loadable extension, with UTF-8 support.

  • libharry 📁 🌐 -- Harry - A Tool for Measuring String Similarity. The tool supports several common distance and kernel functions for strings as well as some excotic similarity measures. The focus of Harry lies on implicit similarity measures, that is, comparison functions that do not give rise to an explicit vector space. Examples of such similarity measures are the Levenshtein distance, the Jaro-Winkler distance or the spectrum kernel.

  • open-vcdiff 📁 🌐 -- an encoder and decoder for the VCDIFF format, as described in RFC 3284: The VCDIFF Generic Differencing and Compression Data Format.

  • rollinghashcpp 📁 🌐 -- randomized rolling hash functions in C++. This is a set of C++ classes implementing various recursive n-gram hashing techniques, also called rolling hashing (http://en.wikipedia.org/wiki/Rolling_hash), including Randomized Karp-Rabin (sometimes called Rabin-Karp), Hashing by Cyclic Polynomials (also known as Buzhash) and Hashing by Irreducible Polynomials.

  • ssdeep 📁 🌐 -- fuzzy hashing library, can be used to assist with identifying almost identical files using context triggered piecewise hashing.

  • xdelta 📁 🌐 -- a C library and command-line tool for delta compression using VCDIFF/RFC 3284 streams.

  • yara-pattern-matcher 📁 🌐 -- for automated and user-specified pattern recognition in custom document & metadata cleaning / processing tasks

fuzzy matching

  • FM-fast-match 📁 🌐 -- FAsT-Match: a port of the Fast Affine Template Matching algorithm (Simon Korman, Daniel Reichman, Gilad Tsur, Shai Avidan, CVPR 2013, Portland)

  • fuzzy-match 📁 🌐 -- FuzzyMatch-cli is a commandline utility allowing to compile FuzzyMatch indexes and use them to lookup fuzzy matches. Okapi BM25 prefiltering is available on branch bm25.

  • libdist 📁 🌐 -- string distance related functions (Damerau-Levenshtein, Jaro-Winkler, longest common substring & subsequence) implemented as SQLite run-time loadable extension, with UTF-8 support.

  • lshbox 📁 🌐 -- a C++ Toolbox of Locality-Sensitive Hashing for Large Scale Image Retrieval. Locality-Sensitive Hashing (LSH) is an efficient method for large scale image retrieval, and it achieves great performance in approximate nearest neighborhood searching.

    LSHBOX is a simple but robust C++ toolbox that provides several LSH algrithms, in addition, it can be integrated into Python and MATLAB languages. The following LSH algrithms have been implemented in LSHBOX, they are:

    • LSH Based on Random Bits Sampling
    • Random Hyperplane Hashing
    • LSH Based on Thresholding
    • LSH Based on p-Stable Distributions
    • Spectral Hashing (SH)
    • Iterative Quantization (ITQ)
    • Double-Bit Quantization Hashing (DBQ)
    • K-means Based Double-Bit Quantization Hashing (KDBQ)
  • pdiff 📁 🌐 -- perceptualdiff (pdiff): a program that compares two images using a perceptually based image metric.

  • rollinghashcpp 📁 🌐 -- randomized rolling hash functions in C++. This is a set of C++ classes implementing various recursive n-gram hashing techniques, also called rolling hashing (http://en.wikipedia.org/wiki/Rolling_hash), including Randomized Karp-Rabin (sometimes called Rabin-Karp), Hashing by Cyclic Polynomials (also known as Buzhash) and Hashing by Irreducible Polynomials.

  • sdhash 📁 🌐 -- a tool which allows two arbitrary blobs of data to be compared for similarity based on common strings of binary data. It is designed to provide quick results during triage and initial investigation phases.

  • ssdeep 📁 🌐 -- fuzzy hashing library, can be used to assist with identifying almost identical files using context triggered piecewise hashing.

  • SSIM 📁 🌐 -- the structural similarity index measure (SSIM) is a popular method to predict perceived image quality. Published in April 2004, with over 46,000 Google Scholar citations, it has been re-implemented hundreds, perhaps thousands, of times, and is widely used as a measurement of image quality for image processing algorithms (even in places where it does not make sense, leading to even worse outcomes!). Unfortunately, if you try to reproduce results in papers, or simply grab a few SSIM implementations and compare results, you will soon find that it is (nearly?) impossible to find two implementations that agree, and even harder to find one that agrees with the original from the author. Chris Lomont ran into this issue many times, so he finally decided to write it up once and for all (and provide clear code that matches the original results, hoping to help reverse the mess that is current SSIM). Most of the problems come from the original implementation being in MATLAB, which not everyone can use. Running the same code in open source Octave, which claims to be MATLAB compatible, even returns wrong results! This large and inconsistent variation among SSIM implementations makes it hard to trust or compare published numbers between papers. The original paper doesn't define how to handle color images, doesn't specify what color space the grayscale values represent (linear? gamma compressed?), adding to the inconsistencies and results. The lack of color causes the following images to be rated as visually perfect by SSIM as published. The paper demonstrates so many issues when using SSIM with color images that they state "we advise not to use SSIM with color images". All of this is a shame since the underlying concept works well for the given compute complexity. A good first step to cleaning up this mess is trying to get widely used implementations to match the author results for their published test values, and this requires clearly specifying the algorithm at the computational level, which the authors did not. Chris Lomont explains some of these choices, and most importantly, provides original, MIT licensed, single file C++ header and single file C# implementations; each reproduces the original author code better than any other version I have found.

  • ssimulacra2 📁 🌐 -- Structural SIMilarity Unveiling Local And Compression Related Artifacts metric developed by Jon Sneyers. SSIMULACRA 2 is based on the concept of the multi-scale structural similarity index measure (MS-SSIM), computed in a perceptually relevant color space, adding two other (asymmetric) error maps, and aggregating using two different norms.

  • VQMT 📁 🌐 -- VQMT (Video Quality Measurement Tool) provides fast implementations of the following objective metrics:

    • MS-SSIM: Multi-Scale Structural Similarity,
    • PSNR: Peak Signal-to-Noise Ratio,
    • PSNR-HVS: Peak Signal-to-Noise Ratio taking into account Contrast Sensitivity Function (CSF),
    • PSNR-HVS-M: Peak Signal-to-Noise Ratio taking into account Contrast Sensitivity Function (CSF) and between-coefficient contrast masking of DCT basis functions.
    • SSIM: Structural Similarity,
    • VIFp: Visual Information Fidelity, pixel domain version

    The above metrics are implemented in C++ with the help of OpenCV and are based on the original Matlab implementations provided by their developers.

  • xor-and-binary-fuse-filter 📁 🌐 -- XOR and Binary Fuse Filter library: Bloom filters are used to quickly check whether an element is part of a set. Xor filters and binary fuse filters are faster and more concise alternative to Bloom filters. They are also smaller than cuckoo filters. They are used in production systems.

decision trees

  • catboost 📁 🌐 -- a fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks. Supports computation on CPU and GPU.
  • decision-tree 📁 🌐 -- a decision tree classifier. Decision trees are a simple machine learning algorithm that use a series of features of an observation to create a prediction of a target outcome class.
  • random-forest 📁 🌐 -- a Fast C++ Implementation of Random Forests as described in: Leo Breiman. Random Forests. Machine Learning 45(1):5-32, 2001.
  • Sherwood 📁 🌐 -- Sherwood: a library for decision forest inference, which was written by Duncan Robertson to accompany the book "A. Criminisi and J. Shotton. Decision Forests: for Computer Vision and Medical Image Analysis. Springer, 2013." The Sherwood library comprises a general purpose, object-oriented software framework for applying decision forests to a wide range of inference problems.
  • treelite 📁 🌐 -- Treelite is a universal model exchange and serialization format for decision tree forests. Treelite aims to be a small library that enables other C++ applications to exchange and store decision trees on the disk as well as the network.
  • yggdrasil-decision-forests 📁 🌐 -- Yggdrasil Decision Forests (YDF) is a production-grade collection of algorithms for the training, serving, and interpretation of decision forest models. YDF is open-source and is available in C++, command-line interface (CLI), TensorFlow (under the name TensorFlow Decision Forests; TF-DF), JavaScript (inference only), and Go (inference only).

GMM/HMM/kM

Guassian Mixture Models / Hidden Markov Models / k-Means: fit patterns, e.g. match & transform a point cloud or image onto a template --> help matching pages against banner templates, etc. as part of the OCR/recognition task.

  • GMM-HMM-kMeans 📁 🌐 -- HMM based on KMeans and GMM
  • GMMreg 📁 🌐 -- implementations of the robust point set registration framework described in the paper "Robust Point Set Registration Using Gaussian Mixture Models", Bing Jian and Baba C. Vemuri, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(8), pp. 1633-1645. An earlier conference version of this work, "A Robust Algorithm for Point Set Registration Using Mixture of Gaussians, Bing Jian and Baba C. Vemuri.", appeared in the proceedings of ICCV'05.
  • hmm-scalable 📁 🌐 -- a Tool for fitting Hidden Markov Models models at scale. In particular, it is targeting a specific kind of HMM used in education called Bayesian Knowledge Tracing (BKT) model.
  • hmm-stoch 📁 🌐 -- StochHMM - A Flexible hidden Markov model application and C++ library that implements HMM from simple text files. It implements traditional HMM algorithms in addition to providing additional flexibility. The additional flexibility is achieved by allowing researchers to integrate additional data sources and application code into the HMM framework.
  • liblinear 📁 🌐 -- a simple package for solving large-scale regularized linear classification, regression and outlier detection.

graph analysis, graph databases

  • arangodb 📁 🌐 -- a scalable open-source multi-model database natively supporting graph, document and search. All supported data models & access patterns can be combined in queries allowing for maximal flexibility.

  • g2o 📁 🌐 -- General Graph Optimization (G2O) is a C++ framework for optimizing graph-based nonlinear error functions. g2o has been designed to be easily extensible to a wide range of problems and a new problem typically can be specified in a few lines of code. The current implementation provides solutions to several variants of SLAM and BA.

  • GraphBLAS 📁 🌐 -- SuiteSparse:GraphBLAS is a complete implementation of the GraphBLAS standard, which defines a set of sparse matrix operations on an extended algebra of semirings using an almost unlimited variety of operators and types. When applied to sparse adjacency matrices, these algebraic operations are equivalent to computations on graphs. GraphBLAS provides a powerful and expressive framework for creating graph algorithms based on the elegant mathematics of sparse matrix operations on a semiring.

  • graph-coloring 📁 🌐 -- a C++ Graph Coloring Package. This project has two primary uses:

    • As an executable for finding the chromatic number for an input graph (in edge list or edge matrix format)
    • As a library for finding the particular coloring of an input graph (represented as a map<string,vector<string>> edge list)
  • graphit 📁 🌐 -- a High-Performance Domain Specific Language for Graph Analytics.

  • kahypar 📁 🌐 -- KaHyPar (Karlsruhe Hypergraph Partitioning) is a multilevel hypergraph partitioning framework providing direct k-way and recursive bisection based partitioning algorithms that compute solutions of very high quality.

  • libgrape-lite 📁 🌐 -- a C++ library from Alibaba for parallel graph processing (GRAPE). It differs from prior systems in its ability to parallelize sequential graph algorithms as a whole by following the PIE programming model from GRAPE. Sequential algorithms can be easily "plugged into" libgrape-lite with only minor changes and get parallelized to handle large graphs efficiently. libgrape-lite is designed to be highly efficient and flexible, to cope with the scale, variety and complexity of real-life graph applications.

  • midas 📁 🌐 -- C++ implementation of:

  • ogdf 📁 🌐 -- OGDF stands both for Open Graph Drawing Framework (the original name) and Open Graph algorithms and Data structures Framework. OGDF is a self-contained C++ library for graph algorithms, in particular for (but not restricted to) automatic graph drawing. It offers sophisticated algorithms and data structures to use within your own applications or scientific projects.

  • snap 📁 🌐 -- Stanford Network Analysis Platform (SNAP) is a general purpose, high performance system for analysis and manipulation of large networks. SNAP scales to massive graphs with hundreds of millions of nodes and billions of edges.

NN, ...

  • aho_corasick 📁 🌐 -- a header only implementation of the Aho-Corasick pattern search algorithm invented by Alfred V. Aho and Margaret J. Corasick. It is a very efficient dictionary matching algorithm that can locate all search patterns against in input text simultaneously in O(n + m), with space complexity O(m) (where n is the length of the input text, and m is the combined length of the search patterns).

  • A-MNS_TemplateMatching 📁 🌐 -- the official code for the PatternRecognition2020 paper: Fast and robust template matching with majority neighbour similarity and annulus projection transformation.

  • arrayfire 📁 🌐 -- a general-purpose tensor library that simplifies the process of software development for the parallel architectures found in CPUs, GPUs, and other hardware acceleration devices. The library serves users in every technical computing market.

  • Awesome-Document-Image-Rectification 📁 🌐 -- a comprehensive list of awesome document image rectification methods based on deep learning.

  • BehaviorTree.CPP 📁 🌐 -- this C++/17 library provides a framework to create BehaviorTrees. It was designed to be flexible, easy to use, reactive and fast. Even if our main use-case is robotics, you can use this library to build AI for games, or to replace Finite State Machines. BehaviorTree.CPP features asynchronous Actions, reactive behaviors, execute multiple Actions concurrently (orthogonality), XML-based DSL scripts which can be loaded at run-time, i.e. the morphology of the Trees is not hard-coded.

  • bhtsne--Barnes-Hut-t-SNE 📁 🌐 -- Barnes-Hut t-SNE

  • blis 📁 🌐 -- BLIS is an award-winning portable software framework for instantiating high-performance BLAS-like dense linear algebra libraries. The framework was designed to isolate essential kernels of computation that, when optimized, immediately enable optimized implementations of most of its commonly used and computationally intensive operations. BLIS is written in ISO C99 and available under a new/modified/3-clause BSD license. While BLIS exports a new BLAS-like API, it also includes a BLAS compatibility layer which gives application developers access to BLIS implementations via traditional BLAS routine calls. An object-based API unique to BLIS is also available.

  • bolt 📁 🌐 -- a deep learning library with high performance and heterogeneous flexibility.

  • brown-cluster 📁 🌐 -- the Brown hierarchical word clustering algorithm. Runs in $O(N C^2)$, where $N$ is the number of word types and $C$ is the number of clusters. Algorithm by Brown, et al.: Class-Based n-gram Models of Natural Language, http://acl.ldc.upenn.edu/J/J92/J92-4003.pdf

  • caffe 📁 🌐 -- a fast deep learning framework made with expression and modularity in mind, developed by Berkeley AI Research (BAIR)/The Berkeley Vision and Learning Center (BVLC).

    • ho-hum; reason: uses google protobuffers, CUDA SDK for the GPU access (at least that's how it looks from the header files reported missing by my compiler). Needs more effort before this can be used in the monolithic production builds.
  • catboost 📁 🌐 -- a fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks. Supports computation on CPU and GPU.

  • CNTK 📁 🌐 -- the Microsoft Cognitive Toolkit (https://cntk.ai) is a unified deep learning toolkit that describes neural networks as a series of computational steps via a directed graph. In this directed graph, leaf nodes represent input values or network parameters, while other nodes represent matrix operations upon their inputs. CNTK allows users to easily realize and combine popular model types such as feed-forward DNNs, convolutional nets (CNNs), and recurrent networks (RNNs/LSTMs). It implements stochastic gradient descent (SGD, error backpropagation) learning with automatic differentiation and parallelization across multiple GPUs and servers. CNTK has been available under an open-source license since April 2015. It is our hope that the community will take advantage of CNTK to share ideas more quickly through the exchange of open source working code.

  • cppflow 📁 🌐 -- run TensorFlow models in c++ without Bazel, without TensorFlow installation and without compiling Tensorflow.

  • CRFpp 📁 🌐 -- CRF++ is a simple, customizable, and open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data. CRF++ is designed for generic purpose and will be applied to a variety of NLP tasks, such as Named Entity Recognition, Information Extraction and Text Chunking.

  • crfsuite 📁 🌐 -- an implementation of Conditional Random Fields (CRFs) for labeling sequential data.

  • CRFsuite-extended 📁 🌐 -- a fork of Naoaki Okazaki's implementation of conditional random fields (CRFs).

  • CurvatureFilter 📁 🌐 -- Curvature filters are efficient solvers for variational models. Traditional solvers, such as gradient descent or Euler Lagrange Equation, start at the total energy and use diffusion scheme to carry out the minimization. When the initial condition is the original image, the data fitting energy always increases while the regularization energy always reduces during the optimization, as illustrated in the below figure. Thus, regularization energy must be the dominant part since the total energy has to decrease. Therefore, Curvature filters focus on minimizing the regularization term, whose minimizers are already known. For example, if the regularization is Gaussian curvature, the developable surfaces minimize this energy. Therefore, in curvature filter, developable surfaces are used to approximate the data. As long as the decreased amount in the regularization part is larger than the increased amount in the data fitting energy, the total energy is reduced.

  • darknet 📁 🌐 -- Darknet is an open source neural network framework written in C and CUDA. It is fast, easy to install, and supports CPU and GPU computation.

  • DBoW2 📁 🌐 -- a C++ library for indexing and converting images into a bag-of-word representation. It implements a hierarchical tree for approximating nearest neighbours in the image feature space and creating a visual vocabulary. DBoW2 also implements an image database with inverted and direct files to index images and enabling quick queries and feature comparisons.

  • DBow3 📁 🌐 -- DBoW3 is an improved version of the DBow2 library, an open source C++ library for indexing and converting images into a bag-of-word representation. It implements a hierarchical tree for approximating nearest neighbours in the image feature space and creating a visual vocabulary. DBoW3 also implements an image database with inverted and direct files to index images and enabling quick queries and feature comparisons.

  • deepdetect 📁 🌐 -- DeepDetect (https://www.deepdetect.com/) is a machine learning API and server written in C++11. It makes state of the art machine learning easy to work with and integrate into existing applications. It has support for both training and inference, with automatic conversion to embedded platforms with TensorRT (NVidia GPU) and NCNN (ARM CPU). It implements support for supervised and unsupervised deep learning of images, text, time series and other data, with focus on simplicity and ease of use, test and connection into existing applications. It supports classification, object detection, segmentation, regression, autoencoders, ... and it relies on external machine learning libraries through a very generic and flexible API.

  • DGM-CRF 📁 🌐 -- DGM (Direct Graphical Models) is a cross-platform C++ library implementing various tasks in probabilistic graphical models with pairwise and complete (dense) dependencies. The library aims to be used for the Markov and Conditional Random Fields (MRF / CRF), Markov Chains, Bayesian Networks, etc.

  • DiskANN 📁 🌐 -- DiskANN is a suite of scalable, accurate and cost-effective approximate nearest neighbor search algorithms for large-scale vector search that support real-time changes and simple filters.

  • dkm 📁 🌐 -- a generic C++11 k-means clustering implementation. The algorithm is based on Lloyds Algorithm and uses the kmeans++ initialization method.

  • dlib 📁 🌐 -- machine learning algorithms

  • DP_means 📁 🌐 -- Dirichlet Process K-means is a bayesian non-parametric extension of the K-means algorithm based on small variance assymptotics (SVA) approximation of the Dirichlet Process Mixture Model. B. Kulis and M. Jordan, "Revisiting k-means: New Algorithms via Bayesian Nonparametrics"

  • dynet 📁 🌐 -- The Dynamic Neural Network Toolkit. DyNet is a neural network library developed by Carnegie Mellon University and many others. It is written in C++ (with bindings in Python) and is designed to be efficient when run on either CPU or GPU, and to work well with networks that have dynamic structures that change for every training instance. For example, these kinds of networks are particularly important in natural language processing tasks, and DyNet has been used to build state-of-the-art systems for syntactic parsing, machine translation, morphological inflection, and many other application areas.

  • falconn 📁 🌐 -- FALCONN (FAst Lookups of Cosine and Other Nearest Neighbors) is a library with algorithms for the nearest neighbor search problem. The algorithms in FALCONN are based on Locality-Sensitive Hashing (LSH), which is a popular class of methods for nearest neighbor search in high-dimensional spaces. The goal of FALCONN is to provide very efficient and well-tested implementations of LSH-based data structures. Currently, FALCONN supports two LSH families for the cosine similarity: hyperplane LSH and cross polytope LSH. Both hash families are implemented with multi-probe LSH in order to minimize memory usage. Moreover, FALCONN is optimized for both dense and sparse data. Despite being designed for the cosine similarity, FALCONN can often be used for nearest neighbor search under the Euclidean distance or a maximum inner product search.

  • fast-kmeans 📁 🌐 -- this Fast K-means Clustering Toolkit is a testbed for comparing variants of Lloyd's k-means clustering algorithm. It includes implementations of several algorithms that accelerate the algorithm by avoiding unnecessary distance calculations.

  • fbow 📁 🌐 -- FBOW (Fast Bag of Words) is an extremmely optimized version of the DBow2/DBow3 libraries. The library is highly optimized to speed up the Bag of Words creation using AVX,SSE and MMX instructions. In loading a vocabulary, fbow is ~80x faster than DBOW2 (see tests directory and try). In transforming an image into a bag of words using on machines with AVX instructions, it is ~6.4x faster.

  • ffht 📁 🌐 -- FFHT (Fast Fast Hadamard Transform) is a library that provides a heavily optimized C99 implementation of the Fast Hadamard Transform. FFHT also provides a thin Python wrapper that allows to perform the Fast Hadamard Transform on one-dimensional NumPy arrays. The Hadamard Transform is a linear orthogonal map defined on real vectors whose length is a power of two. For the precise definition, see the Wikipedia entry. The Hadamard Transform has been recently used a lot in various machine learning and numerical algorithms. FFHT uses AVX to speed up the computation.

  • FFME 📁 🌐 -- key points detection (OpenCV). This method is a SIFT-like one, but specifically designed for egomotion computation. The key idea is that it avoids some of the steps SIFT gives, so that it runs faster, at the cost of not being so robust against scaling. The good news is that in egomotion estimation the scaling is not so critical as in registration applications, where SIFT should be selected.

  • flann 📁 🌐 -- FLANN (Fast Library for Approximate Nearest Neighbors) is a library for performing fast approximate nearest neighbor searches in high dimensional spaces. It contains a collection of algorithms we found to work best for nearest neighbor search and a system for automatically choosing the best algorithm and optimum parameters depending on the dataset.

  • flashlight 📁 🌐 -- a fast, flexible machine learning library written entirely in C++ from the Facebook AI Research and the creators of Torch, TensorFlow, Eigen and Deep Speech, with an emphasis on efficiency and scale.

  • flinng 📁 🌐 -- Filters to Identify Near-Neighbor Groups (FLINNG) is a near neighbor search algorithm outlined in the paper Practical Near Neighbor Search via Group Testing.

  • gtn 📁 🌐 -- GTN (Automatic Differentiation with WFSTs) is a framework for automatic differentiation with weighted finite-state transducers. The goal of GTN is to make adding and experimenting with structure in learning algorithms much simpler. This structure is encoded as weighted automata, either acceptors (WFSAs) or transducers (WFSTs). With gtn you can dynamically construct complex graphs from operations on simpler graphs. Automatic differentiation gives gradients with respect to any input or intermediate graph with a single call to gtn.backward.

  • ikd-Tree 📁 🌐 -- an incremental k-d tree designed for robotic applications. The ikd-Tree incrementally updates a k-d tree with new coming points only, leading to much lower computation time than existing static k-d trees. Besides point-wise operations, the ikd-Tree supports several features such as box-wise operations and down-sampling that are practically useful in robotic applications.

  • InferenceHelper 📁 🌐 -- a wrapper of deep learning frameworks especially for inference. This class provides a common interface to use various deep learnig frameworks, so that you can use the same application code.

  • InversePerspectiveMapping 📁 🌐 -- C++ class for the computation of plane-to-plane homographies, aka bird's-eye view or IPM, particularly relevant in the field of Advanced Driver Assistance Systems.

  • jpeg2dct 📁 🌐 -- Faster Neural Networks Straight from JPEG: jpeg2dct subroutines -- this module is useful for reproducing results presented in the paper Faster Neural Networks Straight from JPEG (ICLR workshop 2018).

  • kann 📁 🌐 -- KANN is a standalone and lightweight library in C for constructing and training small to medium artificial neural networks such as multi-layer perceptrons (MLP), convolutional neural networks (CNN) and recurrent neural networks (RNN), including LSTM and GRU. It implements graph-based reverse-mode automatic differentiation and allows to build topologically complex neural networks with recurrence, shared weights and multiple inputs/outputs/costs. In comparison to mainstream deep learning frameworks such as TensorFlow, KANN is not as scalable, but it is close in flexibility, has a much smaller code base and only depends on the standard C library. In comparison to other lightweight frameworks such as tiny-dnn, KANN is still smaller, times faster and much more versatile, supporting RNN, VAE and non-standard neural networks that may fail these lightweight frameworks. KANN could be potentially useful when you want to experiment small to medium neural networks in C/C++, to deploy no-so-large models without worrying about dependency hell, or to learn the internals of deep learning libraries.

  • K-Medoids-Clustering 📁 🌐 -- K-medoids is a clustering algorithm related to K-means. In contrast to the K-means algorithm, K-medoids chooses datapoints as centers of the clusters. There are eight combinations of Initialization, Assignment and Update algorithms to achieve the best results in the given dataset. Also Clara algorithm approach is implemented.

  • lapack 📁 🌐 -- CBLAS + LAPACK optimized linear algebra libs

  • libahocorasick 📁 🌐 -- a fast and memory efficient library for exact or approximate multi-pattern string search meaning that you can find multiple key strings occurrences at once in some input text. The strings "index" can be built ahead of time and saved (as a pickle) to disk to reload and reuse later. The library provides an ahocorasick Python module that you can use as a plain dict-like Trie or convert a Trie to an automaton for efficient Aho-Corasick search.

  • libcluster 📁 🌐 -- implements various algorithms with variational Bayes learning procedures and efficient cluster splitting heuristics, including the Variational Dirichlet Process (VDP), the Bayesian Gaussian Mixture Model, the Grouped Mixtures Clustering (GMC) model and more clustering algorithms based on diagonal Gaussian, and Exponential distributions.

  • libclustering_dim_redux 📁 🌐 -- C++ code for dimension reduction, Kohonen maps (SOMs), t-SNE, PCA, kNN, ...

  • libdivsufsort 📁 🌐 -- a software library that implements a lightweight suffix array construction algorithm.

  • libfann 📁 🌐 -- FANN: Fast Artificial Neural Network Library, a free open source neural network library, which implements multilayer artificial neural networks in C with support for both fully connected and sparsely connected networks. Cross-platform execution in both fixed and floating point are supported. It includes a framework for easy handling of training data sets. It is easy to use, versatile, well documented, and fast.

  • libirwls 📁 🌐 -- LIBIRWLS is an integrated library that incorporates a parallel implementation of the Iterative Re-Weighted Least Squares (IRWLS) procedure, an alternative to quadratic programming (QP), for training of Support Vector Machines (SVMs). Although there are several methods for SVM training, the number of parallel libraries is very reduced. In particular, this library contains solutions to solve either full or budgeted SVMs making use of shared memory parallelization techniques: (1) a parallel SVM training procedure based on the IRWLS algorithm, (2) a parallel budgeted SVMs solver based on the IRWLS algorithm.

  • libkdtree 📁 🌐 -- libkdtree++ is a C++ template container implementation of k-dimensional space sorting, using a kd-tree.

  • libmlpp 📁 🌐 -- ML++ :: The intent with this machine-learning library is for it to act as a crossroad between low-level developers and machine learning engineers.

  • libsvm 📁 🌐 -- a simple, easy-to-use, and efficient software for SVM classification and regression. It solves C-SVM classification, nu-SVM classification, one-class-SVM, epsilon-SVM regression, and nu-SVM regression. It also provides an automatic model selection tool for C-SVM classification.

  • LightGBM 📁 🌐 -- LightGBM (Light Gradient Boosting Machine) is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:

    • Better accuracy.
    • Capable of handling large-scale data.
    • Faster training speed and higher efficiency.
    • Lower memory usage.
    • Support of parallel, distributed, and GPU learning.
  • LMW-tree 📁 🌐 -- LMW-tree: learning m-way tree is a generic template library written in C++ that implements several algorithms that use the m-way nearest neighbor tree structure to store their data. See the related PhD thesis for more details on m-way nn trees. The algorithms are primarily focussed on computationally efficient clustering. Clustering is an unsupervised machine learning process that finds interesting patterns in data. It places similar items into clusters and dissimilar items into different clusters. The data structures and algorithms can also be used for nearest neighbor search, supervised learning and other machine learning applications. The package includes EM-tree, K-tree, k-means, TSVQ, repeated k-means, clustering, random projections, random indexing, hashing, bit signatures. See the related PhD thesis for more details these algorithms and representations.

  • mace 📁 🌐 -- Mobile AI Compute Engine (or MACE for short) is a deep learning inference framework optimized for mobile heterogeneous computing on Android, iOS, Linux and Windows devices. The design focuses on the following

  • mapreduce 📁 🌐 -- the MapReduce-MPI (MR-MPI) library. MapReduce is the operation popularized by Google for computing on large distributed data sets. See the Wikipedia entry on MapReduce for an overview of what a MapReduce is. The MR-MPI library is a simple, portable implementation of MapReduce that runs on any serial desktop machine or large parallel machine using MPI message passing.

  • marian 📁 🌐 -- an efficient Neural Machine Translation framework written in pure C++ with minimal dependencies.

  • MegEngine 📁 🌐 -- MegEngine is a fast, scalable, and user friendly deep learning framework with 3 key features: (1) Unified framework for both training and inference, (2) The lowest hardware requirements and (3) Inference efficiently on all platforms.

  • midas 📁 🌐 -- C++ implementation of:

  • minhash_clustering 📁 🌐 -- this program is for clustering protein conserved regions using MinWise Independent Hashing. The code uses MRMPI library for MapReduce in C/C++ and constitutes of two major parts.

  • MITIE-nlp 📁 🌐 -- provides state-of-the-art information extraction tools. Includes tools for performing named entity extraction and binary relation detection as well as tools for training custom extractors and relation detectors. MITIE is built on top of dlib, a high-performance machine-learning library, MITIE makes use of several state-of-the-art techniques including the use of distributional word embeddings and Structural Support Vector Machines.

  • MNN 📁 🌐 -- a highly efficient and lightweight deep learning framework. It supports inference and training of deep learning models, and has industry leading performance for inference and training on-device. At present, MNN has been integrated in more than 30 apps of Alibaba Inc, such as Taobao, Tmall, Youku, Dingtalk, Xianyu and etc., covering more than 70 usage scenarios such as live broadcast, short video capture, search recommendation, product searching by image, interactive marketing, equity distribution, security risk control. In addition, MNN is also used on embedded devices, such as IoT. Inside Alibaba, MNN works as the basic module of the compute container in the Walle System, the first end-to-end, general-purpose, and large-scale production system for device-cloud collaborative machine learning, which has been published in the top system conference OSDI’22.

  • mrpt 📁 🌐 -- MRPT is a lightweight and easy-to-use library for approximate nearest neighbor search with random projection. The index building has an integrated hyperparameter tuning algorithm, so the only hyperparameter required to construct the index is the target recall level! According to our experiments MRPT is one of the fastest libraries for approximate nearest neighbor search.

    In the offline phase of the algorithm MRPT indexes the data with a collection of random projection trees. In the online phase the index structure allows us to answer queries in superior time. A detailed description of the algorithm with the time and space complexities, and the aforementioned comparisons can be found in our article that was published in IEEE International Conference on Big Data 2016.

    The algorithm for automatic hyperparameter tuning is described in detail in our new article that will be presented in Pacific-Asia Conference on Knowledge Discovery and Data Mining 2019 (arxiv preprint).

  • Multicore-TSNE 📁 🌐 -- Multicore t-SNE is a multicore modification of Barnes-Hut t-SNE by L. Van der Maaten with Python CFFI-based wrappers. This code also works faster than sklearn.TSNE on 1 core (as of version 0.18).

  • multiverso 📁 🌐 -- a parameter server based framework for training machine learning models on big data with numbers of machines. It is currently a standard C++ library and provides a series of friendly programming interfaces. Now machine learning researchers and practitioners do not need to worry about the system routine issues such as distributed model storage and operation, inter-process and inter-thread communication, multi-threading management, and so on. Instead, they are able to focus on the core machine learning logics: data, model, and training.

  • mxnet 📁 🌐 -- Apache MXNet is a deep learning framework designed for both efficiency and flexibility. It allows you to mix symbolic and imperative programming to maximize efficiency and productivity.

  • nanoflann_dbscan 📁 🌐 -- a fast C++ implementation of the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm.

  • ncnn 📁 🌐 -- high-performance neural network inference computing framework optimized for mobile platforms (i.e. small footprint)

  • NiuTrans.NMT 📁 🌐 -- a lightweight and efficient Transformer-based neural machine translation system. Its main features are:

    • Few dependencies. It is implemented with pure C++, and all dependencies are optional.
    • Flexible running modes. The system can run with various systems and devices (Linux vs. Windows, CPUs vs. GPUs, and FP32 vs. FP16, etc.).
    • Framework agnostic. It supports various models trained with other tools, e.g., fairseq models.
    • High efficiency. It is heavily optimized for fast decoding, see our WMT paper for more details.
  • oneDNN 📁 🌐 -- oneAPI Deep Neural Network Library (oneDNN) is an open-source cross-platform performance library of basic building blocks for deep learning applications. oneDNN is intended for deep learning applications and framework developers interested in improving application performance on CPUs and GPUs.

  • onnxruntime 📁 🌐 -- a cross-platform inference and training machine-learning accelerator. ONNX Runtime inference can enable faster customer experiences and lower costs, supporting models from deep learning frameworks such as PyTorch and TensorFlow/Keras as well as classical machine learning libraries such as scikit-learn, LightGBM, XGBoost, etc. ONNX Runtime is compatible with different hardware, drivers, and operating systems, and provides optimal performance by leveraging hardware accelerators where applicable alongside graph optimizations and transforms. Learn more →

  • onnxruntime-extensions 📁 🌐 -- a library that extends the capability of the ONNX models and inference with ONNX Runtime, via ONNX Runtime Custom Operator ABIs. It includes a set of ONNX Runtime Custom Operator to support the common pre- and post-processing operators for vision, text, and nlp models. The basic workflow is to enhance a ONNX model firstly and then do the model inference with ONNX Runtime and ONNXRuntime-Extensions package.

  • onnxruntime-genai 📁 🌐 -- generative AI: run Llama, Phi, Gemma, Mistral with ONNX Runtime. This API gives you an easy, flexible and performant way of running LLMs on any device. It implements the generative AI loop for ONNX models, including pre and post processing, inference with ONNX Runtime, logits processing, search and sampling, and KV cache management.

  • OpenBLAS 📁 🌐 -- an optimized BLAS (Basic Linear Algebra Subprograms) library based on GotoBLAS2 1.13 BSD version.

  • OpenCL-CTS 📁 🌐 -- the OpenCL Conformance Test Suite (CTS) for all versions of the Khronos OpenCL standard.

  • OpenCL-Headers 📁 🌐 -- C language headers for the OpenCL API.

  • OpenCL-SDK 📁 🌐 -- the Khronos OpenCL SDK. It brings together all the components needed to develop OpenCL applications.

  • OpenFST 📁 🌐 -- a library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs). Weighted finite-state transducers are automata where each transition has an input label, an output label, and a weight. The more familiar finite-state acceptor is represented as a transducer with each transition's input and output label equal. Finite-state acceptors are used to represent sets of strings (specifically, regular or rational sets); finite-state transducers are used to represent binary relations between pairs of strings (specifically, rational transductions). The weights can be used to represent the cost of taking a particular transition. FSTs have key applications in speech recognition and synthesis, machine translation, optical character recognition, pattern matching, string processing, machine learning, information extraction and retrieval among others. Often a weighted transducer is used to represent a probabilistic model (e.g., an n-gram model, pronunciation model). FSTs can be optimized by determinization and minimization, models can be applied to hypothesis sets (also represented as automata) or cascaded by finite-state composition, and the best results can be selected by shortest-path algorithms.

  • OpenFST-utils 📁 🌐 -- a set of useful programs for manipulating Finite State Transducer with the OpenFst library.

  • OpenNN 📁 🌐 -- a software library written in C++ for advanced analytics. It implements neural networks, the most successful machine learning method. The main advantage of OpenNN is its high performance. This library outstands in terms of execution speed and memory allocation. It is constantly optimized and parallelized in order to maximize its efficiency.

  • openvino 📁 🌐 -- OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference, includind several components: namely [Model Optimizer], [OpenVINO™ Runtime], [Post-Training Optimization Tool], as well as CPU, GPU, GNA, multi device and heterogeneous plugins to accelerate deep learning inference on Intel® CPUs and Intel® Processor Graphics. It supports pre-trained models from [Open Model Zoo], along with 100+ open source and public models in popular formats such as TensorFlow, ONNX, PaddlePaddle, MXNet, Caffe, Kaldi.

  • OTB 📁 🌐 -- Orfeo ToolBox (OTB) is an open-source project for state-of-the-art remote sensing. Built on the shoulders of the open-source geospatial community, it can process high resolution optical, multispectral and radar images at the terabyte scale. A wide variety of applications are available: from ortho-rectification or pansharpening, all the way to classification, SAR processing, and much more!

  • PaddleClas 📁 🌐 -- an image classification and image recognition toolset for industry and academia, helping users train better computer vision models and apply them in real scenarios, based on PaddlePaddle.

  • PaddleDetection 📁 🌐 -- a Highly Efficient Development Toolkit for Object Detection based on PaddlePaddle.

  • Paddle-Lite 📁 🌐 -- an updated version of Paddle-Mobile, an open-open source deep learning framework designed to make it easy to perform inference on mobile, embeded, and IoT devices. It is compatible with PaddlePaddle and pre-trained models from other sources.

  • PaddleNLP 📁 🌐 -- a NLP library that is both easy to use and powerful. It aggregates high-quality pretrained models in the industry and provides a plug-and-play development experience, covering a model library for various NLP scenarios. With practical examples from industry practices, PaddleNLP can meet the needs of developers who require flexible customization.

  • PaddleOCR 📁 🌐 -- PaddleOCR aims to create multilingual, awesome, leading, and practical OCR tools that help users train better models and apply them into practice.

  • PaddlePaddle 📁 🌐 -- the first independent R&D deep learning platform in China. It is an industrial platform with advanced technologies and rich features that cover core deep learning frameworks, basic model libraries, end-to-end development kits, tools & components as well as service platforms. PaddlePaddle is originated from industrial practices with dedication and commitments to industrialization. It has been widely adopted by a wide range of sectors including manufacturing, agriculture, enterprise service, and so on while serving more than 4.7 million developers, 180,000 companies and generating 560,000 models. With such advantages, PaddlePaddle has helped an increasing number of partners commercialize AI.

  • pagerank 📁 🌐 -- a pagerank implementation in C++ able to handle very big graphs.

  • pecos 📁 🌐 -- PECOS (Predictions for Enormous and Correlated Output Spaces) is a versatile and modular machine learning (ML) framework for fast learning and inference on problems with large output spaces, such as extreme multi-label ranking (XMR) and large-scale retrieval. PECOS' design is intentionally agnostic to the specific nature of the inputs and outputs as it is envisioned to be a general-purpose framework for multiple distinct applications. Given an input, PECOS identifies a small set (10-100) of relevant outputs from amongst an extremely large (~100MM) candidate set and ranks these outputs in terms of relevance.

  • PGM-index 📁 🌐 -- the Piecewise Geometric Model index (PGM-index) is a data structure that enables fast lookup, predecessor, range searches and updates in arrays of billions of items using orders of magnitude less space than traditional indexes while providing the same worst-case query time guarantees.

  • pico_tree 📁 🌐 -- a C++ header only library for fast nearest neighbor searches and range searches using a KdTree.

  • puffinn 📁 🌐 -- PUFFINN - Parameterless and Universal Fast FInding of Nearest Neighbors - is an easily configurable library for finding the approximate nearest neighbors of arbitrary points. It also supports the identification of the closest pairs in the dataset. The only necessary parameters are the allowed space usage and the recall. Each near neighbor is guaranteed to be found with the probability given by the recall, regardless of the difficulty of the query. Under the hood PUFFINN uses Locality Sensitive Hashing with an adaptive query mechanism. This means that the algorithm works for any similarity measure where a Locality Sensitive Hash family exists. Currently Cosine similarity is supported using SimHash or cross-polytope LSH and Jaccard similarity is supported using MinHash.

  • pyclustering 📁 🌐 -- a Python, C++ data mining library (clustering algorithm, oscillatory networks, neural networks). The library provides Python and C++ implementations (C++ pyclustering library) of each algorithm or model.

  • pyglass 📁 🌐 -- a library for fast inference of graph index for approximate similarity search.

    • It's high performant.
    • No third-party library dependencies, does not rely on OpenBLAS / MKL or any other computing framework.
    • Sophisticated memory management and data structure design, very low memory footprint.
    • Supports multiple graph algorithms, like HNSW and NSG.
    • Supports multiple hardware platforms, like X86 and ARM. Support for GPU is on the way
  • pytorch 📁 🌐 -- PyTorch library in C++

  • pytorch_cluster 📁 🌐 -- a small extension library of highly optimized graph cluster algorithms for the use in PyTorch, supporting:

  • pytorch_cpp_demo 📁 🌐 -- Deep Learning sample programs of PyTorch written in C++.

  • pyTsetlinMachine 📁 🌐 -- implementation of the Tsetlin Machine (https://arxiv.org/abs/1804.01508), Convolutional Tsetlin Machine (https://arxiv.org/abs/1905.09688), Regression Tsetlin Machine (https://arxiv.org/abs/1905.04206, https://royalsocietypublishing.org/doi/full/10.1098/rsta.2019.0165, https://link.springer.com/chapter/10.1007/978-3-030-30244-3_23), Weighted Tsetlin Machines (https://arxiv.org/abs/1911.12607, https://ieeexplore.ieee.org/document/9316190, https://arxiv.org/abs/2002.01245), and Embedding Tsetlin Machine, with support for continuous features (https://arxiv.org/abs/1905.04199, https://link.springer.com/chapter/10.1007%2F978-3-030-22999-3_49), multigranular clauses (https://arxiv.org/abs/1909.07310, https://link.springer.com/chapter/10.1007/978-3-030-34885-4_11), clause indexing (https://arxiv.org/abs/2004.03188, https://link.springer.com/chapter/10.1007/978-3-030-55789-8_60), drop clause (https://arxiv.org/abs/2105.14506), and literal budget (https://www.ijcai.org/proceedings/2023/378).

  • sod 📁 🌐 -- SOD is an embedded, modern cross-platform computer vision and machine learning software library that exposes a set of APIs for deep-learning, advanced media analysis & processing including real-time, multi-class object detection and model training on embedded systems with limited computational resource and IoT devices. SOD was built to provide a common infrastructure for computer vision applications and to accelerate the use of machine perception in open source as well commercial products.

  • spherical-k-means 📁 🌐 -- the spherical K-means algorithm in Matlab and C++. The C++ version emphasizes a multithreaded implementation and features three ways of running the algorithm. It can be executed with a single-thread (same as the Matlab implementation), or using OpenMP or Galois (http://iss.ices.utexas.edu/?p=projects/galois). The purpose of this code is to optimize and compare the different parallel paradigms to maximize the efficiency of the algorithm.

  • ssdeep 📁 🌐 -- fuzzy hashing library, can be used to assist with identifying almost identical files using context triggered piecewise hashing.

  • SSIM 📁 🌐 -- the structural similarity index measure (SSIM) is a popular method to predict perceived image quality. Published in April 2004, with over 46,000 Google Scholar citations, it has been re-implemented hundreds, perhaps thousands, of times, and is widely used as a measurement of image quality for image processing algorithms (even in places where it does not make sense, leading to even worse outcomes!). Unfortunately, if you try to reproduce results in papers, or simply grab a few SSIM implementations and compare results, you will soon find that it is (nearly?) impossible to find two implementations that agree, and even harder to find one that agrees with the original from the author. Chris Lomont ran into this issue many times, so he finally decided to write it up once and for all (and provide clear code that matches the original results, hoping to help reverse the mess that is current SSIM). Most of the problems come from the original implementation being in MATLAB, which not everyone can use. Running the same code in open source Octave, which claims to be MATLAB compatible, even returns wrong results! This large and inconsistent variation among SSIM implementations makes it hard to trust or compare published numbers between papers. The original paper doesn't define how to handle color images, doesn't specify what color space the grayscale values represent (linear? gamma compressed?), adding to the inconsistencies and results. The lack of color causes the following images to be rated as visually perfect by SSIM as published. The paper demonstrates so many issues when using SSIM with color images that they state "we advise not to use SSIM with color images". All of this is a shame since the underlying concept works well for the given compute complexity. A good first step to cleaning up this mess is trying to get widely used implementations to match the author results for their published test values, and this requires clearly specifying the algorithm at the computational level, which the authors did not. Chris Lomont explains some of these choices, and most importantly, provides original, MIT licensed, single file C++ header and single file C# implementations; each reproduces the original author code better than any other version I have found.

  • ssimulacra2 📁 🌐 -- Structural SIMilarity Unveiling Local And Compression Related Artifacts metric developed by Jon Sneyers. SSIMULACRA 2 is based on the concept of the multi-scale structural similarity index measure (MS-SSIM), computed in a perceptually relevant color space, adding two other (asymmetric) error maps, and aggregating using two different norms.

  • stan 📁 🌐 -- Stan is a C++ package providing (1) full Bayesian inference using the No-U-Turn sampler (NUTS), a variant of Hamiltonian Monte Carlo (HMC), (2) approximate Bayesian inference using automatic differentiation variational inference (ADVI), and (3) penalized maximum likelihood estimation (MLE) using L-BFGS optimization. It is built on top of the Stan Math library.

  • stan-math 📁 🌐 -- the Stan Math Library is a C++, reverse-mode automatic differentiation library designed to be usable, extensive and extensible, efficient, scalable, stable, portable, and redistributable in order to facilitate the construction and utilization of algorithms that utilize derivatives.

  • StarSpace 📁 🌐 -- a general-purpose neural model for efficient learning of entity embeddings for solving a wide variety of problems.

  • tapkee 📁 🌐 -- a C++ template library for dimensionality reduction with some bias on spectral methods. The Tapkee origins from the code developed during GSoC 2011 as the part of the Shogun machine learning toolbox. The project aim is to provide efficient and flexible standalone library for dimensionality reduction which can be easily integrated to existing codebases. Tapkee leverages capabilities of effective Eigen3 linear algebra library and optionally makes use of the ARPACK eigensolver. The library uses CoverTree and VP-tree data structures to compute nearest neighbors. To achieve greater flexibility we provide a callback interface which decouples dimension reduction algorithms from the data representation and storage schemes.

  • tensorflow 📁 🌐 -- an end-to-end open source platform for machine learning.

  • tensorflow-docs 📁 🌐 -- TensorFlow documentation

  • tensorflow-io 📁 🌐 -- TensorFlow I/O is a collection of file systems and file formats that are not available in TensorFlow's built-in support. A full list of supported file systems and file formats by TensorFlow I/O can be found here.

  • tensorflow-text 📁 🌐 -- TensorFlow Text provides a collection of text related classes and ops ready to use with TensorFlow 2.0. The library can perform the preprocessing regularly required by text-based models, and includes other features useful for sequence modeling not provided by core TensorFlow.

  • tensorstore 📁 🌐 -- TensorStore is an open-source C++ and Python software library designed for storage and manipulation of large multi-dimensional arrays.

  • thunderSVM 📁 🌐 -- ThunderSVM exploits GPUs and multi-core CPUs to achieve high efficiency, supporting all functionalities of LibSVM such as one-class SVMs, SVC, SVR and probabilistic SVMs.

  • tinn 📁 🌐 -- Tinn (Tiny Neural Network) is a 200 line dependency free neural network library written in C99.

  • tiny-dnn 📁 🌐 -- a C++14 implementation of deep learning. It is suitable for deep learning on limited computational resource, embedded systems and IoT devices.

  • TNN 📁 🌐 -- a high-performance, lightweight neural network inference framework open sourced by Tencent Youtu Lab. It also has many outstanding advantages such as cross-platform, high performance, model compression, and code tailoring. The TNN framework further strengthens the support and performance optimization of mobile devices on the basis of the original Rapidnet and ncnn frameworks. At the same time, it refers to the high performance and good scalability characteristics of the industry's mainstream open source frameworks, and expands the support for X86 and NV GPUs. On the mobile phone, TNN has been used by many applications such as mobile QQ, weishi, and Pitu. As a basic acceleration framework for Tencent Cloud AI, TNN has provided acceleration support for the implementation of many businesses. Everyone is welcome to participate in the collaborative construction to promote the further improvement of the TNN inference framework.

  • vxl 📁 🌐 -- VXL (the Vision-something-Libraries) is a collection of C++ libraries designed for computer vision research and implementation. It was created from TargetJr and the IUE with the aim of making a light, fast and consistent system.

  • waifu2x-ncnn-vulkan 📁 🌐 -- waifu2x ncnn Vulkan: an ncnn project implementation of the waifu2x converter. Runs fast on Intel / AMD / Nvidia / Apple-Silicon with Vulkan API.

  • warp-ctc 📁 🌐 -- A fast parallel implementation of CTC, on both CPU and GPU. Connectionist Temporal Classification (CTC) is a loss function useful for performing supervised learning on sequence data, without needing an alignment between input data and labels. For example, CTC can be used to train end-to-end systems for speech recognition.

  • xnnpack 📁 🌐 -- a highly optimized library of floating-point neural network inference operators for ARM, WebAssembly, and x86 platforms. XNNPACK is not intended for direct use by deep learning practitioners and researchers; instead it provides low-level performance primitives for accelerating high-level machine learning frameworks, such as TensorFlow Lite, TensorFlow.js, PyTorch, and MediaPipe.

  • xtensor 📁 🌐 -- C++ tensors with broadcasting and lazy computing. xtensor is a C++ library meant for numerical analysis with multi-dimensional array expressions.

  • xtensor-blas 📁 🌐 -- an extension to the xtensor library, offering bindings to BLAS and LAPACK libraries through cxxblas and cxxlapack.

  • xtensor-io 📁 🌐 -- a xtensor plugin to read and write images, audio files, NumPy (compressed) NPZ and HDF5 files.

  • xtl 📁 🌐 -- xtensor core library

  • yara-pattern-matcher 📁 🌐 -- for automated and user-specified pattern recognition in custom document & metadata cleaning / processing tasks

  • ZQCNN 📁 🌐 -- ZQCNN is an inference framework that can run under windows, linux and arm-linux. At the same time, there are some demos related to face detection and recognition.

similarity search

  • aho_corasick 📁 🌐 -- a header only implementation of the Aho-Corasick pattern search algorithm invented by Alfred V. Aho and Margaret J. Corasick. It is a very efficient dictionary matching algorithm that can locate all search patterns against in input text simultaneously in O(n + m), with space complexity O(m) (where n is the length of the input text, and m is the combined length of the search patterns).

  • annoy 📁 🌐 -- ANNOY (Approximate Nearest Neighbors Oh Yeah) is a C++ library to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmap-ped into memory so that many processes may share the same data. ANNOY is almost as fast as the fastest libraries, but what really sets Annoy apart is: it has the ability to use static files as indexes, enabling you to share an index across processes. ANNOY also decouples creating indexes from loading them, so you can pass around indexes as files and map them into memory quickly. ANNOY tries to minimize its memory footprint: the indexes are quite small. This is useful when you want to find nearest neighbors using multiple CPU's. Spotify uses ANNOY for music recommendations.

  • brown-cluster 📁 🌐 -- the Brown hierarchical word clustering algorithm. Runs in $O(N C^2)$, where $N$ is the number of word types and $C$ is the number of clusters. Algorithm by Brown, et al.: Class-Based n-gram Models of Natural Language, http://acl.ldc.upenn.edu/J/J92/J92-4003.pdf

  • cppsimhash 📁 🌐 -- C++ simhash implementation for documents and an additional (prototype) simhash index for text documents. Simhash is a hashing technique that belongs to the LSH (Local Sensitive Hashing) algorithmic family. It was initially developed by Moses S. Charikar in 2002 and is described in detail in his paper.

  • CTCWordBeamSearch 📁 🌐 -- Connectionist Temporal Classification (CTC) decoder with dictionary and Language Model (LM).

  • DiskANN 📁 🌐 -- DiskANN is a suite of scalable, accurate and cost-effective approximate nearest neighbor search algorithms for large-scale vector search that support real-time changes and simple filters.

  • DP_means 📁 🌐 -- Dirichlet Process K-means is a bayesian non-parametric extension of the K-means algorithm based on small variance assymptotics (SVA) approximation of the Dirichlet Process Mixture Model. B. Kulis and M. Jordan, "Revisiting k-means: New Algorithms via Bayesian Nonparametrics"

  • faiss 📁 🌐 -- a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy. Some of the most useful algorithms are implemented on the GPU. It is developed primarily at Facebook AI Research.

  • falconn 📁 🌐 -- FALCONN (FAst Lookups of Cosine and Other Nearest Neighbors) is a library with algorithms for the nearest neighbor search problem. The algorithms in FALCONN are based on Locality-Sensitive Hashing (LSH), which is a popular class of methods for nearest neighbor search in high-dimensional spaces. The goal of FALCONN is to provide very efficient and well-tested implementations of LSH-based data structures. Currently, FALCONN supports two LSH families for the cosine similarity: hyperplane LSH and cross polytope LSH. Both hash families are implemented with multi-probe LSH in order to minimize memory usage. Moreover, FALCONN is optimized for both dense and sparse data. Despite being designed for the cosine similarity, FALCONN can often be used for nearest neighbor search under the Euclidean distance or a maximum inner product search.

  • flann 📁 🌐 -- FLANN (Fast Library for Approximate Nearest Neighbors) is a library for performing fast approximate nearest neighbor searches in high dimensional spaces. It contains a collection of algorithms we found to work best for nearest neighbor search and a system for automatically choosing the best algorithm and optimum parameters depending on the dataset.

  • flinng 📁 🌐 -- Filters to Identify Near-Neighbor Groups (FLINNG) is a near neighbor search algorithm outlined in the paper Practical Near Neighbor Search via Group Testing.

  • FM-fast-match 📁 🌐 -- FAsT-Match: a port of the Fast Affine Template Matching algorithm (Simon Korman, Daniel Reichman, Gilad Tsur, Shai Avidan, CVPR 2013, Portland)

  • fuzzy-match 📁 🌐 -- FuzzyMatch-cli is a commandline utility allowing to compile FuzzyMatch indexes and use them to lookup fuzzy matches. Okapi BM25 prefiltering is available on branch bm25.

  • hnswlib 📁 🌐 -- fast approximate nearest neighbor search. Header-only C++ HNSW implementation with python bindings.

  • ikd-Tree 📁 🌐 -- an incremental k-d tree designed for robotic applications. The ikd-Tree incrementally updates a k-d tree with new coming points only, leading to much lower computation time than existing static k-d trees. Besides point-wise operations, the ikd-Tree supports several features such as box-wise operations and down-sampling that are practically useful in robotic applications.

  • imagehash 📁 🌐 -- an image hashing library written in Python. ImageHash supports Average hashing, Perceptual hashing, Difference hashing, Wavelet hashing, HSV color hashing (colorhash) and Crop-resistant hashing. The image hash algorithms (average, perceptual, difference, wavelet) analyse the image structure on luminance (without color information). The color hash algorithm analyses the color distribution and black & gray fractions (without position information).

  • ivf-hnsw 📁 🌐 -- Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors. This is the code for the current state-of-the-art billion-scale nearest neighbor search system presented in the paper: Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors (Dmitry Baranchuk, Artem Babenko, Yury Malkov).

  • kgraph 📁 🌐 -- a library for k-nearest neighbor (k-NN) graph construction and online k-NN search using a k-NN Graph as index. KGraph implements heuristic algorithms that are extremely generic and fast. KGraph works on abstract objects. The only assumption it makes is that a similarity score can be computed on any pair of objects, with a user-provided function.

  • K-Medoids-Clustering 📁 🌐 -- K-medoids is a clustering algorithm related to K-means. In contrast to the K-means algorithm, K-medoids chooses datapoints as centers of the clusters. There are eight combinations of Initialization, Assignment and Update algorithms to achieve the best results in the given dataset. Also Clara algorithm approach is implemented.

  • libahocorasick 📁 🌐 -- a fast and memory efficient library for exact or approximate multi-pattern string search meaning that you can find multiple key strings occurrences at once in some input text. The strings "index" can be built ahead of time and saved (as a pickle) to disk to reload and reuse later. The library provides an ahocorasick Python module that you can use as a plain dict-like Trie or convert a Trie to an automaton for efficient Aho-Corasick search.

  • libharry 📁 🌐 -- Harry - A Tool for Measuring String Similarity. The tool supports several common distance and kernel functions for strings as well as some excotic similarity measures. The focus of Harry lies on implicit similarity measures, that is, comparison functions that do not give rise to an explicit vector space. Examples of such similarity measures are the Levenshtein distance, the Jaro-Winkler distance or the spectrum kernel.

  • libkdtree 📁 🌐 -- libkdtree++ is a C++ template container implementation of k-dimensional space sorting, using a kd-tree.

  • libngt-ann 📁 🌐 -- Yahoo's Neighborhood Graph and Tree for Indexing High-dimensional Data. NGT provides commands and a library for performing high-speed approximate nearest neighbor searches against a large volume of data (several million to several 10 million items of data) in high dimensional vector data space (several ten to several thousand dimensions).

  • libsptag 📁 🌐 -- a library for fast approximate nearest neighbor search. SPTAG (Space Partition Tree And Graph) is a library for large scale vector approximate nearest neighbor search scenario released by Microsoft Research (MSR) and Microsoft Bing.

  • LMW-tree 📁 🌐 -- LMW-tree: learning m-way tree is a generic template library written in C++ that implements several algorithms that use the m-way nearest neighbor tree structure to store their data. See the related PhD thesis for more details on m-way nn trees. The algorithms are primarily focussed on computationally efficient clustering. Clustering is an unsupervised machine learning process that finds interesting patterns in data. It places similar items into clusters and dissimilar items into different clusters. The data structures and algorithms can also be used for nearest neighbor search, supervised learning and other machine learning applications. The package includes EM-tree, K-tree, k-means, TSVQ, repeated k-means, clustering, random projections, random indexing, hashing, bit signatures. See the related PhD thesis for more details these algorithms and representations.

  • lshbox 📁 🌐 -- a C++ Toolbox of Locality-Sensitive Hashing for Large Scale Image Retrieval. Locality-Sensitive Hashing (LSH) is an efficient method for large scale image retrieval, and it achieves great performance in approximate nearest neighborhood searching.

    LSHBOX is a simple but robust C++ toolbox that provides several LSH algrithms, in addition, it can be integrated into Python and MATLAB languages. The following LSH algrithms have been implemented in LSHBOX, they are:

    • LSH Based on Random Bits Sampling
    • Random Hyperplane Hashing
    • LSH Based on Thresholding
    • LSH Based on p-Stable Distributions
    • Spectral Hashing (SH)
    • Iterative Quantization (ITQ)
    • Double-Bit Quantization Hashing (DBQ)
    • K-means Based Double-Bit Quantization Hashing (KDBQ)
  • mrpt 📁 🌐 -- MRPT is a lightweight and easy-to-use library for approximate nearest neighbor search with random projection. The index building has an integrated hyperparameter tuning algorithm, so the only hyperparameter required to construct the index is the target recall level! According to our experiments MRPT is one of the fastest libraries for approximate nearest neighbor search.

    In the offline phase of the algorithm MRPT indexes the data with a collection of random projection trees. In the online phase the index structure allows us to answer queries in superior time. A detailed description of the algorithm with the time and space complexities, and the aforementioned comparisons can be found in our article that was published in IEEE International Conference on Big Data 2016.

    The algorithm for automatic hyperparameter tuning is described in detail in our new article that will be presented in Pacific-Asia Conference on Knowledge Discovery and Data Mining 2019 (arxiv preprint).

  • n2-kNN 📁 🌐 -- N2: Lightweight approximate N\ earest N\ eighbor algorithm library. N2 stands for two N's, which comes from 'Approximate N\ earest N\ eighbor Algorithm'. Before N2, there has been other great approximate nearest neighbor libraries such as Annoy and NMSLIB. However, each of them had different strengths and weaknesses regarding usability, performance, etc. N2 has been developed aiming to bring the strengths of existing aKNN libraries and supplement their weaknesses.

  • nanoflann 📁 🌐 -- a C++11 header-only library for building KD-Trees of datasets with different topologies: R^2, R^3 (point clouds), SO(2) and SO(3) (2D and 3D rotation groups). No support for approximate NN is provided. This library is a fork of the flann library by Marius Muja and David G. Lowe, and born as a child project of MRPT.

  • nanoflann_dbscan 📁 🌐 -- a fast C++ implementation of the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm.

  • nmslib 📁 🌐 -- Non-Metric Space Library (NMSLIB) is an efficient cross-platform similarity search library and a toolkit for evaluation of similarity search methods. The core-library does not have any third-party dependencies. It has been gaining popularity recently. In particular, it has become a part of Amazon Elasticsearch Service. The goal of the project is to create an effective and comprehensive toolkit for searching in generic and non-metric spaces. Even though the library contains a variety of metric-space access methods, our main focus is on generic and approximate search methods, in particular, on methods for non-metric spaces. NMSLIB is possibly the first library with a principled support for non-metric space searching.

  • online-hnsw 📁 🌐 -- Online HNSW: an implementation of the HNSW index for approximate nearest neighbors search for C++14, that supports incremental insertion and removal of elements.

  • pagerank 📁 🌐 -- a pagerank implementation in C++ able to handle very big graphs.

  • pHash 📁 🌐 -- the open source perceptual hash library. Potential applications include copyright protection, similarity search for media files, or even digital forensics. For example, YouTube could maintain a database of hashes that have been submitted by the major movie producers of movies to which they hold the copyright. If a user then uploads the same video to YouTube, the hash will be almost identical, and it can be flagged as a possible copyright violation. The audio hash could be used to automatically tag MP3 files with proper ID3 information, while the text hash could be used for plagiarism detection.

  • phash-gpl 📁 🌐 -- pHash™ Perceptual Hashing Library is a collection of perceptual hashing algorithms for image, audo, video and text media.

  • pico_tree 📁 🌐 -- a C++ header only library for fast nearest neighbor searches and range searches using a KdTree.

  • probminhash 📁 🌐 -- a class of Locality-Sensitive Hash Algorithms for the (Probability) Jaccard Similarity

  • pyglass 📁 🌐 -- a library for fast inference of graph index for approximate similarity search.

    • It's high performant.
    • No third-party library dependencies, does not rely on OpenBLAS / MKL or any other computing framework.
    • Sophisticated memory management and data structure design, very low memory footprint.
    • Supports multiple graph algorithms, like HNSW and NSG.
    • Supports multiple hardware platforms, like X86 and ARM. Support for GPU is on the way
  • sdhash 📁 🌐 -- a tool which allows two arbitrary blobs of data to be compared for similarity based on common strings of binary data. It is designed to provide quick results during triage and initial investigation phases.

  • Shifted-Hamming-Distance 📁 🌐 -- Shifted Hamming Distance (SHD) is an edit-distance based filter that can quickly check whether the minimum number of edits (including insertions, deletions and substitutions) between two strings is smaller than a user defined threshold T (the number of allowed edits between the two strings). Testing if two stings differs by a small amount is a prevalent function that is used in many applications. One of its biggest usage, perhaps, is in DNA or protein mapping, where a short DNA or protein string is compared against an enormous database, in order to find similar matches. In such applications, a query string is usually compared against multiple candidate strings in the database. Only candidates that are similar to the query are considered matches and recorded. SHD expands the basic Hamming distance computation, which only detects substitutions, into a full-fledged edit-distance filter, which counts not only substitutions but insertions and deletions as well.

  • simhash-cpp 📁 🌐 -- Simhash Near-Duplicate Detection enables the identification of all fingerprints that are nearly identical to a query fingerprint. In this context, a fingerprint is an unsigned 64-bit integer. It also comes with an auxillary function designed to generate a fingerprint given a char* and a length. This fingeprint is generated with a tokenizer and a hash function (both of which may be provided as template parameters). Using a cyclic hash function, it then performs simhash on a moving window of tokens (as defined by the tokenizer).

  • spherical-k-means 📁 🌐 -- the spherical K-means algorithm in Matlab and C++. The C++ version emphasizes a multithreaded implementation and features three ways of running the algorithm. It can be executed with a single-thread (same as the Matlab implementation), or using OpenMP or Galois (http://iss.ices.utexas.edu/?p=projects/galois). The purpose of this code is to optimize and compare the different parallel paradigms to maximize the efficiency of the algorithm.

  • ssdeep 📁 🌐 -- fuzzy hashing library, can be used to assist with identifying almost identical files using context triggered piecewise hashing.

  • SSIM 📁 🌐 -- the structural similarity index measure (SSIM) is a popular method to predict perceived image quality. Published in April 2004, with over 46,000 Google Scholar citations, it has been re-implemented hundreds, perhaps thousands, of times, and is widely used as a measurement of image quality for image processing algorithms (even in places where it does not make sense, leading to even worse outcomes!). Unfortunately, if you try to reproduce results in papers, or simply grab a few SSIM implementations and compare results, you will soon find that it is (nearly?) impossible to find two implementations that agree, and even harder to find one that agrees with the original from the author. Chris Lomont ran into this issue many times, so he finally decided to write it up once and for all (and provide clear code that matches the original results, hoping to help reverse the mess that is current SSIM). Most of the problems come from the original implementation being in MATLAB, which not everyone can use. Running the same code in open source Octave, which claims to be MATLAB compatible, even returns wrong results! This large and inconsistent variation among SSIM implementations makes it hard to trust or compare published numbers between papers. The original paper doesn't define how to handle color images, doesn't specify what color space the grayscale values represent (linear? gamma compressed?), adding to the inconsistencies and results. The lack of color causes the following images to be rated as visually perfect by SSIM as published. The paper demonstrates so many issues when using SSIM with color images that they state "we advise not to use SSIM with color images". All of this is a shame since the underlying concept works well for the given compute complexity. A good first step to cleaning up this mess is trying to get widely used implementations to match the author results for their published test values, and this requires clearly specifying the algorithm at the computational level, which the authors did not. Chris Lomont explains some of these choices, and most importantly, provides original, MIT licensed, single file C++ header and single file C# implementations; each reproduces the original author code better than any other version I have found.

  • ssimulacra2 📁 🌐 -- Structural SIMilarity Unveiling Local And Compression Related Artifacts metric developed by Jon Sneyers. SSIMULACRA 2 is based on the concept of the multi-scale structural similarity index measure (MS-SSIM), computed in a perceptually relevant color space, adding two other (asymmetric) error maps, and aggregating using two different norms.

  • tiny-dnn 📁 🌐 -- a C++14 implementation of deep learning. It is suitable for deep learning on limited computational resource, embedded systems and IoT devices.

  • tlsh 📁 🌐 -- TLSH - Trend Micro Locality Sensitive Hash - is a fuzzy matching library. Given a byte stream with a minimum length of 50 bytes TLSH generates a hash value which can be used for similarity comparisons. Similar objects will have similar hash values which allows for the detection of similar objects by comparing their hash values. Note that the byte stream should have a sufficient amount of complexity. For example, a byte stream of identical bytes will not generate a hash value.

  • usearch 📁 🌐 -- smaller & faster Single-File Similarity Search Engine for vectors & texts.

  • VQMT 📁 🌐 -- VQMT (Video Quality Measurement Tool) provides fast implementations of the following objective metrics:

    • MS-SSIM: Multi-Scale Structural Similarity,
    • PSNR: Peak Signal-to-Noise Ratio,
    • PSNR-HVS: Peak Signal-to-Noise Ratio taking into account Contrast Sensitivity Function (CSF),
    • PSNR-HVS-M: Peak Signal-to-Noise Ratio taking into account Contrast Sensitivity Function (CSF) and between-coefficient contrast masking of DCT basis functions.
    • SSIM: Structural Similarity,
    • VIFp: Visual Information Fidelity, pixel domain version

    The above metrics are implemented in C++ with the help of OpenCV and are based on the original Matlab implementations provided by their developers.

  • xgboost 📁 🌐 -- an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Kubernetes, Hadoop, SGE, MPI, Dask) and can solve problems beyond billions of examples.

text tokenization (as a preprocessing step for LDA et al):

i.e. breaking text into words when you receive a textstream without spaces. Also useful for Asian languages, which don't do spaces, e.g. Chinese.

  • Bi-Sent2Vec 📁 🌐 -- provides cross-lingual numerical representations (features) for words, short texts, or sentences, which can be used as input to any machine learning task with applications geared towards cross-lingual word translation, cross-lingual sentence retrieval as well as cross-lingual downstream NLP tasks. The library is a cross-lingual extension of Sent2Vec. Bi-Sent2Vec vectors are also well suited to monolingual tasks as indicated by a marked improvement in the monolingual quality of the word embeddings. (For more details, see paper)

  • BlingFire 📁 🌐 -- we are a team at Microsoft called Bling (Beyond Language Understanding), sharing our FInite State machine and REgular expression manipulation library (FIRE). We use Fire for many linguistic operations inside Bing such as Tokenization, Multi-word expression matching, Unknown word-guessing, Stemming / Lemmatization just to mention a few.

    Fire can also be used to improve FastText: see here.

    Bling Fire Tokenizer provides state of the art performance for Natural Language text tokenization.

  • chewing_text_cud 📁 🌐 -- a text processing / filtering library for use in NLP/search/content analysis research pipelines.

  • cppjieba 📁 🌐 -- the C++ version of the Chinese "Jieba" project:

    • Supports loading a custom user dictionary, using the '|' separator when multipathing or the ';' separator for separate, multiple, dictionaries.
    • Supports 'utf8' encoding.
    • The project comes with a relatively complete unit test, and the stability of the core function Chinese word segmentation (utf8) has been tested by the online environment.
  • fastBPE 📁 🌐 -- text tokenization / ngrams

  • fastText 📁 🌐 -- fastText is a library for efficient learning of word representations and sentence classification. Includes language detection feeatures.

  • fribidi 📁 🌐 -- GNU FriBidi: the Free Implementation of the [Unicode Bidirectional Algorithm]. One of the missing links stopping the penetration of free software in Middle East is the lack of support for the Arabic and Hebrew alphabets. In order to have proper Arabic and Hebrew support, the bidi algorithm needs to be implemented. It is our hope that this library will stimulate more free software in the Middle Eastern countries.

  • friso 📁 🌐 -- high performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm.

  • fxt 📁 🌐 -- a large scale feature extraction tool for text-based machine learning.

  • koan 📁 🌐 -- a word2vec negative sampling implementation with correct CBOW update. kōan only depends on Eigen.

    Although continuous bag of word (CBOW) embeddings can be trained more quickly than skipgram (SG) embeddings, it is a common belief that SG embeddings tend to perform better in practice. This was observed by the original authors of Word2Vec [1] and also in subsequent work [2]. However, we found that popular implementations of word2vec with negative sampling such as word2vec and gensim do not implement the CBOW update correctly, thus potentially leading to misconceptions about the performance of CBOW embeddings when trained correctly.

  • libchewing 📁 🌐 -- The Chewing (酷音) is an intelligent phonetic input method (Zhuyin/Bopomofo) and is one of the most popular choices for Traditional Chinese users. Chewing was inspired by other proprietary intelligent Zhuyin input methods on Microsoft Windows, namely Wang-Xin by Eten, Microsoft New Zhuyin, and Nature Zhuyin (aka Going).

  • libchopshop 📁 🌐 -- NLP/text processing with automated stop word detection and stemmer-based filtering. This library / toolkit is engineered to be able to provide both of the (often more or less disparate) n-gram token streams / vectors required for (1) initializing / training FTS databases, neural nets, etc. and (2) executing effective queries / matches on these engines.

  • libcppjieba 📁 🌐 -- source code extracted from the [CppJieba] project to form a separate project, making it easier to understand and use.

  • libdtm 📁 🌐 -- LibDTM (Dynamic Topic Models and the Document Influence Model) implements topics that change over time (Dynamic Topic Models) and a model of how individual documents predict that change. This code is the result of work by David M. Blei and Sean M. Gerrish.

  • libpostal 📁 🌐 -- a C library for parsing/normalizing street addresses around the world using statistical NLP and open data. The goal of this project is to understand location-based strings in every language, everywhere.

  • libtextcat 📁 🌐 -- text language detection

  • many-stop-words 📁 🌐 -- Many Stop Words is a simple Python package that provides a single function for loading sets of stop words for different languages.

  • mecab 📁 🌐 -- MeCab (Yet Another Part-of-Speech and Morphological Analyzer) is a high-performance morphological analysis engine, designed to be independent of languages, dictionaries, and corpora, using Conditional Random Fields ((CRF)[http://www.cis.upenn.edu/~pereira/papers/crf.pdf]) to estimate the parameters.

  • ngrams-weighted 📁 🌐 -- implements the method to compute N-gram IDF weights for all valid word N-grams in the given corpus (document set).

  • open-location-code 📁 🌐 -- Open Location Code is a technology that gives a way of encoding location into a form that is easier to use than latitude and longitude. The codes generated are called plus codes, as their distinguishing attribute is that they include a "+" character. The technology is designed to produce codes that can be used as a replacement for street addresses, especially in places where buildings aren't numbered or streets aren't named. Plus codes represent an area, not a point. As digits are added to a code, the area shrinks, so a long code is more precise than a short code. Codes that are similar are located closer together than codes that are different.

  • sally 📁 🌐 -- a Tool for Embedding Strings in Vector Spaces. This mapping is referred to as embedding and allows for applying techniques of machine learning and data mining for analysis of string data. Sally can be applied to several types of strings, such as text documents, DNA sequences or log files, where it can handle common formats such as directories, archives and text files of string data. Sally implements a standard technique for mapping strings to a vector space that can be referred to as generalized bag-of-words model. The strings are characterized by a set of features, where each feature is associated with one dimension of the vector space. The following types of features are supported by Sally: bytes, tokens (words), n-grams of bytes and n-grams of tokens.

  • scws-chinese-word-segmentation 📁 🌐 -- SCWS (Simple Chinese Word Segmentation) (i.e.: Simple Chinese Word Segmentation System). This is a mechanical Chinese word segmentation engine based on word frequency dictionary, which can basically correctly segment a whole paragraph of Chinese text into words. A word is the smallest morpheme unit in Chinese, but when writing, it does not separate words with spaces like English, so how to segment words accurately and quickly has always been a difficult problem in Chinese word segmentation. SCWS supports Chinese encoding includes GBK, UTF-8, etc.

    There are not many innovative elements in the word segmentation algorithm. It uses a word frequency dictionary collected by itself, supplemented by certain proper names, names of people, and place names. Basic word segmentation is achieved by identifying rules such as digital age. After small-scale testing, the accuracy is between 90% and 95%, which can basically satisfy some use in small search engines, keyword extraction and other occasions. The first prototype version was released in late 2005.

  • sent2vec 📁 🌐 -- a tool and pre-trained models related to the Bi-Sent2vec. The cross-lingual extension of Sent2Vec can be found here. This library provides numerical representations (features) for words, short texts, or sentences, which can be used as input to any machine learning task.

  • sentencepiece 📁 🌐 -- text tokenization

  • sentence-tokenizer 📁 🌐 -- text tokenization

  • SheenBidi 📁 🌐 -- implements Unicode Bidirectional Algorithm available at http://www.unicode.org/reports/tr9. It is a sophisticated implementation which provides the developers an easy way to use UBA in their applications.

  • stopwords 📁 🌐 -- default English stop words from different sources.

  • ucto 📁 🌐 -- text tokenization

    • libfolia 📁 🌐 -- working with the Format for Linguistic Annotation (FoLiA). Provides a high-level API to read, manipulate, and create FoLiA documents.
    • uctodata 📁 🌐 -- data for ucto library
  • word2vec 📁 🌐 -- Word2Vec in C++ 11.

  • word2vec-GloVe 📁 🌐 -- an implementation of the GloVe (Global Vectors for Word Representation) model for learning word representations.

  • worde_butcher 📁 🌐 -- a tool for text segmentation, keyword extraction and speech tagging. Butchers any text into prime word / phrase cuts, deboning all incoming based on our definitive set of stopwords for all languages.

  • wordfreq 📁 🌐 -- wordfreq is a Python library for looking up the frequencies of words in many languages, based on many sources of data.

  • wordfrequency 📁 🌐 -- FrequencyWords: Frequency Word List Generator and processed files.

  • you-token-to-me 📁 🌐 -- text tokenization

regex matchers (manual edit - pattern recognition)

  • hyperscan 📁 🌐 -- Hyperscan is a high-performance multiple regex matching library.

  • libfsm 📁 🌐 -- provides core functions for finite state machines: NFA, DFA, regular expressions and lexical analysis. Used by ragel.

  • libgnurx 📁 🌐 -- the POSIX regex functionality from glibc extracted into a separate library, for Win32.

  • libwildmatch 📁 🌐 -- wildmatch is a BSD-licensed C/C++ library for git/rsync-style pattern matching.

  • oniguruma 📁 🌐 -- a modern and flexible regular expressions library. It encompasses features from different regular expression implementations that traditionally exist in different languages. Character encoding can be specified per regular expression object. Supported character encodings include: ASCII, UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE, EUC-JP, EUC-TW, EUC-KR, EUC-CN, Shift_JIS, Big5, GB18030, KOI8-R, CP1251, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-10, ISO-8859-11, ISO-8859-13, ISO-8859-14, ISO-8859-15, ISO-8859-16

  • pcre 📁 🌐 -- PCRE2 : Perl-Compatible Regular Expressions. The PCRE2 library is a set of C functions that implement regular expression pattern matching using the same syntax and semantics as Perl 5. PCRE2 has its own native API, as well as a set of wrapper functions that correspond to the POSIX regular expression API. It comes in three forms, for processing 8-bit, 16-bit, or 32-bit code units, in either literal or UTF encoding.

  • pdfgrep 📁 🌐 -- a tool to search text in PDF files. It works similarly to grep.

  • ragel 📁 🌐 -- State Machine Compiler

  • re2 📁 🌐 -- RE2, a regular expression library.

  • re2c 📁 🌐 -- a lexer generator for C/C++, Go and Rust. Its main goal is generating fast lexers: at least as fast as their reasonably optimized hand-coded counterparts. Instead of using traditional table-driven approach, re2c encodes the generated finite state automata directly in the form of conditional jumps and comparisons. The resulting programs are faster and often smaller than their table-driven analogues, and they are much easier to debug and understand. re2c applies quite a few optimizations in order to speed up and compress the generated code. Another distinctive feature is its flexible interface: instead of assuming a fixed program template, re2c lets the programmer write most of the interface code and adapt the generated lexer to any particular environment.

  • RE-flex 📁 🌐 -- the regex-centric, fast lexical analyzer generator for C++ with full Unicode support. Faster than Flex. Accepts Flex specifications. Generates reusable source code that is easy to understand. Introduces indent/dedent anchors, lazy quantifiers, functions for lex/syntax error reporting and more. Seamlessly integrates with Bison and other parsers.

    The RE/flex matcher tracks line numbers, column numbers, and indentations, whereas Flex does not (option noyylineno) and neither do the other regex matchers (except PCRE2 and Boost.Regex when used with RE/flex). Tracking this information incurs some overhead. RE/flex also automatically decodes UTF-8/16/32 input and accepts std::istream, strings, and wide strings as input.

    RE/flex runs equally fast or slightly faster than the best times of Flex.

  • tre 📁 🌐 -- TRE is a lightweight, robust, and efficient POSIX compliant regexp matching library with some exciting features such as approximate (fuzzy) matching. The matching algorithm used in TRE uses linear worst-case time in the length of the text being searched, and quadratic worst-case time in the length of the used regular expression.

  • ugrep 📁 🌐 -- search for anything in everything... ultra fast. "grep for arbitrary binary files."

  • yara-pattern-matcher 📁 🌐 -- for automated and user-specified pattern recognition in custom document & metadata cleaning / processing tasks

OCR: quality improvements, language detect, ...

  • Awesome-Document-Image-Rectification 📁 🌐 -- a comprehensive list of awesome document image rectification methods based on deep learning.

  • Awesome-Image-Quality-Assessment 📁 🌐 -- a comprehensive collection of IQA papers, datasets and codes. We also provide PyTorch implementations of mainstream metrics in IQA-PyTorch

  • Capture2Text 📁 🌐 -- Linux CLI port of Capture2Text v4.5.1 (Ubuntu) - the OCR results from Capture2Text were generally better than standard Tesseract, so it seemed ideal to make this run on Linux.

  • chewing_text_cud 📁 🌐 -- a text processing / filtering library for use in NLP/search/content analysis research pipelines.

  • EasyOCR 📁 🌐 -- ready-to-use OCR with 80+ supported languages and all popular writing scripts including: Latin, Chinese, Arabic, Devanagari, Cyrillic, etc.

  • EasyOCR-cpp 📁 🌐 -- custom C++ implementation of EasyOCR. This C++ project implements the pre/post processing to run a OCR pipeline consisting of a text detector CRAFT, and a CRNN based text recognizer. Unlike the EasyOCR python which is API based, this repo provides a set of classes to show how you can integrate OCR in any C++ program for maximum flexibility.

  • fastText 📁 🌐 -- fastText is a library for efficient learning of word representations and sentence classification. Includes language detection feeatures.

  • fribidi 📁 🌐 -- GNU FriBidi: the Free Implementation of the [Unicode Bidirectional Algorithm]. One of the missing links stopping the penetration of free software in Middle East is the lack of support for the Arabic and Hebrew alphabets. In order to have proper Arabic and Hebrew support, the bidi algorithm needs to be implemented. It is our hope that this library will stimulate more free software in the Middle Eastern countries.

  • hunspell 📁 🌐 -- a free spell checker and morphological analyzer library and command-line tool, designed for quick and high quality spell checking and correcting for languages with word-level writing system, including languages with rich morphology, complex word compounding and character encoding.

  • hunspell-dictionaries 📁 🌐 -- Collection of normalized and installable [hunspell][] dictionaries.

  • hunspell-hyphen 📁 🌐 -- hyphenation library to use converted TeX hyphenation patterns with hunspell.

  • IMGUR5K-Handwriting-Dataset 📁 🌐 -- the IMGUR5K Handwriting Dataset for OCR/image preprocessing benchmarks.

  • InversePerspectiveMapping 📁 🌐 -- C++ class for the computation of plane-to-plane homographies, aka bird's-eye view or IPM, particularly relevant in the field of Advanced Driver Assistance Systems.

  • ipa-dict 📁 🌐 -- Monolingual wordlists with pronunciation information in IPA aims to provide a series of dictionaries consisting of wordlists with accompanying phonemic pronunciation information in International Phonetic Alphabet (IPA) transcription for as many words as possible in as many languages / dialects / variants as possible. The dictionary data is available in a number of human- and machine-readable formats, in order to make it as useful as possible for various other applications.

  • JamSpell 📁 🌐 -- a spell checking library, which considers words surroundings (context) for better correction (accuracy) and is fast (near 5K words per second)

  • libpinyin 📁 🌐 -- the libpinyin project aims to provide the algorithms core for intelligent sentence-based Chinese pinyin input methods.

  • libpostal 📁 🌐 -- a C library for parsing/normalizing street addresses around the world using statistical NLP and open data. The goal of this project is to understand location-based strings in every language, everywhere.

  • libtextcat 📁 🌐 -- text language detection

  • LSWMS 📁 🌐 -- LSWMS (Line Segment detection using Weighted Mean-Shift): line segment detection with OpenCV, originally published by Marcos Nieto Doncel.

  • marian 📁 🌐 -- an efficient Neural Machine Translation framework written in pure C++ with minimal dependencies.

  • nuspell 📁 🌐 -- a fast and safe spelling checker software program. It is designed for languages with rich morphology and complex word compounding. Nuspell is written in modern C++ and it supports Hunspell dictionaries.

  • ocreval 📁 🌐 -- ocreval contains 17 tools for measuring the performance of and experimenting with OCR output. ocreval is a modern port of the ISRI Analytic Tools for OCR Evaluation, with UTF-8 support and other improvements.

  • ocr-evaluation-tools 📁 🌐 -- 19 tools for measuring the performance and quality of OCR output.

  • open-location-code 📁 🌐 -- Open Location Code is a technology that gives a way of encoding location into a form that is easier to use than latitude and longitude. The codes generated are called plus codes, as their distinguishing attribute is that they include a "+" character. The technology is designed to produce codes that can be used as a replacement for street addresses, especially in places where buildings aren't numbered or streets aren't named. Plus codes represent an area, not a point. As digits are added to a code, the area shrinks, so a long code is more precise than a short code. Codes that are similar are located closer together than codes that are different.

  • OTB 📁 🌐 -- Orfeo ToolBox (OTB) is an open-source project for state-of-the-art remote sensing. Built on the shoulders of the open-source geospatial community, it can process high resolution optical, multispectral and radar images at the terabyte scale. A wide variety of applications are available: from ortho-rectification or pansharpening, all the way to classification, SAR processing, and much more!

  • pinyin 📁 🌐 -- pīnyīn is a tool for converting Chinese characters to pinyin. It can be used for Chinese phonetic notation, sorting, and retrieval.

  • retinex 📁 🌐 -- the Retinex algorithm for intrinsic image decomposition. The provided code computes image gradients, and assembles a sparse linear "Ax = b" system. The system is solved using Eigen.

  • scws-chinese-word-segmentation 📁 🌐 -- SCWS (Simple Chinese Word Segmentation) (i.e.: Simple Chinese Word Segmentation System). This is a mechanical Chinese word segmentation engine based on word frequency dictionary, which can basically correctly segment a whole paragraph of Chinese text into words. A word is the smallest morpheme unit in Chinese, but when writing, it does not separate words with spaces like English, so how to segment words accurately and quickly has always been a difficult problem in Chinese word segmentation. SCWS supports Chinese encoding includes GBK, UTF-8, etc.

    There are not many innovative elements in the word segmentation algorithm. It uses a word frequency dictionary collected by itself, supplemented by certain proper names, names of people, and place names. Basic word segmentation is achieved by identifying rules such as digital age. After small-scale testing, the accuracy is between 90% and 95%, which can basically satisfy some use in small search engines, keyword extraction and other occasions. The first prototype version was released in late 2005.

  • SheenBidi 📁 🌐 -- implements Unicode Bidirectional Algorithm available at http://www.unicode.org/reports/tr9. It is a sophisticated implementation which provides the developers an easy way to use UBA in their applications.

  • SymSpell 📁 🌐 -- spelling correction & fuzzy search: 1 million times faster through Symmetric Delete spelling correction algorithm. The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent.

  • SymspellCPP 📁 🌐 -- a C++ port from https://github.com/wolfgarbe/SymSpell v6.5

  • tesslinesplit 📁 🌐 -- a standalone program for using Tesseract's line segmentation algorithm to split up document images.

  • unpaper 📁 🌐 -- a post-processing tool for scanned sheets of paper, especially for book pages that have been scanned from previously created photocopies. The main purpose is to make scanned book pages better readable on screen after conversion to PDF. The program also tries to detect misaligned centering and rotation of ages and will automatically straighten each page by rotating it to the correct angle (a.k.a. deskewing).

OCR page image preprocessing, [scanner] tooling: getting the pages to the OCR engine

  • Awesome-Document-Image-Rectification 📁 🌐 -- a comprehensive list of awesome document image rectification methods based on deep learning.

  • Awesome-Image-Quality-Assessment 📁 🌐 -- a comprehensive collection of IQA papers, datasets and codes. We also provide PyTorch implementations of mainstream metrics in IQA-PyTorch

  • butteraugli 📁 🌐 -- a tool for measuring perceived differences between images. Butteraugli is a project that estimates the psychovisual similarity of two images. It gives a score for the images that is reliable in the domain of barely noticeable differences. Butteraugli not only gives a scalar score, but also computes a spatial map of the level of differences. One of the main motivations for this project is the statistical differences in location and density of different color receptors, particularly the low density of blue cones in the fovea. Another motivation comes from more accurate modeling of ganglion cells, particularly the frequency space inhibition.

  • Capture2Text 📁 🌐 -- Linux CLI port of Capture2Text v4.5.1 (Ubuntu) - the OCR results from Capture2Text were generally better than standard Tesseract, so it seemed ideal to make this run on Linux.

  • ccv-nnc 📁 🌐 -- C-based/Cached/Core Computer Vision Library. A Modern Computer Vision Library.

  • CImg 📁 🌐 -- a small C++ toolkit for image processing.

  • colorm 📁 🌐 -- ColorM is a C++11 header-only color conversion and manipulation library for CSS colors with an API similar to chroma.js's API.

  • ColorSpace 📁 🌐 -- library for converting between color spaces and comparing colors.

  • color-util 📁 🌐 -- a header-only C++11 library for handling colors, including color space converters between RGB, XYZ, Lab, etc. and color difference calculators such as CIEDE2000.

  • dcmtk 📁 🌐 -- the DICOM toolkit (DCMTK) package consists of source code, documentation and installation instructions for a set of software libraries and applications implementing part of the DICOM/MEDICOM Standard.

  • DocLayNet 📁 🌐 -- DocLayNet provides page-by-page layout segmentation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique pages from 6 document categories. It provides several unique features compared to related work such as PubLayNet or DocBank, e.g. Human Annotation: DocLayNet is hand-annotated by well-trained experts, providing a gold-standard in layout segmentation through human recognition and interpretation of each page layout.

  • doxa 📁 🌐 -- Δoxa Binarization Framework (ΔBF) is an image binarization framework which focuses primarily on local adaptive thresholding algorithms, aimed at providing the building blocks one might use to advance the state of handwritten manuscript binarization.

    Supported Algorithms:

    • Otsu - "A threshold selection method from gray-level histograms", 1979.
    • Bernsen - "Dynamic thresholding of gray-level images", 1986.
    • Niblack - "An Introduction to Digital Image Processing", 1986.
    • Sauvola - "Adaptive document image binarization", 1999.
    • Wolf - "Extraction and Recognition of Artificial Text in Multimedia Documents", 2003.
    • Gatos - "Adaptive degraded document image binarization", 2005. (Partial)
    • NICK - "Comparison of Niblack inspired Binarization methods for ancient documents", 2009.
    • Su - "Binarization of Historical Document Images Using the Local Maximum and Minimum", 2010.
    • T.R. Singh - "A New local Adaptive Thresholding Technique in Binarization", 2011.
    • Bataineh - "An adaptive local binarization method for document images based on a novel thresholding method and dynamic windows", 2011. (unreproducible)
    • ISauvola - "ISauvola: Improved Sauvola’s Algorithm for Document Image Binarization", 2016.
    • WAN - "Binarization of Document Image Using Optimum Threshold Modification", 2018.

    Optimizations:

    • Shafait - "Efficient Implementation of Local Adaptive Thresholding Techniques Using Integral Images", 2008.
    • Petty - An algorithm for efficiently calculating the min and max of a local window. Unpublished, 2019.
    • Chan - "Memory-efficient and fast implementation of local adaptive binarization methods", 2019.

    Performance Metrics:

    • Overall Accuracy
    • F-Measure
    • Peak Signal-To-Noise Ratio (PSNR)
    • Negative Rate Metric (NRM)
    • Matthews Correlation Coefficient (MCC)
    • Distance-Reciprocal Distortion Measure (DRDM) - "An Objective Distortion Measure for Binary Document Images Based on Human Visual Perception", 2002.

    Native Image Support:

    • Portable Any-Map: PBM (P4), 8-bit PGM (P5), PPM (P6), PAM (P7)
  • EasyOCR 📁 🌐 -- ready-to-use OCR with 80+ supported languages and all popular writing scripts including: Latin, Chinese, Arabic, Devanagari, Cyrillic, etc.

  • EasyOCR-cpp 📁 🌐 -- custom C++ implementation of EasyOCR. This C++ project implements the pre/post processing to run a OCR pipeline consisting of a text detector CRAFT, and a CRNN based text recognizer. Unlike the EasyOCR python which is API based, this repo provides a set of classes to show how you can integrate OCR in any C++ program for maximum flexibility.

  • farver-OKlab 📁 🌐 -- provides very fast, vectorised functions for conversion of colours between different colour spaces, colour comparisons (distance between colours), encoding/decoding, and channel manipulation in colour strings.

  • fCWT 📁 🌐 -- the fast Continuous Wavelet Transform (fCWT) is a highly optimized C++ library for very fast calculation of the CWT in C++, Matlab, and Python. fCWT has been featured on the January 2022 cover of NATURE Computational Science. In this article, fCWT is compared against eight competitor algorithms, tested on noise resistance and validated on synthetic electroencephalography and in vivo extracellular local field potential data.

  • FFmpeg 📁 🌐 -- a collection of libraries and tools to process multimedia content such as audio, video, subtitles and related metadata.

  • gegl 📁 🌐 -- GEGL (Generic Graphics Library) is a data flow based image processing framework, providing floating point processing and non-destructive image processing capabilities to GNU Image Manipulation Program and other projects. With GEGL you chain together processing operations to represent the desired image processing pipeline. GEGL provides operations for image loading and storing, color adjustments, GIMPs artistic filters and more forms of image processing GEGL can be used on the command-line with the same syntax that can be used for creating processing flows interactively with text from GIMP using gegl-graph.

  • gmic 📁 🌐 -- a Full-Featured Open-Source Framework for Image Processing. It provides several different user interfaces to convert/manipulate/filter/visualize generic image datasets, ranging from 1d scalar signals to 3d+t sequences of multi-spectral volumetric images, hence including 2d color images.

  • gmic-community 📁 🌐 -- community contributions for the GMIC Full-Featured Open-Source Framework for Image Processing. It provides several different user interfaces to convert/manipulate/filter/visualize generic image datasets, ranging from 1d scalar signals to 3d+t sequences of multi-spectral volumetric images, hence including 2d color images.

  • graph-coloring 📁 🌐 -- a C++ Graph Coloring Package. This project has two primary uses:

    • As an executable for finding the chromatic number for an input graph (in edge list or edge matrix format)
    • As a library for finding the particular coloring of an input graph (represented as a map<string,vector<string>> edge list)
  • GraphicsMagick 📁 🌐 -- provides a comprehensive collection of utilities, programming interfaces, and GUIs, to support file format conversion, image processing, and 2D vector rendering. GraphicsMagick is originally based on ImageMagick from ImageMagick Studio (which was originally written by John Cristy at Dupont). The goal of GraphicsMagick is to provide the highest quality product possible while encouraging open and active participation from all interested developers.

  • gtsam 📁 🌐 -- Georgia Tech Smoothing and Mapping Library (GTSAM) is a C++ library that implements smoothing and mapping (SAM) in robotics and vision, using Factor Graphs and Bayes Networks as the underlying computing paradigm rather than sparse matrices.

  • guetzli 📁 🌐 -- a JPEG encoder that aims for excellent compression density at high visual quality. Guetzli-generated images are typically 20-30% smaller than images of equivalent quality generated by libjpeg. Guetzli generates only sequential (nonprogressive) JPEGs due to faster decompression speeds they offer.

  • hsluv-c 📁 🌐 -- HSLuv (revision 4) is a human-friendly alternative to HSL. HSLuv is very similar to CIELUV, a color space designed for perceptual uniformity based on human experiments. When accessed by polar coordinates, it becomes functionally similar to HSL with a single problem: its chroma component doesn't fit into a specific range. HSLuv extends CIELUV with a new saturation component that allows you to span all the available chroma as a neat percentage.

  • ImageMagick 📁 🌐 -- ImageMagick® can create, edit, compose, or convert digital images. It can read and write images in a variety of formats (over 200) including PNG, JPEG, GIF, WebP, HEIC, SVG, PDF, DPX, EXR, and TIFF. ImageMagick can resize, flip, mirror, rotate, distort, shear and transform images, adjust image colors, apply various special effects, or draw text, lines, polygons, ellipses, and Bézier curves.

  • Image-Smoothing-Algorithm-Based-on-Gradient-Analysis 📁 🌐 -- the implementation of an image smoothing algorithm that was proposed in this publication. Our algorithm uses filtering and to achieve edge-preserving smoothing it uses two components of gradient vectors: their magnitudes (or lengths) and directions. Our method discriminates between two types of boundaries in given neighborhood: regular and irregular ones.

  • IMGUR5K-Handwriting-Dataset 📁 🌐 -- the IMGUR5K Handwriting Dataset for OCR/image preprocessing benchmarks.

  • InversePerspectiveMapping 📁 🌐 -- C++ class for the computation of plane-to-plane homographies, aka bird's-eye view or IPM, particularly relevant in the field of Advanced Driver Assistance Systems.

  • ITK 📁 🌐 -- The Insight Toolkit (ITK) is an open-source, cross-platform toolkit for N-dimensional scientific image processing, segmentation, and registration. Segmentation is the process of identifying and classifying data found in a digitally sampled representation. Typically the sampled representation is an image acquired from such medical instrumentation as CT or MRI scanners. Registration is the task of aligning or developing correspondences between data. For example, in the medical environment, a CT scan may be aligned with a MRI scan in order to combine the information contained in both.

  • jasper 📁 🌐 -- JasPer Image Processing/Coding Tool Kit

  • jpeg2dct 📁 🌐 -- Faster Neural Networks Straight from JPEG: jpeg2dct subroutines -- this module is useful for reproducing results presented in the paper Faster Neural Networks Straight from JPEG (ICLR workshop 2018).

  • lcms2 📁 🌐 -- lcms2mt is a thread-safe fork of lcms (a.k.a. Little CMS). Little CMS intends to be a small-footprint color management engine, with special focus on accuracy and performance. It uses the International Color Consortium standard (ICC), which is the modern standard when regarding to color management. The ICC specification is widely used and is referred to in many International and other de-facto standards. It was approved as an International Standard, ISO 15076-1, in 2005. Little CMS is a full implementation of ICC specification 4.3, it fully supports all kind of V2 and V4 profiles, including abstract, devicelink and named color profiles.

  • leptonica 📁 🌐 -- supports many operations that are useful on images.

    Features:

    • Rasterop (aka bitblt)
    • Affine transforms (scaling, translation, rotation, shear) on images of arbitrary pixel depth
    • Projective and bilinear transforms
    • Binary and grayscale morphology, rank order filters, and convolution
    • Seedfill and connected components
    • Image transformations with changes in pixel depth, both at the same scale and with scale change
    • Pixelwise masking, blending, enhancement, arithmetic ops, etc.

    Documentation:

    • LeptonicaDocsSite 📁 🌐 -- unofficial Reference Documentation for the Leptonica image processing library (www.leptonica.org).
    • UnofficialLeptDocs 📁 🌐 -- unofficial Sphinx-generated documentation for the Leptonica image processing library.
  • libchiaroscuramente 📁 🌐 -- a collection of C/C++ functions (components) to help improving / enhancing your images for various purposes (e.g. helping an OCR engine detect and recognize the text in the page scan image)

  • libdip 📁 🌐 -- DIPlib is a C++ library for quantitative image analysis.

  • libimagequant 📁 🌐 -- Palette quantization library that powers pngquant and other PNG optimizers. libimagequant converts RGBA images to palette-based 8-bit indexed images, including alpha component. It's ideal for generating tiny PNG images and nice-looking GIFs. Image encoding/decoding isn't handled by the library itself, bring your own encoder.

  • libinsane 📁 🌐 -- the library to access scanners on both Linux and Windows.

  • libjpegqs 📁 🌐 -- JPEG Quant Smooth tries to recreate lost precision of DCT coefficients based on quantization table from jpeg image. You may not notice jpeg artifacts on the screen without zooming in, but you may notice them after printing. Also, when editing compressed images, artifacts can accumulate, but if you use this program before editing - the result will be better.

  • libpano13 📁 🌐 -- the pano13 library, part of the Panorama Tools by Helmut Dersch of the University of Applied Sciences Furtwangen.

  • libpillowfight 📁 🌐 -- simple C Library containing various image processing algorithms.

    Available algorithms:

    • ACE (Automatic Color Equalization; Parallelized implementation)

    • Canny edge detection

    • Compare: Compare two images (grayscale) and makes the pixels that are different really visible (red).

    • Gaussian blur

    • Scan borders: Tries to detect the borders of a page in an image coming from a scanner.

    • Sobel operator

    • SWT (Stroke Width Transformation)

    • Unpaper's algorithms

      • Blackfilter
      • Blurfilter
      • Border
      • Grayfilter
      • Masks
      • Noisefilter
  • libprecog 📁 🌐 -- PRLib - Pre-Recognition Library. The main aim of the library is to prepare images for OCR (text recogntion). Image processing can really help to improve recognition quality.

  • libprecog-data 📁 🌐 -- PRLib (a.k.a. libprecog) test data.

  • libprecog-manuals 📁 🌐 -- PRLib (a.k.a. libprecog) related papers.

  • libraqm 📁 🌐 -- a small library that encapsulates the logic for complex text layout and provides a convenient API.

  • libvips 📁 🌐 -- a demand-driven, horizontally threaded image processing library which has around 300 operations covering arithmetic, histograms, convolution, morphological operations, frequency filtering, colour, resampling, statistics and others. It supports a large range of numeric types, from 8-bit int to 128-bit complex. Images can have any number of bands. It supports a good range of image formats, including JPEG, JPEG2000, JPEG-XL, TIFF, PNG, WebP, HEIC, AVIF, FITS, Matlab, OpenEXR, PDF, SVG, HDR, PPM / PGM / PFM, CSV, GIF, Analyze, NIfTI, DeepZoom, and OpenSlide. It can also load images via ImageMagick or GraphicsMagick, letting it work with formats like DICOM.

  • libxbr-standalone 📁 🌐 -- this standalone XBR/hqx Library implements the xBR pixel art scaling filter developed by Hyllian, and now also the hqx filter developed by Maxim Stepin. Original source for the xBR implementation: http://git.videolan.org/gitweb.cgi/ffmpeg.git/?p=ffmpeg.git;a=blob;f=libavfilter/vf_xbr.c;h=5c14565b3a03f66f1e0296623dc91373aeac1ed0;hb=HEAD

  • local_adaptive_binarization 📁 🌐 -- uses an improved contrast maximization version of Niblack/Sauvola et al's method to binarize document images. It is also able to perform the more classical Niblack as well as Sauvola et al. methods. Details can be found in the ICPR 2002 paper.

  • LSWMS 📁 🌐 -- LSWMS (Line Segment detection using Weighted Mean-Shift): line segment detection with OpenCV, originally published by Marcos Nieto Doncel.

  • magsac 📁 🌐 -- (MAGSAC++ had been included in OpenCV) the MAGSAC and MAGSAC++ algorithms for robust model fitting without using a single inlier-outlier threshold.

  • oidn-OpenImageDenoise 📁 🌐 -- Intel® Open Image Denoise is an open source library of high-performance, high-quality denoising filters for images rendered with ray tracing.

  • olena 📁 🌐 -- a platform dedicated to image processing. At the moment it is mainly composed of a C++ library: Milena. This library features many tools to easily perform image processing tasks. Its main characteristic is its genericity: it allows to write an algorithm once and run it over many kinds of images (gray scale, color, 1D, 2D, 3D, ...).

  • OpenColorIO 📁 🌐 -- OpenColorIO (OCIO) is a complete color management solution geared towards motion picture production with an emphasis on visual effects and computer animation. OCIO provides a straightforward and consistent user experience across all supporting applications while allowing for sophisticated back-end configuration options suitable for high-end production usage. OCIO is compatible with the Academy Color Encoding Specification (ACES) and is LUT-format agnostic, supporting many popular formats.

  • OpenCP 📁 🌐 -- a library for computational photography.

  • opencv 📁 🌐 -- OpenCV: Open Source Computer Vision Library

  • opencv_3rdparty 📁 🌐 -- 3rdparty libraries used by OpenCV.

  • opencv_contrib 📁 🌐 -- OpenCV's extra modules. This is where you'll find new, bleeding edge OpenCV module development.

  • opencv_extra 📁 🌐 -- extra data for OpenCV: Open Source Computer Vision Library

  • OTB 📁 🌐 -- Orfeo ToolBox (OTB) is an open-source project for state-of-the-art remote sensing. Built on the shoulders of the open-source geospatial community, it can process high resolution optical, multispectral and radar images at the terabyte scale. A wide variety of applications are available: from ortho-rectification or pansharpening, all the way to classification, SAR processing, and much more!

  • pdiff 📁 🌐 -- perceptualdiff (pdiff): a program that compares two images using a perceptually based image metric.

  • Pillow 📁 🌐 -- the friendly PIL (Python Imaging Library) fork by Jeffrey A. Clark (Alex) and contributors. PIL is the Python Imaging Library by Fredrik Lundh and Contributors. This library provides extensive file format support, an efficient internal representation, and fairly powerful image processing capabilities.

  • pillow-resize 📁 🌐 -- a C++ porting of the resize method from the Pillow python library. It is written in C++ using OpenCV for matrix support. The main difference with respect to the resize method of OpenCV is the use of an anti aliasing filter, that is missing in OpenCV and could introduce some artifacts, in particular with strong down-sampling.

  • pixman 📁 🌐 -- a library that provides low-level pixel manipulation features such as image compositing and trapezoid rasterization.

  • poisson_blend 📁 🌐 -- a simple, readable implementation of Poisson Blending, that demonstrates the concepts explained in my article, seamlessly blending a source image and a target image, at some specified pixel location.

  • pylene 📁 🌐 -- Pylene is a fork of Olena/Milena, an image processing library targeting genericity and efficiency. It provided mostly Mathematical Morphology building blocs for image processing pipelines.

  • radon-tf 📁 🌐 -- simple implementation of the radon transform. Faster when using more than one thread to execute it. No inverse function is provided. CPU implementation only.

  • RandomizedRedundantDCTDenoising 📁 🌐 -- demonstrates the paper S. Fujita, N. Fukushima, M. Kimura, and Y. Ishibashi, "Randomized redundant DCT: Efficient denoising by using random subsampling of DCT patches," Proc. Siggraph Asia, Technical Brief, Nov. 2015. In this paper, the DCT-based denoising is accelerated by using a randomized algorithm. The DCT is based on the fastest algorithm and is SIMD vectorized by using SSE. Some modifications improve denoising performance in term of PSNR. The code is 100x faster than the OpenCV's implementation (cv::xphoto::dctDenoising) for the paper. Optionally, we can use DHT (discrete Walsh–Hadamard transform) for fast computation instead of using DCT.

  • retinex 📁 🌐 -- the Retinex algorithm for intrinsic image decomposition. The provided code computes image gradients, and assembles a sparse linear "Ax = b" system. The system is solved using Eigen.

  • rotate 📁 🌐 -- provides several classic, commonly used and novel rotation algorithms (aka block swaps), which were documented since around 1981 up to 2021: three novel rotation algorithms were introduced in 2021, notably the trinity rotation.

  • rotate_detection 📁 🌐 -- angle rotation detection on scanned documents. Designed for embedding in systems using tesseract OCR. The detection algorithm based on Rényi entropy.

  • scantailor 📁 🌐 -- scantailor_advanced is the ScanTailor version that merges the features of the ScanTailor Featured and ScanTailor Enhanced versions, brings new ones and fixes. ScanTailor is an interactive post-processing tool for scanned pages. It performs operations such as page splitting, deskewing, adding/removing borders, selecting content, ... and many others.

  • scilab 📁 🌐 -- Scilab includes hundreds of mathematical functions. It has a high-level programming language allowing access to advanced data structures, 2-D and 3-D graphical functions.

  • simd-imgproc 📁 🌐 -- the Simd Library is an image processing and machine learning library designed for C and C++ programmers. It provides many useful high performance algorithms for image processing such as: pixel format conversion, image scaling and filtration, extraction of statistic information from images, motion detection, object detection (HAAR and LBP classifier cascades) and classification, neural network.

    The algorithms are optimized, using different SIMD CPU extensions where available. The library supports following CPU extensions: SSE, AVX, AVX-512 and AMX for x86/x64, VMX(Altivec) and VSX(Power7) for PowerPC (big-endian), NEON for ARM.

  • SSIM 📁 🌐 -- the structural similarity index measure (SSIM) is a popular method to predict perceived image quality. Published in April 2004, with over 46,000 Google Scholar citations, it has been re-implemented hundreds, perhaps thousands, of times, and is widely used as a measurement of image quality for image processing algorithms (even in places where it does not make sense, leading to even worse outcomes!). Unfortunately, if you try to reproduce results in papers, or simply grab a few SSIM implementations and compare results, you will soon find that it is (nearly?) impossible to find two implementations that agree, and even harder to find one that agrees with the original from the author. Chris Lomont ran into this issue many times, so he finally decided to write it up once and for all (and provide clear code that matches the original results, hoping to help reverse the mess that is current SSIM). Most of the problems come from the original implementation being in MATLAB, which not everyone can use. Running the same code in open source Octave, which claims to be MATLAB compatible, even returns wrong results! This large and inconsistent variation among SSIM implementations makes it hard to trust or compare published numbers between papers. The original paper doesn't define how to handle color images, doesn't specify what color space the grayscale values represent (linear? gamma compressed?), adding to the inconsistencies and results. The lack of color causes the following images to be rated as visually perfect by SSIM as published. The paper demonstrates so many issues when using SSIM with color images that they state "we advise not to use SSIM with color images". All of this is a shame since the underlying concept works well for the given compute complexity. A good first step to cleaning up this mess is trying to get widely used implementations to match the author results for their published test values, and this requires clearly specifying the algorithm at the computational level, which the authors did not. Chris Lomont explains some of these choices, and most importantly, provides original, MIT licensed, single file C++ header and single file C# implementations; each reproduces the original author code better than any other version I have found.

  • tesslinesplit 📁 🌐 -- a standalone program for using Tesseract's line segmentation algorithm to split up document images.

  • twain_library 📁 🌐 -- the DTWAIN Library, Version 5.x, from Dynarithmic Software. DTWAIN is an open source programmer's library that will allow applications to acquire images from TWAIN-enabled devices using a simple Application Programmer's Interface (API).

  • unblending 📁 🌐 -- a C++ library for decomposing a target image into a set of semi-transparent layers associated with advanced color-blend modes (e.g., "multiply" and "color-dodge"). Output layers can be imported to Adobe Photoshop, Adobe After Effects, GIMP, Krita, etc. and are useful for performing complex edits that are otherwise difficult.

  • unpaper 📁 🌐 -- a post-processing tool for scanned sheets of paper, especially for book pages that have been scanned from previously created photocopies. The main purpose is to make scanned book pages better readable on screen after conversion to PDF. The program also tries to detect misaligned centering and rotation of ages and will automatically straighten each page by rotating it to the correct angle (a.k.a. deskewing).

  • vivid 📁 🌐 -- vivid 🌈 is a simple-to-use C++ color library.

  • VQMT 📁 🌐 -- VQMT (Video Quality Measurement Tool) provides fast implementations of the following objective metrics:

    • MS-SSIM: Multi-Scale Structural Similarity,
    • PSNR: Peak Signal-to-Noise Ratio,
    • PSNR-HVS: Peak Signal-to-Noise Ratio taking into account Contrast Sensitivity Function (CSF),
    • PSNR-HVS-M: Peak Signal-to-Noise Ratio taking into account Contrast Sensitivity Function (CSF) and between-coefficient contrast masking of DCT basis functions.
    • SSIM: Structural Similarity,
    • VIFp: Visual Information Fidelity, pixel domain version

    The above metrics are implemented in C++ with the help of OpenCV and are based on the original Matlab implementations provided by their developers.

  • wavelib 📁 🌐 -- C implementation of Discrete Wavelet Transform (DWT,SWT and MODWT), Continuous Wavelet transform (CWT) and Discrete Packet Transform ( Full Tree Decomposition and Best Basis DWPT).

  • wdenoise 📁 🌐 -- Wavelet Denoising in ANSI C using empirical bayes thresholding and a host of other thresholding methods.

  • xbrzscale 📁 🌐 -- xBRZ upscaling commandline tool. This tool allows you to scale your graphics with xBRZ algorithm, see https://en.wikipedia.org/wiki/Pixel-art_scaling_algorithms#xBR_family

  • zimg 📁 🌐 -- the "z" library implements the commonly required image processing basics of scaling, colorspace conversion, and depth conversion. A simple API enables conversion between any supported formats to operate with minimal knowledge from the programmer. All library routines were designed from the ground-up with correctness, flexibility, and thread-safety as first priorities.

image export, image / [scanned] document import

  • avir 📁 🌐 -- a image resizing / scaling library which has reached a production level of quality, and is ready to be incorporated into any project. This library features routines for both down- and upsizing of 8- and 16-bit, 1 to 4-channel images. Image resizing routines were implemented in a portable, cross-platform, header-only C++ code, and have a high level of optimality. Beside resizing, this library offers a sub-pixel shift operation. Built-in sRGB gamma correction is available.

  • brunsli 📁 🌐 -- a lossless JPEG repacking library. Brunsli allows for a 22% decrease in file size while allowing the original JPEG to be recovered byte-by-byte.

  • CImg 📁 🌐 -- a small C++ toolkit for image processing.

  • CxImage 📁 🌐 -- venerated library for reading and creating many image file formats.

  • dcmtk 📁 🌐 -- the DICOM toolkit (DCMTK) package consists of source code, documentation and installation instructions for a set of software libraries and applications implementing part of the DICOM/MEDICOM Standard.

  • FFmpeg 📁 🌐 -- a collection of libraries and tools to process multimedia content such as audio, video, subtitles and related metadata.

  • fpng 📁 🌐 -- a very fast C++ .PNG image reader/writer for 24/32bpp images. fpng was written to see just how fast you can write .PNG's without sacrificing too much compression. The files written by fpng conform to the PNG standard, are readable using any PNG decoder, and load or validate successfully using libpng, wuffs, lodepng, stb_image, and pngcheck. PNG files written using fpng can also be read using fpng faster than other PNG libraries, due to its explicit use of Length-Limited Prefix Codes and an optimized decoder that exploits the properties of these codes.

  • fpnge 📁 🌐 -- fast PNG Encoder: a proof-of-concept fast PNG encoder that uses AVX2 and a special Huffman table to encode images faster. Speed on a single core is anywhere from 180 to 800 MP/s on a Threadripper 3970x, depending on compile time settings and content. It supports 8 and 16 bit content, 1 to 4 channels; it can also emit cICP chunks for signaling that the content should be interpreted as HDR.

  • freeimage 📁 🌐 -- a library supporting popular graphics image formats like PNG, BMP, JPEG, TIFF and others as needed by today's multimedia applications, providing an ANSI C interface.

  • giflib-turbo 📁 🌐 -- GIFLIB-Turbo is a faster drop-in replacement for GIFLIB. The original GIF codecs were written for a much different world and took great pains to use as little memory as possible and to accommodate a slow and unreliable input stream of data. Those constraints are no longer a problem for the vast majority of users and they were hurting the performance. Another feature holding back the performance of the original GIFLIB was that the original codec was designed to work with image data a line at a time and used a separate LZW dictionary to manage the strings of repeating symbols. My codec uses the output image as the dictionary; this allows much faster 'unwinding' of the codes since they are all stored in the right direction to just be copied to the new location.

  • grok-jpeg2000 📁 🌐 -- World's Leading Open Source JPEG 2000 Codec

    Features:

    • support for new High Throughput JPEG 2000 (HTJ2K) standard
    • fast random-access sub-image decoding using TLM and PLT markers
    • full encode/decode support for ICC colour profiles
    • full encode/decode support for XML,IPTC, XMP and EXIF meta-data
    • full encode/decode support for monochrome, sRGB, palette, YCC, extended YCC, CIELab and CMYK colour spaces
    • full encode/decode support for JPEG,PNG,BMP,TIFF,RAW,PNM and PAM image formats
    • full encode/decode support for 1-16 bit precision images
  • guetzli 📁 🌐 -- a JPEG encoder that aims for excellent compression density at high visual quality. Guetzli-generated images are typically 20-30% smaller than images of equivalent quality generated by libjpeg. Guetzli generates only sequential (nonprogressive) JPEGs due to faster decompression speeds they offer.

  • icer_compression 📁 🌐 -- implements the NASA ICER image compression algorithm as a C library. Said compression algorithm is a progressive, wavelet-based image compression algorithm designed to be resistant to data loss, making it suitable for use as the image compression algorithm when encoding images to be transmitted over unreliable delivery channels, such as those in satellite radio communications.

  • Image-Compression-Benchmark 📁 🌐 -- Lossless Image Compression Benchmark: a comparison of 20+ lossless image compression formats on several datasets.

  • jbig2dec 📁 🌐 -- a decoder library and example utility implementing the JBIG2 bi-level image compression spec. Also known as ITU T.88 and ISO IEC 14492, and included by reference in Adobe's PDF version 1.4 and later.

  • jbig2enc 📁 🌐 -- an encoder for JBIG2. JBIG2 encodes bi-level (1 bpp) images using a number of clever tricks to get better compression than G4. This encoder can:

    • Generate JBIG2 files, or fragments for embedding in PDFs
    • Generic region encoding
    • Perform symbol extraction, classification and text region coding
    • Perform refinement coding and,
    • Compress multipage documents

    It uses the Leptonica library.

  • jbigkit 📁 🌐 -- JBIG-KIT lossless image compression library, which implements a highly effective data compression algorithm for bi-level high-resolution images such as fax pages or scanned documents. JBIG-KIT implements the specification: International Standard ISO/IEC 11544:1993 and ITU-T Recommendation T.82(1993), "Information technology - Coded representation of picture and audio information - progressive bi-level image compression", http://www.itu.int/rec/T-REC-T.82, a.k.a. the "JBIG1 standard".

  • jpeginfo 📁 🌐 -- prints information and tests integrity of JPEG/JFIF files.

  • JPEG-XL 📁 🌐 -- JPEG XL reference implementation (encoder and decoder), called libjxl. JPEG XL was standardized in 2022 as ISO/IEC 18181. The core codestream is specified in 18181-1, the file format in 18181-2. Decoder conformance is defined in 18181-3, and 18181-4 is the reference software.

  • knusperli 📁 🌐 -- Knusperli reduces blocking artifacts in decoded JPEG images by interpreting quantized DCT coefficients in the image data as an interval, rather than a fixed value, and choosing the value from that interval that minimizes discontinuities at block boundaries.

  • lerc 📁 🌐 -- LERC (Limited Error Raster Compression) is an open-source image or raster format which supports rapid encoding and decoding for any pixel type (not just RGB or Byte). Users set the maximum compression error per pixel while encoding, so the precision of the original input image is preserved (within user defined error bounds).

  • libaom 📁 🌐 -- AV1 Codec Library

  • libavif 📁 🌐 -- a friendly, portable C implementation of the AV1 Image File Format, as described here: https://aomediacodec.github.io/av1-avif/

  • libde265 📁 🌐 -- libde265 is an open source implementation of the h.265 video codec. It is written from scratch and has a plain C API to enable a simple integration into other software. libde265 supports WPP and tile-based multithreading and includes SSE optimizations. The decoder includes all features of the Main profile and correctly decodes almost all conformance streams (see [wiki page]).

  • libgd 📁 🌐 -- GD is a library for the dynamic creation of images by programmers. GD has support for: WebP, JPEG, PNG, AVIF, HEIF, TIFF, BMP, GIF, TGA, WBMP, XPM.

  • libgif 📁 🌐 -- a library for manipulating GIF files.

  • libheif 📁 🌐 -- High Efficiency Image File Format (HEIF) :: a visual media container format standardized by the Moving Picture Experts Group (MPEG) for storage and sharing of images and image sequences. It is based on the well-known ISO Base Media File Format (ISOBMFF) standard. HEIF Reader/Writer Engine is an implementation of HEIF standard in order to demonstrate its powerful features and capabilities.

  • libheif-alt 📁 🌐 -- an ISO/IEC 23008-12:2017 HEIF and AVIF (AV1 Image File Format) file format decoder and encoder. HEIF and AVIF are new image file formats employing HEVC (h.265) or AV1 image coding, respectively, for the best compression ratios currently possible.

  • libjpeg 📁 🌐 -- the Independent JPEG Group's JPEG software

  • libjpeg-turbo 📁 🌐 -- a JPEG image codec that uses SIMD instructions to accelerate baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and MIPS systems, as well as progressive JPEG compression on x86, x86-64, and Arm systems. On such systems, libjpeg-turbo is generally 2-6x as fast as libjpeg, all else being equal. On other types of systems, libjpeg-turbo can still outperform libjpeg by a significant amount, by virtue of its highly-optimized Huffman coding routines. In many cases, the performance of libjpeg-turbo rivals that of proprietary high-speed JPEG codecs.

  • libkra 📁 🌐 -- a C++ library for importing Krita's KRA & KRZ formatted documents.

  • libpng 📁 🌐 -- LIBPNG: Portable Network Graphics support, official libpng repository.

  • LibRaw 📁 🌐 -- a library for reading and processing of RAW files generated by digital photo cameras.

  • libtiff 📁 🌐 -- TIFF Software Distribution

  • libultrahdr 📁 🌐 -- libultrahdr is an image compression library that uses gain map technology to store and distribute HDR images. Conceptually on the encoding side, the library accepts SDR and HDR rendition of an image and from these a Gain Map (quotient between the two renditions) is computed. The library then uses backward compatible means to store the base image (SDR), gain map image and some associated metadata.

  • libwebp 📁 🌐 -- a library to encode and decode images in WebP format.

  • lunasvg 📁 🌐 -- LunaSVG is a standalone SVG rendering library in C++.

  • mozjpeg 📁 🌐 -- the Mozilla JPEG Encoder Project improves JPEG compression efficiency achieving higher visual quality and smaller file sizes at the same time. It is compatible with the JPEG standard, and the vast majority of the world's deployed JPEG decoders. MozJPEG is a patch for libjpeg-turbo.

  • NBLI 📁 🌐 -- NBLI (New-Bee Lossless Image) is a fast, better lossless compression algorithm, which supports both RGB 24-bit and Gray 8-bit image formats.

  • OpenEXR 📁 🌐 -- a high dynamic-range (HDR) image file format developed by Industrial Light & Magic (ILM) for use in computer imaging applications. OpenEXR is a lossless format for multi-layered images. Professional use. (I've used it before; nice file format.)

  • openexr-images 📁 🌐 -- collection of images associated with the OpenEXR distribution.

  • OpenImageIO 📁 🌐 -- Reading, writing, and processing images in a wide variety of file formats, using a format-agnostic API, aimed at VFX applications.

    Also includes:

    • an ImageCache class that transparently manages a cache so that it can access truly vast amounts of image data (tens of thousands of image files totaling multiple TB) very efficiently using only a tiny amount (tens of megabytes at most) of runtime memory.
    • ImageBuf and ImageBufAlgo functions, which constitute a simple class for storing and manipulating whole images in memory, plus a collection of the most useful computations you might want to do involving those images, including many image processing operations.

    The primary target audience for OIIO is VFX studios and developers of tools such as renderers, compositors, viewers, and other image-related software you'd find in a production pipeline.

  • openjpeg 📁 🌐 -- OPENJPEG Library and Applications -- OpenJPEG is an open-source JPEG 2000 codec written in C language. It has been developed in order to promote the use of JPEG 2000, a still-image compression standard from the Joint Photographic Experts Group (JPEG). Since April 2015, it is officially recognized by ISO/IEC and ITU-T as a JPEG 2000 Reference Software.

  • pdiff 📁 🌐 -- perceptualdiff (pdiff): a program that compares two images using a perceptually based image metric.

  • pmt-png-tools 📁 🌐 -- pngcrush and other PNG and MNG tools

  • psd_sdk 📁 🌐 -- a C++ library that directly reads Photoshop PSD files. The library supports:

    • Groups
    • Nested layers
    • Smart Objects
    • User and vector masks
    • Transparency masks and additional alpha channels
    • 8-bit, 16-bit, and 32-bit data in grayscale and RGB color mode
    • All compression types known to Photoshop

    Additionally, limited export functionality is also supported.

  • qoi 📁 🌐 -- QOI: the “Quite OK Image Format” for fast, lossless image compression, single-file MIT licensed library for C/C++. Compared to stb_image and stb_image_write QOI offers 20x-50x faster encoding, 3x-4x faster decoding and 20% better compression. It's also stupidly simple and fits in about 300 lines of C.

  • qoir 📁 🌐 -- QOIR (pronounced like "choir") is a simple, lossless image file format that is very fast to encode and decode while achieving compression ratios roughly comparable to PNG. It was inspired by the QOI image file format,

  • rawspeed 📁 🌐 -- a library for decoding various images in RAW file format, while providing the fastest decoding speed possible. Supports the most common DSLR and similar class brands.

  • SFML 📁 🌐 -- Simple and Fast Multimedia Library (SFML) is a simple, fast, cross-platform and object-oriented multimedia API. It provides access to windowing, graphics, audio and network.

  • tinyexr 📁 🌐 -- Tiny OpenEXR: tinyexr is a small, single header-only library to load and save OpenEXR (.exr) images.

  • twain_library 📁 🌐 -- the DTWAIN Library, Version 5.x, from Dynarithmic Software. DTWAIN is an open source programmer's library that will allow applications to acquire images from TWAIN-enabled devices using a simple Application Programmer's Interface (API).

  • vpp 📁 🌐 -- Video++ is a video and image processing library taking advantage of the C++14 standard to ease the writing of fast video and image processing applications. The idea behind Video++ performance is to generate via meta-programming code that the compiler can easily optimize. Its main features are generic N-dimensional image containers, a growing set of image processing algorithms, zero-cost abstractions to easily write image processing algorithms for multicore SIMD processors and an embedded language to evaluate image expressions.

  • Imath 🌐 -- float16 support lib for OpenEXR format

    • optional; reason: considered overkill for the projects I'm currently involved in, including Qiqqa. Those can use Apache Tika, ImageMagick or other thirdparty pipelines to convert to & from supported formats.
  • OpenImageIO 🌐 -- a library for reading, writing, and processing images in a wide variety of file formats, using a format-agnostic API, aimed at VFX applications.

    • tentative/pending; reason: considered nice & cool but still overkill. Qiqqa tooling can use Apache Tika, ImageMagick or other thirdparty pipelines to convert to & from supported formats.
  • cgohlke::imagecodecs 🌐 (not included; see also DICOM slot above)

  • DICOM to NIfTI (not included; see also DICOM slot above)

  • GDCM-Grassroots-DICOM 🌐

    • removed; reason: not a frequently used format; the filter codes can be found in other libraries. Overkill. Qiqqa tooling can use Apache Tika, ImageMagick or other thirdparty pipelines to convert to & from supported formats.

Monte Carlo simulations, LDA, keyword inference/extraction, etc.

  • ceres-solver 📁 🌐 -- a library for modeling and solving large, complicated optimization problems. It is a feature rich, mature and performant library which has been used in production at Google since 2010. Ceres Solver can solve two kinds of problems: (1) Non-linear Least Squares problems with bounds constraints, and (2) General unconstrained optimization problems.

  • gibbs-lda 📁 🌐 -- modified GibbsLDA++: A C/C++ Implementation of Latent Dirichlet Allocation by by Xuan-Hieu Phan and Cam-Tu Nguyen.

  • lda 📁 🌐 -- variational EM for latent Dirichlet allocation (LDA), David Blei et al

  • lda-3-variants 📁 🌐 -- three modified open source versions of LDA with collapsed Gibbs Sampling: GibbsLDA++, ompi-lda and online_twitter_lda.

  • lda-bigartm 📁 🌐 -- BigARTM is a powerful tool for topic modeling based on a novel technique called Additive Regularization of Topic Models. This technique effectively builds multi-objective models by adding the weighted sums of regularizers to the optimization criterion. BigARTM is known to combine well very different objectives, including sparsing, smoothing, topics decorrelation and many others. Such combination of regularizers significantly improves several quality measures at once almost without any loss of the perplexity.

  • lda-Familia 📁 🌐 -- Familia: A Configurable Topic Modeling Framework for Industrial Text Engineering (Di Jiang and Yuanfeng Song and Rongzhong Lian and Siqi Bao and Jinhua Peng and Huang He and Hua Wu) (2018)

  • LightLDA 📁 🌐 -- a distributed system for large scale topic modeling. It implements a distributed sampler that enables very large data sizes and models. LightLDA improves sampling throughput and convergence speed via a fast O(1) metropolis-Hastings algorithm, and allows small cluster to tackle very large data and model sizes through model scheduling and data parallelism architecture. LightLDA is implemented with C++ for performance consideration.

  • mcmc 📁 🌐 -- Monte Carlo

  • mmc 📁 🌐 -- Monte Carlo

  • multiverso 📁 🌐 -- a parameter server based framework for training machine learning models on big data with numbers of machines. It is currently a standard C++ library and provides a series of friendly programming interfaces. Now machine learning researchers and practitioners do not need to worry about the system routine issues such as distributed model storage and operation, inter-process and inter-thread communication, multi-threading management, and so on. Instead, they are able to focus on the core machine learning logics: data, model, and training.

  • ncnn 📁 🌐 -- high-performance neural network inference computing framework optimized for mobile platforms (i.e. small footprint)

  • OptimizationTemplateLibrary 📁 🌐 -- Optimization Template Library (OTL)

  • pke 📁 🌐 -- python keyphrase extraction (PKE) is an open source python-based keyphrase extraction toolkit. It provides an end-to-end keyphrase extraction pipeline in which each component can be easily modified or extended to develop new models. pke also allows for easy benchmarking of state-of-the-art keyphrase extraction models, and ships with supervised models trained on the SemEval-2010 dataset.

  • RAKE 📁 🌐 -- the Rapid Automatic Keyword Extraction (RAKE) algorithm as described in: Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic Keyword Extraction from Individual Documents. In M. W. Berry & J. Kogan (Eds.), Text Mining: Theory and Applications: John Wiley & Sons.

  • SDLP 📁 🌐 -- Seidel's LP Algorithm: Linear-Complexity Linear Programming (LP) for Small-Dimensions: this solver is super efficient for small-dimensional LP with any constraint number, mostly encountered in computational geometry. It enjoys linear complexity about the constraint number.

  • stan 📁 🌐 -- Stan is a C++ package providing (1) full Bayesian inference using the No-U-Turn sampler (NUTS), a variant of Hamiltonian Monte Carlo (HMC), (2) approximate Bayesian inference using automatic differentiation variational inference (ADVI), and (3) penalized maximum likelihood estimation (MLE) using L-BFGS optimization. It is built on top of the Stan Math library.

  • stateline 📁 🌐 -- a framework for distributed Markov Chain Monte Carlo (MCMC) sampling written in C++. It implements random walk Metropolis-Hastings with parallel tempering to improve chain mixing, provides an adaptive proposal distribution to speed up convergence, and allows the user to factorise their likelihoods (eg. over sensors or data).

  • waifu2x-ncnn-vulkan 📁 🌐 -- waifu2x ncnn Vulkan: an ncnn project implementation of the waifu2x converter. Runs fast on Intel / AMD / Nvidia / Apple-Silicon with Vulkan API.

  • warpLDA 📁 🌐 -- a cache efficient implementation for Latent Dirichlet Allocation.

  • worde_butcher 📁 🌐 -- a tool for text segmentation, keyword extraction and speech tagging. Butchers any text into prime word / phrase cuts, deboning all incoming based on our definitive set of stopwords for all languages.

  • yake 📁 🌐 -- Yet Another Keyword Extractor (Yake) is a light-weight unsupervised automatic keyword extraction method which rests on text statistical features extracted from single documents to select the most important keywords of a text. Our system does not need to be trained on a particular set of documents, neither it depends on dictionaries, external-corpus, size of the text, language or domain.

  • other topic modeling code on the Net:

Random generators & all things random

  • EigenRand 📁 🌐 -- EigenRand: The Fastest C++11-compatible random distribution generator for Eigen. EigenRand is a header-only library for Eigen, providing vectorized random number engines and vectorized random distribution generators. Since the classic Random functions of Eigen rely on an old C function rand(), there is no way to control random numbers and no guarantee for quality of generated numbers. In addition, Eigen's Random is slow because rand() is hard to vectorize. EigenRand provides a variety of random distribution functions similar to C++11 standard's random functions, which can be vectorized and easily integrated into Eigen's expressions of Matrix and Array. You can get 5~10 times speed by just replacing old Eigen's Random or unvectorizable c++11 random number generators with EigenRand.
  • fastPRNG 📁 🌐 -- a single header-only FAST 32/64 bit PRNG (pseudo-random generator), highly optimized to obtain faster code from compilers, it's based on xoshiro / xoroshiro (Blackman/Vigna), xorshift and other Marsaglia algorithms.
  • libchaos 📁 🌐 -- Advanced library for randomization, hashing and statistical analysis (devoted to chaos machines) written to help with the development of software for scientific research. Project goal is to implement & analyze various algorithms for randomization and hashing, while maintaining simplicity and security, making them suitable for use in your own code. Popular tools like TestU01, Dieharder and Hashdeep are obsolete or their development has been stopped. Libchaos aims to replace them.
  • libprng 📁 🌐 -- a collection of C/C++ PRNGs (pseudo-random number generators) + supporting code.
  • pcg-cpp-random 📁 🌐 -- a C++ implementation of the PCG family of random number generators, which are fast, statistically excellent, and offer a number of useful features.
  • pcg-c-random 📁 🌐 -- a C implementation of the PCG family of random number generators, which are fast, statistically excellent, and offer a number of useful features.
  • prvhash 📁 🌐 -- PRVHASH is a hash function that generates a uniform pseudo-random number sequence derived from the message. PRVHASH is conceptually similar (in the sense of using a pseudo-random number sequence as a hash) to keccak and RadioGatun schemes, but is a completely different implementation of such concept. PRVHASH is both a "randomness extractor" and an "extendable-output function" (XOF).
  • randen 📁 🌐 -- What if we could default to attack-resistant random generators without excessive CPU cost? We introduce 'Randen', a new generator with security guarantees; it outperforms MT19937, pcg64_c32, Philox, ISAAC and ChaCha8 in real-world benchmarks. This is made possible by AES hardware acceleration and a large Feistel permutation.
  • random 📁 🌐 -- random for modern C++ with a convenient API.
  • RNGSobol 📁 🌐 -- Sobol quadi-random numbers generator (C++). Note that unlike pseudo-random numbers, quasi-random numbers care about dimensionality of points.
  • trng4 📁 🌐 -- Tina’s Random Number Generator Library (TRNG) is a state of the art C++ pseudo-random number generator library for sequential and parallel Monte Carlo simulations. Its design principles are based on the extensible random number generator facility that was introduced in the C++11 standard. The TRNG library features an object oriented design, is easy to use and has been speed optimized. Its implementation does not depend on any communication library or hardware architecture.
  • Xoshiro-cpp 📁 🌐 -- a header-only pseudorandom number generator library for modern C++. Based on David Blackman and Sebastiano Vigna's xoshiro/xoroshiro generators.

Regression, curve fitting, polynomials, splines, geometrics, interpolation, math

  • baobzi 📁 🌐 -- an adaptive fast function approximator based on tree search. Word salad aside, baobzi is a tool to convert very CPU intensive function calculations into relatively cheap ones (at the cost of memory). This is similar to functions like chebeval in MATLAB, but can be significantly faster since the order of the polynomial fit can be much much lower to meet similar tolerances. It also isn't constrained for use only in MATLAB. Internally, baobzi represents your function by a grid of binary/quad/oct/N trees, where the leaves represent the function in some small sub-box of the function's domain with chebyshev polynomials. When you evaluate your function at a point with baobzi, it searches the tree for the box containing your point and evaluates using this approximant.
  • blaze 📁 🌐 -- a high-performance C++ math library for dense and sparse arithmetic. With its state-of-the-art Smart Expression Template implementation Blaze combines the elegance and ease of use of a domain-specific language with HPC-grade performance, making it one of the most intuitive and fastest C++ math libraries available.
  • CDT 📁 🌐 -- a numerically robust library for generating constraint or conforming Delaunay triangulations, while properly handling the corner-cases.
  • Clipper2 📁 🌐 -- a Polygon Clipping and Offsetting library.
  • delaunator-cpp 📁 🌐 -- a really fast C++ library for Delaunay triangulation of 2D points.
  • fastops 📁 🌐 -- vector operations library, which enables acceleration of bulk calls of certain math functions on AVX and AVX2 hardware. Currently supported operations are exp, log, sigmoid and tanh. The library itself implements operations using AVX and AVX2, but will work on any hardware with at least SSE2 support.
  • fastrange 📁 🌐 -- a fast alternative to the modulo reduction. It has accelerated some operations in Google's Tensorflow by 10% to 20%. Further reading : http://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/ See also: Daniel Lemire, Fast Random Integer Generation in an Interval, ACM Transactions on Modeling and Computer Simulation, January 2019 Article No. 3 https://doi.org/10.1145/3230636
  • figtree 📁 🌐 -- FIGTree is a library that provides a C/C++ and MATLAB interface for speeding up the computation of the Gauss Transform.
  • fityk 📁 🌐 -- a program for nonlinear fitting of analytical functions (especially peak-shaped) to data (usually experimental data). To put it differently, it is primarily peak fitting software, but can handle other types of functions as well. Apart from the actual fitting, the program helps with data processing and provides ergonomic graphical interface (and also command line interface and scripting API -- but if the program is popular in some fields, it's thanks to its graphical interface). It is reportedly__ used in crystallography, chromatography, photoluminescence and photoelectron spectroscopy, infrared and Raman spectroscopy, to name but a few. Fityk offers various nonlinear fitting methods, simple background subtraction and other manipulations to the dataset, easy placement of peaks and changing of peak parameters, support for analysis of series of datasets, automation of common tasks with scripts, and much more.
  • float_compare 📁 🌐 -- C++ header providing floating point value comparators with user-specifiable tolerances and behaviour.
  • fmath 📁 🌐 -- fast approximate function of exponential function exp and log: includes fmath::log, fmath::exp, fmath::expd.
  • gmt 📁 🌐 -- GMT (Generic Mapping Tools) is an open source collection of about 100 command-line tools for manipulating geographic and Cartesian data sets (including filtering, trend fitting, gridding, projecting, etc.) and producing high-quality illustrations ranging from simple x-y plots via contour maps to artificially illuminated surfaces, 3D perspective views and animations. The GMT supplements add another 50 more specialized and discipline-specific tools. GMT supports over 30 map projections and transformations and requires support data such as GSHHG coastlines, rivers, and political boundaries and optionally DCW country polygons.
  • half 📁 🌐 -- IEEE 754-based half-precision floating point library forked from http://half.sourceforge.net/. This is a C++ header-only library to provide an IEEE 754 conformant 16-bit half-precision floating-point type along with corresponding arithmetic operators, type conversions and common mathematical functions. It aims for both efficiency and ease of use, trying to accurately mimic the behaviour of the built-in floating-point types at the best performance possible.
  • hilbert_curves 📁 🌐 -- the world's fastest implementations of 2D and 3D hilbert curve functions.
  • hilbert_hpp 📁 🌐 -- contains two implementations of the hilbert curve encoding & decoding algorithm described by John Skilling in his paper "Programming the Hilbert Curve".
  • ifopt 📁 🌐 -- a modern, light-weight, [Eigen]-based C++ interface to Nonlinear Programming solvers, such as Ipopt and Snopt.
  • ink-stroke-modeler 📁 🌐 -- smoothes raw freehand input and predicts the input's motion to minimize display latency. It turns noisy pointer input from touch/stylus/etc. into the beautiful stroke patterns of brushes/markers/pens/etc. Be advised that this library was designed to model handwriting, and as such, prioritizes smooth, good-looking curves over precise recreation of the input.
  • Ipopt 📁 🌐 -- Ipopt (Interior Point OPTimizer, pronounced eye-pea-Opt) is a software package for large-scale nonlinear optimization. It is designed to find (local) solutions of mathematical optimization problems.
  • libhilbert 📁 🌐 -- an implementation of the Chenyang, Hong, Nengchao 2008 IEEE N-dimensional Hilber mapping algorithm. The Hilbert generating genes are statically compiled into the library, thus producing a rather large executable size. This library support the forward and backward mapping algorithms from R_N -> R_1 and R_1 -> R_N. The library is used straigth forwardly and for guidance and documentation, see hilbertKey.h.
  • libInterpolate 📁 🌐 -- a C++ interpolation library, which provides classes to perform various types of 1D and 2D function interpolation (linear, spline, etc.).
  • libMultiRobotPlanning 📁 🌐 -- a library with search algorithms primarily for task and path planning for multi robot/agent systems. It is written in C++(14), highly templated for good performance, and comes with useful examples. The following algorithms are currently supported: A*, A* epsilon (also known as focal search), SIPP (Safe Interval Path Planning), Conflict-Based Search (CBS), Enhanced Conflict-Based Search (ECBS), Conflict-Based Search with Optimal Task Assignment (CBS-TA), Enhanced Conflict-Based Search with Optimal Task Assignment (ECBS-TA), Prioritized Planning using SIPP (example code for SIPP), Minimum sum-of-cost (flow-based; integer costs; any number of agents/tasks) and Best Next Assignment (series of optimal solutions)
  • lmfit 📁 🌐 -- least squares fitting Files Levenberg-Marquardt least squares minimization and curve fitting. To minimize arbitrary user-provided functions, or to fit user-provided data. No need to provide derivatives.
  • lol 📁 🌐 -- the header-only part of the Lol (Math) Engine framework.
  • lolremez 📁 🌐 -- LolRemez is a Remez algorithm implementation to approximate functions using polynomials.
  • magsac 📁 🌐 -- (MAGSAC++ had been included in OpenCV) the MAGSAC and MAGSAC++ algorithms for robust model fitting without using a single inlier-outlier threshold.
  • mathtoolbox 📁 🌐 -- mathematical tools (interpolation, dimensionality reduction, optimization, etc.) written in C++11 and Eigen.
  • mlinterp 📁 🌐 -- a fast C++ routine for linear interpolation in arbitrary dimensions (i.e., multilinear interpolation).
  • nlopt-util 📁 🌐 -- a single-header utility library for calling NLopt optimization in a single line using Eigen::VectorXd.
  • openlibm 📁 🌐 -- OpenLibm is an effort to have a high quality, portable, standalone C mathematical library (libm). The project was born out of a need to have a good libm for the Julia programming language that worked consistently across compilers and operating systems, and in 32-bit and 64-bit environments.
  • PointCloudSegmentation 📁 🌐 -- three algorithms on point cloud segmentation used in the following paper: Pairwise Linkage for Point Cloud Segmentation, Xiaohu Lu, etc. ISPRS2016. https://github.com/xiaohulugo/xiaohulugo.github.com/blob/master/papers/PLinkage_Point_Segmentation_ISPRS2016.pdf
  • polatory 📁 🌐 -- a fast and memory-efficient framework for RBF (radial basis function) interpolation. Polatory can perform kriging prediction via RBF interpolation (dual kriging). Although different terminologies are used, both methods produce the same results.
  • poly2tri 📁 🌐 -- Sweep‐line algorithm for constrained Delaunay triangulation, Domiter V. and Zalik B. (2008). Note: since there are no Input validation of the data given for triangulation you need to think about this. Poly2Tri does not support repeat points within epsilon.
  • qHilbert 📁 🌐 -- a vectorized speedup of Hilbert curve generation using SIMD intrinsics. A hilbert curve is a space filling self-similar curve that provides a mapping between 2D space to 1D, and 1D to 2D space while preserving locality between mappings. Hilbert curves split a finite 2D space into recursive quadrants(similar to a full quad-tree) and traverse each quadrant in recursive "U" shapes at each iteration such that every quadrant gets fully visited before moving onto the next one. qHilbert is an attempt at a vectorized speedup of mapping multiple linear 1D indices into planar 2D points in parallel that is based on the Butz Algorithm's utilization of Gray code.
  • radon-tf 📁 🌐 -- simple implementation of the radon transform. Faster when using more than one thread to execute it. No inverse function is provided. CPU implementation only.
  • RNGSobol 📁 🌐 -- Sobol quadi-random numbers generator (C++). Note that unlike pseudo-random numbers, quasi-random numbers care about dimensionality of points.
  • rotate 📁 🌐 -- provides several classic, commonly used and novel rotation algorithms (aka block swaps), which were documented since around 1981 up to 2021: three novel rotation algorithms were introduced in 2021, notably the trinity rotation.
  • RRD 📁 🌐 -- RRD: Rotation-Sensitive Regression for Oriented Scene Text Detection
  • RRPN 📁 🌐 -- (Arbitrary-Oriented Scene Text Detection via Rotation Proposals)[https://arxiv.org/abs/1703.01086]
  • rtl 📁 🌐 -- RANSAC Template Library (RTL) is an open-source robust regression tool especially with RANSAC family. RTL aims to provide fast, accurate, and easy ways to estimate any model parameters with data contaminated with outliers (incorrect data). RTL includes recent RANSAC variants with their performance evaluation with several models with synthetic and real data. RANdom SAmple Consensus (RANSAC) is an iterative method to make any parameter estimator strong against outliers. For example of line fitting, RANSAC enable to estimate a line parameter even though data points include wrong point observations far from the true line.
  • ruy 📁 🌐 -- a matrix multiplication library. Its focus is to cover the matrix multiplication needs of neural network inference engines. Its initial user has been TensorFlow Lite, where it is used by default on the ARM CPU architecture. ruy supports both floating-point and 8-bit-integer-quantized matrices.
  • scilab 📁 🌐 -- Scilab includes hundreds of mathematical functions. It has a high-level programming language allowing access to advanced data structures, 2-D and 3-D graphical functions.
  • sequential-line-search 📁 🌐 -- a C++ library for performing the sequential line search method (which is a human-in-the-loop variant of Bayesian optimization), following the paper "Yuki Koyama, Issei Sato, Daisuke Sakamoto, and Takeo Igarashi. 2017. Sequential Line Search for Efficient Visual Design Optimization by Crowds. ACM Trans. Graph. 36, 4, pp.48:1--48:11 (2017). (a.k.a. Proceedings of SIGGRAPH 2017), DOI: https://doi.org/10.1145/3072959.3073598"
  • sod 📁 🌐 -- SOD is an embedded, modern cross-platform computer vision and machine learning software library that exposes a set of APIs for deep-learning, advanced media analysis & processing including real-time, multi-class object detection and model training on embedded systems with limited computational resource and IoT devices. SOD was built to provide a common infrastructure for computer vision applications and to accelerate the use of machine perception in open source as well commercial products.
  • Sophus 📁 🌐 -- a C++ implementation of Lie groups commonly used for 2d and 3d geometric problems (i.e. for Computer Vision or Robotics applications). Among others, this package includes the special orthogonal groups SO(2) and SO(3) to present rotations in 2d and 3d as well as the special Euclidean group SE(2) and SE(3) to represent rigid body transformations (i.e. rotations and translations) in 2d and 3d.
  • spline 📁 🌐 -- a lightweight C++ cubic spline interpolation library.
  • splinter 📁 🌐 -- SPLINTER (SPLine INTERpolation) is a library for multivariate function approximation with splines. The library can be used for function approximation, regression, data smoothing, data reduction, and much more. Spline approximations are represented by a speedy C++ implementation of the tensor product B-spline. The B-spline consists of piecewise polynomial basis functions, offering a high flexibility and smoothness. The B-spline can be fitted to data using ordinary least squares (OLS), possibly with regularization. The library also offers construction of penalized splines (P-splines).
  • sse2neon 📁 🌐 -- converts Intel SSE intrinsics to Arm/Aarch64 NEON intrinsics, shortening the time needed to get an Arm working program that then can be used to extract profiles and to identify hot paths in the code.
  • sse-popcount 📁 🌐 -- SIMD popcount; sample programs for my article http://0x80.pl/articles/sse-popcount.html / Faster Population Counts using AVX2 Instructions (https://arxiv.org/abs/1611.07612)
  • theoretica 📁 🌐 -- a numerical and automatic math library for scientific research and graphical applications. Theoretica is a header-only mathematical library which provides algorithms for systems simulation, statistical analysis of lab data and numerical approximation, using a functional oriented paradigm to mimic mathematical notation and formulas. The aim of the library is to provide simple access to powerful algorithms while keeping an elegant and transparent interface, enabling the user to focus on the problem at hand.
  • tindicators 📁 🌐 -- a library of technical analysis indicators. It provides over 160 indicators and is blazing fast.
  • tinynurbs 📁 🌐 -- a lightweight header-only C++14 library for Non-Uniform Rational B-Spline curves and surfaces. The API is simple to use and the code is readable while being efficient.
  • tinyspline 📁 🌐 -- TinySpline is a small, yet powerful library for interpolating, transforming, and querying arbitrary NURBS, B-Splines, and Bézier curves.
  • TrianglePP 📁 🌐 -- Triangle++ is a C++ wrapper for the original J.P. Shevchuk's 2005 C-language Triangle package. The library can create standard Delaunay triangulations and their duals, i.e. Voronoi diagrams (aka Dirichlet tessellations). Additionally it can generate quality Delaunay triangulations (where we can set bounds on the areas and angles of the resulting triangles) and constrained Delaunay triangulations (where we can connect some points with and edge and require that this edge will be part of the result).
  • tweeny 📁 🌐 -- an inbetweening library designed for the creation of complex animations for games and other beautiful interactive software. It leverages features of modern C++ to empower developers with an intuitive API for declaring tweenings of any type of value, as long as they support arithmetic operations. The goal of Tweeny is to provide means to create fluid interpolations when animating position, scale, rotation, frames or other values of screen objects, by setting their values as the tween starting point and then, after each tween step, plugging back the result.

Solvers, Clustering, Monte Carlo, Decision Trees

  • adaptive_clustering 📁 🌐 -- a lightweight and accurate point cloud clustering method from the paper Online learning for 3D LiDAR-based human detection: Experimental analysis of point cloud clustering and classification methods, Zhi Yan and Tom Duckett and Nicola Bellotto, 2019.
  • agglomerative-hierarchical-clustering 📁 🌐 -- implements the Agglomerative Hierarchical Clustering algorithm.
  • ArborX 📁 🌐 -- a library designed to provide performance portable algorithms for geometric search, similarly to nanoflann and Boost Geometry.
  • baobzi 📁 🌐 -- an adaptive fast function approximator based on tree search. Word salad aside, baobzi is a tool to convert very CPU intensive function calculations into relatively cheap ones (at the cost of memory). This is similar to functions like chebeval in MATLAB, but can be significantly faster since the order of the polynomial fit can be much much lower to meet similar tolerances. It also isn't constrained for use only in MATLAB. Internally, baobzi represents your function by a grid of binary/quad/oct/N trees, where the leaves represent the function in some small sub-box of the function's domain with chebyshev polynomials. When you evaluate your function at a point with baobzi, it searches the tree for the box containing your point and evaluates using this approximant.
  • brown-cluster 📁 🌐 -- the Brown hierarchical word clustering algorithm. Runs in $O(N C^2)$, where $N$ is the number of word types and $C$ is the number of clusters. Algorithm by Brown, et al.: Class-Based n-gram Models of Natural Language, http://acl.ldc.upenn.edu/J/J92/J92-4003.pdf
  • clustercat 📁 🌐 -- a fast, flexible word clustering software, ClusterCat induces word classes from unannotated text. Word classes are unsupervised part-of-speech tags, requiring no manually-annotated corpus. Words are grouped together that share syntactic/semantic similarities. They are used in many dozens of applications within natural language processing, machine translation, neural net training, and related fields.
  • CppNumericalSolvers 📁 🌐 -- a header-only C++17 BFGS / L-BFGS-B optimization library.
  • dbscan 📁 🌐 -- Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Related Algorithms: a fast C++ (re)implementation of several density-based algorithms with a focus on the DBSCAN family for clustering spatial data. The package includes: DBSCAN, HDBSCAN, OPTICS/OPTICSXi, FOSC, Jarvis-Patrick clustering, LOF (Local outlier factor), GLOSH (Global-Local Outlier Score from Hierarchies), kd-tree based kNN search, Fixed-radius NN search
  • dbscan_kdtree 📁 🌐 -- fast Implementation of DBSCAN using Kd-tree for acceleration. The use case is clustering point cloud (PCL library used).
  • depth_clustering 📁 🌐 -- a fast and robust algorithm to segment point clouds taken with Velodyne sensor into objects. It works with all available Velodyne sensors, i.e. 16, 32 and 64 beam ones.
  • FIt-SNE 📁 🌐 -- FFT-accelerated implementation of t-Stochastic Neighborhood Embedding (t-SNE), which is a highly successful method for dimensionality reduction and visualization of high dimensional datasets. A popular implementation of t-SNE uses the Barnes-Hut algorithm to approximate the gradient at each iteration of gradient descent.
  • fityk 📁 🌐 -- a program for nonlinear fitting of analytical functions (especially peak-shaped) to data (usually experimental data). To put it differently, it is primarily peak fitting software, but can handle other types of functions as well. Apart from the actual fitting, the program helps with data processing and provides ergonomic graphical interface (and also command line interface and scripting API -- but if the program is popular in some fields, it's thanks to its graphical interface). It is reportedly__ used in crystallography, chromatography, photoluminescence and photoelectron spectroscopy, infrared and Raman spectroscopy, to name but a few. Fityk offers various nonlinear fitting methods, simple background subtraction and other manipulations to the dataset, easy placement of peaks and changing of peak parameters, support for analysis of series of datasets, automation of common tasks with scripts, and much more.
  • GALGO-2.0 📁 🌐 -- Genetic Algorithm in C++ with template metaprogramming and abstraction for constrained optimization. GALGO is a C++ template library, header only, designed to solve a problem under constraints (or not) by maximizing or minimizing an objective function on given boundaries. GALGO can also achieve multi-objective optimization. GALGO is based on chromosomes represented as a binary string of 0 and 1 containing the encoded parameters to be estimated. The user is free to choose the number of bits N to encode each one of them within the interval [1,64].
  • galib 📁 🌐 -- modern GAlib: a (modernized) C++ Genetic Algorithm Library. With GAlib you can add evolutionary algorithm optimization to almost any program using any data representation and standard or custom selection, crossover, mutation, scaling, and termination methods.
  • genieclust 📁 🌐 -- a faster and more powerful version of Genie (Fast and Robust Hierarchical Clustering with Noise Point Detection) – a robust and outlier resistant clustering algorithm (see Gagolewski, Bartoszuk, Cena, 2016).
  • gram_savitzky_golay 📁 🌐 -- Savitzky-Golay filtering based on Gram polynomials, as described in General Least-Squares Smoothing and Differentiation by the Convolution (Savitzky-Golay) Method
  • hdbscan 📁 🌐 -- a fast parallel implementation for HDBSCAN* [1] (hierarchical DBSCAN). The implementation stems from our parallel algorithms [2] developed at MIT, and presented at SIGMOD 2021. Our approach is based on generating a well-separated pair decomposition followed by using Kruskal's minimum spanning tree algorithm and bichromatic closest pair computations. We also give a new parallel divide-and-conquer algorithm for computing the dendrogram, which are used in visualizing clusters of different scale that arise for HDBSCAN*.
  • hdbscan-cpp 📁 🌐 -- Fast and Efficient Implementation of HDBSCAN in C++ using STL. HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise. Performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon. This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection. In practice this means that HDBSCAN returns a good clustering straight away with little or no parameter tuning -- and the primary parameter, minimum cluster size, is intuitive and easy to select. HDBSCAN is ideal for exploratory data analysis; it's a fast and robust algorithm that you can trust to return meaningful clusters (if there are any).
  • ifopt 📁 🌐 -- a modern, light-weight, [Eigen]-based C++ interface to Nonlinear Programming solvers, such as Ipopt and Snopt.
  • Ipopt 📁 🌐 -- Ipopt (Interior Point OPTimizer, pronounced eye-pea-Opt) is a software package for large-scale nonlinear optimization. It is designed to find (local) solutions of mathematical optimization problems.
  • kiwi 📁 🌐 -- Kiwi is an efficient C++ implementation of the Cassowary constraint solving algorithm. Kiwi is an implementation of the algorithm based on the seminal Cassowary paper <https://constraints.cs.washington.edu/solvers/cassowary-tochi.pdf>_. It is not a refactoring of the original C++ solver. Kiwi has been designed from the ground up to be lightweight and fast. Kiwi ranges from 10x to 500x faster than the original Cassowary solver with typical use cases gaining a 40x improvement. Memory savings are consistently > 5x.
  • LBFGS-Lite 📁 🌐 -- a header-only L-BFGS unconstrained optimizer.
  • libclustering_dim_redux 📁 🌐 -- C++ code for dimension reduction, Kohonen maps (SOMs), t-SNE, PCA, kNN, ...
  • liblbfgs 📁 🌐 -- libLBFGS: C library of limited-memory BFGS (L-BFGS), a C port of the implementation of Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) method written by Jorge Nocedal. The original FORTRAN source code is available at: http://www.ece.northwestern.edu/~nocedal/lbfgs.html
  • libMultiRobotPlanning 📁 🌐 -- a library with search algorithms primarily for task and path planning for multi robot/agent systems. It is written in C++(14), highly templated for good performance, and comes with useful examples. The following algorithms are currently supported: A*, A* epsilon (also known as focal search), SIPP (Safe Interval Path Planning), Conflict-Based Search (CBS), Enhanced Conflict-Based Search (ECBS), Conflict-Based Search with Optimal Task Assignment (CBS-TA), Enhanced Conflict-Based Search with Optimal Task Assignment (ECBS-TA), Prioritized Planning using SIPP (example code for SIPP), Minimum sum-of-cost (flow-based; integer costs; any number of agents/tasks) and Best Next Assignment (series of optimal solutions)
  • lmfit 📁 🌐 -- least squares fitting Files Levenberg-Marquardt least squares minimization and curve fitting. To minimize arbitrary user-provided functions, or to fit user-provided data. No need to provide derivatives.
  • LMW-tree 📁 🌐 -- LMW-tree: learning m-way tree is a generic template library written in C++ that implements several algorithms that use the m-way nearest neighbor tree structure to store their data. See the related PhD thesis for more details on m-way nn trees. The algorithms are primarily focussed on computationally efficient clustering. Clustering is an unsupervised machine learning process that finds interesting patterns in data. It places similar items into clusters and dissimilar items into different clusters. The data structures and algorithms can also be used for nearest neighbor search, supervised learning and other machine learning applications. The package includes EM-tree, K-tree, k-means, TSVQ, repeated k-means, clustering, random projections, random indexing, hashing, bit signatures. See the related PhD thesis for more details these algorithms and representations.
  • mapreduce 📁 🌐 -- the MapReduce-MPI (MR-MPI) library. MapReduce is the operation popularized by Google for computing on large distributed data sets. See the Wikipedia entry on MapReduce for an overview of what a MapReduce is. The MR-MPI library is a simple, portable implementation of MapReduce that runs on any serial desktop machine or large parallel machine using MPI message passing.
  • mathtoolbox 📁 🌐 -- mathematical tools (interpolation, dimensionality reduction, optimization, etc.) written in C++11 and Eigen.
  • mcl 📁 🌐 -- MCL: Markov CLustering or the Markov CLuster algorithm, MCL is a method for clustering weighted or simple networks, a.k.a. graphs. It is accompanied in this source code by other network-related programs, one of which is RCL (restricted contingency linkage) for fast multi-resolution consensus clustering. The algorithm was conceived in 1998 and first published in a technical report in 1998. A PhD thesis and three more technical reports followed in 2000. The paper "van Dongen, Stijn: Graph clustering via a discrete uncoupling process, Siam Journal on Matrix Analysis and Applications 30-1, p121-141, 2008" https://doi.org/10.1137/040608635 is the result of a long-winded review process that started in 2000 and lay dormant for a long time, for reasons not entirely untypical within the realms of scientific publishing. This MCL implementation is fast, threaded, and uses sparse matrices. It runs on a single machine and can use multiple CPUs.
  • mcmc-jags 📁 🌐 -- JAGS (Just Another Gibbs Sampler), a program for analysis of Bayesian Graphical models by Gibbs Sampling.
  • MicroPather 📁 🌐 -- a path finder and A* solver (astar or a-star) written in platform independent C++ that can be easily integrated into existing code. MicroPather focuses on being a path finding engine for video games but is a generic A* solver.
  • Multicore-TSNE 📁 🌐 -- Multicore t-SNE is a multicore modification of Barnes-Hut t-SNE by L. Van der Maaten with Python CFFI-based wrappers. This code also works faster than sklearn.TSNE on 1 core (as of version 0.18).
  • nlopt 📁 🌐 -- a library for nonlinear local and global optimization, for functions with and without gradient information. It is designed as a simple, unified interface and packaging of several free/open-source nonlinear optimization libraries.
  • nlopt-util 📁 🌐 -- a single-header utility library for calling NLopt optimization in a single line using Eigen::VectorXd.
  • openGA 📁 🌐 -- a free C++ Genetic Algorithm library.
  • openlibm 📁 🌐 -- OpenLibm is an effort to have a high quality, portable, standalone C mathematical library (libm). The project was born out of a need to have a good libm for the Julia programming language that worked consistently across compilers and operating systems, and in 32-bit and 64-bit environments.
  • optframe 📁 🌐 -- OptFrame: a C++ Optimization Framework from the paper: OptFrame: a computational framework for combinatorial optimization problems. Coelho, I.M., Ribas, S., Perché, M.H.P., Munhoz, P., Souza, M.J.F., Ochi, L.S. (2010), in Anais do XLII Simpósio Brasileiro de Pesquisa Operacional (SBPO). Bento Gonçalves-RS, pp 1887-1898.
  • or-tools 📁 🌐 -- Google Optimization Tools (a.k.a., OR-Tools) is an open-source, fast and portable software suite for solving combinatorial optimization problems. The suite includes a constraint programming solver, a linear programming solver and various graph algorithms.
  • osqp 📁 🌐 -- the Operator Splitting Quadratic Program Solver.
  • osqp-cpp 📁 🌐 -- a C++ wrapper for OSQP, an ADMM-based solver for quadratic programming. Compared with OSQP's native C interface, the wrapper provides a more convenient input format using Eigen sparse matrices and handles the lifetime of the OSQPWorkspace struct. This package has similar functionality to osqp-eigen.
  • osqp-eigen 📁 🌐 -- a simple C++ wrapper for osqp library.
  • paramonte 📁 🌐 -- ParaMonte (Plain Powerful Parallel Monte Carlo Library) is a serial/parallel library of Monte Carlo routines for sampling mathematical objective functions of arbitrary-dimensions, in particular, the posterior distributions of Bayesian models in data science, Machine Learning, and scientific inference, with the design goal of unifying the automation (of Monte Carlo simulations), user-friendliness (of the library), accessibility (from multiple programming environments), high-performance (at runtime), and scalability (across many parallel processors).
  • quile 📁 🌐 -- Quilë is a C++20 header-only general purpose genetic algorithms library with no external dependencies supporting floating-point, integer, binary and permutation representations.
  • rgf 📁 🌐 -- Regularized Greedy Forest (RGF) is a tree ensemble machine learning method described in this paper. RGF can deliver better results than gradient boosted decision trees (GBDT) on a number of datasets and it has been used to win a few Kaggle competitions. Unlike the traditional boosted decision tree approach, RGF works directly with the underlying forest structure. RGF integrates two ideas: one is to include tree-structured regularization into the learning formulation; and the other is to employ the fully-corrective regularized greedy algorithm.
  • RNGSobol 📁 🌐 -- Sobol quadi-random numbers generator (C++). Note that unlike pseudo-random numbers, quasi-random numbers care about dimensionality of points.
  • scilab 📁 🌐 -- Scilab includes hundreds of mathematical functions. It has a high-level programming language allowing access to advanced data structures, 2-D and 3-D graphical functions.
  • SDLP 📁 🌐 -- Seidel's LP Algorithm: Linear-Complexity Linear Programming (LP) for Small-Dimensions: this solver is super efficient for small-dimensional LP with any constraint number, mostly encountered in computational geometry. It enjoys linear complexity about the constraint number.
  • sequential-line-search 📁 🌐 -- a C++ library for performing the sequential line search method (which is a human-in-the-loop variant of Bayesian optimization), following the paper "Yuki Koyama, Issei Sato, Daisuke Sakamoto, and Takeo Igarashi. 2017. Sequential Line Search for Efficient Visual Design Optimization by Crowds. ACM Trans. Graph. 36, 4, pp.48:1--48:11 (2017). (a.k.a. Proceedings of SIGGRAPH 2017), DOI: https://doi.org/10.1145/3072959.3073598"
  • somoclu 📁 🌐 -- a massively parallel implementation of self-organizing maps. It exploits multicore CPUs, it is able to rely on MPI for distributing the workload in a cluster, and it can be accelerated by CUDA. A sparse kernel is also included, which is useful for training maps on vector spaces generated in text mining processes.
  • sundials 📁 🌐 -- SUNDIALS (SUite of Nonlinear and DIfferential/ALgebraic equation Solvers) is a family of software packages providing robust and efficient time integrators and nonlinear solvers that can easily be incorporated into existing simulation codes. The packages are designed to require minimal information from the user, allow users to supply their own data structures underneath the packages, and enable interfacing with user-supplied or third-party algebraic solvers and preconditioners.
  • theoretica 📁 🌐 -- a numerical and automatic math library for scientific research and graphical applications. Theoretica is a header-only mathematical library which provides algorithms for systems simulation, statistical analysis of lab data and numerical approximation, using a functional oriented paradigm to mimic mathematical notation and formulas. The aim of the library is to provide simple access to powerful algorithms while keeping an elegant and transparent interface, enabling the user to focus on the problem at hand.
  • thrill 📁 🌐 -- an EXPERIMENTAL C++ framework for algorithmic distributed Big Data batch computations on a cluster of machines. More information at http://project-thrill.org.
  • uno-solver 📁 🌐 -- a modern, modular solver for nonlinearly constrained nonconvex optimization.

Distance Metrics, Image Quality Metrics, Image Comparison

  • edit-distance 📁 🌐 -- a fast implementation of the edit distance (Levenshtein distance). The algorithm used in this library is proposed by Heikki Hyyrö, "Explaining and extending the bit-parallel approximate string matching algorithm of Myers", (2001) http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.7158&rep=rep1&type=pdf.

  • figtree 📁 🌐 -- FIGTree is a library that provides a C/C++ and MATLAB interface for speeding up the computation of the Gauss Transform.

  • flip 📁 🌐 -- ꟻLIP: A Tool for Visualizing and Communicating Errors in Rendered Images, implements the LDR-ꟻLIP and HDR-ꟻLIP image error metrics.

  • glfw 📁 🌐 -- GLFW is an Open Source, multi-platform library for OpenGL, OpenGL ES and Vulkan application development. It provides a simple, platform-independent API for creating windows, contexts and surfaces, reading input, handling events, etc.

  • imagedistance 📁 🌐 -- given two images, calculate their distance in several criteria.

  • iqa 📁 🌐 -- an Image Quality Analysis library.

  • libdip 📁 🌐 -- DIPlib is a C++ library for quantitative image analysis.

  • libxcam 📁 🌐 -- libXCam is a project for extended camera features and focus on image quality improvement and video analysis. There are lots features supported in image pre-processing, image post-processing and smart analysis. This library makes GPU/CPU/ISP working together to improve image quality. OpenCL is used to improve performance in different platforms.

  • magsac 📁 🌐 -- (MAGSAC++ had been included in OpenCV) the MAGSAC and MAGSAC++ algorithms for robust model fitting without using a single inlier-outlier threshold.

  • mecab 📁 🌐 -- MeCab (Yet Another Part-of-Speech and Morphological Analyzer) is a high-performance morphological analysis engine, designed to be independent of languages, dictionaries, and corpora, using Conditional Random Fields ((CRF)[http://www.cis.upenn.edu/~pereira/papers/crf.pdf]) to estimate the parameters.

  • pg_similarity 📁 🌐 -- pg_similarity is an extension to support similarity queries on PostgreSQL. The implementation is tightly integrated in the RDBMS in the sense that it defines operators so instead of the traditional operators (= and <>) you can use ~~~ and ~!~ (any of these operators represents a similarity function).

  • poisson_blend 📁 🌐 -- a simple, readable implementation of Poisson Blending, that demonstrates the concepts explained in my article, seamlessly blending a source image and a target image, at some specified pixel location.

  • polatory 📁 🌐 -- a fast and memory-efficient framework for RBF (radial basis function) interpolation. Polatory can perform kriging prediction via RBF interpolation (dual kriging). Although different terminologies are used, both methods produce the same results.

  • radon-tf 📁 🌐 -- simple implementation of the radon transform. Faster when using more than one thread to execute it. No inverse function is provided. CPU implementation only.

  • RapidFuzz 📁 🌐 -- rapid fuzzy string matching in Python and C++ using the Levenshtein Distance.

  • rotate 📁 🌐 -- provides several classic, commonly used and novel rotation algorithms (aka block swaps), which were documented since around 1981 up to 2021: three novel rotation algorithms were introduced in 2021, notably the trinity rotation.

  • Shifted-Hamming-Distance 📁 🌐 -- Shifted Hamming Distance (SHD) is an edit-distance based filter that can quickly check whether the minimum number of edits (including insertions, deletions and substitutions) between two strings is smaller than a user defined threshold T (the number of allowed edits between the two strings). Testing if two stings differs by a small amount is a prevalent function that is used in many applications. One of its biggest usage, perhaps, is in DNA or protein mapping, where a short DNA or protein string is compared against an enormous database, in order to find similar matches. In such applications, a query string is usually compared against multiple candidate strings in the database. Only candidates that are similar to the query are considered matches and recorded. SHD expands the basic Hamming distance computation, which only detects substitutions, into a full-fledged edit-distance filter, which counts not only substitutions but insertions and deletions as well.

  • vmaf 📁 🌐 -- VMAF (Video Multi-Method Assessment Fusion) is an Emmy-winning perceptual video quality assessment algorithm developed by Netflix. It also provides a set of tools that allows a user to train and test a custom VMAF model.

  • ZLMediaKit 📁 🌐 -- a high-performance operational-level streaming media service framework based on C++11, supporting multiple protocols (RTSP/RTMP/HLS/HTTP-FLV/WebSocket-FLV/GB28181/HTTP-TS/WebSocket-TS/HTTP-fMP4/WebSocket-fMP4/MP4/WebRTC) and protocol conversion.

    This extension supports a set of similarity algorithms. The most known algorithms are covered by this extension. You must be aware that each algorithm is suited for a specific domain. The following algorithms are provided.

    • Cosine Distance;
    • Dice Coefficient;
    • Euclidean Distance;
    • Hamming Distance;
    • Jaccard Coefficient;
    • Jaro Distance;
    • Jaro-Winkler Distance;
    • L1 Distance (as known as City Block or Manhattan Distance);
    • Levenshtein Distance;
    • Matching Coefficient;
    • Monge-Elkan Coefficient;
    • Needleman-Wunsch Coefficient;
    • Overlap Coefficient;
    • Q-Gram Distance;
    • Smith-Waterman Coefficient;
    • Smith-Waterman-Gotoh Coefficient;
    • Soundex Distance.

database "backend storage"

  • arangodb 📁 🌐 -- a scalable open-source multi-model database natively supporting graph, document and search. All supported data models & access patterns can be combined in queries allowing for maximal flexibility.

  • arrow 📁 🌐 -- Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. The reference Arrow libraries contain many distinct software components:

    • Columnar vector and table-like containers (similar to data frames) supporting flat or nested types

    • Conversions to and from other in-memory data structures

    • Integration tests for verifying binary compatibility between the implementations (e.g. sending data from Java to C++)

    • IO interfaces to local and remote filesystems

    • Readers and writers for various widely-used file formats (such as Parquet, CSV)

    • Reference-counted off-heap buffer memory management, for zero-copy memory sharing and handling memory-mapped files

    • Self-describing binary wire formats (streaming and batch/file-like) for remote procedure calls (RPC) and interprocess communication (IPC)

  • BitFunnel 📁 🌐 -- the BitFunnel index used by Bing's super-fresh, news, and media indexes. The algorithm is described in BitFunnel: Revisiting Signatures for Search.

  • csv-parser 📁 🌐 -- Vince's CSV Parser: there's plenty of other CSV parsers in the wild, but I had a hard time finding what I wanted. Inspired by Python's csv module, I wanted a library with simple, intuitive syntax. Furthermore, I wanted support for special use cases such as calculating statistics on very large files. Thus, this library was created with these following goals in mind.

  • csvquote 📁 🌐 -- smart and simple CSV processing on the command line. This program can be used at the start and end of a text processing pipeline so that regular unix command line tools can properly handle CSV data that contain commas and newlines inside quoted data fields. Without this program, embedded special characters would be incorrectly interpreted as separators when they are inside quoted data fields.

  • datamash 📁 🌐 -- GNU Datamash is a command-line program which performs basic numeric, textual and statistical operations on input textual data files. It is designed to be portable and reliable, and aid researchers to easily automate analysis pipelines, without writing code or even short scripts.

  • duckdb 📁 🌐 -- DuckDB is a high-performance analytical database system. It is designed to be fast, reliable, portable, and easy to use. DuckDB provides a rich SQL dialect, with support far beyond basic SQL. DuckDB supports arbitrary and nested correlated subqueries, window functions, collations, complex types (arrays, structs), and more.

  • Extensible-Storage-Engine 📁 🌐 -- ESE is an embedded / ISAM-based database engine, that provides rudimentary table and indexed access. However the library provides many other strongly layered and thus reusable sub-facilities as well: A Synchronization / Locking library, a Data-structures / STL-like library, an OS-abstraction layer, and a Cache Manager, as well as the full-blown database engine itself.

  • fast-cpp-csv-parser 📁 🌐 -- a small, easy-to-use and fast header-only library for reading comma separated value (CSV) files.

  • groonga 📁 🌐 -- an open-source fulltext search engine and column store.

  • harbour-core 📁 🌐 -- Harbour is the free software implementation of a multi-platform, multi-threading, object-oriented, scriptable programming language, backward compatible with Clipper/xBase. Harbour consists of a compiler and runtime libraries with multiple UI and database backends, its own make system and a large collection of libraries and interfaces to many popular APIs.

  • IdGenerator 📁 🌐 -- a digital ID generator using the snowflake algorithm, developed in response to the performance problems that often occur. Example use is when you, as an architecture designer, want to solve the problem of unique database primary keys, especially in multi-database distributed systems. You want the primary key of the data table to use the least storage space, while the index speed and the Select, Insert, and Update queries are fast. Meanwhile there may be more than 50 application instances, and each concurrent request can reach 10W/s. You do not want to rely on the auto-increment operation of redis to obtain continuous primary key IDs, because continuous IDs pose business data security risks.

  • iODBC 📁 🌐 -- the iODBC Driver Manager provides you with everything you need to develop ODBC-compliant applications under Unix without having to pay royalties to other parties. An ODBC driver is still needed to affect your connection architecture. You may build a driver with the iODBC components or obtain an ODBC driver from a commercial vendor.

  • lazycsv 📁 🌐 -- a c++17, posix-compliant, single-header library for reading and parsing csv files. It's fast and lightweight and does not allocate any memory in the constructor or while parsing. It parses each row and cell just on demand on each iteration, that's why it's called lazy.

  • libcsv2 📁 🌐 -- CSV file format reader/writer library.

  • lib_nas_lockfile 📁 🌐 -- lockfile management on NAS and other disparate network filesystem storage. To be combined with SQLite to create a proper Qiqqa Sync operation.

  • libsiridb 📁 🌐 -- SiriDB Connector C (libsiridb) is a library which can be used to communicate with SiriDB using the C program language. This library contains useful functions but does not handle the connection itself.

  • libsl3 📁 🌐 -- a C++ interface for SQLite 3.x. libsl3 is designed to enable comfortable and efficient communication with a SQLite database based on its natural language, which is SQL.

  • libsql 📁 🌐 -- libSQL is an open source, open contribution fork of SQLite. We aim to evolve it to suit many more use cases than SQLite was originally designed for, and plan to use third-party OSS code wherever it makes sense.

    SQLite has solidified its place in modern technology stacks, embedded in nearly any computing device you can think of. Its open source nature and public domain availability make it a popular choice for modification to meet specific use cases. But despite having its code available, SQLite famously doesn't accept external contributors, so community improvements cannot be widely enjoyed. There have been other forks in the past, but they all focus on a specific technical difference. We aim to be a community where people can contribute from many different angles and motivations. We want to see a world where everyone can benefit from all of the great ideas and hard work that the SQLite community contributes back to the codebase.

  • libsqlfs 📁 🌐 -- a POSIX style file system on top of an SQLite database. It allows applications to have access to a full read/write file system in a single file, complete with its own file hierarchy and name space. This is useful for applications which needs structured storage, such as embedding documents within documents, or management of configuration data or preferences.

  • ligra-graph 📁 🌐 -- LIGRA: a Lightweight Graph Processing Framework for Shared Memory; works on both uncompressed and compressed graphs and hypergraphs.

  • mcmd 📁 🌐 -- MCMD (M-Command): a set of commands for handling large scale CSV data. MCMD (called as M-Command) is a set of commands that are developed for the purpose of high-speed processing of large-scale structured tabular data in CSV format. It is possible to efficiently process large scale data with hundred millions row of records on a standard PC.

  • mydumper 📁 🌐 -- a MySQL Logical Backup Tool. It has 2 tools:

    • mydumper which is responsible to export a consistent backup of MySQL databases
    • myloader reads the backup from mydumper, connects the to destination database and imports the backup.
  • mysql-connector-cpp 📁 🌐 -- MySQL Connector/C++ is a release of MySQL Connector/C++, the C++ interface for communicating with MySQL servers.

  • nanodbc 📁 🌐 -- a small C++ wrapper for the native C ODBC API.

  • ormpp 📁 🌐 -- modern C++ ORM, C++17, support mysql, postgresql, sqlite.

  • otl 📁 🌐 -- Oracle Template Library (STL-like wrapper for SQL DB queries; supports many databases besides Oracle)

  • percona-server 📁 🌐 -- Percona Server for MySQL is a free, fully compatible, enhanced, and open source drop-in replacement for any MySQL database. It provides superior performance, scalability, and instrumentation.

  • qlever 📁 🌐 -- a SPARQL engine that can efficiently index and query very large knowledge graphs with up to 100 billion triples on a single standard PC or server. In particular, QLever is fast for queries that involve large intermediate or final results, which are notoriously hard for engines like Blazegraph or Virtuoso.

  • rapidcsv 📁 🌐 -- an easy-to-use C++ CSV parser library. It supports C++11 (and later), is header-only and comes with a basic test suite. The library was featured in the book C++20 for Programmers.

  • siridb-server 📁 🌐 -- SiriDB Server is a highly-scalable, robust and super fast time series database. SiriDB uses a unique mechanism to operate without a global index and allows server resources to be added on the fly. SiriDB’s unique query language includes dynamic grouping of time series for easy analysis over large amounts of time series. SiriDB is scalable on the fly and has no downtime while updating or expanding your database. The scalable possibilities enable you to enlarge the database time after time without losing speed. SiriDB is developed to give an unprecedented performance without downtime. A SiriDB cluster distributes time series across multiple pools. Each pool supports active replicas for load balancing and redundancy. When one of the replicas is not available the database is still accessible.

  • sqawk 📁 🌐 -- apply SQL on CSV files in the shell: sqawk imports CSV files into an on-the-fly SQLite database, and runs a user-supplied query on the data.

  • sqlcipher 📁 🌐 -- SQLCipher is a standalone fork of the SQLite database library that adds 256 bit AES encryption of database files and other security features.

  • sqlean 📁 🌐 -- The ultimate set of SQLite extensions: SQLite has few functions compared to other database management systems. SQLite authors see this as a feature rather than a problem, because SQLite has an extension mechanism in place. There are a lot of SQLite extensions out there, but they are incomplete, inconsistent and scattered across the internet. sqlean brings them together, neatly packaged into domain modules, documented, tested, and built for Linux, Windows and macOS.

  • sqleet 📁 🌐 -- an encryption extension for SQLite3. The encryption is transparent (on-the-fly) and based on modern cryptographic algorithms designed for high performance in software and robust side-channel resistance.

  • sqlite 📁 🌐 -- the complete SQLite database engine.

  • sqlite3-compression-encryption-vfs 📁 🌐 -- CEVFS: Compression & Encryption VFS for SQLite 3 is a SQLite 3 Virtual File System for compressing and encrypting data at the pager level. Once set up, you use SQLite as you normally would and the compression and encryption is transparently handled during database read/write operations via the SQLite pager.

  • sqlite3pp 📁 🌐 -- a minimal ORM wrapper for SQLite et al.

  • sqlite-amalgamation 📁 🌐 -- the SQLite amalgamation, which is the recommended method of building SQLite into larger projects.

  • SQLiteCpp 📁 🌐 -- a smart and easy to use C++ SQLite3 wrapper. SQLiteC++ offers an encapsulation around the native C APIs of SQLite, with a few intuitive and well documented C++ classes.

  • sqlite-fts5-snowball 📁 🌐 -- a simple extension for use with FTS5 within SQLite. It allows FTS5 to use Martin Porter's Snowball stemmers (libstemmer), which are available in several languages. Check http://snowballstem.org/ for more information about them.

  • sqlite_fts_tokenizer_chinese_simple 📁 🌐 -- an extension of sqlite3 fts5 that supports Chinese and Pinyin. It fully provides a solution to the multi-phonetic word problem of full-text retrieval on WeChat mobile terminal: solution 4 in the article, very simple and efficient support for Chinese and Pinyin searches.

    On this basis we also support more accurate phrase matching through cppjieba. See the introduction article at https://www.wangfenjin.com/posts/simple-jieba-tokenizer/

  • SQLiteHistograms 📁 🌐 -- an SQLite extension library for creating histogram tables, tables of ratio between histograms and interpolation tables of scatter point tables.

  • sqliteodbc 📁 🌐 -- SQLite ODBC Driver for the wonderful SQLite 2.8.* and SQLite 3.* Database Engine/Library.

  • sqlite-parquet-vtable 📁 🌐 -- an SQLite virtual table extension to expose Parquet files as SQL tables. You may also find csv2parquet useful. This blog post provides some context on why you might use this.

  • sqlite-stats 📁 🌐 -- provides common statistical functions for SQLite.

  • sqlite_wrapper 📁 🌐 -- an easy-to-use, lightweight and concurrency-friendly SQLite wrapper written in C++17.

  • sqlite_zstd_vfs 📁 🌐 -- SQLite VFS extension providing streaming storage compression using Zstandard (Zstd), transparently compressing pages of the main database file as they're written out and later decompressing them as they're read in. It runs page de/compression on background threads and occasionally generates dictionaries to improve subsequent compression.

  • sqlpp11 📁 🌐 -- a type safe embedded domain specific language for SQL queries and results in C++.

  • ssp 📁 🌐 -- a header only CSV parser which is fast and versatile with modern C++ API. Requires compiler with C++17 support. Can also be used to efficiently convert strings to specific types. Conversion for floating point values invoked using fast-float.

  • unixODBC 📁 🌐 -- an Open Source ODBC sub-system and an ODBC SDK for Linux, Mac OSX, and UNIX.

  • unqlite 📁 🌐 -- UnQLite is a Transactional Embedded Database Engine, an in-process software library which implements a self-contained, serverless, zero-configuration, transactional NoSQL database engine. UnQLite is a document store database similar to MongoDB, Redis, CouchDB etc. as well a standard Key/Value store similar to BerkeleyDB, LevelDB, etc.

    Unlike most other NoSQL databases, UnQLite does not have a separate server process. UnQLite reads and writes directly to ordinary disk files. A complete database with multiple collections is contained in a single disk file. The database file format is cross-platform; you can freely copy a database between 32-bit and 64-bit systems or between big-endian and little-endian architectures.

    • BSD licensed product.
    • Built with a powerful disk storage engine which support O(1) lookup.
    • Cross-platform file format.
    • Document store (JSON) database via Jx9.
    • Pluggable run-time interchangeable storage engine.
    • Serverless, NoSQL database engine.
    • Simple, Clean and easy to use API.
    • Single database file, does not use temporary files.
    • Standard Key/Value store.
    • Support cursors for linear records traversal.
    • Support for on-disk as well in-memory databases.
    • Support Terabyte sized databases.
    • Thread safe and full reentrant.
    • Transactional (ACID) database.
    • UnQLite is a Self-Contained C library without dependency.
    • Zero configuration.
  • upscaledb 📁 🌐 -- a.k.a. hamsterdb: a thread-safe key/value database engine. It supports a B+Tree index structure, uses memory mapped I/O (if available), fast Cursors and variable length keys and can create In-Memory Databases.

  • zsv 📁 🌐 -- the world's fastest (SIMD) CSV parser, with an extensible CLI for SQL querying, format conversion and more.

LMDB, NoSQL and key/value stores

  • arrow 📁 🌐 -- Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. The reference Arrow libraries contain many distinct software components:

    • Columnar vector and table-like containers (similar to data frames) supporting flat or nested types

    • Conversions to and from other in-memory data structures

    • Integration tests for verifying binary compatibility between the implementations (e.g. sending data from Java to C++)

    • IO interfaces to local and remote filesystems

    • Readers and writers for various widely-used file formats (such as Parquet, CSV)

    • Reference-counted off-heap buffer memory management, for zero-copy memory sharing and handling memory-mapped files

    • Self-describing binary wire formats (streaming and batch/file-like) for remote procedure calls (RPC) and interprocess communication (IPC)

  • comdb2-bdb 📁 🌐 -- a clustered RDBMS built on Optimistic Concurrency Control techniques. It provides multiple isolation levels, including Snapshot and Serializable Isolation.

  • ctsa 📁 🌐 -- a Univariate Time Series Analysis and ARIMA Modeling Package in ANSI C: CTSA is a C software package for univariate time series analysis. ARIMA and Seasonal ARIMA models have been added as of 10/30/2014. 07/24/2020 Update: SARIMAX and Auto ARIMA added. Documentation will be added in the coming days. Software is still in beta stage and older ARIMA and SARIMA functions are now superseded by SARIMAX.

  • ejdb 📁 🌐 -- an embeddable JSON database engine published under MIT license, offering a single file database, online backups support, a simple but powerful query language (JQL), based on the TokyoCabinet-inspired KV store iowow.

  • FASTER 📁 🌐 -- helps manage large application state easily, resiliently, and with high performance by offering (1) FASTER Log, which is a high-performance concurrent persistent recoverable log, iterator, and random reader library, and (2) FASTER KV as a concurrent key-value store + cache that is designed for point lookups and heavy updates. FASTER supports data larger than memory, by leveraging fast external storage (local or cloud). It also supports consistent recovery using a fast non-blocking checkpointing technique that lets applications trade-off performance for commit latency. Both FASTER KV and FASTER Log offer orders-of-magnitude higher performance than comparable solutions, on standard workloads.

  • gdbm 📁 🌐 -- GNU dbm is a set of NoSQL database routines that use extendable hashing and works similar to the standard UNIX dbm routines.

  • iowow 📁 🌐 -- a C/11 file storage utility library and persistent key/value storage engine, supporting multiple key-value databases within a single file, online database backups and Write Ahead Logging (WAL) support. Good performance comparing its main competitors: lmdb, leveldb, kyoto cabinet.

  • libmdbx 📁 🌐 -- one of the fastest embeddable key-value ACID database without WAL. libmdbx surpasses the legendary LMDB in terms of reliability, features and performance.

  • libsiridb 📁 🌐 -- SiriDB Connector C (libsiridb) is a library which can be used to communicate with SiriDB using the C program language. This library contains useful functions but does not handle the connection itself.

  • Lightning.NET 📁 🌐 -- .NET library for OpenLDAP's LMDB key-value store

  • lmdb 📁 🌐 -- OpenLDAP LMDB is an outrageously fast key/value store with semantics that make it highly interesting for many applications. Of specific note, besides speed, is the full support for transactions and good read/write concurrency. LMDB is also famed for its robustness when used correctly.

  • lmdb-safe 📁 🌐 -- A safe modern & performant C++ wrapper of LMDB. LMDB is an outrageously fast key/value store with semantics that make it highly interesting for many applications. Of specific note, besides speed, is the full support for transactions and good read/write concurrency. LMDB is also famed for its robustness.. when used correctly. The design of LMDB is elegant and simple, which aids both the performance and stability. The downside of this elegant design is a nontrivial set of rules that need to be followed to not break things. In other words, LMDB delivers great things but only if you use it exactly right. This is by conscious design. The lmdb-safe library aims to deliver the full LMDB performance while programmatically making sure the LMDB semantics are adhered to, with very limited overhead.

  • lmdb.spreads.net 📁 🌐 -- Low-level zero-overhead and the fastest LMDB .NET wrapper with some additional native methods useful for Spreads.

  • lmdb-store 📁 🌐 -- an ultra-fast NodeJS interface to LMDB; probably the fastest and most efficient NodeJS key-value/database interface that exists for full storage and retrieval of structured JS data (objects, arrays, etc.) in a true persisted, scalable, ACID compliant database. It provides a simple interface for interacting with LMDB.

  • lmdbxx 📁 🌐 -- lmdb++: a comprehensive C++11 wrapper for the LMDB embedded database library, offering both an error-checked procedural interface and an object-oriented resource interface with RAII semantics.

  • mmkv 📁 🌐 -- an efficient, small, easy-to-use mobile key-value storage framework used in the WeChat application. It's currently available on Android, iOS/macOS, Win32 and POSIX.

  • PGM-index 📁 🌐 -- the Piecewise Geometric Model index (PGM-index) is a data structure that enables fast lookup, predecessor, range searches and updates in arrays of billions of items using orders of magnitude less space than traditional indexes while providing the same worst-case query time guarantees.

  • pmemkv 📁 🌐 -- pmemkv is a local/embedded key-value datastore optimized for persistent memory. Rather than being tied to a single language or backing implementation, pmemkv provides different options for language bindings and storage engines.

  • pmemkv-bench 📁 🌐 -- benchmark for libpmemkv and its underlying libraries, based on leveldb's db_bench. The pmemkv_bench utility provides some standard read, write & remove benchmarks. It's based on the db_bench utility included with LevelDB and RocksDB, although the list of supported parameters is slightly different.

  • qlever 📁 🌐 -- a SPARQL engine that can efficiently index and query very large knowledge graphs with up to 100 billion triples on a single standard PC or server. In particular, QLever is fast for queries that involve large intermediate or final results, which are notoriously hard for engines like Blazegraph or Virtuoso.

  • sdsl-lite 📁 🌐 -- The Succinct Data Structure Library (SDSL) is a powerful and flexible C++11 library implementing succinct data structures. In total, the library contains the highlights of 40 [research publications][SDSLLIT]. Succinct data structures can represent an object (such as a bitvector or a tree) in space close to the information-theoretic lower bound of the object while supporting operations of the original object efficiently. The theoretical time complexity of an operation performed on the classical data structure and the equivalent succinct data structure are (most of the time) identical.

  • siridb-server 📁 🌐 -- SiriDB Server is a highly-scalable, robust and super fast time series database. SiriDB uses a unique mechanism to operate without a global index and allows server resources to be added on the fly. SiriDB’s unique query language includes dynamic grouping of time series for easy analysis over large amounts of time series. SiriDB is scalable on the fly and has no downtime while updating or expanding your database. The scalable possibilities enable you to enlarge the database time after time without losing speed. SiriDB is developed to give an unprecedented performance without downtime. A SiriDB cluster distributes time series across multiple pools. Each pool supports active replicas for load balancing and redundancy. When one of the replicas is not available the database is still accessible.

  • unqlite 📁 🌐 -- UnQLite is a Transactional Embedded Database Engine, an in-process software library which implements a self-contained, serverless, zero-configuration, transactional NoSQL database engine. UnQLite is a document store database similar to MongoDB, Redis, CouchDB etc. as well a standard Key/Value store similar to BerkeleyDB, LevelDB, etc.

    Unlike most other NoSQL databases, UnQLite does not have a separate server process. UnQLite reads and writes directly to ordinary disk files. A complete database with multiple collections is contained in a single disk file. The database file format is cross-platform; you can freely copy a database between 32-bit and 64-bit systems or between big-endian and little-endian architectures.

    • BSD licensed product.
    • Built with a powerful disk storage engine which support O(1) lookup.
    • Cross-platform file format.
    • Document store (JSON) database via Jx9.
    • Pluggable run-time interchangeable storage engine.
    • Serverless, NoSQL database engine.
    • Simple, Clean and easy to use API.
    • Single database file, does not use temporary files.
    • Standard Key/Value store.
    • Support cursors for linear records traversal.
    • Support for on-disk as well in-memory databases.
    • Support Terabyte sized databases.
    • Thread safe and full reentrant.
    • Transactional (ACID) database.
    • UnQLite is a Self-Contained C library without dependency.
    • Zero configuration.

SQLite specific modules & related materials

  • duckdb 📁 🌐 -- DuckDB is a high-performance analytical database system. It is designed to be fast, reliable, portable, and easy to use. DuckDB provides a rich SQL dialect, with support far beyond basic SQL. DuckDB supports arbitrary and nested correlated subqueries, window functions, collations, complex types (arrays, structs), and more.

  • libdist 📁 🌐 -- string distance related functions (Damerau-Levenshtein, Jaro-Winkler, longest common substring & subsequence) implemented as SQLite run-time loadable extension, with UTF-8 support.

  • lib_nas_lockfile 📁 🌐 -- lockfile management on NAS and other disparate network filesystem storage. To be combined with SQLite to create a proper Qiqqa Sync operation.

  • libsl3 📁 🌐 -- a C++ interface for SQLite 3.x. libsl3 is designed to enable comfortable and efficient communication with a SQLite database based on its natural language, which is SQL.

  • libsql 📁 🌐 -- libSQL is an open source, open contribution fork of SQLite. We aim to evolve it to suit many more use cases than SQLite was originally designed for, and plan to use third-party OSS code wherever it makes sense.

    SQLite has solidified its place in modern technology stacks, embedded in nearly any computing device you can think of. Its open source nature and public domain availability make it a popular choice for modification to meet specific use cases. But despite having its code available, SQLite famously doesn't accept external contributors, so community improvements cannot be widely enjoyed. There have been other forks in the past, but they all focus on a specific technical difference. We aim to be a community where people can contribute from many different angles and motivations. We want to see a world where everyone can benefit from all of the great ideas and hard work that the SQLite community contributes back to the codebase.

  • libsqlfs 📁 🌐 -- a POSIX style file system on top of an SQLite database. It allows applications to have access to a full read/write file system in a single file, complete with its own file hierarchy and name space. This is useful for applications which needs structured storage, such as embedding documents within documents, or management of configuration data or preferences.

  • sqlcipher 📁 🌐 -- SQLCipher is a standalone fork of the SQLite database library that adds 256 bit AES encryption of database files and other security features.

  • sqlean 📁 🌐 -- The ultimate set of SQLite extensions: SQLite has few functions compared to other database management systems. SQLite authors see this as a feature rather than a problem, because SQLite has an extension mechanism in place. There are a lot of SQLite extensions out there, but they are incomplete, inconsistent and scattered across the internet. sqlean brings them together, neatly packaged into domain modules, documented, tested, and built for Linux, Windows and macOS.

  • sqleet 📁 🌐 -- an encryption extension for SQLite3. The encryption is transparent (on-the-fly) and based on modern cryptographic algorithms designed for high performance in software and robust side-channel resistance.

  • sqlite 📁 🌐 -- the complete SQLite database engine.

  • sqlite3-compression-encryption-vfs 📁 🌐 -- CEVFS: Compression & Encryption VFS for SQLite 3 is a SQLite 3 Virtual File System for compressing and encrypting data at the pager level. Once set up, you use SQLite as you normally would and the compression and encryption is transparently handled during database read/write operations via the SQLite pager.

  • sqlite3pp 📁 🌐 -- a minimal ORM wrapper for SQLite et al.

  • sqlite-amalgamation 📁 🌐 -- the SQLite amalgamation, which is the recommended method of building SQLite into larger projects.

  • SQLiteCpp 📁 🌐 -- a smart and easy to use C++ SQLite3 wrapper. SQLiteC++ offers an encapsulation around the native C APIs of SQLite, with a few intuitive and well documented C++ classes.

  • sqlite-fts5-snowball 📁 🌐 -- a simple extension for use with FTS5 within SQLite. It allows FTS5 to use Martin Porter's Snowball stemmers (libstemmer), which are available in several languages. Check http://snowballstem.org/ for more information about them.

  • sqlite_fts_tokenizer_chinese_simple 📁 🌐 -- an extension of sqlite3 fts5 that supports Chinese and Pinyin. It fully provides a solution to the multi-phonetic word problem of full-text retrieval on WeChat mobile terminal: solution 4 in the article, very simple and efficient support for Chinese and Pinyin searches.

    On this basis we also support more accurate phrase matching through cppjieba. See the introduction article at https://www.wangfenjin.com/posts/simple-jieba-tokenizer/

  • SQLiteHistograms 📁 🌐 -- an SQLite extension library for creating histogram tables, tables of ratio between histograms and interpolation tables of scatter point tables.

  • sqliteodbc 📁 🌐 -- SQLite ODBC Driver for the wonderful SQLite 2.8.* and SQLite 3.* Database Engine/Library.

  • sqlite-parquet-vtable 📁 🌐 -- an SQLite virtual table extension to expose Parquet files as SQL tables. You may also find csv2parquet useful. This blog post provides some context on why you might use this.

  • sqlite-stats 📁 🌐 -- provides common statistical functions for SQLite.

  • sqlite_wrapper 📁 🌐 -- an easy-to-use, lightweight and concurrency-friendly SQLite wrapper written in C++17.

  • sqlite_zstd_vfs 📁 🌐 -- SQLite VFS extension providing streaming storage compression using Zstandard (Zstd), transparently compressing pages of the main database file as they're written out and later decompressing them as they're read in. It runs page de/compression on background threads and occasionally generates dictionaries to improve subsequent compression.

metadata & text (OCR et al) -- language detect, suggesting fixes, ...

  • chewing_text_cud 📁 🌐 -- a text processing / filtering library for use in NLP/search/content analysis research pipelines.

  • cld1-language-detect 📁 🌐 -- the CLD (Compact Language Detection) library, extracted from the source code for Google's Chromium library. CLD1 probabilistically detects languages in Unicode UTF-8 text.

  • cld2-language-detect 📁 🌐 -- CLD2 probabilistically detects over 80 languages in Unicode UTF-8 text, either plain text or HTML/XML. For mixed-language input, CLD2 returns the top three languages found and their approximate percentages of the total text bytes. Optionally, it also returns a vector of text spans with the language of each identified. The design target is web pages of at least 200 characters (about two sentences); CLD2 is not designed to do well on very short text.

  • cld3-language-detect 📁 🌐 -- CLD3 is a neural network model for language identification. The inference code extracts character ngrams from the input text and computes the fraction of times each of them appears. The model outputs BCP-47-style language codes, shown in the table below. For some languages, output is differentiated by script. Language and script names from Unicode CLDR.

  • compact_enc_det 📁 🌐 -- Compact Encoding Detection (CED for short) is a library written in C++ that scans given raw bytes and detect the most likely text encoding.

  • cppjieba 📁 🌐 -- the C++ version of the Chinese "Jieba" project:

    • Supports loading a custom user dictionary, using the '|' separator when multipathing or the ';' separator for separate, multiple, dictionaries.
    • Supports 'utf8' encoding.
    • The project comes with a relatively complete unit test, and the stability of the core function Chinese word segmentation (utf8) has been tested by the online environment.
  • cpp-unicodelib 📁 🌐 -- a C++17 single-file header-only Unicode library.

  • detect-character-encoding 📁 🌐 -- detect character encoding using ICU. Tip: If you don’t need ICU in particular, consider using ced, which is based on Google’s lighter compact_enc_det library.

  • enca 📁 🌐 -- Enca (Extremely Naive Charset Analyser) consists of two main components: libenca, an encoding detection library, and enca, a command line frontend, integrating libenca and several charset conversion libraries and tools (GNU recode, UNIX98 iconv, perl Unicode::Map, cstocs).

  • fastBPE 📁 🌐 -- text tokenization / ngrams

  • fastText 📁 🌐 -- fastText is a library for efficient learning of word representations and sentence classification. Includes language detection feeatures.

  • glyph_name 📁 🌐 -- a library for computing Unicode sequences from glyph names according to the Adobe Glyph Naming convention: https://github.com/adobe-type-tools/agl-specification

  • libchardet 📁 🌐 -- is based on Mozilla Universal Charset Detector library and, detects the character set used to encode data.

  • libchopshop 📁 🌐 -- NLP/text processing with automated stop word detection and stemmer-based filtering. This library / toolkit is engineered to be able to provide both of the (often more or less disparate) n-gram token streams / vectors required for (1) initializing / training FTS databases, neural nets, etc. and (2) executing effective queries / matches on these engines.

  • libcppjieba 📁 🌐 -- source code extracted from the [CppJieba] project to form a separate project, making it easier to understand and use.

  • libiconv 📁 🌐 -- provides conversion between many platform, language or country dependent character encodings to & from Unicode. This library provides an iconv() implementation, for use on systems which don't have one, or whose implementation cannot convert from/to Unicode. It provides support for the encodings: European languages (ASCII, ISO-8859-{1,2,3,4,5,7,9,10,13,14,15,16}, KOI8-R, KOI8-U, KOI8-RU, CP{1250,1251,1252,1253,1254,1257}, CP{850,866,1131}, Mac{Roman,CentralEurope,Iceland,Croatian,Romania}, Mac{Cyrillic,Ukraine,Greek,Turkish}, Macintosh), Semitic languages (ISO-8859-{6,8}, CP{1255,1256}, CP862, Mac{Hebrew,Arabic}), Japanese (EUC-JP, SHIFT_JIS, CP932, ISO-2022-JP, ISO-2022-JP-2, ISO-2022-JP-1, ISO-2022-JP-MS), Chinese (EUC-CN, HZ, GBK, CP936, GB18030, EUC-TW, BIG5, CP950, BIG5-HKSCS, BIG5-HKSCS:2004, BIG5-HKSCS:2001, BIG5-HKSCS:1999, ISO-2022-CN, ISO-2022-CN-EXT), Korean (EUC-KR, CP949, ISO-2022-KR, JOHAB), Armenian (ARMSCII-8), Georgian (Georgian-Academy, Georgian-PS), Tajik (KOI8-T), Kazakh (PT154, RK1048), Thai (ISO-8859-11, TIS-620, CP874, MacThai), Laotian (MuleLao-1, CP1133), Vietnamese (VISCII, TCVN, CP1258), Platform specifics (HP-ROMAN8, NEXTSTEP), Full Unicode (UTF-8, UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4BE, UCS-4LE, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, UTF-7, C99, JAVA, UCS-2-INTERNAL, UCS-4-INTERNAL). It also provides support for a few extra encodings: European languages (CP{437,737,775,852,853,855,857,858,860,861,863,865,869,1125}), Semitic languages (CP864), Japanese (EUC-JISX0213, Shift_JISX0213, ISO-2022-JP-3), Chinese (BIG5-2003), Turkmen (TDS565), Platform specifics (ATARIST, RISCOS-LATIN1). It has also some limited support for transliteration, i.e. when a character cannot be represented in the target character set, it can be approximated through one or several similarly looking characters.

  • libnatspec 📁 🌐 -- The Nation Specifity Library is designed to smooth out national peculiarities when using software. Its primary objectives are: (1) Addressing encoding issues in most popular scenarios, (2) Providing various auxiliary tools that facilitate software localization.

  • libpinyin 📁 🌐 -- the libpinyin project aims to provide the algorithms core for intelligent sentence-based Chinese pinyin input methods.

  • libpostal 📁 🌐 -- a C library for parsing/normalizing street addresses around the world using statistical NLP and open data. The goal of this project is to understand location-based strings in every language, everywhere.

  • libtextcat 📁 🌐 -- text language detection

  • libunibreak 📁 🌐 -- an implementation of the line breaking and word breaking algorithms as described in (Unicode Standard Annex 14)[http://www.unicode.org/reports/tr14/] and (Unicode Standard Annex 29)[http://www.unicode.org/reports/tr29/].

  • line_detector 📁 🌐 -- line segment detector (lsd) &. edge drawing line detector (edl) &. hough line detector (standard &. probabilistic) for detection.

  • marian 📁 🌐 -- an efficient Neural Machine Translation framework written in pure C++ with minimal dependencies.

  • open-location-code 📁 🌐 -- Open Location Code is a technology that gives a way of encoding location into a form that is easier to use than latitude and longitude. The codes generated are called plus codes, as their distinguishing attribute is that they include a "+" character. The technology is designed to produce codes that can be used as a replacement for street addresses, especially in places where buildings aren't numbered or streets aren't named. Plus codes represent an area, not a point. As digits are added to a code, the area shrinks, so a long code is more precise than a short code. Codes that are similar are located closer together than codes that are different.

  • pinyin 📁 🌐 -- pīnyīn is a tool for converting Chinese characters to pinyin. It can be used for Chinese phonetic notation, sorting, and retrieval.

  • sentencepiece 📁 🌐 -- text tokenization

  • sentence-tokenizer 📁 🌐 -- text tokenization

  • simdutf 📁 🌐 -- delivers Unicode validation and transcoding at billions of characters per second, providing fast Unicode functions such as ASCII, UTF-8, UTF-16LE/BE and UTF-32 validation, with and without error identification, Latin1 to UTF-8 transcoding and vice versa, etc. The functions are accelerated using SIMD instructions (e.g., ARM NEON, SSE, AVX, AVX-512, RISC-V Vector Extension, etc.). When your strings contain hundreds of characters, we can often transcode them at speeds exceeding a billion characters per second. You should expect high speeds not only with English strings (ASCII) but also Chinese, Japanese, Arabic, and so forth. We handle the full character range (including, for example, emojis).

  • uchardet 📁 🌐 -- uchardet is an encoding and language detector library, which attempts to determine the encoding of the text. It can reliably detect many charsets. Moreover it also works as a very good and fast language detector.

  • ucto 📁 🌐 -- text tokenization

    • libfolia 📁 🌐 -- working with the Format for Linguistic Annotation (FoLiA). Provides a high-level API to read, manipulate, and create FoLiA documents.
    • uctodata 📁 🌐 -- data for ucto library
  • uni-algo 📁 🌐 -- this library handles all unicode conversion/processing problems (there are not only ill-formed sequences actually) properly and always according to The Unicode Standard: in C/C++ there is no safe type for UTF-8/UTF-16 that guarantees that the data will be well-formed; this makes the problem even worse. There are plenty of Unicode libraries for C/C++ out there that implement Unicode algorithms of varying quality, but many of them don't handle ill-formed UTF sequences at all. In the best-case scenario, you'll get an exception/error; in the worst-case, undefined behavior. The biggest problem is that in 99% cases everything will be fine. This is inappropriate for security reasons. This library doesn't work with types/strings/files/streams, it works with the data inside them and makes it safe when it's needed. Check this article if you want more information about ill-formed sequences: https://hsivonen.fi/broken-utf-8 . It is a bit outdated because ICU (International Components for Unicode) now uses W3C conformant implementation too, but the information in the article is very useful. This library does use W3C conformant implementation.

  • unicode-cldr 📁 🌐 -- Unicode CLDR Project: provides key building blocks for software to support the world's languages, with the largest and most extensive standard repository of locale data available. This data is used by a wide spectrum of companies for their software internationalization and localization, adapting software to the conventions of different languages for such common software tasks.

  • unicode-cldr-data 📁 🌐 -- the JSON distribution of CLDR locale data for internationalization. While XML (not JSON) is the "official" format for all CLDR data, this data is programatically generated from the corresponding XML, using the CLDR tooling. This JSON data is generated using only data that has achieved draft="contributed" or draft="approved" status in the CLDR. This is the same threshhold as is used by the ICU (International Components for Unicode).

  • unicode-icu 📁 🌐 -- the International Components for Unicode.

  • unicode-icu-data 📁 🌐 -- International Components for Unicode: Data Repository. This is an auxiliary repository for the International Components for Unicode.

  • unicode-icu-demos 📁 🌐 -- ICU Demos contains sample applications built using the International Components for Unicode (ICU) C++ library ICU4C.

  • unilib 📁 🌐 -- an embeddable C++17 Unicode library.

  • utfcpp 📁 🌐 -- UTF-8 with C++ in a Portable Way

  • win-iconv 📁 🌐 -- an iconv implementation using Win32 API to convert.

  • worde_butcher 📁 🌐 -- a tool for text segmentation, keyword extraction and speech tagging. Butchers any text into prime word / phrase cuts, deboning all incoming based on our definitive set of stopwords for all languages.

  • xmunch 📁 🌐 -- xmunch essentially does, what the 'munch' command of, for example, hunspell does, but is not compatible with hunspell affix definition files. So why use it then? What makes xmunch different from the other tools is the ability to extend incomplete word-lists. For hunspell's munch to identify a stem and add an affix mark, every word formed by the affix with the stem has to be in the original word-list. This makes sense for a compression tool. However if your word list is incomplete, you have to add all possible word-forms of a word first, before any compression is done. Using xmunch instead, you can define a subset of forms which are required to be in the word-list to allow the word to be used as stem. Like this, you can extend the word-list.

  • you-token-to-me 📁 🌐 -- text tokenization

  • ztd.text 📁 🌐 -- an implementation of an up and coming proposal percolating through SG16, P1629 - Standard Text Encoding. It will also include implementations of some downstream ideas covered in Previous Work in this area, including Zach Laine's Boost.Text (proposed), rmf's libogonek, and Tom Honermann's text_view.

PDF (XML) metadata editing

for round-trip annotation and other "external application editing" of known documents; metadata embedding / export

  • PDFGen 📁 🌐 -- a simple PDF Creation/Generation library, contained in a single C-file with header and no external library dependencies. Useful for embedding into other programs that require rudimentary PDF output.
  • pdfgrep 📁 🌐 -- a tool to search text in PDF files. It works similarly to grep.
  • pdfium 📁 🌐 -- the PDF library used by the Chromium project.
  • podofo 📁 🌐 -- a library to work with the PDF file format and includes also a few tools. The name comes from the first two letters of PDF (Portable Document Format). The PoDoFo library is a free portable C++ library which includes classes to parse a PDF file and modify its contents into memory. The changes can be written back to disk easily. PoDoFo is designed to avoid loading large PDF objects into memory until they are required and can write large streams immediately to disk, so it is possible to manipulate quite large files with it.
  • poppler 📁 🌐 -- Poppler is a library for rendering PDF files, and examining or modifying their structure. Poppler originally came from the XPDF sources.
  • qpdf 📁 🌐 -- QPDF is a command-line tool and C++ library that performs content-preserving transformations on PDF files. It supports linearization, encryption, and numerous other features. It can also be used for splitting and merging files, creating PDF files, and inspecting files for study or analysis. QPDF does not render PDFs or perform text extraction, and it does not contain higher-level interfaces for working with page contents. It is a low-level tool for working with the structure of PDF files and can be a valuable tool for anyone who wants to do programmatic or command-line-based manipulation of PDF files.
  • sioyek 📁 🌐 -- a PDF viewer with a focus on textbooks and research papers.
  • sumatrapdf 📁 🌐 -- SumatraPDF is a multi-format (PDF, EPUB, MOBI, CBZ, CBR, FB2, CHM, XPS, DjVu) reader for Windows.
  • XMP-Toolkit-SDK 📁 🌐 -- the XMP Toolkit allows you to integrate XMP functionality into your product, supplying an API for locating, adding, or updating the XMP metadata in a file.
  • xpdf 📁 🌐 -- Xpdf is an open source viewer for Portable Document Format (PDF) files.

web scraping (document extraction, cleaning, metadata extraction, BibTeX, ...)

(see also investigation notes in Qiqqa docs)

  • boost-url 📁 🌐 -- a library for manipulating (RFC3986) Uniform Resource Identifiers (URIs) and Locators (URLs).

  • cURL 📁 🌐 -- the ubiquitous libcurl.

  • curl-impersonate 📁 🌐 -- a special build of curl that can impersonate the four major browsers: Chrome, Edge, Safari & Firefox. curl-impersonate is able to perform TLS and HTTP handshakes that are identical to that of a real browser.

  • curlpp 📁 🌐 -- cURLpp is a C++ wrapper for libcURL.

  • curl-www 📁 🌐 -- the curl.se web site contents.

  • easyexif 📁 🌐 -- EasyEXIF is a tiny, lightweight C++ library that parses basic (EXIF) information out of JPEG files. It uses only the std::string library and is otherwise pure C++. You pass it the binary contents of a JPEG file, and it parses several of the most important EXIF fields for you.

  • everything-curl 📁 🌐 -- Everything curl is an extensive guide for all things curl. The project, the command-line tool, the library, how everything started and how it came to be the useful tool it is today. It explains how we work on developing it further, what it takes to use it, how you can contribute with code or bug reports and why millions of existing users use it.

  • exif 📁 🌐 -- a small command-line utility to show EXIF information hidden in JPEG files, demonstrating the power of libexif.

  • exiv2 📁 🌐 -- a C++ library and a command-line utility to read, write, delete and modify Exif, IPTC, XMP and ICC image metadata.

  • extract 📁 🌐 -- clone of git://git.ghostscript.com/extract.git

  • faup 📁 🌐 -- Faup stands for Finally An Url Parser and is a library and command line tool to parse URLs and normalize fields with two constraints: (1) work with real-life urls (resilient to badly formated ones), and (2) be fast: no allocation for string parsing and read characters only once.

  • GQ-gumbo-css-selectors 📁 🌐 -- GQ is a CSS Selector Engine for Gumbo Parser written in C++11. Using Gumbo Parser as a backend, GQ can parse input HTML and allow users to select and modify elements in the parsed document with CSS Selectors and the provided simple, but powerful mutation API.

  • gumbo-libxml 📁 🌐 -- LibXML2 bindings for the Gumbo HTML5 parser: this provides a libxml2 API on top of the Gumbo parser. It lets you use a modern parser - Gumbo now passes all html5lib tests, including the template tag, and should be fully conformant with the HTML5 spec - with the full ecosystem of libxml tools, including XPath, tree modification, DTD validation, etc.

  • gumbo-parser 📁 🌐 -- HTML parser

  • gumbo_pp 📁 🌐 -- a C++ wrapper over Gumbo that provides a higher level query mechanism.

  • gumbo-query 📁 🌐 -- HTML DOM access in C/C++

  • hescape 📁 🌐 -- a C library for fast HTML escape using SSE instruction, pcmpestri. Hescape provides only one API, hesc_escape_html().

  • houdini 📁 🌐 -- Houdini - The Escapist: is zero-dependency and modular. Houdini is a simple API for escaping text for the web. And unescaping it. HTML escaping follows the OWASP suggestion. All other entities are left as-is. HTML unescaping is fully RFC-compliant. Yes, that's the 253 different entities for you, and decimal/hex code point specifiers. URI escaping and unescaping is fully RFC-compliant. URL escaping and unescaping is the same as generic URIs, but spaces are changed to +.

  • html5-parser 📁 🌐 -- a fast, standards compliant, C based, HTML 5 parser for python. Over thirty times as fast as pure python based parsers, such as html5lib.

  • htmlstreamparser 📁 🌐 -- used in a demo of zsync2

  • http-parser 📁 🌐 -- a parser for HTTP messages written in C. It parses both requests and responses. The parser is designed to be used in performance HTTP applications. It does not make any syscalls nor allocations, it does not buffer data, it can be interrupted at anytime. Depending on your architecture, it only requires about 40 bytes of data per message stream (in a web server that is per connection).

  • lexbor 📁 🌐 -- fast HTML5 fully-conformant HTML + CSS parser.

  • libcpr 📁 🌐 -- wrapper library for cURL. C++ Requests is a simple wrapper around libcurl inspired by the excellent Python Requests project. Despite its name, libcurl's easy interface is anything but, and making mistakes misusing it is a common source of error and frustration. Using the more expressive language facilities of C++11, this library captures the essence of making network calls into a few concise idioms.

  • libexif 📁 🌐 -- a library for parsing, editing, and saving EXIF data. In addition, it has gettext support. All EXIF tags described in EXIF standard 2.1 (and most from 2.2) are supported. Many maker notes from Canon, Casio, Epson, Fuji, Nikon, Olympus, Pentax and Sanyo cameras are also supported.

  • libexpat 📁 🌐 -- XML read/write

  • libhog 📁 🌐 -- hog a.k.a. hound - fetch the (PDF,EPUB,HTML) document you seek using maximum effort: hog is a tool for fetching files from the internet, specifically PDFs. Intended to be used where you browse the 'Net and decide you want to download a given PDF from any site: this can be done through the browser itself, but is sometimes convoluted or neigh impossible (ftp links require another tool, PDFs stored at servers which report as having their SSL certificates expired are a hassle to get through for the user-in-a-hurry, etc. etc.) and hog is meant to cope with all these.

  • libidn2 📁 🌐 -- international domain name parsing

  • libpsl 📁 🌐 -- handles the Public Suffix List (a collection of Top Level Domains (TLDs) suffixes, e.g. .com, .net, Country Top Level Domains (ccTLDs) like .de and .cn and Brand Top Level Domains like .apple and .google. Can be used to:

    • avoid privacy-leaking "super domain" certificates (see post from Jeffry Walton)
    • avoid privacy-leaking "supercookies"
    • domain highlighting parts of the domain in a user interface
    • sorting domain lists by site
  • libxml2 📁 🌐 -- libxml: XML read/write

  • LLhttp-parser 📁 🌐 -- a port and replacement of http_parser to TypeScript. llparse is used to generate the output C source file, which could be compiled and linked with the embedder's program (like Node.js).

  • picohttpparser 📁 🌐 -- PicoHTTPParser is a tiny, primitive, fast HTTP request/response parser. Unlike most parsers, it is stateless and does not allocate memory by itself. All it does is accept pointer to buffer and the output structure, and setups the pointers in the latter to point at the necessary portions of the buffer.

  • qs_parse 📁 🌐 -- a set of simple and easy functions for parsing URL query strings, such as those generated in an HTTP GET form submission.

  • robotstxt 📁 🌐 -- Google robots.txt Parser and Matcher Library. The Robots Exclusion Protocol (REP) is a standard that enables website owners to control which URLs may be accessed by automated clients (i.e. crawlers) through a simple text file with a specific syntax. It's one of the basic building blocks of the internet as we know it and what allows search engines to operate. Because the REP was only a de-facto standard for the past 25 years, different implementers implement parsing of robots.txt slightly differently, leading to confusion. This project aims to fix that by releasing the parser that Google uses.

  • sist2 📁 🌐 -- sist2 (Simple incremental search tool) is a fast, low memory usage, multi-threaded application, which scans drives and directory trees, extracts text and metadata from common file types, generates thumbnails and comes with OCR support (with tesseract) and Named-Entity Recognition (using pre-trained client-side tensorflow models).

  • tidy-html5 📁 🌐 -- clean up HTML documents before archiving/processing

  • URI-Encode-C 📁 🌐 -- an optimized C library for percent encoding/decoding text, i.e. a URI encoder/decoder written in C based on RFC3986.

  • url 📁 🌐 -- URI parsing and other utility functions

  • URL-Detector 📁 🌐 -- Url Detector is a library created by the Linkedin Security Team to detect and extract urls in a long piece of text. Keep in mind that for security purposes, its better to overdetect urls: instead of complying with RFC 3986 (http://www.ietf.org/rfc/rfc3986.txt), we try to detect based on browser behavior, optimizing detection for urls that are visitable through the address bar of Chrome, Firefox, Internet Explorer, and Safari. It is also able to identify the parts of the identified urls.

  • url-parser 📁 🌐 -- parse URLs much like Node's url module.

  • wget2 📁 🌐 -- GNU Wget2 is the successor of GNU Wget, a file and recursive website downloader. Designed and written from scratch it wraps around libwget, that provides the basic functions needed by a web client. Wget2 works multi-threaded and uses many features to allow fast operation. In many cases Wget2 downloads much faster than Wget1.x due to HTTP2, HTTP compression, parallel connections and use of If-Modified-Since HTTP header.

  • xml-pugixml 📁 🌐 -- light-weight, simple and fast XML parser for C++ with XPath support.

audio files & processing

Not just speech processing & speech recognition, but sometimes data is easier "visualized" as audio (sound).

  • AudioFile 📁 🌐 -- a simple header-only C++ library for reading and writing audio files. (WAV, AIFF)
  • dr_libs 📁 🌐 -- single file audio decoding libraries for C and C++ (FLAC, MP3, WAV)
  • flac 📁 🌐 -- a software that can reduce the amount of storage space needed to store digital audio signals without needing to remove information in doing so. The files read and produced by this software are called FLAC files. As these files (which follow the FLAC format) can be read from and written to by other software as well, this software is often referred to as the FLAC reference implementation.
  • libsndfile 📁 🌐 -- a C library for reading and writing files containing sampled audio data, e.g. Ogg, Vorbis and FLAC.
  • minimp3 📁 🌐 -- a minimalistic, single-header library for decoding MP3. minimp3 is designed to be small, fast (with SSE and NEON support), and accurate (ISO conformant).
  • opus 📁 🌐 -- an audio codec for interactive speech and audio transmission over the Internet. Opus can handle a wide range of interactive audio applications, including Voice over IP, videoconferencing, in-game chat, and even remote live music performances. It can scale from low bit-rate narrowband speech to very high quality stereo music.
  • qoa 📁 🌐 -- QOA - The “Quite OK Audio Format” for fast, lossy audio compression - is a single-file library for C/C++. More info at: https://qoaformat.org
  • r8brain-free-src 📁 🌐 -- high-quality professional audio sample rate converter (SRC) / resampler C++ library. Features routines for SRC, both up- and downsampling, to/from any sample rate, including non-integer sample rates: it can be also used for conversion to/from SACD/DSD sample rates, and even go beyond that. Also suitable for fast general-purpose 1D time-series resampling / interpolation (with relaxed filter parameters).
  • sac 📁 🌐 -- a state-of-the-art lossless audio compression model. Lossless audio compression is a complex problem, because PCM data is highly non-stationary and uses high sample resolution (typically >=16bit). That's why classic context modelling suffers from context dilution problems. Sac employs a simple OLS-NLMS predictor per frame including bias correction. Prediction residuals are encoded using a sophisticated bitplane coder including SSE and various forms of probability estimations. Meta-parameters of the predictor are optimized via binary search (or DDS) on by-frame basis. This results in a highly asymmetric codec design. We throw a lot of muscles at the problem and archive only little gains - by practically predicting noise.
  • silk-codec 📁 🌐 -- a library to convert PCM to TenCent Silk files and vice versa.
  • silk-v3-decoder 📁 🌐 -- decodes Silk v3 audio files (like WeChat amr, aud files, qq slk files) and converts to other formats (like mp3).
  • Solo 📁 🌐 -- Agora SOLO is a speech codec, developed based on Silk with BWE(Bandwidth Extension) and MDC(Multi Description Coding). With these technologies, SOLO is able to resist weak networks at low bitrates. The main reason for SOLO to use bandwidth expansion is to reduce the computational complexity.
  • speex 📁 🌐 -- a patent-free voice codec. Unlike other codecs like MP3 and Ogg Vorbis, Speex is designed to compress voice at bitrates in the 2-45 kbps range. Possible applications include VoIP, internet audio streaming, archiving of speech data (e.g. voice mail), and audio books.

file format support

  • AudioFile 📁 🌐 -- a simple header-only C++ library for reading and writing audio files. (WAV, AIFF)

  • basez 📁 🌐 -- encode data into/decode data from base16, base32, base32hex, base64 or base64url stream per RFC 4648; MIME base64 Content-Transfer-Encoding per RFC 2045; or PEM Printable Encoding per RFC 1421.

  • CHM-lib 📁 🌐 -- as I have several HTML pages stored in this format. See also MHTML: mht-rip

  • cpp-base64 📁 🌐 -- base64 encoding and decoding with C++

  • csv-parser 📁 🌐 -- Vince's CSV Parser: there's plenty of other CSV parsers in the wild, but I had a hard time finding what I wanted. Inspired by Python's csv module, I wanted a library with simple, intuitive syntax. Furthermore, I wanted support for special use cases such as calculating statistics on very large files. Thus, this library was created with these following goals in mind.

  • cvmatio 📁 🌐 -- an open source Matlab v7 MAT file parser written in C++, giving users the ability to interact with binary MAT files in their own projects.

  • datamash 📁 🌐 -- GNU Datamash is a command-line program which performs basic numeric, textual and statistical operations on input textual data files. It is designed to be portable and reliable, and aid researchers to easily automate analysis pipelines, without writing code or even short scripts.

  • dcmtk 📁 🌐 -- the DICOM toolkit (DCMTK) package consists of source code, documentation and installation instructions for a set of software libraries and applications implementing part of the DICOM/MEDICOM Standard.

  • djvulibre 📁 🌐 -- DjVu (pronounced "déjà vu") a set of compression technologies, a file format, and a software platform for the delivery over the Web of digital documents, scanned documents, and high resolution images.

  • extract 📁 🌐 -- clone of git://git.ghostscript.com/extract.git

  • fast-cpp-csv-parser 📁 🌐 -- a small, easy-to-use and fast header-only library for reading comma separated value (CSV) files.

  • fastgron 📁 🌐 -- fastgron makes JSON greppable super fast! fastgron transforms JSON into discrete assignments to make it easier to grep for what you want and see the absolute 'path' to it. It eases the exploration of APIs that return large blobs of JSON but lack documentation.

  • FFmpeg 📁 🌐 -- a collection of libraries and tools to process multimedia content such as audio, video, subtitles and related metadata.

  • file 📁 🌐 -- file filetype recognizer tool & mimemagic

  • flac 📁 🌐 -- a software that can reduce the amount of storage space needed to store digital audio signals without needing to remove information in doing so. The files read and produced by this software are called FLAC files. As these files (which follow the FLAC format) can be read from and written to by other software as well, this software is often referred to as the FLAC reference implementation.

  • gmt 📁 🌐 -- GMT (Generic Mapping Tools) is an open source collection of about 100 command-line tools for manipulating geographic and Cartesian data sets (including filtering, trend fitting, gridding, projecting, etc.) and producing high-quality illustrations ranging from simple x-y plots via contour maps to artificially illuminated surfaces, 3D perspective views and animations. The GMT supplements add another 50 more specialized and discipline-specific tools. GMT supports over 30 map projections and transformations and requires support data such as GSHHG coastlines, rivers, and political boundaries and optionally DCW country polygons.

  • gumbo-libxml 📁 🌐 -- LibXML2 bindings for the Gumbo HTML5 parser: this provides a libxml2 API on top of the Gumbo parser. It lets you use a modern parser - Gumbo now passes all html5lib tests, including the template tag, and should be fully conformant with the HTML5 spec - with the full ecosystem of libxml tools, including XPath, tree modification, DTD validation, etc.

  • gumbo-parser 📁 🌐 -- HTML parser

  • gumbo_pp 📁 🌐 -- a C++ wrapper over Gumbo that provides a higher level query mechanism.

  • gumbo-query 📁 🌐 -- HTML DOM access in C/C++

  • html5-parser 📁 🌐 -- a fast, standards compliant, C based, HTML 5 parser for python. Over thirty times as fast as pure python based parsers, such as html5lib.

  • http-parser 📁 🌐 -- a parser for HTTP messages written in C. It parses both requests and responses. The parser is designed to be used in performance HTTP applications. It does not make any syscalls nor allocations, it does not buffer data, it can be interrupted at anytime. Depending on your architecture, it only requires about 40 bytes of data per message stream (in a web server that is per connection).

  • id3-tagparser 📁 🌐 -- a C++ library for reading and writing MP4 (iTunes), ID3, Vorbis, Opus, FLAC and Matroska tags.

  • jq 📁 🌐 -- a lightweight and flexible command-line JSON processor.

  • jtc 📁 🌐 -- jtc stand for: JSON transformational chains (used to be JSON test console) and is a cli tool to extract, manipulate and transform source JSON, offering powerful ways to select one or multiple elements from a source JSON and apply various actions on the selected elements at once (wrap selected elements into a new JSON, filter in/out, sort elements, update elements, insert new elements, remove, copy, move, compare, transform, swap around and many other operations).

  • lazycsv 📁 🌐 -- a c++17, posix-compliant, single-header library for reading and parsing csv files. It's fast and lightweight and does not allocate any memory in the constructor or while parsing. It parses each row and cell just on demand on each iteration, that's why it's called lazy.

  • lexbor 📁 🌐 -- fast HTML5 fully-conformant HTML + CSS parser.

  • libaom 📁 🌐 -- AV1 Codec Library

  • libarchive 📁 🌐 -- a portable, efficient C library that can read and write streaming archives in a variety of formats. It also includes implementations of the common tar, cpio, and zcat command-line tools that use the libarchive library.

  • libase 📁 🌐 -- a tiny library for interpreting the Adobe Swatch Exchange (.ase) file format for color palettes since Adobe Creative Suite 3.

  • libass 📁 🌐 -- libass is a portable subtitle renderer for the ASS/SSA (Advanced Substation Alpha/Substation Alpha) subtitle format.

  • libavif 📁 🌐 -- a friendly, portable C implementation of the AV1 Image File Format, as described here: https://aomediacodec.github.io/av1-avif/

  • libcmime 📁 🌐 -- MIME extract/insert/encode/decode: use for MHTML support

  • libcsv2 📁 🌐 -- CSV file format reader/writer library.

  • libde265 📁 🌐 -- libde265 is an open source implementation of the h.265 video codec. It is written from scratch and has a plain C API to enable a simple integration into other software. libde265 supports WPP and tile-based multithreading and includes SSE optimizations. The decoder includes all features of the Main profile and correctly decodes almost all conformance streams (see [wiki page]).

  • libexpat 📁 🌐 -- XML read/write

  • libheif 📁 🌐 -- High Efficiency Image File Format (HEIF) :: a visual media container format standardized by the Moving Picture Experts Group (MPEG) for storage and sharing of images and image sequences. It is based on the well-known ISO Base Media File Format (ISOBMFF) standard. HEIF Reader/Writer Engine is an implementation of HEIF standard in order to demonstrate its powerful features and capabilities.

  • libheif-alt 📁 🌐 -- an ISO/IEC 23008-12:2017 HEIF and AVIF (AV1 Image File Format) file format decoder and encoder. HEIF and AVIF are new image file formats employing HEVC (h.265) or AV1 image coding, respectively, for the best compression ratios currently possible.

  • libics 📁 🌐 -- the reference library for ICS (Image Cytometry Standard), an open standard for writing images of any dimensionality and data type to file, together with associated information regarding the recording equipment or recorded subject.

    ICS stands for Image Cytometry Standard, and was first proposed in: P. Dean, L. Mascio, D. Ow, D. Sudar, J. Mullikin, "Propsed standard for image cytometry data files", Cytometry, n.11, pp.561-569, 1990.

    It writes 2 files, one is the header, with an '.ics' extension, and the other is the actual image data (with an '.ids' extension.)

    ICS version 2.0 extends this standard to allow for a more versatile placement of the image data. It can now be placed either in the same '.ics' file or inbedded in any other file, by specifying the file name and the byte offset for the data.

    The advantage of ICS over other open standards such as TIFF is that it allows data of any type and dimensionality to be stored. A TIFF file can contain a collection of 2D images; it's up to the user to determine how these relate to each other. An ICS file can contain, for exmaple, a 5D image in which the 4th dimension is the light frequency and the 5th time. Also, all of the information regarding the microscope settings (or whatever instument was used to acquire the image) and the sample preparation can be included in the file.

  • libmetalink 📁 🌐 -- a library to read Metalink XML download description format. It supports both Metalink version 3 and Metalink version 4 (RFC 5854).

  • libmobi 📁 🌐 -- a library for handling Mobipocket/Kindle (MOBI) ebook format documents.

  • libpsd 📁 🌐 -- a library for Adobe Photoshop .psd file's decoding and rendering.

  • LibRaw 📁 🌐 -- a library for reading and processing of RAW files generated by digital photo cameras.

  • libsndfile 📁 🌐 -- a C library for reading and writing files containing sampled audio data, e.g. Ogg, Vorbis and FLAC.

  • libultrahdr 📁 🌐 -- libultrahdr is an image compression library that uses gain map technology to store and distribute HDR images. Conceptually on the encoding side, the library accepts SDR and HDR rendition of an image and from these a Gain Map (quotient between the two renditions) is computed. The library then uses backward compatible means to store the base image (SDR), gain map image and some associated metadata.

  • libwarc 📁 🌐 -- C++ library to parse WARC files. WARC is the official storage format of the Internet Archive for storing scraped content. WARC format used: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf

  • libxml2 📁 🌐 -- libxml: XML read/write

  • libzip 📁 🌐 -- a C library for reading, creating, and modifying zip and zip64 archives.

  • LLhttp-parser 📁 🌐 -- a port and replacement of http_parser to TypeScript. llparse is used to generate the output C source file, which could be compiled and linked with the embedder's program (like Node.js).

  • mcmd 📁 🌐 -- MCMD (M-Command): a set of commands for handling large scale CSV data. MCMD (called as M-Command) is a set of commands that are developed for the purpose of high-speed processing of large-scale structured tabular data in CSV format. It is possible to efficiently process large scale data with hundred millions row of records on a standard PC.

  • metalink-cli 📁 🌐 -- a small program which generates a metalink record on stdout for every file given on the commandline and using the mirror list from stdin.

  • metalink-mini-downloader 📁 🌐 -- a small metalink downloader written in C++, using boost, libcurl and expat. It can either be compiled so that it downloads a specific file and then (optionally) launches it or be compiled into a "downloader template", which can later be used to create a custom downloader by replacing text strings inside the executable (they are marked in a special way, to make this easy).

  • mht-rip 📁 🌐 -- as I have several HTML pages stored in this MHTML format. See also CHM: CHM-lib

  • mime-mega 📁 🌐 -- MIME extract/insert/encode/decode: use for MHTML support

  • mimetic 📁 🌐 -- MIME: use for MHTML support

  • minizip-ng 📁 🌐 -- a zip manipulation library written in C that is supported on Windows, macOS, and Linux. Minizip was originally developed by Gilles Vollant in 1998. It was first included in the zlib distribution as an additional code contribution starting in zlib 1.1.2. Since that time, it has been continually improved upon and contributed to by many people. The original project can still be found in the zlib distribution that is maintained by Mark Adler.

  • netpbm 📁 🌐 -- a toolkit for manipulation of graphic images, including conversion of images between a variety of different formats. There are over 300 separate tools in the package including converters for about 100 graphics formats. Examples of the sort of image manipulation we're talking about are: Shrinking an image by 10%; Cutting the top half off of an image; Making a mirror image; Creating a sequence of images that fade from one image to another, etc.

  • OpenEXR 📁 🌐 -- a high dynamic-range (HDR) image file format developed by Industrial Light & Magic (ILM) for use in computer imaging applications.

  • openexr-images 📁 🌐 -- collection of images associated with the OpenEXR distribution.

  • OpenImageIO 📁 🌐 -- Reading, writing, and processing images in a wide variety of file formats, using a format-agnostic API, aimed at VFX applications.

    Also includes:

    • an ImageCache class that transparently manages a cache so that it can access truly vast amounts of image data (tens of thousands of image files totaling multiple TB) very efficiently using only a tiny amount (tens of megabytes at most) of runtime memory.
    • ImageBuf and ImageBufAlgo functions, which constitute a simple class for storing and manipulating whole images in memory, plus a collection of the most useful computations you might want to do involving those images, including many image processing operations.

    The primary target audience for OIIO is VFX studios and developers of tools such as renderers, compositors, viewers, and other image-related software you'd find in a production pipeline.

  • pdf2htmlEX 📁 🌐 -- convert PDF to HTML without losing text or format.

  • picohttpparser 📁 [🌐](https

About

Data Science & Image Processing amalgam library in C/C++

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published