Image Perception on Inexpensive Embedded Hardware

When people hear “machine learning,” they often imagine large models running on powerful GPUs in data centers. That picture breaks down quickly once we move to edge devices. Cameras, LiDAR sensors, vehicles, and especially drones operate under tight constraints: limited power and thermal budget, strict latency requirements, and often no access to cloud resources. In these environments, the question is not how powerful a model can be, but how efficiently practical computation can be performed.

Image matching in perception is a good example. We start with raw red, green, and blue (RGB) values for every pixel. Even a modest image contains millions of such values. Processing all of them directly is neither necessary nor efficient. Instead, perception systems select small regions of interest and reduce them to compact numeric representations that can be compared across images. These representations allow a system to recognize the same physical point despite camera motion, rotation, or slight changes in lighting.

A common approach is to extract a small image patch (or kernel) at a time. For example, an 8×8 grayscale patch captures local structure without excessive data. The goal is not to preserve every pixel, but to capture enough information so that the patch can be reliably matched to a corresponding patch in another image. This step already reflects a core edge-computing idea: reduce data early, before it becomes expensive to move or process.

Traditionally, this reduction has been done with hand-designed feature descriptors. These methods rely on fixed mathematical operations, such as gradient comparisons or intensity differences, to produce a compact signature. While effective, they are rigid. Small neural networks provide an alternative that is still lightweight but more adaptable. Instead of hard-coding how pixel values should be combined, the network learns those combinations from data.

In this post, I start with a minimal deep neural network that performs this role. The deep neural network (DNN) operates on a small set of numeric inputs derived from an image patch and produces a short descriptor vector. The network is intentionally simple, consisting of two fully connected layers with a thresholding step between them. There is no recurrence, no attention, and no dynamic behavior. Every input produces an output in a fixed number of operations.

This simplicity is not a limitation. It is a design choice driven by edge constraints. A network of this size can be evaluated with predictable latency and minimal memory access. It can be quantized to fixed-point arithmetic without complex error behavior. Most importantly, it maps naturally onto hardware.

A Python implementation is a practical starting point because it allows the algorithm to be expressed clearly and tested quickly. However, Python is not the target execution environment. Running this computation on a CPU means executing many small arithmetic operations sequentially, with overhead that dominates the actual math. GPUs improve throughput, but at a cost in power and system complexity that is often unacceptable in embedded platforms.

On a system on a chip (SOC) like TI’s TDA4, this kind of workload can already be accelerated without a GPU. The device includes a C7x DSP paired with a Matrix Multiply Accelerator (MMA) designed specifically for dense linear algebra and neural network inference. Dense layers map naturally onto this hardware, while simple activation functions and distance calculations run efficiently on the DSP itself. This makes TDA4 a strong example of how edge processors are evolving toward integrated ML acceleration. At the same time, devices in this class typically fall in the $16 to $20 range in volume, which is entirely reasonable for automotive or industrial systems but still too expensive for many low-cost or highly specialized designs.

This cost boundary is one of the motivations for exploring FPGA and ASIC-style implementations, where a narrowly focused accelerator can deliver the required functionality at lower power and potentially lower unit cost.

On an FPGA or later on an ASIC, the computation becomes the structure of the circuit. Multiplications, additions, and comparisons happen in parallel, every clock cycle. Data flows through the network in a fixed pattern, producing one descriptor after another with deterministic timing. For workloads such as image matching, this approach aligns much better with edge requirements.

In the following blogs, I will expand on these ideas step by step.

As an Amazon Associate I earn from qualifying purchases.

Uki no ikigai to tamashii

My interests

Image Perception on Inexpensive Embedded Hardware

No comments:

Post a Comment

apt quotation..