Abstract—With steps, with the first step being the recognition

Abstract—With enhancement in Neural Networks and Deep
Learning technologies, the focus on problems like Face Recognition,
live scene detection and more have started facing the
limelight. From simple ANN to GANs, neural networks can be
used to solve simple to complex problems. One of them is AlwaysOn
Haar-Like Face Detector, implemented in wearable devices.
While wearable devices are small and can not withstand large
power consumption and large chip designs to support advanced
CNNs for face detection techniques, this paper introduces an
ultra-low power CNN FR Processor and a CIS Integrated alwayson
Haar Like Face Detector, which can be used and implemented
for smart wearable devices. Earlier work on the same has
produced less efficient results, resulting in less than 10 hours
operation time with 190mAh coin battery.
Index Terms—face recognition; feedforward neural nets; learning
(artificial intelligence); low-power electronics; microprocessor
chips; power aware computing; CIS; CNN processors; alwayson
Haar-like face detector; convolutional neural network;deep
learning;device intelligence; Face recognition; Power demand;
Program processors
Recent trends of Artificial Intelligence and Deep Learning
have focused on training the machines to take decisions
themselves, without any human interference. This remains
an open problem, while neural network models like CNN
(Convolutional Neural Networks), RNN (Recurrent Neural
Networks), ANN (Artificial Neural Networks), GANs et
cetera. To implement these models in wearable devices (like
smart watches, smart glasses etc.), power efficiency and
processing speed plays vital roles. Wearable devices have
limited battery capacity and high recognition accuracy is
required along with less power consumption. To overcome
these hurdles, ultra low-power CNN Face Recognition (FR)
processor and CIS Integrated with an always-on Haar Like
Face Detector has been introduced, for smart wearable
This works in three steps, with the first step being the
recognition algorithm and haar cascade learning xml file
being generated, to produce efficient and accurate results.
Haar features needed for face recognition are shown in Fig.
1.1 while the results of a simple application of haar cascade
classifier are shown in Fig. 1.2
Fig. 1. “Haar Features”
Fig. 2. “Output of Haar Classifier”
The other steps include an ultra-low-power CNNP with wide
I/O local distributed memory (DM-CNNP), and a separable
filter approximation for convolutional layers (SF-CONV) and
a transpose-read SRAM (T-SRAM) for low-power CNN processing
1. Fig. 3 shows the overall proposed FR system,
consisting of two chips, Face Image Sensor (FIS) and CNNP.
Once face detection is done using Haar Cascade Classifier, FIS
will transmit only the face image to CNNP and then CNNP
completes Face Recognition task.
Fig. 3. “Overall Architecture as proposed.”
It’s very important to learn the importance of hardware acceleration,
as most of the models may give higher accuracy but
may not be able to work on wearable devices. Thus, the given
architecture gives interest about power efficiency and accuracy
being taken care off or face recognition. The outputs of the
face recognition task has been given in Fig. 4. Architecture of
Fig. 4. “The result of the proposed architecture.”
DM-CNNP containing the necessary components, in order to
speed up MAC operations is shown in Fig. 5. The process is
followed in the following steps: Each PE will fetch 32 words
per cycle, from local T-SRAM to support 4 convolution units.
Where each unit has a 64 MAC array. Thus, the CNNP with 16
PEs (4 ? 4) will be able to fetch 512 words per cycle (16 ? 32),
from the wide I/O local distributed memory. This will be able
to execute 1, 024 MAC operations/cycle simultaneously. This
much wider memory bandwidth and huge parallel MAC operations
per cycle enable high throughput operation, with less
clock frequency (5MHz), at near threshold voltage (NTV),
0.46mV . Thus, when a convolution operation is performed,
the MAC input registers shift the words by one column at
each cycle to accumulate the partial sums in the accumulation
registers. The PEs connected to the same row can transfer data
to other PEs to reduce the overhead in processing cycles due
to inter-PE data communication. Also, the MAC units can be
clock-gated with mask bits to reduce the unnecessary power
Fig. 5. “CNN Architecture”
We can conclude from the literature survey that along with
algorithms, power efficiency and hardware acceleration play
major roles. From 1980s to 2017, there have been major
advances in the hardware circuity and this progress enables
technology to enhance further. CNN Processor, Face Image
Sensor, and circuits like Analog Haar-like filtering circuit are
new to us, and gives us good information on the trending