How Neural Network has Changed the Industry Aspects?

What Is Artificial Intelligence (AI)?

Yugal Choubisa
9 min readMar 18, 2021

--

Artificial Intelligence refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions to solve the problems.

What do mean by Machine Learning?

In simple Words Machine learning is the study of computer algorithms which is used to train machine using a “set of data or experience” in order to make prediction or decisions without being explicitly programmed to do so that improve automatically through experience and by the use of data.

What do you mean by Neural Network?

Artificial neural network is also known as Neural Network. It is a network or circuit of neurons, or in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus a neural network is a biological neural network, made up of an artificial neural network, for solving artificial intelligence problems.

How Apple has used Neural Network and now how it became it’s major feature?

An On-device Deep Neural Network for Face Detection

Apple started using deep learning for face detection in iOS 10. With the release of the Vision framework, developers can now use this technology and many other computer vision algorithms in their apps. We faced significant challenges in developing the framework so that we could preserve user privacy and run efficiently on-device. This article discusses these challenges and describes the face detection algorithm.

Introduction:

Apple first released face detection in a public API in the Core Image framework through the CI Detector class. This API was also used internally by Apple apps, such as Photos. The earliest release of CI Detector used a method based on the Viola-Jones detection algorithm. We based subsequent improvements to CI Detector on advances in traditional computer vision.

Apple’s iCloud Photo Library is a cloud-based solution for photo and video storage. However, due to Apple’s strong commitment to user privacy, we couldn’t use iCloud servers for computer vision computations. Every photo and video sent to iCloud Photo Library is encrypted on the device before it is sent to cloud storage, and can only be decrypted by devices that are registered with the iCloud account. Therefore, to bring deep learning based computer vision solutions to our customers, we had to address directly the challenges of getting deep learning algorithms running on iPhone.

Unlike cloud-based services, whose resources can be dedicated solely to a vision problem, on-device computation must take place while sharing these system resources with other running applications. Finally, the computation must be efficient enough to process a large Photos library in a reasonably short amount of time, but without significant power usage or thermal increase.

Moving From Viola-Jones to Deep Learning:

In 2014, Apple began working on a deep learning approach to detecting faces in images, deep convolutional networks (DCN) were just beginning to yield promising results on object detection tasks. Most prominent among these was an approach called “Over Feat” which popularized some simple ideas that showed DCN’s to be quite efficient at scanning an image for an object.

Over Feat drew the equivalence between fully connected layers of a neural network and convolutional layers with valid convolutions of filters of the same spatial dimensions as the input. This work made clear that a binary classification network of a fixed receptive field could be efficiently applied to an arbitrary sized image to produce an appropriately sized output map. The Over Feat paper also provided clever recipes to produce denser output maps by effectively reducing the network stride.

Apple built our initial architecture based on some of the insights from the Over Feat paper, resulting in a fully convolutional network with a multitask objective comprising of:

  • a binary classification to predict the presence or absence of a face in the input, and
  • a regression to predict the bounding box parameters that best localized the face in the input.

They experimented with several ways of training such a network. For example, a simple procedure for training is to create a large dataset of image tiles of a fixed size corresponding to the smallest valid input to the network such that each tile produces a single output from the network. The training dataset is ideally balanced, so that half of the tiles contain a face (positive class) and the other half do not contain a face (negative class). For each positive tile, they provide the true location (x, y, w, h) of the face. They train the network to optimize the multitask objective described previously. Once trained, the network is able to predict whether a tile contains a face, and if so, it also provides the coordinates and scale of the face in the tile.

A revised DCN architecture for face detection

Since the network is fully convolutional, it can efficiently process an arbitrary sized image and produce a 2D output map. Each point on the map corresponds to a tile in the input image and contains the prediction from the network regarding the presence of or absence of a face in that title and its location/scale within the input tile.

Given such a network, they could then build a fairly standard processing pipeline to perform face detection, consisting of a multi-scale image pyramid, the face detector network, and a post-processing module. They needed a multi-scale pyramid to handle faces across a wide range of sizes. They apply the network to each level of the pyramid and candidate detections are collected from each layer. The post processing module then combines these candidate detections across scales to produce a list of bounding boxes that correspond to the network’s final prediction of the faces in the image.

Face detection workflow

This strategy brought us closer to running a deep convolutional network on device to exhaustively scan an image. But network complexity and size remained key bottlenecks to performance. Overcoming this challenge meant not only limiting the network to a simple topology, but also restricting the number of layers of the network, the number of channels per layer, and the kernel size of the convolutional filters. These restrictions raised a crucial problem: our networks that were producing acceptable accuracy were anything but simple, most going over 20 layers and consisting of several network-in-network modules. Using such networks in the image scanning framework described previously would be completely infeasible. They led to unacceptable performance and power usage. In fact, we would not even be able to load the network into memory. The challenge then was how to train a simple and compact network that could mimic the behavior of the accurate but highly complex networks.

They decided to leverage an approach, informally called “teacher-student” training. This approach provided us a mechanism to train a second thin-and-deep network, in such a way that it matched very closely the outputs of the big complex network that they had trained as described previously. The student network was composed of a simple repeating structure of 3x3 convolutions and pooling layers and its architecture was heavily tailored to best leverage our neural network inference engine.

Now, finally, they had an algorithm for a deep neural network for face detection that was feasible for on-device execution. They iterated through several rounds of training to obtain a network model that was accurate enough to enable the desired applications. While this network was accurate and feasible, a tremendous amount of work still remained to make it practical for deploying on millions of user devices.

Optimizing the Image Pipeline:

Practical considerations around deep learning factored heavily into our design choices for an easy-to-use framework for developers, which they call Vision. It became quickly apparent that great algorithms are not enough for creating a great framework. They had to have a highly optimized imaging pipeline.

Face detection should work well whether used in live camera capture streams, video processing, or processing of images from disc or the web. It should work regardless of image representation and format.

They were concerned with power consumption and memory usage, especially for streaming and image capture. They worried about memory footprint, such as the large one needed for a 64 Megapixel panorama. They addressed these concerns by using techniques of partial subsampled decoding and automatic tiling to perform computer vision tasks on large images even with non-typical aspect ratios.

Another challenge was color space matching. Apple has a broad set of color space APIs but we did not want to burden developers with the task of color matching. The Vision framework handles color matching, thus lowering the threshold for a successful adoption of computer vision into any app.

Vision also optimizes by efficient handling and reuse of intermediates. Face detection, face landmark detection, and a few other computer vision tasks work from the same scaled intermediate image. By abstracting the interface to the algorithms and finding a place of ownership for the image or buffer to be processed, Vision can create and cache intermediate images to improve performance for multiple computer vision tasks without the need for the developer to do any work.

The flip side was also true. From the central interface perspective, we could drive the algorithm development into directions that allow for better reusing or sharing of intermediates. Vision hosts several different, and independent, computer vision algorithms. For the various algorithms to work well together, implementations use input resolutions and color spaces that are shared across as many algorithms as possible

Optimizing for On-device Performance:

The joy of ease-of-use would quickly dissipate if our face detection API were not able to be used both in real time apps and in background system processes. Users want face detection to run smoothly when processing their photo libraries for face recognition, or analyzing a picture immediately after a shot. They don’t want the battery to drain or the performance of the system to slow to a crawl. Apple’s mobile devices are multitasking devices. Background computer vision processing therefore shouldn’t significantly impact the rest of the system’s features.

They implement several strategies to minimize memory footprint and GPU usage. To reduce memory footprint, we allocate the intermediate layers of our neural networks by analyzing the compute graph. This allows us to alias multiple layers to the same buffer. While being fully deterministic, this technique reduces memory footprint without impacting the performance or allocations fragmentation, and can be used on either the CPU or GPU.

For Vision, the detector runs 5 networks (one for each image pyramid scale . These 5 networks share the same weights and parameters, but have different shapes for their input, output, and intermediate layers. To reduce footprint even further, we run the liveness-based memory optimization algorithm on the joint graph composed by those 5 networks, significantly reducing the footprint. Also, the multiple networks reuse the same weight and parameter buffers, thus reducing memory needs.

To achieve better performance, they exploit the fully convolutional nature of the network: All the scales are dynamically resized to match the resolution of the input image. Compared to fitting the image in square network retinas (padded by void bands), fitting the network to the size of the image allows them to reduce drastically the number of total operations. Because the topology of the operation is not changed by the reshape and the high performance of the rest of the allocator, dynamic reshaping does not introduce performance overhead related to allocation.

To ensure UI responsiveness and fluidity while deep neural networks run in background, we split GPU work items for each layer of the network until each individual time is less than a millisecond. This allows the driver to switch contexts to higher priority tasks in a timely manner, such as UI animations, thus reducing and sometimes eliminating frame drop.

Combined, all these strategies ensure that our users can enjoy local, low-latency, private deep learning inference without being aware that their phone is running neural networks at several hundreds of gigaflops per second.

The Closure is that now Face Detection became the major feature of iPhone and more this has changed the aspects in future generation.

Thanks, Regards and Stay Safe

Yugal Choubisa

--

--

Yugal Choubisa

Mr. Engineer, Technical Content Writer, Love to Share knowledge