Computer vision technology

The technologies for pattern recognition and deep learning are widely used in modern society, which contributes to finding and saving static or moving images, and are called computer or machine vision. Advanced computer technologies provide image processing, analysis and search and it is possible to capture a static image or video in-line and interpret their content with the help of these technologies. The above-mentioned technologies are under active development. According to the research results of The Insight Partners, the computer and machine vision systems market will reach revenues of 14,48 billion by 2025.

Image recognition is a term for computer technologies that can recognize specific people, animals, objects or other target objects using machine learning algorithms and concepts. The term “image recognition” is related to “computer vision”, which is an umbrella term for the process of teaching computers to ‘see’ like humans, and "image processing", which is a general term for computers that perform intensive work with these images.

Image recognition is carried out in various ways, but a lot of best methods involve the usage of sinuous neural networks for images filtering through a series of artificial layers of neurons. Convolutional neural networks were specially created for image recognition and similar image processing. Through a combination of such techniques as max-pooling, step configuration and padding, convolutional neural filters work on images in order to help machine learning programs to improve object image detection.

Let us consider the concepts of “computer vision” and “machine vision” in more detail. Machine and computer visions are similar concepts, but there are certain differences between them.

Computer vision is the technology that facilitates ‘seeing’ the outside world and allows us to analyse visual data and make decisions on its base or gaining awareness of the environment and current situation. One of driving factors of computer vision is the amount of data generated today, which is then used to teach and improve computer vision. Not only does computer vision provide the possibility to find and recognize objects, but also to track and classify them. Thanks to computer vision objects are identified, video analytics are carried out, a description of the images contents is formed as well as gesture recognition and processing of results is performed.

Computer vision is the technology that enables computers and systems to obtain significant information from digital images, videos and other visual data and perform actions or make recommendations on the basis of the obtained information. Given that artificial intelligence enables the computer's thinking process, computer vision provides computers with the ability to see, observe and understand.

Computer vision works almost the same way as human vision, except that human vision has the advantage of context throughout life, which is helpful in training to distinguish between objects, to determine the distance to them, whether they're moving, and whether there's something wrong with the image.

Computer vision teaches machines to perform these functions, but it has to do it in much shorter time frames using cameras, data and algorithms rather than retinal, optic nerves or visual cortex. Since the system, which is trained to check objects or observe their motion, can analyse thousands of objects or processes per minute, noticing even imperceptial eye defects or problems, it definitely exceeds human capabilities.

Machine vision is a segment of computer vision, which is used in manufacturing. Computer vision is based on a set of general technologies, while machine vision is used for analysing images in order to solve industrial problems.

History of computer vision technology development

Initial research conducted in the field of computer vision began in the 1950s, when some of the first neural networks were used in order to detect object edges as well as to sort simple objects into circles and squares. In the 1970s, the first commercial usage of computer vision provided interpretation of printed or handwritten text with the help of optical character recognition. This technology was used in order to interpret written text for the blind.

Due to the development of the Internet in the 1990s vast amounts of image databases became available on the Internet and provided an analysis conduction as well as the usage of facial recognition programs. Those increasing databases contributed to identification of specific people in photos and videos.

These days several factors contribute to the development of computer vision technology:

Mobile technologies with built-in cameras replenish Internet resources with photos and videos.

The computational power of data processing tools has increased, while tools themselves have become more affordable. Tools and technologies designed for computer processing of static and moving images are constantly being improved. Algorithms, particularly convolutional neural networks, harness the benefits of hardware and software. The influence of new developments on computer vision technology has been impressive. Accuracy rates for objects identification and classification have increased from 50 to 99 percent in less than a decade, so current systems are more accurate than humans at quick detection and response to visual data.

The development of computer vision technology has intensified since 2012, when convolutional neural network AlexNet influenced the activation of machine learning, especially computer vision algorithms. AlexNet won the ImageNet LSVRC-2012 image recognition competition by a significant margin in 2012. The winner’s result of image recognition was marked by a minor number of mistakes that in relative indicators amounted to 15.3%. Popular at that time technology for creating and maintaining a massive annotated ImageNet images database, working out and testing pattern recognition and machine vision methods submitted a result that contained 26.2% of mistakes and took the second place.

Scientists the world over have been trying to discover ways of extracting data from visual resources for years. In 1959, the duo of Harvard neurophysiologists David Hubel and Torsten Wiesel published one of the most essential articles in the field of computer vision technology entitled “Receptive fields of single neurons in the cat's striate cortex”, in which core properties of visual cortical neurons as well as how a cat’s visual experience shapes its cortical architecture were analysed. The researchers conducted several rather elaborate experiments. They placed electrodes into the primary visual cortex area of an anaesthetised cat’s brain and observed, or at least attempted to, the neuronal activity in that area while showing the animal various images. Their first efforts were in vain: they could not make nerve cells respond. However, a few months into the research, they noticed that one neuron fired as they were slipping a new slide into the projector. After some initial confusions, Hubel and Wiesel realised that what got the neuron excited was the movement of the line created by the shadow of the sharp edge of the glass slide. Through further experimentations, the researchers discovered some neurons present in the primary visual cortex as a combination of simple and complex neurons, and visual processing of the information ever starts with simple structures, such as oriented edges of the object. This feature is used as the primary concept in deep learning.

The following considerable invention in computer vision’s history was the invention of the first digital image scanner. In 1956, an apparatus was developed by Russell Kirsch along with his colleagues that provided the possibility of transforming images into grids of numbers - the binary language machines could understand. Thanks to their development it is possible to process digital images in diverse ways nowadays.

One of the first digitally scanned photographs was the image of Russell’s infant son. It was just a grainy photo of five by five centimetres captured as 30 976 pixels (176x176 array), but it has become so incredibly famous that the original image is currently preserved at Portland Art Museum.

The paper “Machine Perception Of Three-Dimensional Solids” was published by Lawrence Roberts in 1963, which included the technology that predates modern computer vision. Later that year Lawrence described the process of extracting three-dimensional images of solid objects from 2D photographs. The researcher proposed the visual world to be reduced to simple geometric shapes.

The primary purpose of the computer program, which was developed and described by Lawrence , was the conversion of 2D photographs into line drawings, on the basis of which to build up 3D representation of these lines and, finally, to display the 3D structures of objects with removed all hidden lines. Larry noted that the processes of 2D to 3D construction, followed by 3D to 2D display were an essential starting point for future research into computer-aided 3D systems. It is noteworthy that Lawrence conducted research in the field of computer vision for a short time. Instead, he soon joined DARPA and is nowadays known as one of the inventors of the Internet.

In the 1960s, computer vision became a part of academic programs and some researchers, who were extremely optimistic about the future of the field, believed that it would take no more than 25 years to create a powerful computer. However, Seymour Papert, a professor in the artificial intelligence laboratory of the Massachusetts Institute of Technology, chose to launch the Summer Vision project and solve the problem of machine vision in a few months. He was of the opinion that a small group of students at the Massachusetts Institute of Technology was able to develop a significant part of the vision system in one summer. The students, coordinated by Seymour Papert himself and Gerald Sussman, were supposed to develop a platform that could automatically perform background/foreground segmentation as well as extract non-overlapping objects from real-world photos. The project was not realised. However, this project, according to many scientists, became the official birth of the field of computer vision as a scientific field.

In 1974, optical character recognition technology was introduced, which could recognise the text printed in any font or typeface. Likewise, intelligent character recognition could decipher handwritten text with the help of neural networks. From then on, these technologies have found their way into document and invoice processing, vehicle licence plate recognition, mobile payments, machine translation, and other common applications.

In 1982, David Marr, a British neuroscientist, published another landmark paper: “Vision: A Computational Investigation into the Human Representation and Processing of Visual Information”. Drawing on Hubel and Wiesel's ideas that visual processing does not begin with holistic objects, David provided the following meaning of objects in the visual field: he established that vision is hierarchical. The major function of the visual system, he asserted, is to create 3D representations of environmental objects in order for us to have the possibility of interaction with them.

He introduced a framework for vision, where low-level algorithms that detect edges, curves, corners, etc., are used as stepping stones towards a high-level understanding of visual data.

David Marr’s representational framework for vision includes:

  • Primal sketch of an image, where edges, bars, boundaries etc., are represented; 
  • 2D sketch representation where surfaces, information about depth and discontinuities on an image are pieced together;
  • 3D model that is hierarchically organised in terms of surface and volumetric primitives.

David Marr’s work was groundbreaking at the time, but it was too abstract. Nevertheless, it provided neither any information about the kinds of mathematical modelling that could be used in an artificial visual system, nor did it emphasise typological features of a learning process.

Around the same time, a Japanese computer scientist, Kunihiko Fukushima built a self-organising artificial network of simple and complex cells that could recognize patterns and was unaffected by position shifts. The network, Neocognitron, included several convolutional layers whose (typically rectangular) receptive fields had weight vectors (known as filters).

These filters’ function was to slide across 2D arrays of input values (such as image pixels) and, after performing certain calculations, produce activation events (2D arrays) that were to be used as inputs for subsequent layers of the network.

Fukushima’s Neocognitron is apparently the first ever neural network to deserve the name deep; it is a prototype of today’s convnets, algorithms that are a deep learning network with a tripartite structure.

A few years later, in 1989, a young French scientist Yann LeCun applied a backprop style learning algorithm to Fukushima’s convolutional neural network architecture. After working on the project for a few years, LeCun released LeNet-5 — the first modern convnet that introduced some of the essential ingredients we still use in CNNs today as a specific type of artificial neural network that uses perceptrons, machine learning algorithms for data analysis. Similarly to Kunihiko Fukushima, LeCun decided to apply his invention to character recognition and even released a commercial product for reading zip codes. Furthermore, his work resulted in the creation of the Mixed National Institute of Standards and Technology dataset (MNIST) of handwritten digits, which is a comprehensive database of handwritten digit samples, that is, a well-known standard developed by the US National Institute of Standards and Technology in order to unify calibration and the process of computer vision methods selection for image recognition on the basis of machine learning, in particular of artificial neural network technologies. The above mentioned database contains 60,000 images for machine learning and 10,000 images for testing and is the best known reference dataset.

In 1997, a Berkeley professor named Jitendra Malik along with his student Jianbo Shi released a paper in which he described his attempts to tackle perceptual grouping. The researchers tried to get machines to carve out images into sensible parts in order to automatically determine which pixels on an image are shared as such an approach provides the possibility to distinguish objects from their surroundings using a graph theory algorithm. However, they didn’t manage to gain any significant results as the problem of perceptual grouping is still relevant.

In the late 1990s, computer vision, as a field, largely changed its focus. Around 1999, a lot of researchers stopped trying to reconstruct objects by creating 3D models of them (the method proposed by Marr) and directed their efforts towards feature-based object recognition instead. David Lowe’s work “Object Recognition from Local Scale-Invariant Features” was particularly interesting. The paper describes a visual recognition system that uses local features that are invariant to rotation, location, and, partially, changes in illumination. These features, according to Lowe, are somewhat similar to the properties of neurons found in the inferior temporal cortex that are involved in object detection processes.

In 2001, the first real-time face detection framework was introduced by Paul Viola and Michael Jones. Although the algorithm was not based on deep learning, it still had deep learning signs as, while processing images, the algorithm learned which features could help localise faces. Viola/Jones face detector is still widely used. This is a binary classifier that’s built out of several weaker classifiers and during the learning phase, which is quite time-consuming in this case, the cascade of weak classifiers is trained using Adaboost. In order to find an object of interest, particularly a face, the model partitions input images into rectangular patches and submits them all to the cascade of weak detectors. If a patch makes it through every stage of the cascade, it is classified as positive, but if not, the algorithm rejects it immediately. This process is repeated many times at various scales. Five years after the paper was published, Fujitsu released a camera with a real-time face detection feature that relied on the Viola/Jones algorithm. As the field of computer vision kept advancing, the community felt an acute need for a benchmark image dataset and standard evaluation metrics for comparison.

The extensive usage of artificial intelligence in computer vision technologies, which helps computers to improve on their own experience through deep learning can be observed nowadays. It differs from traditional approaches used in machine vision. For instance, artificial intelligence analyses images that can not be accurately and unambiguously recognised by standard programs, and trains the programs themselves then.

Current state of research in the field

In the modern world of digital transformations, there are countless images and videos only from the built-in cameras of our mobile devices, data from thermal or infrared sensors and other sources. Since the amount of data is constantly increasing (more than 3 billion images are published on the Internet every day), the computing power essential for their analysis is becoming more and more affordable. As the field of computer vision is extending thanks to the opportunity to use new hardware and algorithms, the accuracy of object identification increases as well. In less than a decade, modern systems have reached 99% recognition accuracy in comparison with previous 50%, making them more accurate, facilitating a rapid response to visual data that significantly exceeds human capabilities.

In order for computer programs to imitate human vision, they must obtain, process, analyse, and understand images. The tremendous growth of this field has been achieved through an iterative learning process, which has become possible due to the development of neural network research. It begins with mining data from unstructured information that is helpful in learning a certain topic. Each image should be marked with metadata that indicates the correct answer. As the neural network sorts out the data and signals, the needed image is found; it is the received feedback on its accuracy that contributes to the enhancement of the recognition. Neural networks use pattern recognition techniques in order to distinguish between parts of an image. Computer programs learn from millions of uploaded images instead of determining attributes manually. A vast number of real applications for computer vision have been developed, even though the technology is still quite new. Considering the fact that humans are continually cooperating with machines, the usage of human potential will be reduced in order for them to focus on more valuable tasks, because computer programs manage to automate processes that require image recognition.

A couple of major technologies are used: a type of machine learning called deep learning and a convolutional neural network. Machine learning applies algorithmic models that provide a computer with the opportunity to learn the context of visual data independently. If enough data is provided through the model, the computer will "look" at the data and learn to distinguish one image from another. Algorithms provide the machine with the possibility to learn on its own, rather than programming it to recognise images. A convolutional neural network assists a machine learning or deep learning model in looking for something by breaking down an image into pixels that have been tagged or labelled. It makes use of the labels in order to perform convolutions (a mathematical operation on two functions to create a third function) and makes predictions about what it "sees". The neural network performs convolutions and checks the accuracy of its predictions in a series of iterations until the predictions start to come true. After that, it recognizes or sees the image as humans do.

Like a human, who creates an image from a distance, a neural network recognizes hard edges and simple shapes first, and then fills in the information by iterating on its predictions. They are used to understand separate images. A recurrent neural network is used in a similar way for video applications in order to help computers to understand how the images in a series of frames are related to each other.

Machine vision’s components and objectives

The machine vision system contains a set of components. It mainly consists of digital cameras that obtain images and smart cameras that process data. The system also includes a computer with a powerful processor, artificial intelligence software, input-output hardware, light sources, and synchronisation sensors. Among the major tasks solved by machine vision, the following are distinguished:

  • object and text recognition
  • restoration of the object shape according to the image
  • identification of individuals
  • movement assessment
  • object detection
  • recovery of what is happening in the image

Technology solves a lot of tasks in various fields:

  • In medicine - helps to make diagnoses
  • In manufacturing - makes use of robots in production
  • In automotive industry - is responsible for the navigation of autonomous vehicles
  • in other industries - counts visitors and reads barcodes

Thus, machine vision enables the automation of manual work, which speeds up the execution of work processes. Computer vision is used in a variety of industries, from energy and utilities to manufacturing and automotive, and the market is constantly growing. Several examples of computer vision usage:

Autonomous vehicles

Computer vision is significant for self-driving vehicles. Manufacturers such as Tesla, BMW, Volvo, and Audi make use of several cameras, lidars, radars, and ultrasonic sensors to capture images of the environment in order for their self-driving cars to detect objects, lane markings, signs, and traffic lights for safe driving.

Google Translate application

This application provides real-time translation of a foreign-language text document into the desired language using computer vision. It also uses optical character recognition to "see" the image and augmented reality in order to provide an accurate translation.

Face recognition

China is definitely at the forefront of facial recognition technology. They use it for police work, payment portals and airport security checkpoints in order to prevent thefts.

Health care

Since 90 percent of all medical data is based on images, computer vision may be used in medicine as well. Medical facilities, professionals and patients benefit from computer vision, which enables the creation of new medical diagnostic methods for analysing X-rays, mammograms and other patient monitoring scans in order to detect problems early and assist in surgical intervention.

Real-time sports tracking

Ball and puck tracking has become commonplace in various sports, but computer vision provides the possibility to analyse the game, selected strategies, players performances and ratings as well as to track brand sponsorship visibility in sports broadcasts.


At CES 2019, John Deere unveiled a semi-autonomous combine harvester that uses artificial intelligence and computer vision in order to analyse grain quality during harvesting and find the optimal seeding route. Computer vision also has great potential in detecting weeds so that herbicides can be sprayed directly on them rather than on crops. This is anticipated to reduce the amount of needed herbicides by 90 percent.


Computer vision provides manufacturers with the ability to operate more safely, intelligently, and efficiently in a variety of ways. Predictive maintenance is only one of the numerous instances of equipment being monitored with computer vision in order to be able to interfere before breakdowns result in costly downtime. Product packaging and quality is controlled as well as the number of defective products is reduced with the help of computer vision.

Formation of information resources

IBM introduced a computer vision platform that solves the problems of both technology development and information resources. IBM Maximo Visual Inspection includes tools that enable specialists to mark, train, and deploy computer vision models through deep learning — without programming or deep learning experience. Computer vision models can be deployed in local data processing centres, clouds and peripheral data storages.

Computer vision is currently one of the most popular research fields. It is at the intersection of many academic subjects, such as computer science (graphics, algorithms, theories, systems, architecture), mathematics (information retrieval, machine learning), engineering (robotics, speech, NLP, image processing), physics (optics), biology (neurology) and psychology (cognitive science).

Given that computer vision is a relative understanding of visual environments and their contexts, many scientists assert that the field is paving the way for general artificial intelligence due to its interdisciplinary nature.