Edited by: Feihu Zhang, Northwestern Polytechnical University, China
Reviewed by: Xiaosu Hu, University of Michigan, United States; Hong Zhang, Indiana University, Purdue University Indianapolis, United States
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
We consider the fundamental question: how a legacy “student” Artificial Intelligent (AI) system could learn from a legacy “teacher” AI system or a human expert without re-training and, most importantly, without requiring significant computational resources. Here “learning” is broadly understood as an ability of one system to mimic responses of the other to an incoming stimulation and vice-versa. We call such learning an Artificial Intelligence knowledge transfer. We show that if internal variables of the “student” Artificial Intelligent system have the structure of an
Explosive development of neuroinformatics and Artificial Intelligence (AI) in recent years gives rise to new fundamental scientific and societal challenges. Developing technologies, professions, vocations, and corresponding educational environments for sustained generation of evergrowing number of AI Systems is currently recognized as amongst the most crucial of these (Hall and Pesenti,
Knowledge transfer between Artificial Intelligent systems has been the subject of extensive discussion in the literature for more than two decades (Gilev et al.,
In this contribution we provide new framework for automated, fast, and non-destructive process of knowledge spreading across AI systems of varying architectures. In this framework, knowledge transfer is accomplished by means of Knowledge Transfer Units comprising of mere linear functionals and/or their small cascades. Main mathematical ideas are rooted in measure concentration (Gibbs,
The paper is organized as follows. In section 2 we introduce a general framework for computationally efficient non-iterative AI Knowledge Transfer and present two algorithms for transferring knowledge between a pair of AI systems in which one operates as a teacher and the other functions as a student. These results are based on Stochastic Separation Theorems (Gorban and Tyukin,
Consider two AI systems, a student AI, denoted as AI_{s}, and a teacher AI, denoted as AI_{t}. These legacy AI systems process some
Over a period of activity system AI_{s} generates a set
and
A diagram schematically representing the process is shown in Figure
AI_{s} does not make such errors
existing competencies of AI_{s} on the set of inputs corresponding to internal states
knowledge transfer from AI_{t} to AI_{s} is reversible in the sense that AI_{s} can “unlearn” new knowledge by modifying just a fraction of its parameters, if required.
AI Knowledge transfer diagram.
Before proceeding with a proposed solution to the above AI Knowledge Transfer problem, understanding basic yet fundamental properties of the sets
Let the set
be an i.i.d. sample from a distribution in ℝ^{n}. Pick another set
from the same distribution at random. What is the probability that there is a linear functional separating
Below we provide three
Consider the case when the underlying probability distribution is an equidistribution in the unit ball
The proof of the theorem is provided in the
Figure
Estimate (1) of
The proof of the theorem is provided in
Examples of estimates (3) for various parameter settings are shown in Figure
Estimate (2) of
Remark 1. Estimates (1), (3) for the probability
Remark 2. Note that not only Theorems 1, 2 provide estimates from below of the probability that two random i.i.d. drawn samples from
Whilst having explicit separation functionals as well as thresholds is an obvious advantage from practical view point, the estimates that are associated with such functionals do not account for more flexible alternatives. In what follows we present a generalization of the above results that accounts for such a possibility as well as extends applicability of the approach to samples from product distributions. The results are provided in Theorem 3.
The proof of the theorem is provided in
Having introduced Theorems 1–3, we are now ready to formulate our main results–algorithms for non-iterative AI Knowledge Transfer.
Our first algorithm, Algorithm 1, considers cases when
Single-functional AI Knowledge Transfer
Two-functional AI Knowledge Transfer
The algorithms comprise of two general stages, pre-processing stage and knowledge transfer stage. The purpose of the pre-processing stage is to regularize and “sphere” the data. This operation brings the setup close to the one considered in statements of Theorems 1, 2. The knowledge transfer stage constructs Auxiliary Knowledge Transfer Units in a way that is very similar to the argument presented in the proofs of Theorems 1 and 2. Indeed, if
Remark 3. Note that the regularization step in the pre-processing stage ensures that the matrix
Denoting
we obtain that
Remark 4. Clustering at Step 2.a can be achieved by classical
Remark 5. Auxiliary Knowledge Transfer Units in Step 2.b of Algorithm 1 are derived in accordance with standard Fisher linear discriminant formalism. This, however, need not be the case, and other methods, e.g., support vector machines (Vapnik,
Furthermore, instead of the sets
Depending on configuration of samples
In what follows we illustrate the approach as well as the application of the proposed Knowledge Transfer algorithms in a relevant problem of a computer vision system design for pedestrian detection in live video streams.
Let
In this section we illustrate application of the proposed AI Knowledge Transfer technology and demonstrate that this technology can be successfully employed to compensate for the lack of power of an edge-based device. In particular, we suggest that the edge-based system is “taught” by the state-of-the-art teacher in a non-iterative and near-real time way. Since our building blocks are linear functionals, such learning will not lead to significant computational overheads. At the same time, as we will show later, the proposed AI Knowledge Transfer will result in a major boost to the system's performance in the conditions of the experiment.
In our experiments, the teacher AI,
We note that for both
In order to make the experiment more realistic, we assumed that internal states of both systems are inaccessible for direct observation. To generate sets
Knowledge transfer diagram between a teacher AI and a student AI augmented with HOG-based feature generator.
In this experiment we consider and address two types of errors: false positives (original Type I errors) and false negatives (original Type II errors). The error types were determined as follows. An error is deemed as
Our main focus was to replicate a deployment scenario in which
It is worthwhile to mention that output labels of the chosen teacher AI,
Examples of False positives generated by the teacher AI,
To test the approach we used NOTTINGHAM video (Burton,
For the purposes of training and testing Knowledge Transfer Units, the video has been passed through
Sets 1 and 2 constitute different
We note that labeling of false positives involved outputs of
Finally, to quantify performance of the proposed knowledge transfer approach, it is important to distinguish between definitions of error types (Type I and Type II) for the original system and error types characterizing
Definition of the error types in knowledge transfer experiments.
Yes | Yes | True positive |
No | False negative | |
No | Yes | False positive |
No | True negative^{*} |
Results of the application of Algorithms 1, 2 as well as the analysis of their performance on the testing sets are provided below.
We generated 10 different realizations of Sets 1 and 2. This resulted in 10 different samples of the training and testing sets. The algorithms have been applied to all these different combinations. Single run of the preprocessing step, Step 1, took, on average, 23.5 s to complete on an Apple laptop with 3.5 GHz A7 processor. After the pre-processing step only 164 principal components have been retained. This resulted in significant reduction of dimensionality of the feature vectors. In our experiments pre-processing also included normalization of the whitened vectors so that their
Prior to running Steps 2 of the algorithms we checked if the feature vectors corresponding to errors (false positives) in the training set are correlated. This allows an informed choice of the number of clusters parameter,
Correlations within clusters after Step 2.
Performance of Algorithms 1, 2 on the Testing sets generated from NOTTINGHAM video is summarized in Figures
as functions of decision-making threshold in
Performance of the up-trained
Performance of the up-trained
As Figure
As for results shown in Figure
Our experiments showed how the approach could be used for filtering Type I errors in the original system. The technology, however, could be used to recover Type II errors too (false negatives in the original system), should the data be available. Several strategies might be evoked to obtain this data. The first approach is to use background subtraction to detect a moving object and pass the object through both
Recovering Type II errors of the original
137 | 169 | 0.05 |
87 | 86 | 0.1 |
45 | 35 | 0.15 |
10 | 14 | 0.2 |
8 | 2 | 0.25 |
1 | 0 | 0.3 |
In this work we proposed a framework for instantaneous knowledge transfer between AI systems whose internal state used for decision-making can be described by elements of a high-dimensional vector space. The framework enables development of non-iterative algorithms for knowledge spreading between legacy AI systems with heterogeneous non-identical architectures and varying computing capabilities. Feasibility of the framework was illustrated with an example of knowledge transfer between two AI systems for automated pedestrian detection in video streams.
In the basis of the proposed knowledge transfer framework are separation theorems (Theorem 1–3) stating peculiar properties of large but finite random samples in high dimension. According to these results,
All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.
KS was employed by ARM Holding, and IR was employed by Spectral Edge Ltd. At the time of preparing the manuscript IR was employed by ARM Holding. All data used in experiments have been provided by ARM Holding. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
According to (4), the probability that
Pick
Illustration to the proof of Theorem 1.
According to (4) and Figure
Suppose that event
and
Consider the vector
Equation (5) implies that
and, consequently,
Finally, consider
It is clear that if ||
and its complement in
Hence the probability that ℓ_{0}(
and
This is a lower bound for the probability that the
and evaluate the following inner products
According to assumption (2),
and, respectively,
Let
It is clear that ℓ_{0}(
where
Estimate (3) now follows. □
are vectors whose coordinates coincide with that of the quotient representation of
with probability
for all
separates