Effective training of deep convolutional neural networks for hyperspectral image classification through artificial labeling

Eﬀectiv e training of deep con v olutional neural net w orks for h yp ersp ectral image classiﬁcation through artiﬁcial lab eling W o jciech Masarczyk, Przemysła w Głom b, Bartosz Grabowski, Mateusz Ostaszewski Institute of Theoretical and Applied Informatics, Polish Academ y of Sciences Bałtyc k a 5, 44-100 Gliwice, Poland Email: {wmasarczyk,przemg,bgrabowski,mostaszewski}@iitis.pl T elephone: +48 32 2317319 Abstract Hyp erspectral imaging is a rich source of data, allo wing for multitude of eﬀectiv e appli- cations. Ho wev er, suc h imaging remains c hallenging b ecause of large data dimension and, t ypically , small p ool of av ailable training examples. While deep learning approaches hav e b een sho wn to b e successful in providing eﬀective classiﬁcation solutions, esp ecially for high dimensional problems, unfortunately they work best with a lot of lab elled examples av ailable. T o alleviate the second requirement for a particular dataset the transfer learning approach can b e used: ﬁrst the netw ork is pre-trained on some dataset with large amount of training lab els av ailable, then the actual dataset is used to ﬁne-tune the netw ork. This strategy is not straightforw ard to apply with h yp ersp ectral images, as it is often the case that only one particular image of some type or characteristic is av ailable. In this pap er, we prop ose and inv estigate a simple and eﬀective strategy of transfer learning that uses unsup ervised pre-training step without lab el information. This approach can b e applied to many of the h yp ersp ectral classiﬁcation problems. Performed exp erimen ts show that it is very eﬀective at improving the classiﬁcation accuracy without b eing restricted to a particular image type or neural netw ork architecture. The exp erimen ts were carried out on several deep neural net work arc hitectures and v arious sizes of lab eled training sets. The greatest improv emen t in o verall accuracy on the Indian Pines and Pa via Universit y datasets is ov er 21 and 13 p ercen tage p oints, resp ectiv ely . An additional adv antage of the prop osed approach is the unsup ervised nature of the pre-training step, which can be done immediately after image acquisition, without the need of the p oten tially costly expert’s time. Keyw ords: h yp ersp ectral image classiﬁcation; deep learning; conv olutional neural netw orks; transfer learning; unsup ervised training sample selection 1 In tro duction Classiﬁcation of h yp ersp ectral images (HSI) has many p oten tial applications, e.g. land co ver segmen tation [1], mineral identiﬁcation [2], or anomaly detection [3]. The classiﬁcation algo- rithms used include b oth general mo dels, e.g. the SVM [4], and dedicated approaches, taking in to account sp ectral prop erties or spatial class distribution [5]. Recently there hav e b een at- tempts to use Deep Learning Neural Netw orks (DLNN) for HSI classiﬁcation. The motiv ation is that such metho ds hav e gained attention after achieving state of the art in natural image 1 pro cessing tasks [6]. Their unique ability to pro cess an image using a hierarchical comp osition of simple features learned during training makes them a p o w erful to ol in areas where manipulation of high-dimensional data is needed. While DLNN can achiev e very go od accuracy scores, they hav e the drawbac k of requiring a large amount of training data for estimation of mo del parameters. Such data is not alwa ys a v ailable, as it is common to ha ve a single HSI with just a handful of training lab els av ailable. T o bridge a gap b et ween this realistic scenario and DLNN netw ork requirements, we prop ose an approach that trains the DLNN in tw o stages, with the ﬁrst – pre-training – stage using artiﬁcial lab els. In the remainder of this section, w e discuss the relev ant related, and in tro duce the motiv ation of our approach and state the hypothesis that is the base of our metho d. A num ber of DLNN architectures hav e b een prop osed, inspired by mathematical deriv ations and/or neuroscience studies. The Conv olutional Neural Netw orks (CNN) [7] are a sp ecial case of deep neural netw orks which were originally developed to pro cess images, but are also used for other types of data like audio. They combine traditional neural netw orks with biologically inspired structure in to a very eﬀective learning algorithm. They scan m ultidimensional input piece by piece with a conv olutional window, which is a set of neurons with common weigh ts. Con volution windo w processes lo cal dependencies (features) in the input data. The output corresp onding to one conv olutional window is called a feature map and it can b e interpreted as a map of activit y of the given feature on the whole input. The CNN remain one of the most p opular arc hitectures for DLNN classiﬁcation in use to da y . Other approaches include the generative architectures, e.g. the Restricted Boltzmann Ma- c hine (RBM) [8, 9], Auto encoders (AE) [10] or Deep Belief Netw ork (DBN) [11, 12]. Y et another p opular architecture is the Recurrent Neural Netw ork (RNN) whic h, through directed cycles b et w een units, has the p oten tial of representing the state of pro cessed sequence. They are ap- plicable e.g. for time series prediction or outlier detection. The most p opular types of RNN are Long Short-T erm Memory (LSTM) netw orks [13] and Gated Recurrent Units (GRUs) [14]. They improv e the original RNN architecture by dealing with explo ding and v anishing gradien t problem. F or classiﬁcation of HSI data, the CNN is the most p opular arc hitecture chosen. In [15] the simple CNN architecture is adapted to HSI classiﬁcation; the lac k of training labels is mitigated b y adding geometric transformations to a v ailable training data p oints. In [16] authors use three kinds of conv olutional windows: t wo of them are 3D conv olutions which analyse s patial and sp ectral dep endencies in the input picture, while the third is the 1D k ernel. Next the feature maps from these three types of con v olutions are stac ked one after the other and create join t output of this ﬁrst part of the netw ork. The follo wing lay ers consist only of the one dimensional conv olutional kernels and residual connections. The authors of [17] in tro duce a parallel stream of pro cessing with an original approach for spatial enhancement of hypersp ectral data. The authors of [18] design a deep net work that reduces the eﬀect of Hughes phenomenon (curse of dimensionality) and use additional unlabelled sample p ool to impro ve p erformance. In [19] authors propose an alternativ e arc hitecture called RPNet based on preﬁxed con volutional kernels. It com bines shallow and deep features for classiﬁcation. Another architecture (MugNet) is prop osed in [20] with a fo cus on simplicit y of pro cessing for classiﬁcation of hypersp ectral data with few training samples and reduced num b er of h yp erparameters. A yet another arc hitecture approach is used in [21] where a m ulti-branch fusion netw ork is introduced, whic h uses merging multiple branc hes on an ordinary CNN. An additional L2 regularization step is introduced to improv e the generalization ability with limited num b er of training samples. The work [22] prop oses a strategy based on m ultiple con volutional lay ers fusion. T wo distinct netw orks, comp osed of similar mo dules but diﬀerent organization, are examined. Other architectures are also used. F or example in [23] authors utilize th e sequential nature 2 of hypersp ectral pixels and use some v ariations of recurrent neural netw orks – Gated Recurrent Unit (GR U) and Long-Short T erm Memory (LSTM) net w orks. Moreov er, in [24] one dimensional con volutional lay ers follow ed b y LSTM units were used. Chen et. al. [25] use artiﬁcial neural net works for feature extraction. They utilize stack ed auto enco ders (SAE) for feature extraction from pixels, and PCA for reduction of the sp ectral dimensionality of the training segments taken from the picture. Next, the logistic regression is p erformed on this spectral (SAE) and spatial (PCA) extracted information. Another approach [26] uses stack ed SAE for an application study – detection of a rice eating insect. RNN architectures are also employ ed, as they are suitable for pro cessing the sp ectral vector data. The work [27] applies sequen tial sp ectral pro cessing of h yp ersp ectral data, using a RNN supp orted by a guided ﬁlter. In [28] authors use the multi- scale hierarc hical recurrent neural net works (MHRNNs) to learn the spatial dep endency of non- adjacen t image patches in the tw o-dimension (2D) spatial domain. Another idea to analysing HSI is spatial–spectral metho d in which net w ork takes information not only from spectrum bands but also from spatial dep endencies of image [16]. A signiﬁcant problem in practical h yp erspectral classiﬁcation is the small num b er of training samples. It is related to the diﬃculty of obtaining veriﬁed lab els [1], as often each pixel must b e individually ev aluated b efore lab elling. Therefore, a reference hypersp ectral classiﬁcation exp erimen t may assume num b er as lo w as 1% av ailable samples per class [2]. A n um b er of approac hes has b een exploited to deal with this diﬃculty , e.g. including combining spatial and sp ectral features [29], additional training sample generation [30], extending the classiﬁcation algorithm with segmentation [31], or employing Activ e Learning [32]. F or the DLNN classiﬁcation, the lac k of high volume of training data is a serious complication, as they typically require a lot of data to achiev e high eﬃciency . Optimal use of DLNN in HSI classiﬁcation w ould require learning them with just a few lab elled samples. This ma y b e obtained b y searching for well-tailored architecture for sp eciﬁc task [15], how ever such approach requires relativ ely big v alidation set to obtain meaningful results. The other approach is to expand the a v ailable training set. It may be achiev ed either by artiﬁcially augmen ting training set or using diﬀeren t dataset as a source for pre-training [33]. Another approach is to add a regularization step to impro ve the generalization abilit y with limited n umber of training samples [21]. A simpliﬁcation of the netw ork architecture for classiﬁcation with few training samples is employ ed in the MugNet netw ork [20]. Finally , where p ossible, the transfer learning approach is used, e.g. [34]. The transfer learning [35] uses training samples from tw o domains, whic h share common c haracteristics. A netw ork is ﬁrst pre-trained on the ﬁrst domain, whic h has plentiful supply of training samples but do es not solve the problem at hand. Then, the training is up dated with the second domain, which adapts the weigh ts to the actual problem. T ransfer learning is simple to apply in the case of conv olutional neural netw orks (CNNs). In [36], authors compared diﬀerent versions of transfer learning for CNNs in the case of natural images classiﬁcation. They studied its eﬀects dep ending on the num b er of transferred lay ers and whether they were ﬁne tuned or not as well as dep ending on the diﬀerences b et ween the considered datasets. In [37], authors used transfer learning on CNNs to recognize emotions from the pictures of faces. Other uses include ev aluating the level of p o vert y in a region given its remote sensing images [38] and computer-aided detection using CT scans [39]. There hav e b een applications of transfer learning in the general remote sensing (not-hypersp ectral) images. In [40] deep learned features are transferred for eﬀective target detection; negative b oot- strapping is used for impro ving the conv ergence of the detector. A similar approac h is applied in [41] where RNN net work trained on m ultisp ectral city images is used to derive features for studying urban d ynamics across seasonal, spatial and annual v ariance. The authors of [42] study the p erformance of transfer learning in tw o remote sensing scene classiﬁcations. The results show 3 that features generalize w ell to high resolution remote sensing images. As the work [43] sho ws, transfer learning can b e applied in remote sensing using RNN arc hitectures also. Recen tly , transfer learning has b een also applied to the HSI data. In [34], authors applied it for CNNs originally used for classifying well kno wn remote sensing hypersp ectral images to clas- sify images acquired from ﬁeld-based platforms and regarding a diﬀeren t domain. The authors of [44] use a in termediate step of sup ervised similarit y learning for anomaly detection in unla- b elled h yp ersp ectral image. A diﬀerent approach to transfer learning is prop osed in [45] which explores the high lev el feature correlation of tw o HSI. A new training principle simultaneously pro cesses b oth images, to estimate a common feature space for b oth images. A yet another approac h is prop osed in [46] where HS I sup erresolution is achiev ed using supp orted high reso- lution natural image. This natural image is used as a training reference, which is later adapted to HSI domain. In [47], iterative pro cess com bines training and up dating the currently used training lab el set. T wo sp ecialized architectures (for spatial and sp ectral pro cessing) are used. The training iteratively extends the current lab el set, starting from the initial exp ert’s labels. The ab o ve approaches do not apply to the arguably most p opular practical scenario, where only a single HSI with a handful of lab els is av ailable. Moreov er, getting the training lab els often requires additional resources (e.g. exp ert consultation and/or site visit). It is thus desirable to ha ve unsup ervised metho ds for realization of the pre-training step. Authors of [48] use outlier detection and segmentation to provide candidates for training of target detector in HSI. This information is used to construct a subspace for target detection by transfer learning theory . This shows the p oten tial of using an unsup ervised approach, how ever limited to separation of target/anomaly p oin ts from the background. In the w ork of [33], a separate clustering step is used for generation of pseudo-lab els, using Dirichlet pro cess mixture mo del. The netw ork is trained on the pseudo-lab els, then the all but last lay ers are extracted, and the ﬁnal netw ork is trained on the originally provided training lab els. While this scheme is shown to b e eﬀectiv e in the presented results, it relies on a complex non-neural prepro cessing and tailoring the DLNN conﬁguration to each dataset separately . Also, the eﬀect of size of lab el areas and eﬀects on diﬀeren t arc hitectures are not in vestigated. W e show that similar gains can b e made with a simpler prepro cessing, independent of the DLNN architecture chosen. The authors of [49] prop ose to use a sparse co ding to estimate high level features from unlab elled data from diﬀerent sources. This approach do es not require training data, but is tailored to the case where multiple inputs are av ailable, preferably with diverse con tents. T o close the gap b et ween data ineﬃcient deep learning mo dels and practical applications of HSI we prop ose a metho d which takes adv an tage of abundan t unlab elled data p oin ts present on HSI images. Precisely , w e state a hypothesis: Spatial similarity of unlab elled data p oin ts can b e utilized to gain accuracy in hypersp ectral classiﬁcation. T o corrob orate our h yp othesis, w e construct a simple clustering metho d that assigns artiﬁcial lab el to each pixel on the image based on its spatial lo cation. This artiﬁcial dataset is used to pre-train deep learning classiﬁer. Next the model is ﬁne-tuned with original dataset. Through series of exp eriments we show sup eriorit y of the prop osed approach o ver the standard learning pro cedure. Our approach is motiv ated by tw o kno wn phenomena: cluster assumption [50] and regularization eﬀect of noise in classes [51, 52, 53]. W e note that many of remote sensing images share common prop erties, most notably the ‘cluster assumption’ – pixels that are close to one another or form a distinct cluster or group frequently share the class lab el. Additionally , due to the simplistic form of our clustering metho d, w e purp osefully introduce noise in lab els used during pre-training phase, ho wev er as shown in [51] this lab el noise has little to no eﬀect on ﬁnal accuracy , as long as n umber of prop erly lab elled examples scales prop ortionally whic h is our case. 4 2 Materials & Metho ds Our metho d is to b e applied in the following case: 1. Classiﬁcation of pixels from a remote sensing hypersp ectral image; 2. Neural netw orks used as a classiﬁer; 3. F ew training lab els av ailable. In such situation, we prop ose to augment the training with a pre-training step that uses artiﬁcial lab els, which are indep enden t of the training lab els. Inclusion of this pre-training step can b e view ed as a mo diﬁcation of a transfer learning approach. Conv en tional transfer learning in this case would use a related dataset (source domain) with abundance of lab els to pre-train, then the current dataset (target domain) to ﬁne-tune. In our case, the source domain consists of ev ery p oint in the h yp erspectral image, while the target domain is comp osed of only the labelled samples. In the remainder of this Section w e discuss: the spatial structure of hypersp ectral images and the characteristics of neural netw ork that make this approach feasible, and the details of its application. W e also describ e the exp erimen ts used to test the prop osed approach. 2.1 Spatial structure of h yp ersp ectral images It is well-kno wn that remote sensing h yp ersp ectral images con tain spatial structure, that can b e exploited to impro ve clas siﬁcation scores when only a few training samples are av ailable [54, 31, 5, 55, 30]. A segmentation can b e applied to prop ose candidate pixels for lab elling with high conﬁdence [54] or identify connected components for label assignmen t [31]. Class training samples can b e extended through mo derated region growing [5] or spatial ﬁltering combined with spatial- sp ectral Lab el Propagation [55]. Finally , disagreement b et ween spatial and sp ectral classiﬁers can b e used to prop ose new samples [30]. A qualitative in vestigation of this phenomenon shows that h yp ersp ectral pixels close to one another, whether spatially or sp ectrally , are lik ely to hav e the same class lab el, th us fulﬁlling the ‘cluster assumption’ [50]. This eﬀect often leads to a blob- lik e structure of a hypersp ectral dataset, observ ed in many hypersp ectral classiﬁcation problems (e.g. land cov er lab elling in remote sensing, paint iden tiﬁcation in heritage science, scene analysis in forensics). A single class with samples in diﬀeren t parts of an image can b e made of a n umber of blobs, which diﬀer from each other b ecause of, e.g., non-uniformity in class structure (e.g. the same class can contain diﬀering crop types), sp ectral v ariations (e.g. same crop in tw o areas can ha ve diﬀering prop erties due to sunlight exp osure, soil type) or acquisition conditions (e.g. level of lighting, shado ws). 2.2 Emergence of data-dep enden t ﬁlters in neural net work training During training, subsequen t lay ers of a deep neural netw ork form a represen tation of a lo cal input data structure [56]. Given a data source, this representation, esp ecially on low er lay ers, can b e remark ably similar across diﬀerent dataset. F or example, in the problem of natural image classiﬁcation the learned k ernels resemble a Gab or ﬁlter bank [6, 57], indep enden t of class set. This form of a ﬁlter can b e shown to arise indep endently when indep enden t comp onen ts [58] or an eﬀective sparse co de [59] for natural images is estimated. Another case where data-dep enden t ﬁlters emerge is the pretext task approac h, e.g. [60, 61], where the netw ork ﬁrst learns to predict the input sequence without class lab els, which are introduced at a ﬁne-tune stage for to get the ﬁnal classiﬁcation mo del. Apparen tly the deep neural net works are able, at le ast in part, to extract an eﬃcient class-indep enden t data representation. This phenomenon has not b een 5 studied for hypersp ectral images, how ever, it can be argued that similar class-indep endent but data-dep enden t representation is b eing learned in training for hypersp ectral image classiﬁcation. 2.3 Metho ds used for prop osed artiﬁcial lab elling approach Our metho d for creating artiﬁcial lab els for the pre-training step is a simple segmentation al- gorithm whic h assumes the lo cal homogeneity of samples’ sp ectral characteristics. It works by dividing the considered image in to k rectangles, where eac h of these rectangles has its o wn lab el. F or an image of height h and width w , w e divide its heigh t into m roughly equal parts and its width in to n roughly equal parts, so that k = m · n . W e then get k rectangles, where eac h one’s height equals appro ximately h/m , while its width equals approximately w/n . Each of these rectangles deﬁnes a diﬀerent artiﬁcial class with a diﬀerent lab el. A schematic is presented in Figure 1. The function of artiﬁcial lab els is for the net work to learn class-indep endent blob patterns presen t in the data. This fo cuses the netw ork training in the ﬁne tuning on the actual training lab els, with the netw ork ‘orien ted’ to wards the features of the current image. It can also b e of adv antage in situations when a class is comp osed of multiple blobs, and not all of them hav e samples in the training set. In that case suﬃcien tly correct lab elling is unlikely to be obtained [62] with just the training samples, but the prop osed grid structure forces the net work to es timate features for the whole image. An additional adv an tage of this approach is to shift the p oten tially time consuming pre-training from the exp ert lab elling moment to the acquisition momen t. In other w ords, netw ork training do es not need to be held bac k un til the exp ert’s labels are a v ailable, but can commence right after the image is recorded. 2.4 Selected netw ork architectures In our exp eriments three architectures were tested, based on [16, 15, 63]. All three share a com- mon approach to exploit lo cal homogeneity of hypersp ectral images, how ever each one has its unique strengths and weaknesses making them an interesting testb ed for the universalit y of the prop osed metho d. The ﬁrst architecture [16] features relativ ely high num b er of conv olutional la yers which mi gh t b e helpful in transfer learning application. The second architecture [15], to the b est of authors knowledge, is one of the b est netw orks that are trained on limited num b er of samples p er class. How ev er due to its constrained capacity , it may not b eneﬁt as muc h from the pre-training phase. The last of the considered con volutional neural netw orks [63] is concep- tually the simplest of the three, which allo ws us to test our approac h using more conv entional con volutional arc hitecture. 2.5 Exp erimen ts This subsection describ es the exp erimen ts ev aluating the prop osed approach. W e inv estigate the p erformance of the artiﬁcial lab el pre-training in the following four exp erimen ts: 1. Exp erimen t 1 ev aluates the accuracy improv ement ac hieved b y using the metho d. 2. Exp erimen t 2 inv estigates the v ariabilit y introduce d by the size and shap e of the patc hes used to deﬁne artiﬁcial classes, using one of the datasets from Exp erimen t 1. 3. Exp erimen t 3 is an additional inv estigation of the observed phenomenon that big patches (larger area, smaller total n umber of classes) p erform worse than little ones (smaller area, larger total num b er of classes), done using a diﬀerent dataset. 6 Pre t ra i n i n g Ev a l u a t i o n F i n e - t u n i n g P r e t ra i n i n g r e s u l t F i n a l re su l t A r t i f i ci a l l a b e l s T ru e l a b e l s T r a i n i n g l a b e l s su b se t H y p e r sp e ct ra l d a t a N e t w o rk a r ch i t e ct u re , h y p e rp a ra m e t e r s Figure 1: The ov erview of the unsup ervised pretraining algorithm prop osed in this work. First, the netw ork is pretrained on grid-based scheme of artiﬁcially assigned lab els. The netw ork weigh ts are then ﬁne-tuned on a limited set of training samples selected from true lab els, consistent with t ypical hypersp ectral classiﬁcation scenario. 4. Exp erimen t 4 is an examination of a claim ab out the emergence of data-indep endent rep- resen tations during neural net work training using prop osed artiﬁcial lab elling sc heme. In the following subsections, the detailed descriptions of the conducted exp erimen ts are given, while in the Section 3 we present the results of the exp erimen ts. 2.5.1 Exp erimen t 1 In this exp erimen t, the prop osed approach is ev aluated using diﬀerent hypersp ectral images and neural netw ork architectures to prov e its robustness. F or the exp erimen t, we hav e used tw o w ell-known hypersp ectral datasets: Indian Pines and P avia Univ ersity . The Indian Pines dataset was collected by the A VIRIS sensor ov er the Northw est Indiana area. The image consists of 145 × 145 pixels. Each pixel has 220 sp ectral bands in the frequency range 0.4–2.5 × 10 − 6 m. Channels aﬀected by noise and/or w ater absorption were remov ed (i.e. [104–108], [150–163], 220), bringing the total image dimension to 200 bands. The reference ground truth contains 16 classes representing mostly diﬀerent t yp es of crops. T o b e consistent with exp erimen ts p erformed in [16], we c ho ose only 8 classes. 7 The Pa via Univ ersity dataset w as collected by the ROSIS sensor ov er the urban area of the Univ ersity of Pa via in Italy . This image consist of 610 × 340 pixels. It has 115 sp ectral bands in the frequency range from 0.43 to 0.86 × 10 − 6 m. The noisiest 12 bands were remo ved, and remained 103 w ere utilized in the exp erimen ts. Ground truth includes 9 classes, corresp onding mostly to diﬀerent building materials. The tw o datasets w ere sub jected to a feature transformation. F or a given dataset, the mean m b of each hypersp ectral band b were calculated. In the case of eac h dataset, and for each given pixel x and band b , the corresp onding mean m b w as subtracted, x ( b ) := x ( b ) − m b . F or this exp erimen t, all three of the previously introduced neural netw ork architectures were used. As discussed previously , the training w as divided into pre-training and ﬁne-tuning stages. In pre-training, the data was lab elled through assigning an artiﬁcial class to eac h blo c k within a grid of dimensions 5 × 5 . No ground truth data was used at this stage. In the ﬁne-tuning stage, a selected n umber of ground truth lab els w as used. The num b er of training samples from eac h class w as set at n = 5 , 15 , 50 . This allow ed to observe the p erformance b oth in typical hypersp ectral scenarios (small num ber of classes used) and deep netw ork scenarios (larger num b er of samples p er class av ailable). Because the classiﬁcation accuracy dep ends on the training set used in ﬁne-tuning each exp erimen t was rep eated n = 15 times for error rep orting. The p erformance is rep orted in Overall Accuracy (O A) after ﬁne tuning. Additionally , A v erage Accuracy (AA) and κ co eﬃcien t were insp ected and improv ements veriﬁed with statistical tests. 2.5.2 Exp erimen t 2 The second experiment inv estigates the v ariability introduced b y the size and shape of the patc hes used in artiﬁcial lab elling. F or this exp erimen t only the Indian Pines image introduced in Exp erimen t 1 was used, as it is the more challenging of the t wo in tro duced datasets. As was the case with the previous exp erimen t, the mean w as subtracted. The netw ork in vestigated is the architecture based on [16], c hosen b ecause it has the most p otential to b e aﬀected by the transfer learning pro cess. In this exp erimen t, ﬁrst the grid size was inv estigated. The dimensions of the patches, v aries from 2 × 2 , which equals 4 artiﬁcial classes, up to 72 × 72 , 5184 artiﬁcial classes. F urthermore, another w ay of creating artiﬁcial labels is considered. The image is divided into the giv en n umber of vertical stripes. Visualisation of diﬀerent artiﬁcial lab elings is presented in Figure 2. The inv estigated patches were created by dividing horizontal and vertical side of an image in to w = 2 , 3 , 5 , 7 , 9 , 15 , 19 , 25 , 29 , 36 , 39 , 48 , 72 equal parts. The vertical strip es were created by dividing horizontal side of an image in to s = 2 , 5 , 9 , 16 , 25 , 36 , 49 , 81 equal parts (so in the case of s = 2 , there are only 2 classes lo cated to the right and left of a single vertical line). The vertical strip es w ere included to observe whether the pixel distance aﬀects the p erformance – for patc hes, all the pixels share similar neigh b ourho od; for strip es, the top and b ottom pixels ha ve a notable spatial separation and, arguably , the distant pixels should not b e marked with the same class lab el without prior knowledge of spatial class distribution. Note that in case of patches made b y dividing eac h side of an image into w = 29 , 36 , 39 , 48 , 72 equal parts the size of a square patch is smaller than the size of a pro cessing window 5 × 5 in teste d architecture. That means no sample fed to a netw ork during pre-training phase has a coherent class representation (i.e. a single class presen t in the window). This exp erimen t w as p erformed with 5 training samples p er class and 50 exp erimen t runs for each grid density and the n umber of strip es. 2.5.3 Exp erimen t 3 In this exp erimen t, we test the hypothesis that the more numerous patc hes’ division pro duces a b etter pre-training set than the less n umerous ones. W e in vestigate this using a sp ecially designed 8 (a) Grid 3 × 3 (b) Strip es 5 Figure 2: Scheme of creating artiﬁcial classes on Indian Pines dataset. F rom left: grid of 9 artiﬁcial classes (a) , v ertical strip es with 5 artiﬁcial classes (b) . Artiﬁcial classes for P a via Univ ersity dataset were created analogically . h yp ersp ectral test image. In this exp erimen t, w e use the image of pain ts from museum’s collection. This dataset [64] w as collected b y the SPECIM hypersp ectral system in the Lab oratory of Analysis and Nondestructiv e In vestigation of Heritage Ob jects (LANBOZ) in National Museum in Krakó w. This image consist of 455 × 310 pixels. Each pixel has 256 spectral bands in the frequency range from 1000 to 2500 nm. Ground truth consists of manual annotations of diﬀeren t green pigments used in the mixture of paints for v arious painting regions. The image of oil paints on pap er was used, selected from four av ailable, as it was considered one of the more challenging of the images. The lay out of classes present in this image was esp ecially designed to verify hypersp ectral classiﬁers. The diﬀerent chemical comp ositions of the paints used introduce v ariations of class sp ectra, yet at the same time all paints are v ariations of the green pigment with more or less greenish h ue. The classiﬁcation problem is th us diﬃcult, but not exceedingly so. Regular grid la yout, with diﬀerent thickness of paints and fragments where one pigment ov erpaints another, in tro duce spatial diversit y in the sp ectra. Since the image is artiﬁcially created, ground truth can b e precisely marked. The original purp ose of the image w as to ev aluate iden tiﬁcation of copp er pigmen ts, diﬃcult to diﬀerentiate by other (non-hypersp ectral) sensors. Here we tak e adv antage of its regularity by complementing the original ground truth ( n GT = 5 classes) with a joined set ( n GT − 2 = 2 classes) and split set ( n GT − 10 = 10 classes). Those tw o sets of mo diﬁed ground truth allow us to compare the prop osed grid scheme, as tested in exp erimen ts 1 and 2, with a ground truth based pre-training with more and less classes than the original set. W e argue that the regular la yout of this image is more suited for this exp eriment than e.g. Indian Pines or P avia Univ ersity images; usage of additional dataset allows us to further verify the generalization p oten tial of our approach. In the case of this dataset, the mean was subtracted as in the case of the previous exp eriments. A dditionally , the standard deviation σ b of each hypersp ectral band b was calculated and then all pixels were divided by the corresp onding standard deviation v alue σ b , x ( b ) := x ( b ) σ b . In this exp erimen t, as in the previous one, the neural netw ork based on [16] was used. T raining size was equal to 5 training samples p er class and there were 50 exp erimen t runs for each examined case. The following cases were in vestigated: 9 (a) Pain ting (b) GT (c) GT-2 (d) GT-10 Figure 3: Sc heme of creating artiﬁcial classes on Pigmen t dataset. F rom left: false-colour RGB (bands 50, 27, 17) image of the painting (a) , original class lab els (b) , classes artiﬁcially joined in to 2 sets (c) , classes artiﬁcially split into 10 sets (d) . Dark rectangles denote background, excluded from the exp erimen t. 1. The p erformance of DLNN with pre-training p erformed with 2 classes prepared from joining the ground truth classes (GT-2). 2. The p erformance of DLNN with pre-training p erformed with 10 classes prepared by split- ting the ground truth classes (GT-10). 3. The p erformance of DLNN without pre-training (GT). 4. The performance of DLNN with pre-training with artiﬁcial patches of size 5 × 5 , 20 × 20 , 30 × 30 . 2.5.4 Exp erimen t 4 In this exp erimen t, we examine the claims from subsection 2.2 ab out the emergence of data- dep eneden t represen tations during neural netw ork training using prop osed artiﬁcial labelling sc heme with noisy labels. T o this end, we visualised in ternal net work parameters resulting from net work training using t-SNE algorithm [65]. In the exp erimen t, we used neural netw ork arc hitecture based on [16] and the Indian Pines dataset describ ed in subsection 2.5.1. W e trained the netw ork on the dataset using the following scenarios: 1. The netw ork w as trained using 1600 lab elled samples, with 200 samples p er class. This scenario represents the neural netw ork trained with abundant information ab out the data – unrealistic, but conv enien t from the p oin t of netw ork’s requiremen ts. 2. The netw ork was trained using 40 lab elled samples, with 5 samples p er class. This scenario represen ts the neural netw ork trained with v ery limited information about the data – realistic, but diﬃcult learning problem. 3. The netw ork was trained using only the artiﬁcial lab els created as explained in subsec- tion 2.3. Therefore, the netw ork did not ’see’ the true lab els and could create the internal represenations only based on the noisy lab els provided for training. 4. The netw ork was trained using the complete pretraining-ﬁnetuning scheme introduced in this section. That is, ﬁrst it was pretrained using artiﬁcial lab els as in p oint 3, and then 10 all lay ers except the last was ﬁnetune using the training set analogous to the one from p oin t 2. This scenario was introduced to help explain the impact of the ﬁnetuning step in our approach. As a result, we obtain 4 trained neural netw orks. As a next step, using v alidation dataset w e extract the activ ations of the next-to-last lay ers of the considered netw orks, and use t-SNE algorithm, which is used to visualise high-dimensional data, to learn if the la yers right b efore the classiﬁcation lay ers of the netw orks did learn useful data representations. 3 Results This Section presents the results of exp erimen ts introduced in subsection 2.5. 3.1 Exp erimen t 1 The ﬁrst exp eriment’s results are presented in T able 1. Each column presents the result for one t yp e of netw ork, each ro w for a set dataset and the n umber of training examples. Each table cell presen ts the results with and without pre-training, in p ercen t of Overall Accuracy , including the standard deviation of the result. The results from T able 1 were computed from a batch of n = 15 indep enden t runs for each case. The sp eciﬁc v alue of n was chosen to provide robust result, after a set of preliminary runs with diﬀerent n v alues. A Mann–Whitney U test was p erformed on the results to conﬁrm statistical signiﬁcance of the improv emen t gained with the prop osed metho d. As Overall Accuracy can b e sensitive to class imbalances, A verage Accuracy and κ co eﬃcient w ere computed for additional veriﬁcation, and were inspected for negative p erformance. The presented results sho w that application of the prop osed metho d leads to deﬁnite and consisten t improv ement in accuracy across diﬀerent images, num b er of ground truth lab els used and netw ork arc hitectures. In all but one case, the improv ement is statistically signiﬁcan t, and in some cases approac hes 20 p ercen tage p oin ts. The most challenging is scenario with 5 training samples p er class. Even a verage o verall accuracy ac hieved b y architecture originally examined on small training set [15] do es not exceed 67% on Indian Pines dataset. After the application of the proposed metho d, p erformance improv es up to 72.8% OA. The most improv ement is seen in the architecture [16], namely on IP dataset with only 5 training samples p er class in ﬁne- tuning pro cedure, it improv es from a verage 52.62 OA to 74.04 O A. This is to b e exp ected as this arc hitecture has the most p oten tial to b eneﬁt from additional training samples. Considering these improv emen ts, it can b e summarized that the results of the exp eriment supp ort stated h yp othesis and the v alidity of the prop osed approac h. The qualitativ e ev aluation of selected realizations (corresp onding to the median score) is presented in Figures 4 and 5. 3.2 Exp erimen t 2 The results of the exp erimen t are presen ted in T able 2. F or eac h grid size or the num b er of strip es, the ov erall accuracy and the standard deviation are giv en. These statistics are based on 50 exp erimen t runs for each artiﬁcial lab elling scheme. It can b e seen that the the score rises sharply until the num b er of artiﬁcial classes reaches appro ximately the num b er of original classes (at 5 × 5 , note that the original IP ground truth lea ves a sizeable p ortion of bac kground unmarked, whic h most probably would con tribute some additional classes if marked). After that v alue, there’s a declining trend. It can b e noted that the scores are higher with smaller patches. It seems viable to form a conclusion that when the original class n umber is unknown, it is b etter to o verestimate than underestimate their num b er. 11 T able 1: The result of the ﬁrst exp erimen t. Each row presents Overall A ccuracy (O A), A v erage Accuracy (AA) and Cohen’s k appa ( κ ) for given scenario. IP denotes the Indian Pines dataset, PU the P a via Univ ersity; further diﬀerentiation is for num b er of samples p er class in ﬁne-tuning. Accuracies are given as av erages with standard deviations with and without pretraining for the three in v estigated net w ork arc hitectures. arc hitecture [16] arc hitecture [15] arc hitecture [63] no pretraining pretraining no pretraining pretraining no pretraining pretraining IP 5/class O A: 52 . 62 ± 4 . 4 74 . 04 ± 4 . 1 † 66 . 15 ± 4 . 5 72 . 80 ± 3 . 2 † 50 . 05 ± 5 . 1 63 . 52 ± 4 . 2 † AA: 58 . 15 ± 2 . 9 78 . 83 ± 2 . 6 † 71 . 42 ± 3 . 9 78 . 60 ± 3 . 0 † 53 . 66 ± 3 . 0 65 . 86 ± 3 . 6 † κ : 0 . 45 ± 0 . 04 0 . 69 ± 0 . 05 † 0 . 60 ± 0 . 05 0 . 68 ± 0 . 04 † 0 . 41 ± 0 . 05 0 . 57 ± 0 . 05 † IP 15/class O A: 67 . 58 ± 3 . 2 87 . 04 ± 2 . 4 † 82 . 61 ± 2 . 8 87 . 04 ± 2 . 1 † 64 . 18 ± 2 . 8 75 . 30 ± 1 . 7 † AA: 73 . 82 ± 2 . 7 90 . 41 ± 1 . 5 † 87 . 07 ± 1 . 8 90 . 97 ± 1 . 5 † 67 . 54 ± 2 . 5 78 . 40 ± 2 . 1 † κ : 0 . 62 ± 0 . 03 0 . 85 ± 0 . 03 † 0 . 79 ± 0 . 03 0 . 85 ± 0 . 02 † 0 . 58 ± 0 . 03 0 . 71 ± 0 . 02 † IP 50/class O A: 80 . 51 ± 4 . 8 93 . 66 ± 1 . 3 † 93 . 75 ± 1 . 2 94 . 65 ± 1 . 0 81 . 39 ± 1 . 1 87 . 06 ± 0 . 9 † AA: 87 . 48 ± 2 . 6 95 . 81 ± 0 . 8 † 95 . 86 ± 0 . 7 96 . 62 ± 0 . 8 † 85 . 10 ± 0 . 9 90 . 38 ± 1 . 1 † κ : 0 . 77 ± 0 . 05 0 . 92 ± 0 . 02 † 0 . 92 ± 0 . 01 0 . 94 ± 0 . 01 0 . 78 ± 0 . 01 0 . 85 ± 0 . 01 † PU 5/class O A: 67 . 47 ± 6 . 5 80 . 08 ± 7 . 0 † 73 . 31 ± 4 . 1 80 . 33 ± 5 . 2 † 65 . 55 ± 3 . 8 74 . 34 ± 7 . 0 † AA: 76 . 56 ± 3 . 1 87 . 66 ± 2 . 9 † 84 . 67 ± 2 . 6 88 . 86 ± 3 . 2 † 64 . 39 ± 2 . 4 76 . 92 ± 3 . 7 † κ : 0 . 60 ± 0 . 07 0 . 75 ± 0 . 08 † 0 . 67 ± 0 . 05 0 . 76 ± 0 . 06 † 0 . 56 ± 0 . 04 0 . 68 ± 0 . 08 † PU 15/class O A: 83 . 63 ± 2 . 7 91 . 87 ± 3 . 3 † 88 . 21 ± 2 . 9 91 . 96 ± 2 . 6 † 75 . 50 ± 2 . 4 89 . 33 ± 3 . 4 † AA: 89 . 48 ± 1 . 1 94 . 65 ± 1 . 0 † 93 . 40 ± 1 . 1 95 . 01 ± 0 . 8 † 77 . 59 ± 1 . 2 89 . 95 ± 1 . 8 † κ : 0 . 79 ± 0 . 03 0 . 90 ± 0 . 04 † 0 . 85 ± 0 . 04 0 . 90 ± 0 . 03 † 0 . 69 ± 0 . 03 0 . 86 ± 0 . 04 † PU 50/class O A: 93 . 40 ± 1 . 4 97 . 86 ± 0 . 5 † 96 . 08 ± 0 . 9 96 . 84 ± 1 . 2 ‡ 87 . 79 ± 1 . 7 96 . 55 ± 0 . 5 † AA: 95 . 47 ± 0 . 7 98 . 13 ± 0 . 3 † 97 . 09 ± 0 . 5 97 . 90 ± 0 . 4 † 89 . 05 ± 0 . 9 96 . 37 ± 0 . 3 † κ : 0 . 91 ± 0 . 02 0 . 97 ± 0 . 01 † 0 . 95 ± 0 . 01 0 . 96 ± 0 . 02 ‡ 0 . 84 ± 0 . 02 0 . 95 ± 0 . 01 † † , ‡ Statistically signiﬁcant improv ement, ev aluated with Mann–Whitney U test, with P < 0 . 01 ( † ) or P < 0 . 05 ( ‡ ). In the latter case, it is p ossible that even a chance guess w ould pro vide a satisfactory performance. The strip es do not form as go od a training set as rectangular grid, which conﬁrms the initial supp osition that artiﬁcial classes should b e conﬁned to lo cal areas. Some improv emen t how ev er is still seen, which supp orts our ov erall prop osition, that general artiﬁcial lab elling can b e used for improving the DLNN p erformance without precise estimation of the artiﬁcial class patch size. 3.3 Exp erimen t 3 T able 3 presents the results of the third exp erimen t. The ov erall accuracy w as calculated based on n = 50 runs for eac h examined scenario. Here, the original p erformance (GT) can b e signiﬁcan tly improv ed by the grid-based artiﬁcial lab elling (see results for 5 × 5 , 20 × 20 , 30 × 30 ). How ev er, in this case the p erformance gain can b e confronted with a lab el dataset created from ground truth data (GT-2, GT-10). As can b e exp ected, the ground truth data provides a higher p erformance; how ev er, the artiﬁcial lab elling pro vides half of that gain with no prior information needed. The ground truth exp erimen ts GT-2 and GT-10 also conﬁrm the observ ation that classes split is a b etter option than joining. The latter observ ation provides an additional supp ort to the conclusion that more small classes (dense grid) is preferable than few large ones (sparse grid). 12 T able 2: The second exp erimen t results. Grid density describ es num b er of rectangular patches which represent artiﬁcial lab els for pre-training phase. Num of strip es denotes num b er of vertical strip es which represent artiﬁcial lab els for pre-training phase. Accuracies are giv en as Overall A ccuracy for learning of the net work based on [16] with transfer learning on the Indian Pines dataset. Grid density / mo del O A Num of strip es / mo del O A (2x2) 61 . 88 ± 4 . 5 2 58 . 53 ± 4 . 9 (3x3) 64 . 45 ± 4 . 1 5 67 . 68 ± 4 . 4 (5x5) 75 . 05 ± 4 . 3 9 68 . 58 ± 3 . 9 (7x7) 72 . 33 ± 4 . 4 16 69 . 25 ± 3 . 7 (9x9) 74 . 06 ± 3 . 6 25 69 . 09 ± 3 . 6 (14x14) 74 . 13 ± 3 . 7 36 69 . 23 ± 3 . 3 (19x19) 73 . 24 ± 3 . 9 49 70 . 08 ± 4 . 0 (24x24) 73 . 43 ± 4 . 0 81 68 . 41 ± 4 . 3 (29x29) 70 . 19 ± 4 . 6 (36x36) 69 . 88 ± 4 . 3 (39x39) 68 . 69 ± 4 . 4 (48x48) 69 . 25 ± 3 . 2 (72x72) 66 . 70 ± 3 . 3 T able 3: The third exp erimen t results. Ev aluation of pretraining on Pig- men ts dataset using the prop osed approach and classes created from ground truth. The ob jective w as to collate the p erformance of artiﬁcial lab els of diﬀeren t sizes with those created through splitting or joining the ground truth. Exp erimen t setting GT a (5x5) b (20x20) b (30x30) b GT-2 c GT-10 c O A 61.15% 68.35% 75.70% 73.99% 75.70% 85.40% a No pretraining. b Pretraining with artiﬁcial classes (prop osed metho d). c Pretraining with mo diﬁed ground truth classes (veriﬁcation). 3.4 Exp erimen t 4 The results of the exp erimen t are presen ted in Figure 6. As exp ected, the net work trained using 1600 true-lab eled samples generated go o d in ternal represen tations, whic h can b e seen b y the go od separability of the classes. In contrast, neural netw ork trained using only 5 samples p er class did not generate representations allo wing the separation of samples of diﬀerent classes. In the case of scenario 3, we can clearly see that the classes were b etter separated when compared with scenario 2, though of course not as go od as in scenario 1. Moreov er, the authors did not observ e any noticeable diﬀerences b et ween scenarios 3 and 4. W e argue that the presen ted results provide some suggestion that during neural netw ork training using prop osed artiﬁcial lab elling scheme there is an emergence of useful data-related represen tations even b efore the ﬁne-tuning step. 13 4 Discussion Our results conﬁrm the v alidit y of our prop osition: a simple artiﬁcial lab elling through grouping of the samples based on a lo cal neighbourho o d provides an eﬃcient transfer learning scheme. It brings signiﬁcant improv ements of accuracy across datasets and DLNN conﬁgurations. The re- sults for diﬀeren t datasets, which ha ve distinctive ground truth lay outs suggest that it is not the random alignment with the regularity of a particular ground truth pattern. It is also seen that the local structure is important, as seen in the adv antage of grid division o ver strip e division. The generally b etter p erformance of higher ov er low er num b er of artiﬁcial classes suggests an expla- nation in that for transfer learning, it is not as imp ortan t to lo cate the exact num b er of classes, but to isolate and learn their comp onents, p erhaps for b etter in ternal feature representation. W e view the main adv antage of the prop osed metho d as enhancing the training of a neu- ral netw ork for hypersp ectral remote sensing classiﬁcation. The prop osed pre-training oﬀers a n umber of b eneﬁts: 1. Enhance the training of neural netw orks in h yp erspectral classiﬁcation scenario. With low n umber of training samples in typical scenarios (e.g. 5 − 15 /class, sometimes even less) the num b er of netw ork free parameters can b e several orders of magnitude higher than the training data, which p oses a risk of ov ertraining. 2. Through splitting the training into tw o phases, it can b e used to shift some of the com- putational burden of netw ork training to the time b efore an exp ert is called in to p erform lab elling, and make more eﬀective use of his or hers time. 3. Larger num b er of training samples av ailable can b e of use in case diﬀeren t netw ork archi- tectures are compared for the same problem, or during the searching the h yp erparameter space. An op en question is whether a clustering algorithm, lik e [33] or outlier segmentation [48] could b e adapted here leading to greater eﬃciency . It is probable that a more complex artiﬁcial lab elling algorithm could outp erform the prop osed solution; ho wev er even in that case, a simple, generally applicable heuristic that improv es p erformance can b e of v alue. Our approach has common motiv ation with self-taugh t learning [49], where we w ant the classiﬁer to derive high- lev el input represen tation from the unlab elled data; ho wev er w e use the same data for b oth training stages and instead c hange the lab el set. It also av oids com bining neural and non-neural approac hes, and preven ts introducing additional assumptions through the manual selection of the latter. A qualitative examination of the pre-training results shows that some class structure is visible after pre-training (see examples in Figure 7). No iden tiﬁable features of this structure ha ve b een noticed when in vestigating pre-training images when asso ciated with b etter or worse ﬁnal (after ﬁne-tuning) results. Ho wev er, the general level of structure visible after pre-training relates to the ﬁnal p erformance. The net work architecture based on the work [63] is b est in learning the artiﬁcial classes grid and also the w orst at the ﬁnal classiﬁcation. The other tw o netw orks based on the w orks [16, 15] hav e more complex pre-training results and corresp ondingly b etter ﬁnal results. This suggests that the training scheme and/or netw ork architecture functions as a form of regularization that preven ts o vertraining, and that the pre-training classiﬁcation result can b e p ossibly used to con trol pre-training and av oid ov ertraining to o. The emergence of partial class structure in the pre-training phase – which do es not use ground truth, hence can b e viewed as unsup ervised pro cessing – also suggests that this approach can b e adapted to solve unsup ervised tasks, e.g. clustering or anomaly detection. 14 T o provide additional v eriﬁcation, we’v e analysed p er-class classiﬁcation scores for b oth datasets, us ing the data from exp erimen t one, and the same Mann–Whitney U with P < 0 . 05 . As could b e exp ected, p erformance gains are unequal, as classes diﬀer with their ov erlap and general diﬃcult y of classiﬁcation. How ev er, the individual classes show ed improv ement in most of the cases. A cross 198 tests 1 , in 104 cases the improv emen t was statistically signiﬁcant; for the remaining cases, in 39 cases the accuracy of 100 % was achiev ed irresp ective of pre-training, in 32 cases pre-training improv ed the mean of the class score. In the remaining cases where pre-training score mean was low er that the reference, the av erage diﬀerence was b elo w tw o p er- cen tage p oin ts. The prop osed metho d thus can b e viewed as ‘not damaging’ to individual class scores. A dditionally , a batch of exp eriments were p erformed for sensitivity analysis of small v aria- tions of h yp erparameter setting; the results were very similar to those presented. A separate exp erimen t was conducted analysing time-requirements when training the netw orks. The results of the exp eriments are presen ted in Figure 8. The results show that it is more imp ortant to train the netw ork during pre-training stage than during the ﬁne-tuning stage (one can clearly see the results getting b etter when moving vertically within a grid from Figure 8, as opp osed to moving horizon tally). As one can see, in the case of the low er n um b er of pre-training iterations (10k-50k), ev en mo derate increase leads to deﬁnite impro vemen t in the accuracy of the classiﬁcation. The results also suggest that it could b e p ossible to reduce the time of training in b oth of the stages without sacriﬁcing the eﬀectiveness of classiﬁcation. Moreov er, it can b e presumed that choosing a diﬀerent num b er of iterations of the pre-training and ﬁne-tuning stages could lead to ac hieving ev en b etter results than the ones presented in this work (for example, when training the netw ork for 90k iterations in the pre-training stage, and 20k iterations in the ﬁne-tuning stage, it was p ossible to achiev e the accuracy of 81.34%). Analysing the results from the T able 1, one can notice that pre-training improv es accuracy in some net works more than in others. W e susp ect that an imp ortan t factor determining such diﬀerences is the capacity of neural netw orks. W e argue this with fact that artiﬁcial neural net work with greater num b er of parameters is able to b etter pro cess information contained in the entire image, which we utilize in the pre-training phase. Therefore, the arc hitecture [15] with the smallest n umber of parameters achiev es a smaller increment of the accuracy in comparison to other tw o net works. Ho wev er, one must to b e aw are that there are a num b er of other factors that aﬀect netw ork p erformance. In particular, architectures [16, 15] were designed for the task of HSI classiﬁcation. With an emphasis on the architecture [15], whic h has b een studied on a small training data sets, and therefore has comp etitiv e accuracy ev en without pre-training. On the other hand, arc hitecture [63] was designed for a slightly diﬀerent training regimen, which may explain the fact that it achiev es worse results than the other tw o. Our approach could be used for semi-automatic systems lik e [66], whic h use only a part of the annotation, and could b e made fully unsup ervised. F urthermore, we believe this is one approac h for self-taugh t learning [49], that can b e helpful in diverse application of deep learning mo dels. W e note, how ever, that optimization w ould require further studies to address the issue of which lay ers b eneﬁt most of this scheme, i.e. similar to [36]. Our exp erimen ts show that the prop osed scheme is largely resistant to the incorrect estimation of the num b er of classes, hence its parametrization can b e considered low-cost. It can b e also viewed as a conﬁrmation of traditional softw are developmen t principle of ‘divide and conquer’, as of ev en older prov erb, ‘divide et imp era’. 1 T en classes for Indian Pines, 11 for P avia Universit y , each rep eated across 9 pairs of net work architecture and sample p er class num b er. 15 5 Conclusions W e hav e presented and veriﬁed a simple metho d pre-training of DLNN for hypersp ectral classi- ﬁcation based on the hypothesis that spatial similarity of unlab elled data p oin ts can b e utilized to gain accuracy in h yp ersp ectral classiﬁcation. In the ﬁrst exp erimen t, we show ed that for all three neural netw ork architectures tested, and for the all t wo reference datasets, the prop osed pro cedure leads to an improv emen t of classiﬁcation eﬃciency for small num b er of training sam- ples. In the second and third exp erimen ts, we analysed the prop erties of prop osed metho d; the obtained results suggest that the n umber and shap e of the pixel blobs hav e an impact on the eﬀectiv eness of the metho d. Sp eciﬁcally , w e conclude from the second exp erimen t that it is safer to underestimate the size of a lab el cluster rather than ov erestimate and simultaneously reduce c hance of joining separate classes. This conclusion is in line with results of the third exp eriment, from which w e also conclude that it is b etter to split ground truth classes than join them. The absence of training lab els requirement provides an imp ortan t adv antage: it shifts the need of exp ert’s participation and data labelling from the start of the data analysis pro cess to its late stages. This allo ws for the use of the p otentially long time from the acquisition to the start of data interpretation stage for pre-training the netw ork, and decreases the delay b et ween exp ert’s lab elling to getting the classiﬁcation result. Considering the length of time required to train deep neural net works, this is a signiﬁcan t adv antage for their applications. An additional b eneﬁt is that multiple unannotated images can b e used in the pre-training stage, p oten tially increasing the robustness of the result. A c kno wledgemen ts This work has b een partially supp orted by the pro jects: ‘Represen tation of dynamic 3D scenes using the Atomic Shap es Netw ork model’ ﬁnanced by the National Science Cen tre, decision DEC-2011/03/D/ST6/03753 and ‘Application of transfer learning metho ds in the problem of h yp ersp ectral images classiﬁcation using conv olutional neural netw orks’ funded from the P olish budget funds for science in the years 2018-2022, as a scientiﬁc pro ject unde r the „Diamond Gran t” program, no. DI2017 013847. M.O. ac knowledges supp ort from Polish National Science Cen ter sc holarship 2018/28/T/ST6/00429. This researc h was supp orted in part by PLGrid Infrastructure. The authors would like to thank Lab oratory of Analysis and Nondestructive In vestigation of Heritage Ob jects (LANBOZ) in National Museum in Krak ów for pro viding the pigmen ts dataset, in particular to Janna Simone Mostert for her help in the preparation of paintings and Agata Mendys for acquisition of the dataset. The authors also thank Zbigniew Puchała for help in carrying out statistical analysis of the results. Additionally authors thank Y u et al [15] for sharing the co de. References [1] J.M. Bioucas-Dias, A. Plaza, G. Camps-V alls, P . Scheunders, N.M. Nasrabadi, and J. Chanussot. Hyp ersp ectral remote sensing data analysis and future c hallenges. IEEE Geoscience and Remote Sensing Magazine, 1(2):6–36, 2013. [2] P . Ghamisi, J. Plaza, Y. Chen, J. Li, and A. J. Plaza. Adv anced sp ectral classiﬁers for h yp ersp ectral images: A review. IEEE Geoscience and Remote Sensing Magazine, 5(1):8– 32, 2017. 16 [3] Chein-I Chang. Hyp erspectral Imaging: T echniques for Spectral Detection and Classiﬁcation. Springer, Boston, MA, 2003. [4] F. Melgani and L. Bruzzone. Classiﬁcation of h yp erspectral remote sensing images with Sup- p ort V ector Machines. IEEE T ransactions on Geoscience and Remote Sensing, 42(8):1778– 1790, 2004. [5] Mic hał Romaszewski, Przem ysła w Głomb, and Michał Cholew a. Semi-sup ervised hyper- sp ectral classiﬁcation from a small n umber of training samples using a co-training approach. ISPRS Journal of Photogrammetry and Remote Sensing, 121:60 – 76, 2016. [6] Alex Krizhevsky , Ilya Sutskev er, and Geoﬀrey E Hin ton. Imagenet classiﬁcation with deep con volutional neural netw orks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. W ein- b erger, editors, A dv ances in Neural Information Pro cessing Systems 25, pages 1097–1105. Curran Asso ciates, Inc., 2012. [7] Y ann LeCun, Y oshua Bengio, et al. Conv olutional netw orks for images, sp eec h, and time series. The handb o ok of brain theory and neural netw orks, 3361(10):1995, 1995. [8] P aul Smolensky . Information pro cessing in dynamical systems: F oundations of harmony theory . T ec hnical rep ort, Colorado Univ at Boulder Dept of Computer Science, 1986. [9] Da vid H Ac kley , Geoﬀrey E Hinton, and T errence J Sejnowski. A learning algorithm for b oltzmann mac hines. Cognitive science, 9(1):147–169, 1985. [10] Geoﬀrey E Hinton and Ruslan R Salakhutdino v. Reducing the dimensionality of data with neural netw orks. science, 313(5786):504–507, 2006. [11] Geoﬀrey E Hinton. Deep b elief netw orks. Scholarpedia, 4(5):5947, 2009. [12] Geoﬀrey E Hin ton, Simon Osindero, and Y ee-Why e T eh. A fast learning algorithm for deep b elief nets. Neural computation, 18(7):1527–1554, 2006. [13] Sepp Ho c hreiter and Jürgen Schmidh ub er. Long short-term memory . Neural computation, 9(8):1735–1780, 1997. [14] Rah ul Dey and F athi M Salem t. Gate-v ariants of gated recurrent unit (gru) neural netw orks. In 2017 IEEE 60th In ternational Midwest Symposium on Circuits and Systems (MWSCAS), pages 1597–1600. IEEE, 2017. [15] Shiqi Y u, Sen Jia, and Chun y an Xu. Con volutional neural netw orks for h yp ersp ectral image classiﬁcation. Neuro computing, 219:88–98, 2017. [16] Hyungtae Lee and Heesung Kw on. Going deeper with contextual cnn for hypersp ectral image classiﬁcation. IEEE T ransactions on Image Pro cessing, 26(10):4843–4855, 2017. [17] Mengxin Han, Runmin Cong, Xinyu Li, Huazhu F u, and Jianjun Lei. Join t spatial- sp ectral hypersp ectral image classiﬁcation based on conv olutional neural netw ork. P attern Recognition Letters, 2018. [18] Xic huan Zhou, Nian Liu, F ang T ang, Ying jun Zhao, Kai Qin, Lei Zhang, and Dong Li. A deep manifold learning approach for spatial-sp ectral classiﬁcation with limited lab eled training samples. Neuro computing, 331:138 – 149, 2019. 17 [19] Y onghao Xu, Bo Du, F an Zhang, and Liangp ei Zhang. Hyp erspectral image classiﬁcation via a random patches netw ork. ISPRS Journal of Photogrammetry and Remote Sensing, 142:344 – 357, 2018. [20] Bin Pan, Zhenw ei Shi, and Xia Xu. Mugnet: Deep learning for hypersp ectral image clas- siﬁcation using limited samples. ISPRS Journal of Photogrammetry and Remote Sensing, 145:108 – 119, 2018. Deep Learning RS Data. [21] Hongmin Gao, Y ao Y ang, Sheng Lei, Chenming Li, Hui Zhou, and Xiaoyu Qu. Multi-branch fusion netw ork for hypersp ectral image classiﬁcation. Knowledge-Based Systems, 167:11 – 25, 2019. [22] Guangzhe Zhao, Guangyun Liu, Leyuan F ang, Bing T u, and P edram Ghamisi. Multiple con- v olutional lay ers fusion framework for hypersp ectral image classiﬁcation. Neuro computing, 2019. [23] Lic hao Mou, Pedram Ghamisi, and Xiao Xiang Zhu. Deep recurrent neural netw orks for h yp ersp ectral image classiﬁcation. IEEE T ransactions on Geoscience and Remote Sensing, 2017. [24] Hao W u and Saurabh Prasad. Conv olutional recurren t neural netw orks for hypersp ectral data classiﬁcation. Remote Sensing, 9(3):298, 2017. [25] Y ushi Chen, Zhouhan Lin, Xing Zhao, Gang W ang, and Y anfeng Gu. Deep learning- based classiﬁcation of hypersp ectral data. IEEE Journal of Selected topics in applied earth observ ations and remote sensing, 7(6):2094–2107, 2014. [26] Y angy ang F an, Ch u Zhang, Ziyi Liu, Zheng jun Qiu, and Y ong He. Cost-sensitiv e stack ed sparse auto-enco der mo dels to detect strip ed stem b orer infestation on rice based on hyper- sp ectral imaging. Knowledge-Based Systems, 168:49 – 58, 2019. [27] Y anh ui Guo, Siming Han, Han Cao, Y u Zhang, and Qian W ang. Guided ﬁlter based deep recurren t neural netw orks for hypersp ectral image classiﬁcation. Pro cedia Computer Science, 129:219 – 223, 2018. 2017 INTERNA TIONAL CONFERENCE ON IDENTIFICA- TION,INF ORMA TION AND KNOWLEDGEIN THE INTERNET OF THINGS. [28] Cheng Shi and Chi-Man Pun. Multi-scale hierarc hical recurrent neural netw orks for hyper- sp ectral image classiﬁcation. Neuro computing, 294:82 – 93, 2018. [29] An tonio Plaza, Jon Atli Benediktsson, Joseph W Boardman, Jason Brazile, Lorenzo Bruz- zone, Gustav o Camps-V alls, Jo celyn Chan ussot, Mathieu F auvel, Paolo Gamba, Anthon y Gualtieri, et al. Recent adv ances in tec hniques for hypersp ectral image pro cessing. Remote sensing of environmen t, 113:S110–S122, 2009. [30] M. Cholewa, P . Głomb, and M. Romaszewski. A spatial-sp ectral disagreement-based sample selection with an application to hypersp ectral data classiﬁcation. IEEE Geoscience and Remote Sensing Letters, 16(3):467–471, 2019. [31] Y uliy a T arabalk a, Jo celyn Chan ussot, and Jón A tli Benediktsson. Segmentation and classi- ﬁcation of hypersp ectral images using minim um spanning forest gro wn from automatically selected markers. Systems, Man, and Cyb ernetics, P art B: Cyb ernetics, IEEE T ransactions on, 40(5):1267–1279, 2010. 18 [32] Inmaculada Dópido, Jun Li, An tonio Plaza, and P aolo Gamba. Semi-sup ervised classiﬁ- cation of urban hypersp ectral data using sp ectral unmixing concepts. In Urban Remote Sensing Even t (JURSE), 2013 Joint, pages 186–189. IEEE, 2013. [33] H. W u and S. Prasad. Semi-sup ervised deep learning using pseudo lab els for hypersp ectral image classiﬁcation. IEEE T ransactions on Image Pro cessing, 27(3):1259–1270, 2018. [34] L. Windrim, A. Melkum yan, R. J. Murphy , A. Chlingaryan, and R. Ramakrishnan. Pre- training for hypersp ectral conv olutional neural netw ork classiﬁcation. IEEE T ransactions on Geoscience and Remote Sensing, 56(5):2798–2810, 2018. [35] S. J. Pan and Q. Y ang. A surv ey on transfer learning. IEEE T ransactions on Knowledge and Data Engineering, 22(10):1345–1359, 2010. [36] Jason Y osinski, Jeﬀ Clune, Y oshua Bengio, and Ho d Lipson. How transferable are features in deep neural netw orks? In Pro ceedings of the 27th International Conference on Neural Information Pro cessing Systems - V olume 2, NIPS’14, pages 3320–3328, Cam bridge, MA, USA, 2014. MIT Press. [37] Hong-W ei Ng, Viet Dung Nguyen, V assilios V onik akis, and Stefan Winkler. Deep learning for emotion recognition on small datasets using transfer learning. In Pro ceedings of the 2015 A CM on International Conference on Multimo dal In teraction, ICMI ’15, pages 443–449. A CM, 2015. [38] M. Xie, N. Jean, M. Burke, D. Lob ell, and S. Ermon. T ransfer Learning from Deep F eatures for Remote Sensing and Po vert y Mapping. ArXiv e-prints, 2015. [39] H.-C. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Y ao, D. Mollura, and R. M. Summers. Deep Con volutional Neural Netw orks for Computer-Aided Detection: CNN Ar- c hitectures, Dataset Characteristics and T ransfer Learning. ArXiv e-prints, 2016. [40] P eicheng Zhou, Gong Cheng, Zhenbao Liu, Shuh ui Bu, and Xintao Hu. W eakly sup ervised target detection in remote sensing images base d on transferred deep features and negativ e b ootstrapping. Multidimensional Systems and Signal Pro cessing, 27(4):925–944, 2016. [41] H. Lyu and H. Lu. A deep information based transfer learning method to detect annual urban dynamics of b eijing and newyork from 1984–2016. In 2017 IEEE International Geoscience and Remote Sensing Symp osium (IGARSS), pages 1958–1961, 2017. [42] F an Hu, Gui-Song Xia, Jingwen Hu, and Liangp ei Zhang. T ransferring deep conv olu- tional neural netw orks for the scene classiﬁcation of high-resolution remote sensing imagery . Remote Sensing, 7(11):14680–14707, 2015. [43] Haob o Lyu, Hui Lu, and Lic hao Mou. Learning a transferable change rule from a recurrent neural netw ork for land cov er change detection. Remote Sensing, 8(6), 2016. [44] W. Li, G . W u, and Q. Du. T ransferred deep learning for anomaly detection in h yp ersp ectral imagery . IEEE Geoscience and Remote Sensing Letters, 14(5):597–601, 2017. [45] J. Lin, R. W ard, and Z. J. W ang. Deep transfer learning for hypersp ectral image classiﬁca- tion. In 2018 IEEE 20th International W orkshop on Multimedia Signal Pro cessing (MMSP), pages 1–5, 2018. 19 [46] Y. Y uan, X. Zheng, and X. Lu. Hypersp ectral image sup erresolution by transfer learn- ing. IEEE Journal of Selected T opics in Applied Earth Observ ations and Remote Sensing, 10(5):1963–1974, 2017. [47] Bei F ang, Ying Li, Haokui Zhang, and Jonathan Cheung-W ai Chan. Semi-sup ervised deep learning classiﬁcation for h yp ersp ectral image based on dual-strategy sample selec- tion. Remote Sensing, 10(4), 2018. [48] Bo Du, Liangp ei Zhang, Dacheng T ao, and Dengyi Zhang. Unsup ervised transfer learning for target detection from hypersp ectral images. Neuro computing, 120:72 – 82, 2013. Image F eature Detection and Description. [49] Ra jat Raina, Alexis Battle, Honglak Lee, Benjamin Pac k er, and Andrew Y. Ng. Self-taught learning: T ransfer learning from unlab eled data. In Pro ceedings of the 24th International Conference on Machine Learning, ICML ’07, pages 759–766. ACM, 2007. [50] Olivier Chap elle, Bernhard Sc holkopf, and Alexander Zien. Semi-Sup ervised Learning. The MIT Press, 2006. [51] Da vid Rolnick, Andreas V eit, Serge Belongie, and Nir Shavit. Deep learning is robust to massiv e lab el noise, 2018. [52] Geoﬀrey Hinton, Oriol Viny als, and Jeﬀ Dean. Distilling the knowledge in a neural netw ork, 2015. [53] Tim Salimans, Ian Go odfellow, W o jciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, and Xi Chen. Improv ed techniques for training gans. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Adv ances in Neural Information Pro cessing Systems 29, pages 2234–2242. Curran Asso ciates, Inc., 2016. [54] Kun T an, Erzhu Li, Qian Du, and P eijun Du. An eﬃcient semi-sup ervised classiﬁcation ap- proac h for hypersp ectral imagery . ISPRS Journal of Photogrammetry and Remote Sensing, 97:36–45, 2014. [55] Liguo W ang, Siyuan Hao, Qunming W ang, and Ying W ang. Semi-supervised classiﬁcation for h yp ersp ectral imagery based on spatial-sp ectral Lab el Propagation. ISPRS Journal of Photogrammetry and Remote Sensing, 97:123–137, 2014. [56] Ian Go odfellow, Y osh ua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org . [57] Honglak Lee, Roger Grosse, Ra jesh Ranganath, and Andrew Y. Ng. Con v olutional deep b elief netw orks for scalable unsup ervised learning of hierarchical representations. In Pro ceedings of the 26th Annual In ternational Conference on Machine Learning, ICML ’09, page 609–616, New Y ork, NY, USA, 2009. Asso ciation for Computing Machinery . [58] An thony J. Bell and T errence J. Sejnowski. The “indep enden t comp onen ts” of natural scenes are edge ﬁlters. Vision Research, 37(23):3327 – 3338, 1997. [59] Bruno A. Olshausen and David J. Field. Emergence of simple-cell receptive ﬁeld prop erties b y learning a sparse co de for natural images. Nature, 381(6583):607–609, Jun 1996. [60] Andrew M Dai and Quo c V Le. Semi-supervised sequence learning. In C. Cortes, N. D. La wrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, A dv ances in Neural Information Pro cessing Systems 28, pages 3079–3087. Curran Asso ciates, Inc., 2015. 20 [61] Jerem y How ard and Sebastian Ru der. Universal language mo del ﬁne-tuning for text classi- ﬁcation. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Pro ceedings of the 56th Ann ual Meeting of the Asso ciation for Computational Linguistics, page 328–339. Asso ciation for Computational Linguistics, 2018. [62] W ei W ang and Zhi-Hua Zhou. A new analysis of co-training. In Pro ceedings of the 27th in ternational conference on mac hine learning (ICML-10), pages 1135–1142, 2010. [63] B. Liu, X. Y u, P . Zhang, A. Y u, Q. F u, and X. W ei. Sup ervised deep feature extraction for h yp ersp ectral image classiﬁcation. IEEE T ransactions on Geoscience and Remote Sensing, 56(4):1909–1921, April 2018. [64] Bartosz Grab o wski, W o jciech Masarczyk, Przemysła w Głomb, and Agata Mendys. Auto- matic pigmen t identiﬁcation from h ypersp ectral data. Journal of Cultural Heritage, 31:1–12, 2018. [65] Laurens v an der Maaten and Geoﬀrey Hinton. Visualizing data using t-SNE. Journal of Mac hine Learning Research, 9:2579–2605, 2008. [66] Ross Girshick, Jeﬀ Donah ue, T revor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate ob ject detection and semantic segmentation. In Pro ceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’14, pages 580–587. IEEE Computer So ciet y , 2014. 21 (a) 5s/A9 OA 76 . 1 % AA 79 . 5 % κ 0 . 71 (b) 15s/A9 OA 88 . 3 % AA 91 . 1 % κ 0 . 86 (c) 50s/A9 OA 94 . 2 % AA 96 . 0 % κ 0 . 93 (d) 5s/A3 O A 66 . 3 % AA 71 . 2 % κ 0 . 61 (e) 15s/A3 OA 87 . 0 % AA 91 . 5 % κ 0 . 84 (f ) 50s/A3 OA 94 . 8 % AA 96 . 9 % κ 0 . 94 (g) 5s/A5 OA 65 . 8 % AA 68 . 9 % κ 0 . 59 (h) 15s/A5 OA 75 . 4 % AA 76 . 7 % κ 0 . 71 (i) 50s/A5 OA 87 . 3 % AA 90 . 1 % κ 0 . 84 Figure 4: Sample results from exp erimen t one, Indian Pines dataset. Ro ws present the three examined architectures, where A9, A3 and A5 corresp onds to architectures [16], [15] and [63] resp ectiv ely . Columns present the three cases of n umber of true training samples p er class in ﬁne-tuning (5s, 15s and 50s). F or each result, the Ov erall Accuracy (OA), A verage Accuracy (AA) and κ co eﬃcien t are rep orted. Isolated grey points mark lo cations of the training samples, and are excluded from the ev aluation. 22 (a) 5s/A9 OA 79 . 7 % AA 88 . 4 % κ 0 . 75 (b) 15s/A9 O A 91 . 3 . 3 % AA 93 . 8 % κ 0 . 89 (c) 50s/A9 OA 97 . 8 % AA 98 . 2 % κ 0 . 97 (d) 5s/A3 O A 81 . 7 % AA 91 . 0 % κ 0 . 77 (e) 15s/A3 OA 92 . 3 % AA 94 . 7 % κ 0 . 90 (f ) 50s/A3 OA 97 . 4 % AA 97 . 3 % κ 0 . 97 (g) 5s/A5 OA 77 . 1 % AA 79 . 8 % κ 0 . 71 (h) 15s/A5 OA 90 . 4 % AA 88 . 6 % κ 0 . 88 (i) 50s/A5 OA 96 . 6 % AA 96 . 4 % κ 0 . 96 Figure 5: Sample results from exp erimen t one, Pa via Univ ersity dataset. The scheme is identical to the Figure 4. 23 (a) Activ ations from netw ork trained using 200 sam- ples/class (b) Activ ations from netw ork trained using 5 sam- ples/class (c) Activ ations from netw ork trained using artiﬁcial la- bels only (d) A ctiv ations from netw ork trained with ﬁne-tuning (artiﬁcial followed b y training lab els) Figure 6: The visualisation of the learned parameters for four netw orks introduced in subsec- tion 2.5.4. Each p oin t represents given sample’s activ ations transformed to the 2-dimensional space using t-SNE algorithm. Diﬀerent colors represent diﬀerent classes present on the image. 24 (a) dataset: IP/architecture: [16] (b) dataset: IP/architecture: [15] (c) dataset: IP/architecture: [63] (d) dataset: PU/architecture: [16] (e) dataset: PU/architecture: [15] (f ) dataset: PU/architecture: [63] Figure 7: Sample pre-training results. T op ro w Indian Pines, bottom ro w P a via Univ ersity datasets. Columns present the three architectures studied (based on the works [16, 15, 63]). Some class structure is visible dep ending on the dataset and net work selected. 25 10k 20k 30k 40k 50k 60k 70k 80k 90k 100k Fine-tuning iterations 10k 20k 30k 40k 50k 60k 70k 80k 90k 100k Pre-training iterations 62.96 63.0 66.51 64.77 62.54 62.87 64.93 65.73 65.37 66.69 71.91 70.31 67.78 71.35 73.8 69.92 70.4 70.88 73.11 68.03 74.74 76.34 74.41 71.95 69.7 73.74 73.06 74.06 70.18 71.22 74.21 76.94 73.71 73.11 76.33 73.29 75.39 74.87 75.4 75.09 75.64 73.98 72.62 78.85 76.44 77.55 76.85 77.92 78.5 74.64 73.98 78.54 77.86 76.47 77.0 76.31 76.19 73.87 71.94 76.81 78.29 75.46 76.32 75.3 77.24 75.0 73.35 72.85 75.1 78.49 77.59 75.31 76.68 74.81 74.66 75.14 74.56 79.2 76.39 79.2 75.29 81.34 73.69 78.34 78.17 80.44 74.07 76.6 78.69 80.38 73.5 74.47 76.71 75.2 74.12 76.28 75.05 78.39 74.93 75.96 Figure 8: The classiﬁcation accuracies of the netw orks trained on 5 samples/class and tested with the rest of the image using the Indian Pines dataset and neural net work based on [16]. On the y-axis, the num b er of ep ochs for the pre-training stage is written, while on the x-axis the n umber of ep ochs for the ﬁne-tuning stage is written. The resuts suggest the relative imp ortance of the pre-training stage in comparison to the ﬁne-tuning stage. 26

Effective training of deep convolutional neural networks for hyperspectral image classification through artificial labeling

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment