Human Daily Activities Indexing in Videos from Wearable Cameras for Monitoring of Patients with Dementia Diseases

Human Daily Activities Indexing in Video s from Wearable Cameras for Monitoring of Patients w ith Dementia Diseases Svebor Karaman 1 , Jenny Benois-Pineau 1 , Rémi Mégret 2 , Vladislavs Dovgalec s 2 , Jean-François Dartig ues 3 , Yann Gaëstel 3 1 LaBRI, Université de Bo rdeaux, Talence , France, {Svebor.Karaman, Jenny.Be nois}@labri.fr, 2 IMS, Université de Borde aux, Talence, France, {Remi.Megret, Vladislavs.Dovgalec s}@ims-bordeaux.fr, 3 INSERM U.897, Université Victor Segalen Borde aux 2, Bordeaux, France, {Jean-Francois.Dartigues, Yann.Gaestel}@isped.u-bordeaux2.fr Abstract Our research focu ses on analysing hu man a ctivities according to a known behaviorist scena rio, in case of noisy and high dimensio nal collected data. The data come from the monitoring of pa tients wi th dementia diseases by wearable ca meras. We define a structural model of video recording s b ased on a Hidden Markov Model. New spatio -temporal featu res, color features and localization features are proposed as observations. First results in recogn ition of a ctivities are promising. 1. Introduction In the field of human behavior analysis, video brings an objective vision to the medical practitioners. Working with medical docto rs, our aim is to develop a method for indexing videos via recognition of activities obtained from the monitoring of p atients with wearable video cameras [12]. The daily activ ities of interest are specified by medical researches in the context of studies o f dementia and in par ticular of the Alzheim er disease [5]. According to these studies the analysis of instrumental activities of daily life is one o f the most important tools in the early diagnosis o f Alzheimer disease. I t is also mandatory in order to monitor the development of the disease. To r etrieve the structure of activ ities w e use a Hidden Markov Mod el (HMM), the HMMs have b een successfully applied in video analysis [3] [4]. In videos, a HMM ca n be used at a low level, e.g. for d etecting a scene change in video [3] or at a higher level, to reveal the structure of the video accord ing to a known gramm ar of events like in a tennis match [4]. Generally, HMMs are used for segm entation, classification and reco gnition of events in videos with a clear structure. However the wearable cameras used in patient monitoring provide a long sequence shot of daily activities. T he aim o f this p aper is therefore to propose a schem e to divide the video into coherent segm ents and choose a set of descriptors efficiently describing th e video content in order to identify the daily activities. In section 2 we present the video acquisition setup and characterize the v ideo data w e use. The motion analysis and motion based tem poral clustering is presented in section 3. Section 4 details the choice of descriptors and justifies the description space. In section 5 we present the scenario we have defined with medical doctors and d esign a HMM. Results are presented in section 6 and co nclusions and perspectives of this w ork are dr awn in section 7. 2. Video Acquisition Setup 2.1. The Device The use of w earable ca meras in the obser vation of patients suff ering from dementia diseases can b e interesting for medical doctor s since it alway s keeps an objective vision of the patient activities contrarily to the patient's relatives who may see the situation as either worse or b etter than it is in reality [5]. T his may lead to a late detection of the disease. Wearab le cameras have been used in some previous p rojects like in the Sen seCam pro ject [6] where the images were recorded as a mem ory aid for the patient. The WearCam proj ect [8] uses a camera strapp ed on the head of you ng children, combined with a wireless transmis sion from th e camera to the recorder . In our p rojec t, a mobile video recorder i s used in order to o btain the best quality of image in such an embedded context. The device h as to b e as light as possible, since the patients ar e aged p ersons, and also to allow people to go on with their d aily activities with out difficulties due to the device. After studies of several p ositions in [1 2] and further in situ analysis, we have foun d the shoulder position to meet these objectives the best. The view point o btained from this position is close to the patient’s viewpoint without the problem of fix ing a camera on patient’s head which is uncomf ortable for the patient and may lead to high non significan t motion. Tod ay our prototype uses a Fish- Eye lens w ith an effective diagonal angle of 150° wh ich allows us to capture most of instrumental activities. 2.2. Characteristics of Recorded Videos The video sequences shot by a w earable camera have im portant motion since the camera follow s the movem ents of the person ( Fig 2 ). T his lead s to som e blur in case of strong m ovemen ts ( Fig 2a ). Strong lighting chang es ap pears wh en the person moves to a different room or faces a window ( Fig 2 b, c ). T he wh ole video is recor ded in one shot. 3. Motion analysis 3.1. Global Motion Estimation In wearable camera setting video s the ego-motion allows us to distinguish situations when the patient moves or remains still, e.g. sitting or standing. Therefore the motion is re levant as a descriptor of the activity of the p atient. To extract this inf ormation we use the Cam era Motion Detection ( CMD) method previously developed in [9]. T his tool estimates the motion accord ing to a f irst order com plete affine model. It tak es as input data the motion vector (d x i , d y i ) T of each macro-block in the co mpressed video stream ( Eq. 1 ). We only use the P images of MPEG videos to estimate the m otion model.                                   i i i i y x a a a a a a dy dx 6 5 3 2 4 1 ( 1 ) Eq.1 Motion compensation vector, with (x i , y i ) coordinates of a block center. 3.2. Motion Based Temporal Segment ation Most research in indexing video co ntent uses videos with shot boundaries, w hether cut or p rogressive transition [15]. I n o ur case the wh ole video is a sequence shot. To d efine an equivalen t to the traditional shot, a temporal unit, called segment , adapted to the specific characterist ics of the video must be chosen. T he typical app roach now adays consists in splitting the video into segmen ts of a fixed duration, from half a second to one seco nd [2], based on the latency of hum ans in understanding visual concep ts in video. Another ap proa ch consists in more thorough use of intrinsic camera motion observed in the image plane by segmenting the sequences in so ca lled camera view points [1 3]. In our case, the camera being worn by the patient, the motion is clearly related to the person's position. Therefore, defining a unique point of view as a segmen t is a straightforw ard choice. By composing the CMD parameters ( Eq. 1 ) over time, the trajectories of each corner of the image are computed. When corners traj ectories reach an outbound position relatively to a pred efined threshold t , a “cut” is d etected, the current segment ends and a new one starts at the next frame. Pr evious experiments have show n empirically that a threshold t = 0.2×image width gives a good segmentation, meaning no viewpoint is miss ing and there is no over -segm entation of the video. Each segm ent m ust co ntain at least five frames to ensure at least one localizatio n estimate, see 4 .3 . A key frame is chosen for each segment as its temporal center. 4. Description Space The video features in our case have to express the dynam ics of activity, localization in the hom e environm ent and image content. Hence w e propose a complex d escription space wh ere low-level featu res such as MPEG7 Color Layout Descr iptor , are m erged with m id -level f eatures in an “early f usion” w ay. Fig 2a. Mo tion blu r due to stron g motion. Fig 2b. High lighting while facing a w indow. Fig 2c. High lighting while facing a w indow. Fig 2. Example of videos acquired with a wearable camera. 4.1. Cut histogram A cut histogram H c is defined as a N c bins histogram , where the i th bin contains the number of cuts as computed by the Motion Based Temporal Segmen tation in the 2 i previous frames. The number of bins N c has been s et to 6 or 8 therefore defining a maxim um temporal horizon of 25 6 frames i.e. 1 0 seconds. A cut histogram H c,seg for a segment is the average of cut histogram s H c of all the frames within this segm ent. T his feature character izes dynam ics of activities related to person’s displacement. 4.2. Translation Param eter Histogram Translational parameters of the affine model ( Eq. 1 ) are good indicato rs on the strength of ego -motion of a person which differs depending on the activities. A translation parameters energy his togram H tpe is defined as a N e bins histogram representing the quantized pr obability distribution of the value of the energy of translation parameters over each video segm ent. Two H tpe histograms are built, one for each translation parameter a 1 and a 4 . The ranges of the histogram bins follow a logarithmic scale defined with a step s h ( Eq. 2 ) to pro vide a higher resolution for low- motion parameters values. This histogram is aim ed at distinguish ing betw een low and hig h motion activities. Eq.2 Translation parameter histogram, a being either a 1 or a 4 . 4.3. Localization For image based localization technique we use a Bag of Features app roach [11 ], detailed in [7] which represents each frame by a signature corresponding to a “visu al word” histogram computed from SUR F features [14]. The SURF descrip tors are q uantized using a 3 levels quantization tree [11 ] with a branching factor of 10, yielding 1111 dimensional signatures. T he lo cation is then estim ated using 1-NN classifier. This feature is an im portant d iscriminator with induced semantics e.g. cooking activ ities cannot happen in a living room. For each segm ent a N l bins histogram is built, N l being the num ber of localization classes. E ach bin contains the empirical prob ab ility of being in the estimated i th localization class w ithin the segm ent. 4.4. Color and Spatial Inform ation Colors in images are relevant to determine the environm ent of the patient. In our videos, the stro ng movem ents g enerate blur on im ages w hich does not allow us to identify details. Therefore som e global inform ation of the spatial distribution of colors is required. For this purpose we use the MPEG -7 Color Layou t Descriptor (CLD) [10] wh ich is based o n the DCT transform ation. Mo st frequently , the use of 6 coefficients for lum inance and 3 coefficien ts for each chrominan ce co mponent is dep icted as a relevant choice [10]. For each segment the key frame CLD is selected as descriptor . 5. Hidden Markov Model The practitioners have defined scenarios for understanding the stage of ev olution of the patient’s dementia. T he aim is to reco rd the patient doing d aily activities where the autonomy can be evaluated. Hence, the taxonomy of states o f the HM M to design is dr iven by these activities. Each activity is complex so it can hardly be modeled by a sin gle state. We co nsider a hierarchical HMM in which the upper level, called Activity HMM, contains states correspo nding to semantic activities such as “w orking on a computer” o r “making coffee” and w here the lower level states consist of elementary states with a nested hierarchical relation: each semantic activity is m odeled by an elementary HMM w ith m states, m being the global structural parameter in this two level model. T he transitional matrix of the Activity HMM is fixed a p riori accor ding to the patient’s home environm ent and all initial probab ilities ar e set equal . The non semantic states of the Elem entary HMM are modeled by Gaussian Mixture Models (GMM) in the observation space described in section 4 . T he Elementary HM M state transition matrix A and the observations GMM param eters are learned u sing the Baum-Welsh algorithm. In o rder to introduce the temporal regularization in the HMM, recent research focuses on segm ental HMM [16]. In o ur work w e prefer to stay in the classic HMM fram ew ork as it is lighter in ter ms of complexity and we o btain temporal regularization by increasing the initial looping probability of each elementary state . Pr evious experiments have show n an o ver-segmen tation by the HMM when o bservations were used for each frame. In this w ork an observation corresponds to a vector obtained for a video segm ent by co ncatenating the descriptors defined in section 4. T he HMMs are built using th e HTK Librar y [1]. e h tp e e h h tp e h tp e N i f or s i a if i H N i f or s i a s i if i H i f or s i a if i H                * ) log ( 1 ] [ 1 .. 2 * ) log ( * ) 1 ( 1 ] [ 1 * ) log ( 1 ] [ 2 2 2 6. Results For learning purposes, w e use 10% of the total num ber of frames for each complex activity. T hese frames are used to train the localization estimator and the HM M. In this pap er, w e used the ground tr uth localization to train HMM . The o ther features were extracted automatically . The tests are do ne o ver the segment ob servations in wh ich w e choose several subspaces of o ur description space. Examples of co nfigurations ar e presented in the right column o f table 1 . W e co nsider ed the number of elementary states m =1,3 or 5 , for each activity HMM . The initial looping pr obabilities A ii were set to 0.9 . T he dataset was composed o f 3974 frames used for learning and 31 0 segments for reco gnition co rresponding to 3 3 min utes o f video. The 7 different activities ( “ moving in home office” , “ movin g in kitchen”, “going up/down the stairs” , “ m oving outdoors” , “ moving in the living room”, “making co ffee”, “working on computer”) present in the video were annotated. T he best recognition p erformances are p resented in table 1 , corresponding confusion m atrices are show n in F ig 3 , lines and column s representing previously listed activities in this order. Table 1. Configuration for best recognitio n results The results are very good for some activit ies such as “m oving in home office” with prec ision of 0.94, recall of 0.8 1 and F -Score of 0.8 7 in the configuration (H c + Localization, 5 states HMMs). Ho wever, othe r activities are much more diff icult to detect, such as “m oving in the kitchen” with a F -Score of 0.47. This reveals that the visual d escription spac e is still limited to distinguish activities w hich take place in the same environm ent and involve similar motion activity. The activity “w orking on computer” is also specific b ecause the temporal segmen tation gives only 2 segments in the video even if this activity represents thousands of frames. The temporal segm entation has to be refined. 7. Conclusions and Perspectives This article has p resented a hum an activity indexing method based on HMM with a mixed description space. Results sh ow that the activities w hich have a strong correlatio n with lo calization are well identified. Nevertheless, it is necessary to enrich the description space by features describing obj ects, audio content and patient’s behavior (e.g. m anual activity ). Due to the diversity of home environments , it w ould be hardly possible to tr ain generic m odels. Hence, w e will have to define efficient protocols for large-scale training. 8. References [1] HTK We b-Site: http://htk .eng.cam.ac.uk [2] E. Dumont a nd B. Merialdo, “Rushes video parsing using video sequence al ignment” CBMI, 2009, pp . 44 - 49 [3] J. Borez cky, L. Rowe, "Comparison of Video Shot Boundary Detection Te chniqu es". Journal of Ele ctronic Imaging, 1996 [4] E. Kijak, P. Gros, L . Oisel, "Hierarchical Structu re Analysis of Sport Videos Using HMMS". I EEE ICI P, 2003 [5] J.-F. Dartigues, "Methodolog ical Problems in Clinical and Epidemiological R esearch on Age ing". Revue d'épid émiolog ie et d e santé publiqu e, 2005 [6] S. Hodges et al., "Se nseCam: a Retrospective Memo ry Aid". UBI COMP, 2006, pp. 17 7 - 193 [7] V. Dovgalecs, R. Mégret, H. Wannous, Y. Berthoumieu. "Semi- Supervised Learning for Location Recognition from Wearable Video". CBMI 2010, Gre noble [8] L. P icardi et al. "We arCam: A He ad Wirele ss Camera for Monitoring Gaze Attention a nd for the Diagnosis of Devel opment al Disorders in Young Children". Inter national Symposium on Robot & Human Interactive Comm unicati on, 2 007 [9] J. Benois-Pineau, P. Kramer, "Camera Motion Detection in th e Rough Indexing Paradigm". T REC Video, 2005 [10] T. Sikora, B.S. M anju nath , P. Salembier, "I ntroduction to MPEG-7". Multimedia Content Description I nterface, WI L EY, 2002 [11] D. Nister an d H. Stewenius, “Scalable Recognition with a Vocabulary T ree”. CVPR , 2006 [12] R. M egre t, D. Szol gay, J.Benois-Pineau, Ph. Joly , J. Pinqu ier, J. - F. Darti gues, C. Hel mer “Wearable video monitoring of people with age dementia: Video indexing at the ser vice of healthcare” CBMI, 2008, pp. 101 - 108 [13] G. Abdoll ahian, Z. Pizlo and E. J. Delp “A study on t he effect of camera motion on human visual a ttention”. IEEE I CIP 2008 [14] H. Bay, A. Ess, E. Tuyte laars and L uc Van Go ol, “S URF: Speeded Up Robust Features”. CVI U , 2008 [15] W. Dupuy, J. Benois - Pineau, D. Barba “Re covering of Visual Scenarios i n Movies by M otion Analysis and Grouping Spatio - tempor al Colour Signat ures of Video Shots”, Pro c. of EUSFL A T'2001, 2001 , pp. 3 85 - 390 [16] M. Delakis, G. Gravier and P. Gros “A udiovisual integration with Segment Models for tennis vid eo parsing”. Computer Vision and Image Understanding vol. 111, August 2008 , pp . 142 - 154 Measure Score Configuration F-Score 0.64 H c + Localization 5 states HMMs Recall 0.7 H tpe + CLD + Localization 3 states HMMs Precision 0.67 H c + Localization 5 states HMMs Fig 3a. F-S core Fig 3b. Recall Fig 3c. P recision Fig 3 . Con fusion matrices for best recognition results

Human Daily Activities Indexing in Videos from Wearable Cameras for Monitoring of Patients with Dementia Diseases

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment