Dynamics of thematic information flows

The studies of the dynamics of topical dataflow of new information in the framework of a logistic model were suggested. The condition of topic balance, when the number of publications on all topics is proportional to the information space and time, w…

Authors: ** D.V. L, e, S.M. Braichevskii **

Dynamics of thematic information flows
Dyna mics of thematic informatio n flow s D.V. Lande, S.M. Braichevskii Information center E lVisti, Ky iv, Ukraine The studies of the dynamics of topical dataflow of new information in the framework of a logistic model were suggested. The condition of topic balance, when the number of publications on all topics is proportional to the information space and time, was p resented. General time dependence of t he publication intensity in the Internet, devoted to particular topics, was observed; unlike an exponent model, i t has a saturati on area. So me limitations of a logistic model were identified opening the way for further research. Key words : information flows, Internet network, logistic model, topic balance Topics One of the main features of network information space i s the availability of a d yna mic segment [1], its content changing with time. Thereb y, recentl y th e concept of data flows has become relevant [2-4], they begin to pla y more and more important role in present-day information technologies. Therefore, to study the dynamics of data flows is definitel y important and interesting, particularly because the issue has not been researched enough [5]. During recent decades c ertain achievements have been mad e in solving the problem of information obsolescence in the fram ework of m odel Barton-Kebler [6], which was dev eloped because of the need t o evaluate a real usage term of scientific works and also the approaches of Cole and other authors [7]. L ater it turned out that the results achie ved (as well as t he approaches) could be useful in a wider contex t of the information technologies. However, the comprehension of the processes of t he d ynamics of data flows requires somewhat deeper analysis and more sophisticated technique. Studying the d ynamics of thematic d ata flows o f new i nformation in a framework of a logistic model is suggested i n this work. Common t ime dependence of t he publication intensit y was received; it appe ared to correspond to experimental data. Alon gside with this, limitations of a logistic model have been identified which in turn opens the wa ys for further research. Available models As it is well known, a general structure of the Internet network consist s of t wo main parts – static and dynamic. The whole Internet space can be r elatively divi ded int o two consti tuents – stable and dynamic, they both hav e very different characteristi cs from the point of view of a r equired integration of d ata flows. I n particular, ev en information obsolescence processes, loss of its actuality in Barton-Kebler model are described with the equation, which consists of two components: m(t) = 1 – ae -T – be -2T , where m (t) – share of useful information in a total dataflow t hrough ti me, the first numerator corresponds to stable resources, the second one - d ynamic-new. A stable constituent of the Internet contains "long-term" information, while a d ynamic constituent has constantly updating resources. So me part of this consti tuent joi ns the stable one 2 in the course of time, while a greater part of i t "disappears" from the Internet or enters the segment of "hidden" web-space, not accessible for users via known information-retrieval systems. A segment of new infor mation is apparently mo re vividly d ynamic. On t he one hand, it has the highest level of updating; on the other hand, hu ge amount of data is generated and distributed there. In view of our needs, it is this segment which s eems to be the best for the research. Generally sp eaking, info rmation dynamics in the network is due to man y factors, most of them cannot be anal yzed. As a reasonable assumption, general character of time dependence of the number of thematic publications in the net work is de fined with ver y simple regularities, which allow developing mathematic models. In t he works we are fami liar with and which d eal with t he information obsolescence, Maltus model is used [ 8] (probably with some modifications similar to super-position of two curves with different parameters). The advantage of the model i s that Maltus equation has an exact answer in the form of a ver y simple and co nvenient function – exponent; from the poin t of view of the result interpretation, it looks ver y disputable. The main problem i s that exponent is a monotonous i ncreasing function: it cannot describe t he processes, which b y nature must have local extremes. There is no need to prove that news loses its actuality whi ch results in the decrease of the publication numb er. To get mor e adequate dependence, we hav e to refer to more complic ated models. A logistic m odel appears to be ver y promi sing; it was suggested by P. Ferhlust [9] to describe t he d yna mics of popul ation and by R . Purl [ 10] – for biological communities; later it was successfully used in numerous researches. The advanta ge of the model, first of all, is the f act that i t combines the simplicity of the t ask formulation with the possibility to vary the answers with help of a set of parameters which can have more or less transparent physical contents. Topic balance Let us l ook at the general picture of the d ynamics of thematic data flows, i n particular the mechanisms which are typical for a new segment of the Internet. We presume that the ma jority of organizations- generators of new info rmation work in a stationary r egime, which can be characterized by m aximum capacity of information space N ( we state that the issue of par ameter regularity and their measurement is not co nsidered in this paper). This means that each organization-generator pro duces i nformation flow which is constant as to the number of signs and messa ges. It is the amount of i nformation that changes with time, depending on the topic. In other words, the increase in the publication number on one topic results in the decrease in the publication number on other topics; so for each time span T there is: NT dt t n T M i i = ∫ ∑ = 0 1 ) ( , (1) where n i ( t ) – the publication num ber pe r time unit, and M – the total n umber of all possible topics. Part n i ( t ) is always expected to be zero. To s tudy the d ynamics of a particular them atic i nformation flow, which is describ ed with the density n i ( t ), presents the major interest. It is worth mentioning that when we sa y "topic" r eferring to a info rmation flow, it should not be taken very directly . By using this word we mean certain abstracti on/generality associated with the activity of information sources. It does have a connection with t he events in a real world, but its subjective expression cannot be as simple as it may look from the first sight. For example, the launch of a spacecraft to Mars ma y cause a number of publications concerning the expediency to relocate budget finance in favor of research. 3 Hence, it is not alwa ys p ossible to establish the connection between the activity increase of sources-generators and the situation in the environment. That is why we will speak about the appearance of a n ew to pic, bearing in mind a set of factors which cause the increase in the publication number per t ime uni t. The localization of a particular topic in a semantic space and its articulation in communicative mechanisms is a different issue which we do not have an intention to discuss, at least in this paper. W e will only stat e that it ma y be solved in a wide range of c ases. Our major conc ern is t he fact th at topi cs appear and disappear at certain mom ents (i.e., they lose their actualit y and present no interest for people). It may be theoreticall y a ssumed that lots of publications associated with a defined set of topics are interlinked, namely, some publications can be referred t o s everal di fferent topics at the same tim e. Gen erally, su ch "polytopics" i s a phenomenon which should not be ignored, however we, to a first approximation, will consider that it does not distort a general picture. Furthermore, we will think that during its period of actuality a topic fixes a se t of mechanisms which result in the increase o f the publication number that have so me common features. Diff erent topi cs m ay raise different data flows; so in t his respec t they ar e not interchangeable. On a fo rmal l evel, let us compa re two parameters with a topic as an abstract concept: duration (t ypical "life ti me") λ and intensity D . In the context of this work, we will consider the intensity to be constant. This is a simplified opinion but it i s good enough to identif y general trends. The m entioned-above du ration does not necessarily coincide with the beginning and the end of an event or a num ber of ev ents. I t characterizes onl y a certain timespan when a topic actuality is lost. Intensity can be defined as a quantit y which characterizes the number of publications caused by a certain topic averaged on interval λ . The response of media means, described as quantity D , h as never been momentary: ther e is alwa ys a certain time delay. To take int o account this factor, let us introduce a factor of lateness τ . Hence, we can s uggest the followin g qualit y picture of the d ynamics of thematic data flows. The generation of data flows has two constituents: background and thematic. A background constituent is defined with num erous factors which are not ver y mu ch connected with each ot her, and under certain conditions it ma y approach ( as to thematic classification) noise. But it ensures the publication of r elatively stable number of mater ials based on the principal "Something should be published!" A new topic causes the process (to be more exact, a set of processes) o f re-distributi on of network resources as actual stories appear. The s cope of background publications decreases, that of thematic ones increas es. If the duration of two or more topics intercrosses, then thematic publications begin to redi stribute among them, the nature of the re-distributi on being defined with the meaning of λ and D of each topic. W hen a topic loses its actualit y , associated resources move either to background flows or to other thematic ones. In thi s paper we s tudy the second, thematic constituent, and we focus on the dy namics of flows caused b y on e topic. A definition "interaction" of sev eral topics is a different subject to be researched and it is beyond our task. We will give onl y two real data flows and their behavior in the model which will be described then. I n the first case (F ig. 1a ) the publica tions, scanned by the syste m of news monitoring f rom t he Internet according to the to pics of i llness and quittin g a career by a famous political figure, w ere considered. Before his ill ness took a tu rn for th e worst, the publications concerning his a ctivities were at a high level. The information about his illness increased the number of publications considerably; it reached the highest saturation level. The information about his giving up a po litical career d ecreased th e number of publ ications t o the l owest level; final stabilization occurred at this level. Another example is election of a mayor of a bi g cit y (Fig. 1b). B efore el ection campaign started there were ver y few publi cations about this person in the Internet, which corresponded to a low stable l evel. The election and appointment on the position of mayor were accompanied by a great number of publications of both positi ve and 4 negative nature (an upper level). The proc ess of a m ayor's activities after the election is followed by the number of publications which corresponds to an average stabilizing level. а ) б ) Fig. 1. Examples of data flows Logistic model If nec essary, a logistics model can be consid ered as a generalization of Maltus m odel, which envisages a balan ce between the inc reased speed of a function an d its meaning at each time moment: ) ( ) ( t kn dt t dn = , (2) where k – some coefficient of proportion. We agreed to consider the d ynamics of a p articular thematic information flo w, so we will not write indices for quantities n i ( t ), which define a topic. The idea is to make a c oefficient in M altus equ ation a time function, and t he answer should not ex ceed a threshold meaning. V arious wa ys can h elp, but the use of constant i s the most popular; in its obvious view, it limits the answer increase. In our cas e we will use capacit y N . Then we can present the followi ng )) ( ( t rn N k − , (3) where k – Maltu s co efficient, and r – the factor, which desc ribes ne gative proc esses for this system, associated wit h inner factors. We have to t ake in to accoun t the parameters in obvious view, which c haracterize th e effect of a topic on the p ublication d ynamics. As the intensit y D is defined as constant, it s contribu tion will be represente d as follows: , 0 ( ) 0 , 0 , D t y t t t < ≤ λ   =  < > λ   (4) Correspondingl y, we will consider two time areas separately: 0 < t ≤ λ з D > 0 and t > λ with D = 0, answers for th em are functions u ( t ) and v ( t ). W e wi ll receive a complete answer b y means of "fellin g" on a boundar y in point λ : 5 ( ), 0 ( ) ( ), u t t n t v t t < ≤ λ   =  > λ   (5) ) ( ) ( λ λ v u = The increase in the publi cation number on a given t opic when its actualit y is not equal to zero ( D > 0), and probabl y the transfe r to a saturation level, correspond s to the first area; th e process of decreasing the publ ication number, caused b y the loss of its actuality ( D = 0), corresponds t o the second area. Havin g adjust ed the paramete rs t o the threshold quantit y N , the equation for the first area will be the following: ) ( )) ( 1 )( ( ) ( τ τ τ τ − + − − − = − t Du t qu t pu dt t du , (6) 0 ) 0 ( n u = Quantity p defines standardiz ed probabil ity for a publication to appear per time unit despi te the actuality of a given to pic. Such factor shows background m echanisms of the information generation (a typical exam ple is: re-publication of the mat erials which have previousl y been published i n prestigious information resources). Quantit y D chara cterizes a direct e ffect of th e actuality of a given t opic. P arameter q charact erizes a speed dec rease in t he publication number and is the quantit y which is inverse to as ymptotic meaning of t he dependence u ( t ) wh en D = 0. The initi al conditi on in (6) ex presses two aspects of the i nformation d ynam ics: first ly, the availabilit y o f back ground cons tituent o f data flows, secondl y, un certainty o f a definite mom ent, when a parti cular topi c contribut es to a proc ess o f publication generation. Due to t his, at a time moment t = 0 the re ex ists a c ertain qu antity of publicati ons which c an be associated wi th this topic. For the second area we have )) ( 1 )( ( ) ( λ λ λ − − − = − t qv t pv dt t dv , (7) ) ( ) ( λ λ u v = Since in the second area a topic has no effect on the publication dynamics (it describes the processes which are i nertial to the topi c), we d o n ot include factor of de lay τ in equation (6). The threshold conditi on in equation (7) provid es "felling" of fu nctions u (t) and v(t) . The answer (6) is as follows )] )( ( ex p[ ) 1 ( 1 ) ( 0 τ − + − − + = t D p n u u t u s s , (8) where u s – asymptotic m eaning u , the quantit y of which defines th e s aturatio n area (if, o f cou rse, this dependence has enou gh time to reach it ): pq D p u s + = . (9) We state that expression (9) d oes not depend on meanin g n 0 , which pro ves that init ial conditions are not important for the saturation of i nformation d ynamics. No matter what an 6 initial num ber of publications is, the satu ration wi ll be defined ex clusively b y the pa rameters, which characterize background speed of the increase in the publication numb er, quantit ative level of actu ality and ne gative factors o f the pro cess. From the practical point of view we ma y ignore background fa ctors which are not eas y to be studied. Curve (8) has a bendin g point τ + − + = ) 1 ln( 1 0 in f n u D p t s . (10) Thus, we have so-c alled S-like dependence for the first area, and when t ~ t inf dependence (8) mov es to linear and corresponds t o a linear model. For better conveni ence we represent (8) in a differ ent wa y: ] ) exp[( ) 1 ( ] ) exp[ ( ] ) ex p[( ) 1 ( )] )( exp[( )] )( exp[( 0 0 τ τ τ D p n u t D p t D p u n u t D p t D p u s s s s + − + + + = − + − + − + . (11) It is clearl y seen, provided in f 0 ) 1 ln( 1 t n u D p t s = + − + < τ . (12) dependence u ( t ) has ex ponent nature, i ts expression being defined with t he delay q uantit y τ . Hence, fo r meanings t , which are much lower than those of t inf , our model a grees wi th an exponent mo del. A typical dependence is s hown in Fig. 2. Fig. 2. Increase ar ea Let us move to the s econd area. Its answer looks as follows: )] ( exp[ )) ( 1 ( ) ( ) ( ) ( λ λ λ λ − − − + = t p qu qu u t v . (13 ) If dependen ce u(t) has eno ugh time to reach saturati on within timespan t < λ , we ma y simplif y the answer (13), showing it in t he following wa y: 7 )]) ( ex p[ 1 ( ) ( ) ( λ − − − + + = t p D p D p v t v s , (14) where v s = 1 /q asymptotic meanin g of the dependence v(t) . As it should be exp ected, quantit y v s depends neit her on initial conditions nor on "felling" on area boundaries. In the s econd area d ynamics of publications t o a first approximati on has an exponent nature, so it agrees with the results. A typical dependence of t he second area is shown in Fig. 3. Рис . 3. Decrease area So, we see that ou r dependence h as saturation area u s (when t ≤ λ ) and asymptotes v s , that describes a gradual decrease in the numbe r of publ ications t o a background lev el. It means that it qua litatively a grees wi th common opinion about the nature of information d ynamics, received from experimental data. B esides, it als o agrees with linear and ex ponent models in certain areas t . A typical complete dependence n ( t ) is shown in Fig. 4. Conclusion Thus, a su ggested mode l gives a correct d escription (at least at a l evel of qu alitative properties) of time dep endence of public ation densit y , caused b y a particul ar topic. It contains the saturation area which cannot be expl ained in the framework of an exponent model. We also see that the dependence received is not symmetric and h as repres entative "crest" on a boundary of two are as. The answer to our equation for the second area, contrar y to the first one, has no saturation condit ion; it describes closer-to-exponent decline, which as ymptotically moves to zero. This interesting aspect of a curve behavior is practicall y ob served in some cases, but not in all of them. The experiments prove the availabilit y of two more t ypes, which we will n ot discuss now. We will only mention t hat the easies t realization of the model has been considered. There is a chance that i ts more comp licated m odifications will make i t possible to describe all major types of real d ynamics. 8 Fig. 4. Generalized figure of the dynamic of t hematic informat ion flow Cyclic processes of increase-decre ase in the information resource act ivities present another problem of information d ynamics, and th ey are not di rectly connected with information factors (for example, periodical decreas e in the publication number at weekends/holi days). The i ssue of identi fy ing the correlation between the answers of the suggested lo gistic equations and the balance of topi cs is open for research (1). Therefore, we h ave all gr ounds t o state that a logistic model does describe the d ynamic of a certain categor y of thematic data flows. Reference 1. Braichevskii, S.M. Lande, D.V. Urgent asp ects of cur rent informatio n flow // Scientific and technical informatio n processing / - USA: All erton press, i nc. – Vol 32 , part 6. -2005. – P. 18-31. 2. Department of Defense Trusted C omputer System Evaluation C riteria - DoD, 1985. 3. Handbook for the Comp uter S ecurity Ce rtification of Trusted S ystems - NR L Technic al Memorandum 5540:062A, 12 Feb. 1996. 4. A Guide to Understanding C overt Channel Anal ysis o f Trusted S y stems, N CSC-TG-030, ver. 1 - National Computer Securit y Center, 1993. 5. Gianna M. Del Corso, Antonio Gull í, Francesco Romani. Ranking a s tream of news. International World Wide Web Con ference. Procee dings of the 14th international conference on World W ide Web. Chiba, Japan. – 2005. - P. 97 - 106. 6. Burton R.E. and Kebler R.W . The "half-life" of some scientific and technical literatures. American Documentation 19 60;1:98—109. 7. Cole P.F. Journal usage versus a ge of journal // J.Doc. – 1963. – Vol. 19, № 1. – P. 1-10. 8. Malthus T.R. An essa y on the principal of Population . 179 8 (Penguin Books 1970). 9. Verhulst P .F. Noti ce sur la loi que la popu lation suit dans son accroissement Corr. Math. Et Phys. 10, 113-121, 18. 10. Pearl R. The Introduction to Medical Biometry and S tatistics. Philadelpia, 1930; Ibid. The Natural History of Populati on. L., 1939.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment