Possibility and prevention of inappropriate data manipulation in Polar Data Journal
Stakeholders in the scientific field must always maintain transparency in the process of publishing research results in journals. Unfortunately, although research misconduct has stopped, certain forms of manipulation continue to appear in other forms…
Authors: Takeshi Terui, Yasuyuki Minamiyama, Kazutsuna Yamaji
Possibilit y and prevention of inappropriate data manipulati on in P ol ar Data Journal Takeshi Terui National Institut e of Polar Research Research Organization of Information and Systems Tachikawa, Japan 0000-0002-0231-2193 Yasuyuki Minamiyama Policy Data Lab The Tokyo Foundation fo r Policy Research Minato-ku, Japan 0000-0002-7280-3342 Kazutsuna Yamaji National Instit ute of Informatics Research Organization of Information and Systems Chiyoda-ku, Japan 0000-0001-6108-9385 Abstract —Stakeholders in the scientific field must alway s maintain transparency in the pro cess of publishi ng researc h results in journals. Unfortunately, although research miscondu ct has stopped, certain f orms of mani pulation continue to appear in other forms. As new techniques o f scientific publishing develop, science stakeholders need to examine the possibility of inappropriate a ctivity in these new pla tforms. The N ational Institute of Polar Research in Japan launched a new d ata journal Polar Data Journal (PDJ) in 201 7 to review the quality o f data obtained i n the polar region. To ma intain transparency in this new data journal, we investigated the po ssibility of inappropriate data manipulatio n in peer reviews before the i nception of this journal. We clarified inappropriate a cti vit y for the data in the peer review and cons idered preventive measure s. We desig ned a specific workf low for PDJ. This inclu ded two measures: (i) the comparison of hash values in the review process and (ii ) o pen peer review re port pu blishing. Using the hash value co mparison, we detected two inst ances of inappropriate dat a manipulation after the sta rt of the jour nal. This research will help imp rove workflow in data journals and data repositories. Keywords—Fraud P rev ention, Hash Value, Peer Review Repor t I. I NTRODUCTIO N All s ubmitted da ta is reviewed for quality a nd provenance checks during the review process of the data journal. Digital data would have a bigger size or a higher resoluti on th an the printed format in r egula r scienti fic papers. Digital data can b e easily changed, moved, or copied on the personal computer, even when this data pertai ns to b ig data. Ther efore, data journal publishers must pay special attention to inappro priate data manipulatio n done du rin g pee r reviews an d p ost- publishi ng reviews. Earlier, it was assumed that imp roper operati ons were mostly done on figures and tables in regular papers. Therefore, the chec king system depended on human readability and the discriminati on abili ty of t he peer re viewers. How ever, considered it i mpossible to check the whole data in a da ta journal by using the revie w process followed for regular papers . Therefore , it became nece ss ary to have a data journal-s pecific review process and r ules. Polar Data Journal ( PDJ) is a d ata journal launched by the National Institute of Polar Resea rch in Japan in J anuary 2017 [1]. PDJ performs the qua li ty control of various data observed in the polar regi ons usi ng peer r eview; th erefore, t he stakeholders incl uding scie ntists can conf idently use the published data . When we developed the review process in P DJ, we listed the possibi lity of w here the i nappropriate data manipulation might occur in each submiss ion p ro cess until the publicati on, and we considere d the mea sures necessary for the listed risk . By developing the framework of data publishing as per th e recommendati ons of the Research Data Alliance standards, we designed a sp ecific review p ro cess for PD J to prevent inappropriat e data m anipulatio n [2]. The first step in the process was to calculate the hash v alue to confirm t h e identity of the data. The second ste p was to pub lish a pe er review repo rt on the PDJ web site. PDJ is an advanced data journal, which implements these processes for t he official peer review process. In this r esea rch, we investi gate the various kinds of inappropriat e data manipulations and examine how to p re vent these manipulations . We have succeeded in detectin g data manipulation twice after the launch of PDJ. We s how two ca se studies about preventi on and d iscuss the influence of the new method in data journals . II. M ETHOD Fig. 1 is t he latest flowcha rt about PDJ f rom the submissio n of a paper u ntil the publi cation. PDJ uses the Edit orial Manager (EM i n Fi g. 1) for the peer review process and uses the JAIRO Cloud [3] system for paper publications . PDJ does no t hav e the original dat a r epository se rvice and u ses exte rnal data reposito ry services. This figure is included in the PDJ policy and mu st b e followed as a workflow. PDJ also has an authorship policy a nd a data poli cy. A. Hash Value The hash value is a t ypical value that is out put when specific data is input into the hash function. This value is unique accor ding to the input data; therefore, it is possible to detect tampering by comparing the hash valu es. PDJ u ses SHA-256 as a ha s h function, which has a certi fication from Green Network of Exce llence Program-Arctic Clim ate Change Research Pr oject (GRENE-Arctic) and Arctic Challenge for Sus tainability Research Project (ArCS) by the Japanese Ministry of Educati on, Culture, Sports, Science, and Technology © 2019 IEEE. Personal use of this material is p ermitted. Permission from IEEE must b e obtained for a ll ot her uses, in any current or future media, in cluding reprinting/republishing this material for advertising or promotional p urposes, creating new collective works, for resale or redistribution to servers o r lists, or reuse of any copy righted compo nent of this work in other works. Fig. 1. The peer review process diagram of Polar Data Jour nal updated fr om reference [1] CRYPTREC in Japan [4]. The input value of t he hash functi on is th e target data of the submitt ed manuscript. After copying this data from the data repository to the JA IRO Clo ud, the hash values of those d ata are c om pared by using the PDJ secretariat (sea data flo w (4), (4'), and (14) i n Fig. 1). This comparison process makes it possible to know wheth er the d ata has been changed i n each review state. B. Peer Review Report The peer r eview re port is a document that incl udes all comments and fe ed back in the revi ew process . The open pee r review report is known a s Pe erJ, it is the journal of life and environment al sciences [5]. Nature Communica tions started to provide the optio n of p ublish ing the reviewer reports in 2015 [6]. The open peer review report is exp ected as a new approach to m aint ain transparenc y. In PDJ, the p eer review report wa s published when the author’s paper w as published. The name and affiliation of the referee are disclose d in the report with the approval of the referee. It als o incl udes the hash values and the download li nks for th e data. III. I NAPPROPRIATE D ATA M AN IPULATION Inappr opriate data man ipulation is the fabricati on, falsificat ion, a nd p la giarism (FFP) of data. Digital d ata can be easily edited by using a personal computer. Inappropriate data manipulation may b e done by stakeho lders of the manuscript. The PDJ ed itorial committee will judge whether the ac t of manipulation was intent ional or unintenti onal aft er a detail ed investigati on. However, we m ust develop a method to detec t the FFP in the system. Theref ore, we need to know about the situation in which inappropriat e data manipulation might occur during the revie w p rocess. W e considered the possibility of data manipulation in each role in Fig. 1 ; this information has been s ummarized in Ta ble 1. Based on this result, we considered how the hash v alu e and the peer review report a re useful f or detecting e ach possi bility. A. Aut hor The author is the owner of both the s ubmitted manusc ript in the journal an d the registered d ata in the data repository . The author has the authority for data ma nip ulation a nd operation. The author must describe how to create data and how to use it . Before submitting a manuscript, the d ata must be register ed with the data repository . The m anuscript w ith an accessible TABLE I. E ACH ROLE , F UNCTION , INAPPROPRIATE DATA MA NIPULATION , THE PROCES S OCCURRED IN F IG . 1, AND M EASURES IN PDJ Role Functions Inappropriate Data Manipulation Flow in Fig. 1 Measures in PDJ Author Registering data int o a trust data repository Submitting a manuscript wi th data URL Revising the manuscrip t and data Fake data registrati on Pre-1 Journal policy Unauthorized data cha nge during the review process 2-13 Hash value Data change after the acceptance of the paper 14-17 Hash value Referee Reviewing a submitt ed paper and its data Data plagiarism 9 Journal policy Comments that in duce data edits for the referee’s benef i t 10 Peer review re po rt Editor Nominating the referee Judging manuscript acc epted or not Inappropriate referee nom ination 7 Peer review report Notification of inappropr iate review results 12 Peer review re p ort Data Repository Publishing data Archiving data Providing the landing page Data loss Mainly after 13 Hash value Data falsification Mainly after 13 Hash value Data fabrication Pre-1 and after 13 Journal policy Secretariat Supporting the pe er r eview process Proc edural error 2-17 System implementatio n URL of the data is sent into the manusc ript submission syste m by the author ( Fl ow 1 in Fig. 1). The aut hor is the owner of data; the refore, the followin g operations are available: Fake data registrat ion. Unautho rized data c hange during t he review p rocess. Data chan ge after the acceptance of the p aper. The j udgment of whet her data is fake or not is fundamentall y a check m ade by the referee; thi s pr ocess is similar to the process followed for ordinary journals . The hash value compar ison d oes not w ork f or fake d ata registration because there is no compa rable ha s h value. We con s idered tha t the strange data including the false data would be screened out by the referee ’s comments. The PD J policy is defined so tha t the data for mat is the stan dard format use d by the scienti fic community; if not, the data usage mu s t be d escribed . Although this policy is for quality improvement and data reusab ility, the amount of work for the referee may be reduced by this p olicy. The refere e and the editor do not allow any modifica tions during th e review period without authori zation in the standard journal. Changin g the data without the conse nt of the ref eree or the editor sho uld not also be d one in the data journ al. However, the possi bility of data m an ipulation due to the author’s mistakes cannot be e xclud ed; therefor e, data manipulati on must be detecte d in the review process . We designed ou r process to confirm the identity of the d ata a t the ti me of submission , revision, and acceptance of data (see data flow (4), (4'), and (14) in Fig. 1) by copying the data from t he data reposito ry and confi rming the hash value. Data renewal or update may occ ur after t he manuscript acceptanc e due to the progress in research. If the author changes t he d ata on the data repository side, i t m eans t hat un- reviewed data had b een published; this was not consiste nt with the paper. T herefore, we need a soluti on for end users to detect whether or not the data is p eer-re viewed. PDJ ’s peer review report includ es a p ermanent download link and t he hash value of the reviewed d ata. End users can know the hash value from the report, and they can c o nfirm whether or not the mos t recent data in the data r epository is t he same. B. Referee The referee h as the responsibili ty for checking the con tent of the manuscr ipt a nd the data quality. PDJ have a dopt ed a single-blind peer r ev iew process. The referee can perfor m the following o perations on the data: Data pla giarism. Comments that induce data edits for the referee’s benefit. Unfortuna tely, peer r eviewers have been plagiarizi ng [7]. Data copying is easy because a replica ca n be created by just a drag-and- drop operation. When the r eferee confirms the data, the referee performs this operation (see data f low (8) and (9) in Fig. 1). Ther efore, it is possible to progress the referee’s research by using this data because the referee has access t o the earlier data also. T his act may be suspected to be data plagiarism. If the data is unp ublished, it is just plagiarism in appearance even if the referee does not deem it to be so. To overcome t his, the data repository requirement needs vari ous features (e. g., previ ewing the Web browser, generating temporary d ownload links, and t racking data). Howev er, not all data re positories wo rldwide ha ve these functions. It is ver y unreasona ble to demand complete confid entiality of the data seen by a third party. We concluded that it is difficult to detect such refer ee-side acti ons in the review pr ocess beca use they are carried out in the referee's environment. There was no tech nical solution; therefore, we considered ways to reduce plagiarism using policy r est rictions. We designed the PD J rule t o require open access for t he reviewing data. The r efer ee will lose the advanta ges and motivat ions of e arly access by using open access. The referee can i ndirectly induce data editing in t he comments to the a utho r (see data flow (10) in Fig. 1). Any comments that impro ve t he data quality is usef ul. However, comments th at benefit the referee are not b eneficia l, such as a comment that can induce a specific data format conversio n for the referee’s computer environment. The editor should prevent inductive comments that only benefit the referee, but it is not easy to find it in the specifi c scie ntific data because of its high expertise. We designed the peer-review repo rt b ecause suppress ing such non-reasonabl e comments can be expected by including all the referee comments . C. Editor The editor plays the most r espons ible role among the journal s takeholders. Edito rs have s ubstantial auth ority over both the author and the peer reviewer. Ed itors cannot directly manipulate data, but they can use t heir authority to contribute to inappr opriate data ma nipulations as fol lows: Inappr opriate refere e nomination. Notifica tion of inap propriate review results . Inappr opriate re fe ree no m inat ion (see data flow 7 in Fi g. 1) will result in inappr opriate data m anipulati on by the referee (see Se ction III-B). It is also d ifficu lt to detect this activity in the review process. T he peer review r eport always d escribes the editor’s name a nd feedback. If t he editor suspect ed any manipulatio n, it is possible t o trace it back fro m the peer- review repo rt. The editor has stro ng authority over the selection of p eer review results (see data flow (12) in Fig 1). If the edi tor reports a review result that has lost neutrality, there is no w ay to prevent the FFP. Both mutual oversight and strong governan ce by the edi to rial board would prevent the FFP . T herefore, we concluded tha t it is not a f iel d of the PDJ workflow design. It should b e discussed as a matter o f publication ethics. The most signifi cant loss for authors who receive a rejectio n notice is the time spent and the effo rt made to make the submissi on. The authors are expected to recover their costs by submi tting their work to any other j ournal; however, this would require a similar r evie w pro ces s. We ex pect the review processes in other journals to be si m plified and to use processes, such as the c ascadin g peer re view mechanism [ 8]. D. Data Reposi tory The data repository is an information service for registeri ng and publishin g data. PDJ recommends usin g a t rusted data reposito ry that h as a free access landing page, an open license policy, and a persiste nt identifier (e.g., a digital object identifier) publis hing functio n. The data reposi tory can directly manipulate the registered data through applications on the server. Th erefore, t he followin g data operatio ns can occ ur: Data l oss Data fals ification Data fabr ication These are always present as informatio n security risks for the data reposit ory. These events would occur due to failure or unauthorize d access to the information system. In this s tudy, we thought that information securi ty measures for the data reposito ry are sufficient, and we want to discuss this possibility of occurrence from the unexpected activity related to each role in the review process . If da ta is overwritte n when it is updated, the past data may be lost. Data updates by authors (as shown in Section III-A) may result in the loss of the id entity o f the peer-reviewed data. End users can c onfirm the data integri ty from the has h value on the peer review report. The most important thing here is whether the data before updating remains in the data reposit ory. The core trustworthy data reposit ory r equire ments in clude the version cont rol strateg y [9], but it does not desc ribe the specifi c requirement . The required function is just that all pre-up date versions are accessible. The download link of t he r eviewed data must be unique. The requ irement of the data reposito ry for the PDJ defined these functions. The PD J does not allow the use of th e private data reposit ory or the We bsite. Data falsificat ion would occur due to improper operatio ns or configuration on the data reposito ry side. It often happens because of human error or se tting error in the d ata reposit ory side. Comparin g the hash valu e at each stage under the review process (see Section III-A) detects data falsifi cation. If there is a difference in the hash valu es, the PDJ secretariat i mmediately contacts the editor. The editor starts investigat ing data falsificat ion. If an unauthorized acces s or attack i s suspected, the PDJ secreta riat imm ediately notifies the managem ent organizati on of the data repository, and requests a r esponse of the info rmation incid ent. Data fabrication can occur as a result of collusion between the data repository and the author. The occurrence of this event requires some mut ually beneficial relati onship between the data reposi tory and the author. This act reduces the reli ability of the data journal and creates a conflict of interest in the data reposito ry. The indepe ndence of the repository also needs to be discussed, but t his is not possible right now witho ut su fficient information . Future research is expected about repository - driven fra uds. E. Secreta riat The secretar iat is respo nsible for administ rative procedures of the review process. The secreta riat contacts each role, hash value calculat ion, d ata confirmati on, and t he creation of the peer r eview report. The secretariat operation proceeds according to the journal workflow. This role can not operate the data directl y. However, some losses may o ccur as a result of procedural err or s or d elayed actions; this is because the entire workflow is in volved. PDJ has various se rvices to automate the procedure t o reduce human error. F. Othe r Services External informatio n services are mostly used for a journal support syst em includi ng submission and publis hing. PDJ uses the E M fo r the review process and the JAIRO Cloud system for paper p ublishi ng. These services are not di rectly r elated to inappropriat e data manipulation because these are inde pendent of t he PDJ editori al boar d. H owever, a gener al in formation security ris k exists. IV. C ASE S TUD Y PDJ rele ased si x articles in March 2019. Two inappropriate data manipulati ons have occ urred since the laun ch of PDJ. Both manipulations wer e detected in the process of confirming the d ata ident ity using hash values before pub lication. We describe the survey res ults o f two cases. A. Case 1 : Download Link Generation Bug i n Data Repository Despite d ownloa ding data from the data repository u si ng the sa me link , d iffe rent has h values wer e output each time th ey were downloaded. The secret ariat notif ied the data reposit ory administrat or of this event. The administr ator e xamined the download link on the landing page. As a r esul t of t he investigati on, it was found that when a download is starte d on the landing page, compress ion processing of the registere d data is started. C ompressed data wit h different timesta mps were created dependi ng on t he current date; therefo re, the hash value of the downloaded data was altere d. We concluded the case of the tech nical problem o f do wnload im plementation of data reposito ry. The secreta riat requested the data repositor y to prepare a download link with the same h ash value. The dat a reposito ry was improved t o suppres s the data cre ation w ith different t imestamps. B. Case 2 : Data Not Prese nt in Data R epository The author receiv ed comments from the referee and realized that it was necessar y to correct the data. Therefore, the author d irect ly contacte d the data repos itory administra tor to request withdrawin g the old data and re-register the new data. The administrator did not r ec ognize that the data was under review and processed it a s request ed by the author. The author responded only to the correcti ons of the manusc ript at the time of revisin g and did not respond to the d ata correction . When the manuscript w as accepte d, a comparis on of the hash values revealed that the data had been modifie d after submission . The editorial chairperson asked the data repository administrat or to clarify th e details o f t he data update b ecause of the possi bility of s erious dat a falsificati on. As a result , the above situa tion beca me clear. After interviewin g th e author, we reali zed that the auth or was not aware of the prohibiti on on data modifications under the review process without notification. We concluded that this was not a malicious act because the author lac ked the necessary information . The author was neglig ent without b eing malicio us. The data repository performed high-level data manipulatio n in cluding deletions and changes by directly contacting the a ut hor. For this reason, the PDJ sec retariat requested t he data reposito ry to defi ne strict operati on rules a nd procedures . The secreta riat decide d always t o compare the hash values of th e data w hen revising t he data. V. S UMMARY This study pres ents the possibilit y of inappropria te data manipulatio n in the PDJ. We also considered how to prevent this manipulation ( Tab le I). Primary measures that we used, such as a st rict policy and governance, were the same as the FFP m eas ure for ordinar y journals, but the data-specific part was given a n ew design. As sh own in the case s tudies, a comparison of t he hash values s ucceed ed in detecting inappropriat e data mani pulation. It is necessary to compare the hash values to detect unauthorize d data changes during the review process. The secretariat calculates the h ash value in the PDJ workflow, but the calculati on f low is not int egrated into the system as y et. I t is implemente d in the informati on services related to the data journal. I n a ss ociation with this requirement, t he data reposito ry al so n eeds to have the hash valu e l isted on t he landing page and the data downloa d link. It is not y et clear how the peer rev iew report would work because i t ai ms a t the social suppress ion ef fect for the referee and the e ditor. Furth er experi mentation is required. These measures will eventually become obsolete with advances in inf ormation technolo gy. It is essential to ma inta in the reliabilit y of data journals and data disclos ure by accumulating various cases in the fu ture. A CKNOWLEDGMENT We want to thank PDJ editorial board members and the secretariat for fully implementing these measures we designed for the data journal. Moreover, also we thank Dr. Akira Kadokura in Research Organi zation of Info rmation System for feedback aft er the in troduction. R EFERENCES [1] Y. Minamiyama, T. Terui, Y . Murayama, H. Y abukiI, K. Yamaji, and M. Kanao, “La unching a new data journa l "Polar Data Journal": Toward a new data p ublishing framework f or polar science,” J. Inf. Process. Mana g. , vol. 60, no. 3, pp. 147–156, 2017. [2] C. C. Austin et al. , “ Key components of data pub lishing: using current best practic es to develop a reference model for data publishing,” Int. J. D igit. Libr. , vol. 18, n o. 2, pp. 77–92, 2017 . [3] A. Maeda, H. Kato, N. T akahashi, and K. Yamaji, “JAIRO Cloud as a system infrastruc ture,” J. Coll. Univ. Libr. , vo l. 103, pp. 9–15, 2016. [4] H. Imai, and A. Yamag ishi, “CRYPTREC Pro ject Cryptographic Evaluation Project for the Japanese Electr onic Government,” in International Confer ence on the Theor y and Application of Cryptology and Inform ation Security , Springer, 2000, pp. 399–40 0. [5] I. Hames, “The changin g face of peer review,” S ci Ed , vol. 1, no. 1, pp. 9–12, Feb. 2014. [6] “Transparent peer re view at Nature Communic ations,” Nat. Commun. , vol. 6, p. 10277, Dec. 2015. [7] C. Laine, “Scientific miscond uct hurts” Ann. Intern. M ed. , vol. 166, no. 2, pp. 148– 149, Apr. 2017. [8] E. F. Barroga, “Cascadin g peer review for open- access publishing,” Eur Sci Ed , vol. 3 9, pp. 90–91, 2013. [9] I. Dillo and L. De Leeuw, “CoreTrustSeal,” Mitteilungen der Vereinigung Österre ichischer Bibl. und Bibl. , vol. 71, no. 1, pp. 162–170, 2018.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment