Day 2 from APE — Data, Open Science, and the Future

The second day of the 2020 APE Meeting brings some surprising problems to the fore

Like the first day, Day 2 of the 2020 Academic Publishing in Europe (APE) meeting was jam-packed. It started with a strong session covering some of the implications of the rise of China, and the purchase of EDP Sciences by the Chinese publisher Science Press. (The head of Science Press spoke yesterday, and was well-received.) This China-focused “wake up session” consisted of an energizing give-and-take between Eefke Smit and Ed Gerstner of Nature. The conversation centered on whether Chinese publishers will become more Western in their norms and ethics, and whether Chinese publishers can participate in Open Science when their cell lines and other research variables are prohibited from leaving China.

The next section was about research data, and included the announcement of the 2020 STM Research Data Year. Smit was joined by Chris Graf (Wiley), Grace Baynes (Springer Nature), Joris van Rossum (STM Association), and Niamh O’Connor (PLOS). There was a roughly rehearsed discussion of the positive effects of requiring data availability statements (DAS), and how authors respond better to requirements than to requests (no surprise).

The panel repeatedly cited a 25% citation advantage for papers featuring a DAS, and claimed this incentive should drive publishers to require DAS for authors, and incentivize authors to comply. In other words, they hung a lot of laundry on this line.

I finally asked for a citation to the paper substantiating this claim, as bibliographic research is notorious for being both unreliable and hypothesis-friendly (i.e., massage the data hard enough, and it will say whatever you want).

It turns out the citation advantage claim hasn’t yet made it into a peer-reviewed paper, but is a preprint, and one that’s been kicking around a while. I didn’t get a chance to ask the panelists if any of them had actually read the preprint with a critical eye. Missed opportunity, I suppose.

I have given this manuscript more than a cursory look, and it looks like a bit of a letdown if you’re looking for evidence of a major citation benefit from DAS inclusion. Methodologically, the authors’ approach is a bit odd in that the authors used word-matching to identify the candidate publishers’ journals rather than DOIs. I don’t know how that affected reliability, but it can’t have helped.

The authors also, to their credit, downplay the citation effect, writing:

Turning our attention to the effect of DAS on citation advantage, we note that the . . . policies play a somewhat minor role.

Bottom line? The 25% citation advantage at best increases citations 22-25% above from 1.xx, so is very slight on an absolute basis. This all seems likely to be confounded by things most papers like this don’t account for — authors from high-prestige institutions, more interesting topics that generally garner more citations, or both.

In any event, the statistic this panel pointed to repeatedly comes from a preprint (un-peer-reviewed claim), has potentially consequential confounders the authors didn’t account for, and to me trumpeting this preprint as evidence of a generalizable claim seems like hanging your hat on a toothpick.

Later, there were more discussions of open data and Open Science, this time by a panel of librarians. Listening to them was comforting, as it felt like the world of Putin, Facebook, surveillance capitalism, measles outbreaks, misinformation, polarized politics, Brexit, Trump, Burisma, Biden, and Iran doesn’t exist, and that we’re operating in 1998, when data and scientific claims could frolic across the dial-up modems connecting the World Wide Web, free of worry.

During this, two questions came to mind, which I posed to the group:

  1. How do Open Science and open data account for having to exist in a world of surveillance capitalism, where sensational information or misinformation is preferred by the distribution and discovery services of Facebook, Google, and others? How does good science avoid the “tree falling in the woods” phenomenon?

  2. What if a data-centric equivalent of Sci-Hub emerges? That is, an entity immune from prosecution, harvesting data from studies and recombining it into profiles and surveillance packages it then gives away or sells, in essence weaponizing open data and Open Science? (I’m sure this and more already exists on the Dark Web, but to have it out in the “open” would be a whole different ballgame.)

No good answers were forthcoming, which brings me back to the morning session, where what’s called “surveillance research” was discussed. This type of research includes facial recognition, gait recognition, genotype profiling to generate phenotypic dopplegangers (yes, computers are being programmed to look at your DNA and now extrapolate your face with good reliability), fingerprints at a distance, and general machine learning and AI applications utilizing data without consent. Based on the discussion, the ethical dimensions for IRBs, editors, and publishers are not clear at all, and the path to rapidly evolve and develop appropriate policies is unclear.

One example given was of how photos uploaded to Flickr or the like and given by default a CC-BY license can be used by companies to train facial recognition software. Because the photos aren’t republished, there’s no requirement to acknowledge their source. But kids and others in vulnerable populations can be included in these, for benign or nefarious purposes. Should an editor publish a paper about a machine learning algorithm that uses such data? Should an IRB approve such human subject research in the first place? What kind of training dataset is possible for Big Tech now based on CC-BY articles? Is it being used for good? Wouldn’t licensing it — receiving revenues, requiring disclosure of purpose — be better?

There was also a fascinating story about researchers consciously falsifying data to obscure identities and protect privacy — if they didn’t, they felt human subjects could be identified too easily.

I came away thinking once again that we haven’t thought hard enough about Open Science — not only along the dimensions I recounted last week, but in an age of information manipulation, powerful computer media companies, aggressive nation-state actors seeking power and societal discord, and so forth. We’re in the midst of some combination of an information war and an information crime spree, yet behaving as if it’s not happening. We need to become more trusted ourselves, and perhaps a path to this is to trust Big Tech, nation-states, and others less.

We need to take a lot into account as we move into the future. It’s complicated. Our solutions will be better for taking a breath and doing some scenario planning.

The day wrapped up with a number of announcements about the future of APE, which I won’t recount here, but they are good for the meeting and for its founder.

Finally, the future more broadly was also addressed in a presentation from Michael Mabe, who served as the CEO of the International Association of STM Publishers from 2006 until late 2019. Mabe gave his predictions for what publishing may look like in 2030.

Not surprisingly, we came back around to China, with Mabe wondering if 2030 will see China participating in a US-Euro-centric publishing world, or the US and Europe participating in a China-centric publishing world. The audience suggested that a multi-center world is just as likely, and the history of European and US publishing exchanges suggest this is viable. However, the question around China’s willingness to be as open (culturally, economically, politically) as the West remains the big unknown.

Mabe also predicted that subscriptions and copyright will become irrelevant by 2030. I later caught up with him, and we began to discuss this last point with a small group from an empirical perspective. Now, owing to stringent copyright protections and the subscription model inherent to streaming services, the piracy-damaged music industry is resurgent economically and creatively. Subscriptions also are driving cell phone plans, streaming movies and television, and delivery service levels (Amazon Prime). Even content subscriptions (see bottom) are resurgent. Of all his predictions, this one seems the hardest to pin down — when money’s involved, people get clever, after all. It’s worth noting that Mabe’s 2010 predictions for 2020 were strikingly accurate.

At its 15th iteration, the best news to come out is that APE is set up to continue for years to come. Look for announcements on this. Plan accordingly.

That’s it from Berlin this year. Thanks for reading.

Loading more posts…