20 de febrero de 2017

How Many Grains of Salt Must We Take When Looking at Metrics?

We all want to be scored. We want to know exactly where we stand. We want to know how much people like us. In other words, we want metrics. I am not an expert on human behavior so I really can’t explain the science behind this but it seems a universal human condition.

Despite there being solid evidence that workers experience anxiety and are demoralized by being ranked in performance evaluations, companies face lots of resistance from staff to get rid of ratings and rankings.

Despite a common sense push to promote students in school based on ability and not age, we continue to see no creative way around giving grades and having kids take standardized test to rate how they compare to their peers.

We are comfortable being rated and ranked and we insist on protecting the metrics that currently exist. Added to performance metrics are social metrics. How many “impressions” did my tweet get? What about “total engagements?” Did my “thought” perform better on Facebook? How many “reactions” did I get? How many of those were positive and how many negative?

With the exception of professional athletes, who literally make a living on whether they have positive stats, the only other profession that seems absolutely enamored of metrics are academics.

All metrics come with an asterisk, a caveat, a grain of salt - or several grains of salt. What follows is a review of some of the strengths and limitation of different metrics around citation, attention, and usage. For each product or platform, I am making a judgement on how many grains of salt one should keep in mind when using the data. I decided to use this source for measuring the number of grains needed:
  • A pinch is a thousand grains of salt;
  • A cup holds a million grains of salt;
  • A bathtub holds a billion grains of salt;
  • and a classroom holds a trillion grains of salt.


Citation information from Crossref is fast and relatively accurate but dependent on publishers depositing XML references with their Crossref deposit. How it works is that participating publishers deposit their tagged and parsed references for each paper/chapter/etc. to Crossref. If the references are complete and properly formatted, Crossref tallies the citations from that reference list. Questions remain about completeness of the literature - really old papers for which there are no Crossref records will not be deposited and not all publishers are depositing the reference data.

Users may find Crossref citation information displayed on journal article pages either with a simple number of citations or a complete “cited-by” list. Crossref makes the information easy to access via an API. Sharing sites or indexing cites may also use Crossref citation information but publishers must agree to share the data with the third party. So even with third party platforms using Crossref citation data, users may see different numbers if a publisher chooses to opt out of allowing a third party to access their data from Crossref.

Grains of Salt - a pinch: For older papers the citation counts may never be accurate if backfile content is not digitized or if it is digitized but references are not deposited into Crossref. “Most Cited” lists using Crossref data may be based on citations over a specific period of time as opposed to all time. This is not always clear on the publisher or index site.


Citation information from Web of Science (WoS) is accessible by subscription only. That said, some publisher and index sites may use WoS data to display “cited by” information if they have paid to surface that information. WoS only has citations from publications indexed by WoS, which includes about 12,000* journals. The actual number of journals in existence is a hotly debated number, but the 2015 STM Association Report estimated at about 28,000. WoS publishes a list bi-weekly of which journals are included in the database.

Grains of Salt - a bathtub*: Due to the limited size of the WoS database, much caution should be used when it comes to their citation metrics. The citation database only goes back to 1990, so older content will not be included. Because WoS uses publisher supplied metadata for all different purposes, the processing time seems less automated and there can be a lag time of several months before supplied articles are found in WoS.

*UPDATE: I have been informed that the WoS database includes 28,000 records with 20,000 of them collecting citation information. I don’t have a URL with this information, it comes from Clarivate Analytics staff. Given that the breadth was larger than originally reported in an earlier version of the post, I will upgrade the “grains of salt” status to a cup. 2/9/2017: I was contacted by a different Clarivate employee who provided these numbers: The full database has about 33,000 but the Core Collection only has 17,500. All of the citation metrics come from the core collection. Further, some selected journals go back as far as 1900. You can read more about the process here.


The Scopus provides a platform for subscribers to access raw citation information and generate their own analysis and the recently launched CiteScore adds another level of metric crunching. While this is a spiffy new tool, it sits on existing Scopus database functionality of reporting citation counts. Scopus includes 23,000 journals, many more than WoS. Scopus publishes a list annually of which journals are included in the database.

Grains of Salt - a cup: While bigger than WoS, Scopus still omits many journals and is only available via a subscription. Content goes back to 1996* so citation information is missing for older content. The processing of publisher supplied content into Scopus seems less automated, again because specific file formats and tagging are not required. There can be delays in finding newly published content in Scopus, which will also cause slight delays in citation updates as well.

UPDATE: Scopus has been adding back content with indexed citations. The database currently includes content citations back to 1970. Metadata records without citation data is available back to 1900.


Google Scholar gets behind-the-paywall access to journals that agree to be crawled. I cannot find any statistics or reports about how many publishers allow this. Google uses the full-text crawl (of paywalled and open content) as well as data from their patent database to count citations. Anyone can access the data. In addition to the patent citations, Google Scholar has been adding citations from other sources such as policy documents and preprint servers.

Google Scholar does not provide a list of which publications or web sites they are indexing or collecting citation data from, though some have tried to make a guess. Unlike WoS and Scopus, there is no curation process so everything gets swept into the metrics, regardless of whether it is helpful or appropriate.

Grains of Salt - a classroom: While Google Scholar is free, giving it a leg up in accessibility, it is not at all transparent on what is actually being counted as a citation. Citations are updated when they update Google Scholar. I know that sounds redundant, but the frequency at which things get updated is a question. For example, here at ASCE, we noticed some issues with how some content was displaying in Google Scholar. We made some changes that they suggested and were told that when they re-index the site later this year, those changes will be reflected. This seems to indicate that data is not updated in real time. In fact, on the Google Scholar Citation page, it warns authors that any fixes to citations  “usually takes 6-9 months for the changes to be reflected in Google Scholar; for very large publishers, it can take much longer.” Google Scholar is also not the cleanest database. Author disambiguation and collapsing multiple versions of a paper into one record can be sloppy. It’s very possible for the same paper to appear as multiple papers, each with their own citation counts (which are not at all clear on duplication).
With the exception of professional athletes, who literally make a living on whether they have positive stats, the only other profession that seems absolutely enamored of metrics are academics.
Alternative Metrics and Social Sharing Sites


Altmetric is probably the most popular tool for measuring “attention” metrics. In addition to citation data, it collects social media mentions, mainstream media mentions, public policy documents, and “saves” on reference management sites - namely, Mendeley. They are also now counting mentions on post-publication peer review sites (such as Publons and PubPeer), Wikipedia, and social media sites like Reddit and YouTube. Altmetric does perform some “weighting” of the mentions. For example, an author blogging about their own work counts less than Gizmodo blogging about the same work.

Grains of Salt - a cup: Altmetric is not always transparent about how the little donut score was created for each paper and the weighting is unknown. Further, they continue to add sources or change their rules about weighting as more resources become available. This is not a bad thing but it certainly means that the Altmetric score on a paper could change significantly over time even if no one is currently talking about the paper anymore.


ResearchGate, a social sharing site, does provide citation counts for papers but it is not clear where this information is coming from. On a spot check, it does not seem that ResearchGate is using Crossref data and may instead only be counting citations within the full text article shared on their site.

Grains of Salt - a classroom: This is a big black box with no idea what happens on the inside.

Downloads and Usage Metrics

While it may seem to make sense that the most cited papers are the most downloaded papers, this is not often the case. Still, authors are asking for more and more information about how many times their papers have been downloaded. This shift is indicating to me that download counts are being included on CVs or tenure and promotion packages. For clarity, I am counting “download” as a full-text download of the PDF or a view of the full-text HTML.

Download information is only available from the journal platform. Journals that are subscription based journal should provide consistent statistics following the COUNTER rules for reporting. COUNTER rules require an apples to apples comparison across publishers as well as eliminating usage by bots and spiders. Not all open access publishers are following COUNTER rules as they are not beholden to librarians for usage statistics.

Grains of Salt—a bathtub: If a journal moves from one publisher to another — or if that publisher moves from one journal platform to another — download history may be lost. The total downloads are also deprecated due to papers being shared on private groups (like in Mendeley) or commercial sites such as ResearchGate, and of course illegal hosting sites like Sci-Hub and Libgen. The STM sharing principles calls for building usage stats across legal sharing networks but that infrastructure has not been built yet.

Total downloads (or full text access counts) are also muddled by versions of articles in institutional repositories, funding agency repositories, and preprint/archiving servers. If an author has posted a version of the paper on multiple web sites, he or she may need to go to every site for which downloads are available and add them up.


No. Authors want to know about the impact they are making and for some of these metrics, their career advancement depends partly on this information. That said, we are fully immersed in a culture of sharing and authors are heavily encouraged and in some cases mandated to put multiple copies of their papers into multiple platforms. While this “spread the seeds” approach may sow more visibility, it can also makes quantifying the citations, views, and attention impossible.

What is important is to let authors and editors know the limitations of the data that are available.

Autor: Angela Cochran
Twitter: <@ACOCHRAN12733>
Fuente: <https://scholarlykitchen.sspnet.org/>
Publicar un comentario