# Data sources¶

Dissemin works with various data sources, providing bibliographic references, full texts and publisher’s policies. Most of these sources only provide a search API to expose their data: we do not store a copy of their database but perform calls to their external API when required. This has the advantage of keeping the local data storage needs very modest, but fetching the appropriate data from the APIs takes some time.

## CrossRef¶

CrossRef is an association of publishers, mainly in charge of issuing Digital Object Identifiers (DOIs) for academic publications. DOIs provide many useful features:

• Redirection to the publication’s page on the publisher’s website, with links of the form http://dx.doi.org/10.1103/physreve.89.033013. When a publisher changes the structure of its website, tells CrossRef where the resources have moved, updating the DOI proxy so that users are redirected to the new location.

• Associating metadata to DOIs, in a uniform format. The metadata associated with a given DOI can be retrieved using content negociation. This is useful to get the metadata associated with a DOI that we discover from other metadata sources. It works as follows:

\$ curl -LH "Accept: application/citeproc+json" http://dx.doi.org/10.1103/physreve.89.033013
{
"indexed":
{
"date-parts": [[2015,6,10]],
"timestamp": 1433897719282
},
"reference-count": 36,
"publisher": "American Physical Society (APS)",
"issue": "3",
{
"content-version": "vor",
"delay-in-days": 13,
"start": {
"date-parts": [[2014,3,14]],
"timestamp": 1394755200000
},
}

],
"DOI": "10.1103/physreve.89.033013",
"type": "journal-article",
"source": "CrossRef",
"title": "Small-scale anisotropic intermittency in magnetohydrodynamic turbulence at low magnetic Reynolds numbers",
"prefix": "http://id.crossref.org/prefix/10.1103",
"volume": "89",
"author": [
{
"affiliation": [ ],
"family": "Okamoto",
"given": "Naoya"
},
{
"affiliation": [ ],
"family": "Yoshimatsu",
"given": "Katsunori"
},
{
"affiliation": [ ],
"family": "Schneider",
"given": "Kai"
},
{
"affiliation": [ ],
"family": "Farge",
"given": "Marie"
}

],
"member": "http://id.crossref.org/member/16",
"container-title": "Physical Review E",
{
"intended-application": "syndication",
"content-version": "vor",
"content-type": "unspecified",
}
],
"deposited":
{
"date-parts":[[2015,4,13]],
"timestamp": 1428883200000
},
"score": 1,
"subtitle": [ ],
"issued":
{
"date-parts":[[2014,3]]
},
"URL": "http://dx.doi.org/10.1103/physreve.89.033013",
"ISSN":
[
"1539-3755",
"1550-2376"
],
"subject":
[
"Condensed Matter Physics",
"Statistical and Nonlinear Physics",
"Statistics and Probability"
]
}

• A search API, basically a machine-readable version of CrossRef Metadata Search. Similar metadata is returned for each search result. The documentation can be found here. By searching for an researcher’s name and browsing through the few first results pages, we get the metadata for most papers written by that researcher and registered at CrossRef. When a researcher is associated with an ORCID id, we also search for papers using their id. This can return papers that do not appear in the ORCID profile (CrossRef has introduced an auto-update feature in 2015 to populate automatically the profiles, but it is an opt-in feature and only applies to subsequent papers). This service only returns DOIs issued by CrossRef, the two other services also work for other DOI registration agencies such as DataCite or MEDRA. DataCite offers a similar service but we do not use it as they cover mostly data.

## SHERPA/RoMEO¶

SHERPA/RoMEO is a service run by JISC which provides a semi-structured representation of publisher’s self-archiving policies. They offer an API, whose functionality is very similar to the search service they offer to their regular users. You can search for a policy by journal or by publisher. Since some publishers have multiple archiving policies, RoMEO recommends to search by journal because it ensures that you will get the policy in place for this specific journal. The publishers app replicates most of RoMEO’s data model, by defining Journal and Publisher models with fields taken from SHERPA. We synchronize our model with SHERPA’s data every two weeks using their dumps.

For many journal articles and all conference papers, RoMEO knows the publisher but not the journal, and the metadata returned by CrossRef contains both the journal (or the proceedings title) and the publisher. We use therefore a two-step approach:

• We search for the journal: if it succeeds, we assign the journal and the corresponding policy to the paper.
• If it fails, we search for the a default policy from the publisher. Default policies are those which have a null romeo_parent_id.

Because the publisher names are not always the same in Crossref and SHERPA, we add heuristics to disambiguate publishers. We use the papers for which a corresponding journal was found in SHERPA and collect their publisher names as seen in Crossref. If we see that a given Crossref publisher name is overwhelmingly associated to a given SHERPA Publisher which is a default publisher policy (romeo_parent_id is null), then we also link Crossref papers with this publisher name but no matching journal to this SHERPA Publisher.

The task to update SHEPA policies is fetch_updates_from_romeo, from publishers.tasks. The first time it is run, it fetches the entire dump. The next time, it will only fetch the publishers which were updated since the last update. To force a full ingestion, nullify all the update dates on the RoMEO publishers with Publisher.objects.update(last_updated=None), and then run fetch_updates_from_romeo.

## ORCID¶

ORCID has a public API that can be used to fetch the metadata of all papers (“works”) made visible of any ORCID profile (unfortunately, very often, the profiles are empty). ORCID does not enforce any strict metadata format, which makes it hard to import papers in Dissemin. Specifically, works do not always have a list of authors (which is a shame given that this service is supposed to solve ambiguity of author names). Even worse, when an authors list is provided, the owner of the ORCID record is almost never identified in this list.

We try to make the most of the available metadata:

• If a DOI is present, fetch the metadata using content negociation ;
• If a Bibtex version of the metadata is available, parse the Bibtex record to extract the title and author names ;
• Otherwise, if no authors are given, skip the paper.

We then try to find which author is the owner of the ORCID record, using a dedicated name-matching heuristic (papers.name.shallower_name_similarity()). The name that matches the most the reference name of the ORCID record is assumed to refer to the record’s owner.

## Proaixy¶

Proaixy is our own OAI-PMH proxy. We use it to discover preprints. It harvests papers from various OAI-PMH sources (notably the sources handled by BASE, see complete list) and re-exposes the result in OAI-PMH, adding a few functionalities.

• Search by fingerprint: each paper in Dissemin has a fingerprint, a robust representation of the title, and sometimes the year of publication or the last names of the authors. Proaixy enables to fetch all records that match a given fingerprint.
• Search by author name or name signature: a similar feature to search for papers matching a given name.

## BASE and CORE¶

Interfaces for the search APIs of BASE and CORE have been implemented but are not used anymore. They basically search for a researcher’s name and go through the first few pages of results.