Extract abstract / full text from non-fiction provided by DOI or heading

There are quite a few tools for extracting text from PDF files [1-4]. However, the problem with most scientific articles is the difficulty of accessing PDF directly, mainly because of the need to pay for them. There are tools that provide easy access to document information, such as metadata or library, outside of the library information only [5-6]. I want to take a step forward and go beyond just bibtex / metadata:

Assuming there is no direct access to the publication's PDF files, is there any way to get at least the abstraction of the scientific article provided in the DOI document or in the title? Through my search, I discovered that there were attempts for some similar purposes [7]. Does anyone know a website / tool that can help me get / extract abstract or full text of scientific articles? If there are no such tools, can you give me some tips on how I should go after solving this problem?

thanks

[1] http://stackoverflow.com/questions/1813427/extracting-information-from-pdfs-of-research-papers [2] https://stackoverflow.com/questions/6731735/extracting-the-actual-in-text-title-from-a-pdf [3] http://stackoverflow.com/questions/6731735/extracting-the-actual-in-text-title-from-a-pdf?lq=1 [4] http://stackoverflow.com/questions/14291856/extracting-article-contents-from-pdf-magazines?rq=1 [5] https://stackoverflow.com/questions/10507049/get-metadata-from-doi [6] https://github.com/venthur/gscholar [7] https://stackoverflow.com/questions/15768499/extract-text-from-google-scholar 
+5
source share
3 answers

You can see the crossref text and datamining (tdm) service ( http://tdmsupport.crossref.org/ ). This organization provides a free RESTful API. This tdm service includes more than 4000 publishers. Examples can be found at the link below:

https://github.com/CrossRef/rest-api-doc/blob/master/rest_api_tour.md

But give a very simple example:

If you go to the link

http://api.crossref.org/works/10.1080/10260220290013453

You will see that in addition to some basic metadata, there are two other metadata: a license and a link where the first gives under which license this publication is provided, and the second gives the full text. In our example, you will see in the metadata of the license that the license is creative (CC), which means that it can be used for tdm purposes. Searching for publications with CC licenses in Crossref provides hundreds of thousands of publications with their full texts. From my last research, I can say that the Hindawi publication is the most friendly publisher. Even they provide over 100,000 editions of the witt CC license. The last thing is that the full texts can be presented in xml or pdf format. For these formats, xml is very structured, so it is easy to extract data.

To summarize, you can automatically access many full texts through the crossref tdm service, using your API and simply writing a GET request. If you have further questions, feel free to ask.

Greetings.

+4
source

If the article is on PubMed (which contains about 25 million documents), you can use the Python Entrez package to get an abstract.

0
source

Crossref might be worth checking out. They allow members to include abstracts with metadata, but this is optional, so it is not a complete coverage. According to their help desk, when I asked, they have abstracts available for approximately 450,000 DOIs registered since June 2016.

If an abstract paragraph exists in their metadata, you can retrieve it using the UNIXML format. Here is one specific example:

 curl -LH "Accept:application/vnd.crossref.unixref+xml" http://dx.crossref.org/10.1155/2016/3845247 
0
source

All Articles