Citeseerx api search

Is there a way to access CiteSeerX programmatically (for example, searching by author and / or title?) It is surprising that I can not find anything important; are others also trying to get the scientific metadata of the article without resorting to curettage?

EDIT: Please note that CiteSeerX supports OAI PMH, but it does seem to be an API focused on digital libraries that are constantly updated with each other ("content distribution") and do not specifically support search. Moreover, the information on the sites on this page is very scarce and even says: "Currently, there are difficulties with OAI."

There is another question about the CiteSeerX API (although not specifically for searching); 2 answers do not solve the problem (one says Mendeley, the other part of the software, and the other says that OAI-PMH implementations can offer extensions to the minimum specification).

Alternatively, can someone suggest a good way to get quotes from authors / titles programmatically?

+8
api web-scraping metadata
source share
1 answer

As one of the commentators suggested, I first tried jabref:

jabref -n -f "citeseer: title: (lessons from) author: (Brewer)"

However, jabref does not seem to understand that the query string should contain colons and therefore gives an error.

For search results, I ended up scrambling CiteSeerX results with Python BeautifulSoup:

url = "http://citeseerx.ist.psu.edu/search?q=" q = "title%3A%28{1}%29+author%3%28{0}%29&submit=Search&sort=cite&t=doc" url += q.format (author_last, title.replace (" ", "+")) soup = BeautifulSoup (urllib2.urlopen (url).read ()) result = soup.html.body ("div", id = "result_list") [0].div title = result.h3.a.string.strip () authors = result ("span", "authors") [0].string authors = authors [len ("by "):].strip () date = result ("span", "pubyear") [0].string.strip (", ") 

You can get the document identifier from the results (mistakenly named "doi = ..." in the URL of the resulting link), and then pass this to the CiteSeerX OAI engine to get Dublin Core XML (for example, http://citeseerx.ist.psu.edu/oai2 ? verb = GetRecord & metadataPrefix = oai_dc & identifier = oai: CiteSeerX.psu: 10.1.1.42.2177 ); however, the XML ends up with several dc: date elements, which makes it less useful than scrape output.

Too bad CiteSeerX makes people resort to scraping, despite all the open archives / open access rhetoric.

+5
source share

All Articles