Springer Publishing

Thursday 3 September 2015

Using Google Books API and R to illustrate the general impact of a scientific work over time

In order to get a quick idea of how a book has affected scientific research over time, Google Books API provides that data and R provides the visual!

The Book "The Carbohydrates", edited by Ward Pigman, is an example of a book that you might think has had a significant impact on the landscape of chemical science over the years. If another book cites this one, chances are Google Books will have a record. We can use the Google Books API to have a look.



R code:

library(XML)
library(RCurl)
library(RJSONIO)
result<-getURL("https://www.googleapis.com/books/v1/volumes?q=%22the%20carbohydrates%22%20pigman&startIndex=0",ssl.verifyhost=F,ssl.verifypeer=F,followlocation=T)

#This returns a text object in R which consists of 10 results in JSON format.

list<-fromJSON(result)

totalcount<-fromJSON(result)[[2]] ##returns the total results number
fromJSON(result)[[3]] ##returns all the listings for the 10 results
fromJSON(result)[[3]][[1]]$volumeInfo$publishedDate ##returns the date the book was published for result number 1.

lapply(fromJSON(result)[[3]],function(x) x$volumeInfo$publishedDate) ##returns the publishing date for all 10 books in the list.

##Again you will need to loop this with a new startIndex value each time until 440 is reached.
#Finally, categorize the book;s impact over time by grouping the dates according to year (because
#this is most likely the only datum consistently available.
#The following loop will amass all the JSON returned.

totalcount<-fromJSON(result)[[2]] ##returns the total results number
list1<-list()
#Begin for loop
for(i in 0:floor(totalcount/10)){

list1[[i]]<-getURL(paste0("https://www.googleapis.com/books/v1/volumes?q=%22the%20carbohydrates%22%20pigman&startIndex=",(i*10)),ssl.verifyhost=F,ssl.verifypeer=F,followlocation=T)

}

#The following loop will amass only the published date of results. Less data to save and more time between calls (which is a good thing for the servers).

totalcount<-fromJSON(getURL("https://www.googleapis.com/books/v1/volumes?q=%22the%20carbohydrates%22%20pigman&startIndex=0",ssl.verifyhost=F,ssl.verifypeer=F,followlocation=T))[[2]] ##returns the total results number
vec<-c()
#Begin for loop
for(i in 0:floor(totalcount/10)){

vec<-c(vec,unlist(lapply(fromJSON(getURL(paste0("https://www.googleapis.com/books/v1/volumes?q=%22the%20carbohydrates%22%20pigman&startIndex=",(i*10)),ssl.verifyhost=F,ssl.verifypeer=F,followlocation=T))[[3]],function(x) x$volumeInfo$publishedDate)))

}

vec

#If you want to call quicker, use the URL to extract only the totalItems and publishedDate information by appending the following to the URL

#&fields=totalItems,items/volumeInfo/publishedDate

#This will return only the dates.

#Display vec in R as a kind of timeline graph using package igraph

#As a saveable function. Input your API key in double quotations and your query in double 
#quotations (URL-encoded).

GBapi<-function(query,key){
totalcount<-fromJSON(getURL(paste0("https://www.googleapis.com/books/v1/volumes?q=",query,"&startIndex=0&key=",key),ssl.verifyhost=F,ssl.verifypeer=F,followlocation=T))[[2]] ##returns the total results number

list1<-list()
#Begin for loop
for(i in 0:floor(totalcount/10)){

list1[[i+1]]<-fromJSON(getURL(paste0("https://www.googleapis.com/books/v1/volumes?q=",query,"&startIndex=",(i*10),"&key=",key),ssl.verifyhost=F,ssl.verifypeer=F,followlocation=T))

}

list1

}

And lapply() on the resulting list for the data.



And for comparison

#partition the plot space
par(mfrow=c(2,1))
#Plot one book first. xlim parameter makes sure the windows are the same size.

plot(table(unlist(regmatches(unlist(lapply(gbapi,
function(x) lapply(x$items,function(y) y$volumeInfo$publishedDate))),
gregexpr("[0-9]{4}",
unlist(lapply(gbapi,
function(x) lapply(x$items,function(y) y$volumeInfo$publishedDate))))))),
ylab="Number of Books on Google Books",xlim=c(1800,2015))

title(main="Some books published per year
relating to
'Computational Chemical Graph Theory' by Trinajstic")
#plot the other book below
plot(table(unlist(regmatches(vec,gregexpr("[0-9]{4}",vec)))),xlim=c(1800,2015),
ylab="Number of Books on Google Books")
title(main="Some books published per year
relating to

'The Carbohydrates' by Pigman")

Hopefully this kind of metric provides a useful way to approximate the scholarly impact a book has had on other books. In the sciences, textbooks and other books have always had an authoritative quality to them, so this metric may indicate a certain kind of scientific influence which may include teaching, information gathering and reputation all in one. 

Current difficulties are mostly related to the limits imposed on the user by the Google Books API. At a certain point, the number of books returned on a result page diminishes. A workaround for this is in the works.

No comments:

Post a Comment