Springer Publishing

Sunday 13 September 2015

Returning Google Search results in R - "mirex"

Introduction

When searching for specific information on the internet, the keywords we use often have multiple meanings. It is problematic when using statistical measures to gather information quickly: If you get 5 million results, how many of the results are directly related to the exact meaning you intend to search for? How many different meanings are there for a single word? Statistical metrics will easily lose sight of the range of meanings unless they are managed appropriately.

Think of the English word "love" (About 5.7 billion results on Google) and how many websites are dedicated to it. You may be looking for a detailed explication of the Greek notions of love as eros and agape, but end up on someone's careless Facebook post where 'love' is being used sarcastically. You may be directed to companies or people whose names include the word 'Love'.

Chemical nomenclature searching

In the world of chemistry, language is also extremely important and very complicated. IUPAC chemical nomenclature is a kind of agglutinative language, but additionally, many chemicals have their own trade names and traditional names. Some of these names are so old and common that they have acquired many different meanings and contexts over the years.

When searching for information about a chemical called "mirex", a prohibited pesticide, it is important to know that PubChem alone has amassed 121 "synonyms" and alternate names for this molecule. Using R, we can record the estimated number of hits returned in Google Search for each synonym of 'mirex'. The number of hits tells us something of the popularity of the word, but we cannot tell if there are other non-chemical meanings to the word that artificially inflate the results numbers.

R code and example

The following R code returns the approximate Google Search number of results for each entry in a vector.

library(XML)
library(RCurl)
LIST<-{a vector or matrix column of identifiers}
vec<-c()
for(i in 1:length(LIST)){
results<-unlist(xpathApply(htmlTreeParse(getURL(paste0("https://www.google.ca/search?q=",LIST[i]),
ssl.verifyhost=F,ssl.verifypeer=F,
followlocation=T),useInternalNode=T),"//div[@id='resultStats']",xmlValue))
vec[i]<-as.numeric(paste0(unlist(strsplit(results,"[A-Za-z, ]+")),collapse=""))
}

For the 121 synonyms listed in PubChem for the pesticide "mirex", the results can be displayed in a bar plot

barplot(vec,ylim=c(0,6e5))


The two tallest bars extend much further vertically past the boundary of the plot window into the millions. The synonym for mirex which returned the most hits (about 16,400,000) was "HRS 1276" (without double quotation marks). A few reasons are that when not enclosed in double quotation marks, HRS can refer to "hotel reservation service" or "hrs" as an abbreviation for 'hours', searching '1276 hours'. When enclosed in double quotation marks, "HRS 1276" returns 973 results. It can be said that this terms can have high "keyword search entropy"--a concept I will explore at a later time.

Summary

For a chemical with high legal profile, such as mirex, it is important to provide appropriate search terms to find the information needed. Perhaps those search terms with the smallest number of hits are the most relevant terms. Perhaps "mirex" is the most popular synonym for the chemical, but what percentage of hits returned by Google Search of "mirex" relate to the the pesticide and what portion relate to something else? More hits does not necessarily mean more popular.

Monday 7 September 2015

Cucumber + cherry

When eaten together, cherry and cucumber compliment each other. I wouldn't say that they directly enhance each other's flavours, but they seem to produce a slightly unique and positive flavour. That unique flavour, however, is not as strong as the natural cherry flavour still present.

The cherry flavour appears to dominate the combination just slightly and the cucumber flavour is almost overpowered.

Cherry and cucumber are somewhat close in texture because they are both crunchy so this combination is approximately equal to cucumber in texture, but contains the full texture of both cucumber and cherry.

Thursday 3 September 2015

Using Google Books API and R to illustrate the general impact of a scientific work over time

In order to get a quick idea of how a book has affected scientific research over time, Google Books API provides that data and R provides the visual!

The Book "The Carbohydrates", edited by Ward Pigman, is an example of a book that you might think has had a significant impact on the landscape of chemical science over the years. If another book cites this one, chances are Google Books will have a record. We can use the Google Books API to have a look.



R code:

library(XML)
library(RCurl)
library(RJSONIO)
result<-getURL("https://www.googleapis.com/books/v1/volumes?q=%22the%20carbohydrates%22%20pigman&startIndex=0",ssl.verifyhost=F,ssl.verifypeer=F,followlocation=T)

#This returns a text object in R which consists of 10 results in JSON format.

list<-fromJSON(result)

totalcount<-fromJSON(result)[[2]] ##returns the total results number
fromJSON(result)[[3]] ##returns all the listings for the 10 results
fromJSON(result)[[3]][[1]]$volumeInfo$publishedDate ##returns the date the book was published for result number 1.

lapply(fromJSON(result)[[3]],function(x) x$volumeInfo$publishedDate) ##returns the publishing date for all 10 books in the list.

##Again you will need to loop this with a new startIndex value each time until 440 is reached.
#Finally, categorize the book;s impact over time by grouping the dates according to year (because
#this is most likely the only datum consistently available.
#The following loop will amass all the JSON returned.

totalcount<-fromJSON(result)[[2]] ##returns the total results number
list1<-list()
#Begin for loop
for(i in 0:floor(totalcount/10)){

list1[[i]]<-getURL(paste0("https://www.googleapis.com/books/v1/volumes?q=%22the%20carbohydrates%22%20pigman&startIndex=",(i*10)),ssl.verifyhost=F,ssl.verifypeer=F,followlocation=T)

}

#The following loop will amass only the published date of results. Less data to save and more time between calls (which is a good thing for the servers).

totalcount<-fromJSON(getURL("https://www.googleapis.com/books/v1/volumes?q=%22the%20carbohydrates%22%20pigman&startIndex=0",ssl.verifyhost=F,ssl.verifypeer=F,followlocation=T))[[2]] ##returns the total results number
vec<-c()
#Begin for loop
for(i in 0:floor(totalcount/10)){

vec<-c(vec,unlist(lapply(fromJSON(getURL(paste0("https://www.googleapis.com/books/v1/volumes?q=%22the%20carbohydrates%22%20pigman&startIndex=",(i*10)),ssl.verifyhost=F,ssl.verifypeer=F,followlocation=T))[[3]],function(x) x$volumeInfo$publishedDate)))

}

vec

#If you want to call quicker, use the URL to extract only the totalItems and publishedDate information by appending the following to the URL

#&fields=totalItems,items/volumeInfo/publishedDate

#This will return only the dates.

#Display vec in R as a kind of timeline graph using package igraph

#As a saveable function. Input your API key in double quotations and your query in double 
#quotations (URL-encoded).

GBapi<-function(query,key){
totalcount<-fromJSON(getURL(paste0("https://www.googleapis.com/books/v1/volumes?q=",query,"&startIndex=0&key=",key),ssl.verifyhost=F,ssl.verifypeer=F,followlocation=T))[[2]] ##returns the total results number

list1<-list()
#Begin for loop
for(i in 0:floor(totalcount/10)){

list1[[i+1]]<-fromJSON(getURL(paste0("https://www.googleapis.com/books/v1/volumes?q=",query,"&startIndex=",(i*10),"&key=",key),ssl.verifyhost=F,ssl.verifypeer=F,followlocation=T))

}

list1

}

And lapply() on the resulting list for the data.



And for comparison

#partition the plot space
par(mfrow=c(2,1))
#Plot one book first. xlim parameter makes sure the windows are the same size.

plot(table(unlist(regmatches(unlist(lapply(gbapi,
function(x) lapply(x$items,function(y) y$volumeInfo$publishedDate))),
gregexpr("[0-9]{4}",
unlist(lapply(gbapi,
function(x) lapply(x$items,function(y) y$volumeInfo$publishedDate))))))),
ylab="Number of Books on Google Books",xlim=c(1800,2015))

title(main="Some books published per year
relating to
'Computational Chemical Graph Theory' by Trinajstic")
#plot the other book below
plot(table(unlist(regmatches(vec,gregexpr("[0-9]{4}",vec)))),xlim=c(1800,2015),
ylab="Number of Books on Google Books")
title(main="Some books published per year
relating to

'The Carbohydrates' by Pigman")

Hopefully this kind of metric provides a useful way to approximate the scholarly impact a book has had on other books. In the sciences, textbooks and other books have always had an authoritative quality to them, so this metric may indicate a certain kind of scientific influence which may include teaching, information gathering and reputation all in one. 

Current difficulties are mostly related to the limits imposed on the user by the Google Books API. At a certain point, the number of books returned on a result page diminishes. A workaround for this is in the works.

Wednesday 2 September 2015

Using PubChem to match CAS numbers to identifiers

Using the PubChem REST API is the most straightforward for new users because it utilizes the URL.
CAS registry numbers are ubiquitous chemical identifiers that have use in many areas of industry. It is important, therefore, to be able to connect other chemical identifiers to CAS RN, improving the visibility of chemicals on the internet.

1. Download data (containing CAS RN)

Domestic Substances List (Canada)
Non-Confidential TSCA Inventory (United States)

2. Search individually through PubChem REST API (leveraging R) and returning as text.

"naphthenic acids"
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/1338-24-5/synonyms/txt

R code:

library(XML)
library(RCurl)
LIST<-{your vector of CAS RN}
getURL(paste0("http://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/",LIST,"/synonyms/txt"))

use xmlTreeParse() for each entry in the vector to transform it into xml for slightly easier handling.

3. Create list object in R of synonyms by dumping synonyms into list objects.

For each xml, get the value of each synonym node and save it as the i-th list entry.

OR

3b. Use REST API to make a call for identifiers

"naphthenic acids"
http://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/1338-24-5/property/InChI/TXT

3c. Append another column to the matrix which contains the identifier (InChI, SMILES, etc...)



USEFUL POST:
http://depth-first.com/articles/2007/05/21/simple-cas-number-lookup-with-pubchem/