Springer Publishing

Sunday 13 September 2015

Returning Google Search results in R - "mirex"

Introduction

When searching for specific information on the internet, the keywords we use often have multiple meanings. It is problematic when using statistical measures to gather information quickly: If you get 5 million results, how many of the results are directly related to the exact meaning you intend to search for? How many different meanings are there for a single word? Statistical metrics will easily lose sight of the range of meanings unless they are managed appropriately.

Think of the English word "love" (About 5.7 billion results on Google) and how many websites are dedicated to it. You may be looking for a detailed explication of the Greek notions of love as eros and agape, but end up on someone's careless Facebook post where 'love' is being used sarcastically. You may be directed to companies or people whose names include the word 'Love'.

Chemical nomenclature searching

In the world of chemistry, language is also extremely important and very complicated. IUPAC chemical nomenclature is a kind of agglutinative language, but additionally, many chemicals have their own trade names and traditional names. Some of these names are so old and common that they have acquired many different meanings and contexts over the years.

When searching for information about a chemical called "mirex", a prohibited pesticide, it is important to know that PubChem alone has amassed 121 "synonyms" and alternate names for this molecule. Using R, we can record the estimated number of hits returned in Google Search for each synonym of 'mirex'. The number of hits tells us something of the popularity of the word, but we cannot tell if there are other non-chemical meanings to the word that artificially inflate the results numbers.

R code and example

The following R code returns the approximate Google Search number of results for each entry in a vector.

library(XML)
library(RCurl)
LIST<-{a vector or matrix column of identifiers}
vec<-c()
for(i in 1:length(LIST)){
results<-unlist(xpathApply(htmlTreeParse(getURL(paste0("https://www.google.ca/search?q=",LIST[i]),
ssl.verifyhost=F,ssl.verifypeer=F,
followlocation=T),useInternalNode=T),"//div[@id='resultStats']",xmlValue))
vec[i]<-as.numeric(paste0(unlist(strsplit(results,"[A-Za-z, ]+")),collapse=""))
}

For the 121 synonyms listed in PubChem for the pesticide "mirex", the results can be displayed in a bar plot

barplot(vec,ylim=c(0,6e5))


The two tallest bars extend much further vertically past the boundary of the plot window into the millions. The synonym for mirex which returned the most hits (about 16,400,000) was "HRS 1276" (without double quotation marks). A few reasons are that when not enclosed in double quotation marks, HRS can refer to "hotel reservation service" or "hrs" as an abbreviation for 'hours', searching '1276 hours'. When enclosed in double quotation marks, "HRS 1276" returns 973 results. It can be said that this terms can have high "keyword search entropy"--a concept I will explore at a later time.

Summary

For a chemical with high legal profile, such as mirex, it is important to provide appropriate search terms to find the information needed. Perhaps those search terms with the smallest number of hits are the most relevant terms. Perhaps "mirex" is the most popular synonym for the chemical, but what percentage of hits returned by Google Search of "mirex" relate to the the pesticide and what portion relate to something else? More hits does not necessarily mean more popular.

No comments:

Post a Comment