Physically Chemist: June 2017

Thursday 8 June 2017

International Patent Classification as a printable treemap

The IPC (International Patent Classification) can be downloaded in *.xml format!

Therefore, it contains tree-like data: categories and subcategories of classes of patents. If the data is tree-like, it can be transformed into a treemap, which is a cool "square-ish" version of the same data.

So I used R to do it.

I made it printable!

Download the IPC valid symbols xml file.

symb<-xmlTreeParse("YOURDIRECTORY//ipc_valid_symbols_20170101\\ipc_valid_symbols_20170101.xml",useInternalNode=T)
symb2<-unlist(lapply(xpathApply(symb,"//*",xmlAttrs),function(x) x["symbol"]))
symb3<-symb2[grep("^.{14}",symb2)]

#Treemapping IPC xml classes
c4<-sapply(symb3,function(x) paste0(substr(x,1,1),".", substr(x,2,3),".", substr(x,4,4),".", substr(x,5,8),".",substr(x,8,14),collapse=""))
treemapready<-cbind(substr(c4,1,1),substr(c4,1,4),substr(c4,1,6),substr(c4,1,11),substr(c4,1,18))
treemapready4<-cbind(substr(unique(treemapready[,4]),1,1),substr(unique(treemapready[,4]),1,4),substr(unique(treemapready[,4]),1,6),unique(treemapready[,4]))
treemapready4df<-data.frame(treemapready4)
tmr4df<-cbind(treemapready4df,as.numeric(table(treemapready[,4])))
colnames(tmr4df)<-c("a","b","c","d","e")
treemap(tmr4df,index=colnames(tmr4df[,1:4]),vSize="e")
treegraph(tmr4df,index=colnames(tmr4df[,1:4]))
##
pdf("ipc.pdf",width=8.5,height=11,paper='special')
treemap(cbind(tmr4df,f="#FFFFFF"),index=colnames(tmr4df[,1:4]),vSize="e",type="color",vColor="f")
dev.off()

Check you working directory and go find the output pdf!

As we say in Gaelic, "sin agad e!😀"

There you go!

Reading brand new Canadian patent applications into R

Part of a series of things I want to do with patents!

First, you need to import the data into R and find a good way of displaying them. So, go to the Canadian Patent Office Registry and display an edition as "Accessible Format" (not as pdf). This makes it easy to read in R. Sort of.
Amid the fog, I found the html tag <div> does an excellent job at dividing the sections. Copy and paste the following into R.

library(XML)
cipor1<-htmlTreeParse("http://www.ic.gc.ca/opic-cipo/cpor/eng/view.html?yearSelected=2017&editionSelected=23_Jun-06&docSelected=cpor&docType=weekly&fileFormat=HTML",useInternalNode=T)
#pages divided into <div>
#
divs<-lapply(xpathApply(cipor1,"//div",xmlValue),function(x) strsplit(x,"\\[51\\]"))

Change the URL in htmlTreeParse() if you would like.

The 'divs' object is a list whose elements contain the tables and headings for each table.

ipcs<-unique(unlist(divs)[grep("^ Int\\.Cl\\.",unlist(divs))])

The object 'ipcs' contains the International Patent Classification descriptions for each patent. That's where we go next.

Physically Chemist

Labels

Springer Publishing

Thursday 8 June 2017

International Patent Classification as a printable treemap

Reading brand new Canadian patent applications into R

About Me