Springer Publishing

Thursday 8 June 2017

Reading brand new Canadian patent applications into R

Part of a series of things I want to do with patents!

First, you need to import the data into R and find a good way of displaying them. So, go to the Canadian Patent Office Registry and display an edition as "Accessible Format" (not as pdf). This makes it easy to read in R. Sort of.
Amid the fog, I found the html tag <div> does an excellent job at dividing the sections. Copy and paste the following into R.

library(XML)
cipor1<-htmlTreeParse("http://www.ic.gc.ca/opic-cipo/cpor/eng/view.html?yearSelected=2017&editionSelected=23_Jun-06&docSelected=cpor&docType=weekly&fileFormat=HTML",useInternalNode=T)
#pages divided into <div>
#
divs<-lapply(xpathApply(cipor1,"//div",xmlValue),function(x) strsplit(x,"\\[51\\]"))

Change the URL in htmlTreeParse() if you would like.

The 'divs' object is a list whose elements contain the tables and headings for each table.

ipcs<-unique(unlist(divs)[grep("^ Int\\.Cl\\.",unlist(divs))])

The object 'ipcs' contains the International Patent Classification descriptions for each patent. That's where we go next.

No comments:

Post a Comment