“Big Data” is oftentimes used as a term when seeking to make “Simple Data” appear fancy. Still, the definition is quite clear: To be “Big Data,” the dataset must be too large or too complex for the traditional data-processing applications to handle.
Software technologies develop rapidly. What is “traditional” today was non-traditional yesterday. And what is new and advanced today can become traditional tomorrow.
Literature suggests several dimensions which make a dataset too large or too complex. In the context of IP industry data, the most relevant are volume, variety, and veracity. The importance of these dimensions has changed over time. Therefore, we need to incorporate a time perspective to judge a dataset’s “Big Data-ness.”
Traditionally, the volume of data points made IP data software applications to be rightfully referred to as Big Data tools. I remember that even 10 years ago, while I was working at the Institute of Innovation Management at LMU Munich, we were commonly fighting with storing and processing worldwide patent information in an acceptable amount of time, largely due to the immense size of the database with over 100 million patent publications. Algorithms were running for days on the most advanced servers we could buy on the market. At that time the datasets were commonly too large to be dealt with by traditional data processing approaches.
Nowadays, we can process 140 million patent publications in just a couple of seconds on a traditional database like PostgreSQL, the standard relational database used in most applications. I rarely happen to see single queries which take an immeasurable amount of time. If that is still the case, this rather means that the setup is not done well or methods are not chosen appropriately.
By now, the issue is mostly variety which makes processing the information difficult and complex. In the IP industry, we mostly deal with structured datasets (even in full-text analysis) that are much simpler to use than unstructured ones. 100+ patent offices generally follow a similar format, and there have been quite helpful initiatives to harmonize formats across patent data sources, e.g., the WIPO format ST.36 or ST.66. Therefore, we are not really lacking a proper definition of a structure.
The issues we are currently facing arise from what is pressed into the structure. To give a real-life example: At Patent-Pilot we process the agent information. Some offices collect law firm names, some offices collect attorney names only, and some collect both. Some collect it in English and the local language, some offer automated translations/transliterations. Address information is oftentimes not available at all. We might see the original information at the time of application or updates along the way without an indication an update happened. Yet, it is always pushed into the same fields in XML. In my eyes, more efforts should be dedicated to collecting the exact same data worldwide in the same format and agree on transforming backfiles accordingly, so collected information becomes really comparable.
We have built our own algorithms – by the way based on HI (= Human Intelligence) rather than Artificial Intelligence – to process the datasets. In other words, most of the work in “Big Data” currently revolves around the standardization of diverse and inconsistent raw data. Here, the volume is not an issue anymore, but, surprisingly, even an incredibly helpful factor that allows us to better decide on how to deal with the variety and clean the data. The more data you have, the easier it is to reliably spot the proper or “clean” entries among similar entries and correct the wrong one. It also improves the training datasets which are used in Machine Learning algorithms.
The future of Big Data for the profession is certainly most interesting and deserves special attention. I strongly believe that we are currently still on the surface of what data can provide to the IP practitioners. Big Data will be less a simple “information” source, but rather become crucial for “insights”.
Veracity, i.e. finding sense in the data, is the new key Big Data dimension. Patterns are impossible to spot by looking at single data points or even when processing/grouping/aggregating the various sources using traditional models. Pattern recognition is very important for what is, in my eyes, the ultimate goal: Supporting decision-making based on data.
What does this mean concretely in our use case? From our data, we can reveal which IP law firms work for which clients. We can also identify the law firm’s networks and, for example, whether or not a firm is seeking balanced international relationships (reciprocity).
Going one step further, we have built advanced algorithms that allow us to identify whether the client or the IP firm is in charge of the decision-making in international case exchange – a detail that is not available on the raw data as such. This is the power of Big Data.
Ultimately, “insights” would mean processing the information algorithmically to advise for the right decision, for example, which law firms or clients to work with worldwide. In other words: In comparison to my own firm’s strategy, who are the ideal partners out there? Where can we identify patterns by comparing one portfolio against all other law firms/client portfolios that make a partner/client appear to fit? Avoiding conflicts of interest, identifying technical white spots or mutual business relationships, strategic goals and many other factors can be used to guide the user in the decision-making process – this is what we are currently working on.
The example discussed above relates to the support of a law firm’s internal decision-making. Other applications might support IP attorneys in the quality of advice they give to their clients in the application process.
In any case, technically, this means moving away from patent publication-based datasets and into transformed datasets that dynamically calculate figures like client market share trends or similar applicants. Comparing thousands of law firms’ portfolios, millions of patent applicant portfolios, or several technical areas in parallel is not an easy task for any database. N times n matrices are still difficult to handle for most modern database systems. It creates the complexity that truly allows referring to such data-processing methods as too complex for traditional applications, hence qualifies
as Big Data.
In a nutshell, advances in technology have started to change the perception of Big Data applications and will continue to do so. This will lead to deeper information processing for the IP professionals and make new software products arise that further support IP attorneys in their decision-making processes.