Learning, Living & Feeling with BIG Data

Click, Click, Click, Click. Cut and paste. Jump screens. Cut and paste. Scroll. Click, Click, Cut and paste. Scroll. Deep breath in. Jump screens. Scroll. Cut and paste. Exhale. Click, Click, Click, Click.

[This post is by Kat and Claudia]

Along with the cracking of compressed spines and groans when standing up far too infrequently to release aching muscles, this is the sound of digital archive work. Kat has been working on the dataset from March 2019 and Claudia joined in September. Seven research assistants[i] were employed at different times to assist – such was the volume of work. Together, we have been crafting, sculpting, fixing, mending, cleaning, translating, cataloguing, saving, discussing and compiling clothing patent data from 1820 through to today.

For months we worked together and with each other and hundreds of thousands of digital files. We lived with, dreamt about, inscribed into our brains and bodies, and became excited and bored by the processing of data. We shared insights about the content (constant decisions to be made about what was and wasn’t relevant), the worry (about the time it was taking, and if we were doing it right), the responsibility (is everyone working too hard?), the process (how to ensure shared systems, how to do it better and quicker) and the delight (stumbling across exciting finds and laughing together about amazing inventions). We talked online, pinned things on walls and scribbled fieldnotes about the experience; in attempts to hold onto a period that we also at times desperately longed to pass.

Carolyn Steedman writes eloquently about archive fever:

Archive fever comes on at night, long after the archive has shut for the day. Typically, the fever starts in the early hours of the morning… What keeps you awake… is actually the archive, and the myriads of its dead, who all day long have pressed their concerns upon you… You think: I could get to hate these people; and then; I can never do these people justice; and finally: I shall never get it done (2002:17-18).

While we may never have hated the dataset, we definitely had feelings about it. We were exhausted by it, loved it, grew tired of it and worked at on it at a feverish pace. Following (briefly) are some of the challenges, processes and insights of the experience.

Challenges

European Patent Office (EPO) databases hold a remarkable collection of digitized patent records from over 100 global patent offices. Like all archives, and archival work, these provisions enable big data to be collated and analysed in a way that was not possible a few years ago. However, like all data it requires work – additions, cleaning and organising.

Global patent data presents unique challenges: patent systems change over time (200year span) and across place (95 countries are represented in our sample). This requires complex piecing together of multiple sources, reflection and related research and lots of time. The quality of records is further shaped by patent office’s histories, changing classification systems and digital processes. As such, there are shifts in naming/ numbering systems, missing data and translation issues that require close attention and repair (and recognition that not everything can be fixed at this point). The skills, developing expertise and tiredness of researchers and our collective ongoing decision-making processes also shape the data.

The POP team are sensitive to these differences and changing practices and seek as much as is possible to render visible the taxonomies and decisions (by patent offices and team members) that shape the resulting datasets. (The POP blog is one way we are doing this.)

Scale

The initial search for ‘A41 – Wearing Apparel’ generated a LOT of patent data. We reduced this down to under 300,000 rich text format (RTF) files by cleaning, fixing and reducing the noise. The size of the Work Package 1 dataset was not known prior to the project starting, as the classification “A41” captures relevant patents (outerwear, protective garments, shirts, corsets, underwear etc) and also patents less irrelevant for the project (such as manufacturing methods, display and measurement devices and artificial trees and flowers).

And, more data…

We quickly became aware that many of the key clothing inventions we were interested in were absent in the A41 sweep. (For eg. Space suits are filed under B64G 6/00). So, we did new IPC/CPC and multi-lingual key word searches in PATSTAT and ESPACENET. We checked and augmented data with other international databases – Google Patents, AUSPAT, PATENTSCOPE, IPONZ, CANADAPAT, WIPO etc. We also visited archives (British Library, Australian National Archives so far – with more planned when travel is possible). These non-A41 searches expand the scale and depth of the dataset. And, we continue to add more work as we format and file data to match.

Digitisation

The EPO has been providing free access to digitised data via Espacenet since 1998, and this provision has grown from 30million to over 100million spanning vast international collections. In 2016, it expanded its offering with PATSTAT’s web interface. This digital treasure trove is highly valuable – making this type of ambitious project possible, and online access to data is especially useful in these challenging remote-working conditions. However, as indicated above, the range of digitisation practices varies.

While many patent offices are increasingly advancing their digitised offerings, not all data is consistently available. Titles and abstracts can be omitted and, in some cases, the original patent PDFs are not yet online.Issues arise with machine learning content capture when originals are aged – with faded text, ink blots and smudges. Fs become Ss and Ts become t’. Sometimes, it is not possible to transliterate data. Canada’s patent offices, for example, digitized hand-written patents from1869-1919 on microfilm which have then been scanned and cannot be machine read. In these cases, we have attempted to transcribe them into the dataset.

 


Fig. 1. Low quality scanning: CA69614A Pocket for Garments

 


Fig. 2. Hand writing: CA13974A Improvements in Skirt Adjusters

 

Fig. 3. Missing data: AR 224584 Improvements in Underwear and the like

Translation

The dataset was initially multi-lingual. Some data was partially or entirely in its original language. Some data had already been translated by online systems into English using Google Translate or the EPO’s translation service. While on the whole this spans from satisfactory to excellent (we have a multi-lingual team), on occasion it resulted in confusing strings of texts. This worsened when the initial scanning and digitisation was poor. As we quickly learned, some languages (such as Korean, Japanese and German) are less well served by these processes, due to their specificity and/or graphic nature.

Fig. 4. Poor translation: JPS5480764A Welding Protective Mask

Forwards and backwards

Crafting a dataset is a labour-intensive process. We organised hundreds of thousands of RTFs according to clear and persistent naming conventions of YEAR and COUNTRY. Data starts in 1836 and 94 countries are represented. Each RTF file includes patent number, link to PDF patent (on Espacenet), international classification mark (IPC/CPC), year, country, title and abstract. The latter is especially critical. The translations, additional data input and enormous amount of cleaning we have done, means the dataset is key-word searchable in a way that was not possible before. Some year folders took more than a week to organised and fix (2016 has 15,645 files). It was also not a linear process. We collectively had to think about how to resolve unexpected issues and often returned to ‘finished’ files to check and adjust. The data was ‘editing’ us and our ability to think about and respond to challenges.

The lively archive

As the above examples show, the dataset is neither fixed nor static. Because of character of the raw data, it is unique and constantly changing. We are constantly fixing, adding and editing it. This is also happening in the larger context. In late 2019, Espacenet, now called Classic Espacenet was updated and launched as Advanced Espacenet. The latter has more advanced search functionality and viewing of text and thumbnail capabilities amongst other things. These changes are welcome. However, our dataset files (as per all files in the PATSTAT database) link to original patent documents in Classic Espacenet. Given the size of the full PATSTAT database, redirecting these links is a vast undertaking and will take time.

Overall, we have found the process dynamic, surprising and physically taxing. Not only have we been shaping the dataset, but we in turn have been shaped by the dataset – learning, living, feeling with the archive.

Returning to Steedman who in her archive fever, over the course of months of work, writes: ‘You think in the delirium: it was their dust that I breathed in’ (1997:19). Our digital archival work is no less material or embodied, just differently pressed into our bodies and imprinted on our memories.

References: 

Steedman, Carolyn. 2002. Dust. Manchester University Press

Steedman, Carolyn. 1997. Something She Called a Fever: Michelet, Derrida, and Dust, The American Historical Review, 106 (4): 1159-1180

 

[i] Thankyou to Ignacio, Laura, Jaice, Tara, William, Silvester and Ellie

This site uses Akismet to reduce spam. Learn how your comment data is processed.