Materials Research Activities

Ashley Rindsberg log: Cornell Center for Materials Research Archive Project March/April 2002

Cornell Center for Materials Research Archive Project

CCMR Archive Project introductory page
log, March-April 2002
log, May 2002
log, June-October 2002

Ashley Rindsberg: Detailed Log for Months of March and April, 2002

Now that we are well into the actual scanning phase of the project, it's an opportune time to provide an update of how we've approached the project - in general and in its specifics - up to this point. This will be an informal log, broken down into clusters of weeks, in order to provide information about different phases of the project instead of recounting activities of one day to the next.

Early March

The first couple weeks of work on the project mainly involved groping my way through the masses of documents, reports, proposals, correspondences etc. The files did have a filing key and appropriate number-tags attached to each file, but the system was not particularly helpful, probably largely because of notoriously bad filing methods used ten to fifteen years ago. I pretty much literally just began opening drawers and, at random, pulling files to try to understand what kind of content each file contained. At the same time, I read bits of lengthy histories of the center written in the 1960s, 70s, 80s (particularly one by Robert Sproull) and 90s (Hartman, 1992). This helped me understand what different acronyms meant (NSF, MRL, MSC, DARPA, ARPA, REU etc.) and, more importantly, exactly how each division, program, project was related - or unrelated - to the center.

The best source of information for this sort of thing, though, was not the long, anecdotal histories located in the Center's archives, but asking people in the Center - particularly Helene. Helene was able to provide answers to all questions regarding different aspects of the Center and a good deal of its history. This was crucial to my understanding the files, which, in turn, was crucial to developing a hierarchy to order the scanned files.

At the same time, I was scouring the files for information on the past directors and associate directors of the center. This was extremely helpful. It gave my groping some sort of objective. More importantly, I was able identify key members (Sproull, Sack, Leugrans, Silcox) who had also served as directors.

Late March

The first major task was developing the hierarchy I mention above. I did this using both my intuitions about the archive (i.e. thinking of myself as a prospective user of the digital archive) and by using what I knew of the archive. It turns out that most of the documents in the archive are correspondences of some sort: requests for membership to the Center, expenditure approvals, policy statements from the university, to name a few. So, I made "Correspondence" sort of a master category for the archive. This would take care of ordering a large chunk of the documents, so long as I'd be able to develop the hierarchy downwards (that is, more specific, narrow categories). The rest of the categories were fairly straightforward. I found that there was a kind of document which took the form of an official/semi-official report which included the huge annual reports written for the NSF, the quarterly reports (no longer written), and the sometimes long histories written by members of the Center. As this was going on, Ivan and I began coordinating about technical issues involved with scanning. I'd done some test-scans which were really slow (because of the scanner's configurations) and tedious (because we didn't have any file naming conventions).

Early April

Thus the development of file-naming conventions and coding. This was, and still is, Ivan's domain. I'll explain the progression of the naming system below. Ivan, Helene, and I sat down with the basic hierarchy I developed and began talking about file-naming and directory structure. Initially, we planned for a hierarchy that was 10 layers deep. So Ivan decided to begin each file name with 10-character file notation, using letters to code for each section of the hierarchy. For example

RAXXXXXXXX-

initially indicated Reports:Annual, with the series of x's to pad the rest of the code for un-used spaces.

The next number was a 3-digit field for document serial - to code for which document (defining a document as the entire entity from start to finish, not just a page) each scanned-page was a part of. This extended the naming code to

RAXXXXXXXX-001,

for example. Next, Ivan worked in a 4-digit page code, also just numerical. The decision here turned on whether there are documents in the files which exceed 1000 pages. I couldn't find a single one so the code went from,

RAXXXXXXXX-001-0001

RAXXXXXXXX-001-001.

Then came a 7-character code for the date - yyyymmdd (year, month, date) and a 2-character alpha-numerical to indicate the author. This left us at

RAXXXXXXXX-001-001-20020419-AA

The last 24 (out of an available ~60) characters Ivan stipulated for some sort of written, descriptive title.

The next phase of developing an effective file-naming convention involved refining the convention, after having seen how well it works with the actual documents. We quickly realized that the hierarchy wouldn't be as deep as we'd thought, so we cut 3 characters out of the initial sequence. We left the page-number sequence four characters because I mentioned the possibility (though slight) that a huge document might surface from the basement of Clark Hall, or wherever, and really make some trouble.

Ivan noticed as I began to scan some initial documents that page numbers printed on the page do not necessarily reflect the number of pages in the documents. There were frequently unnumbered filler pages in reports which threw off our pagination tag. So, Ivan added an alpha-numerical tag to account for these unnumbered pages. So if an unnumbered page followed numbered-page 213 it would be named

RAXXXXX-001-0001-A-...

The most recent changes made to the system were done with time and convenience in mind. Including actual page numbers in the file name is extremely time-consuming, and increases amount of time needed for each scan.

In addition to the file naming convention, Ivan developed a scan-log so I can record what's been scanned in a day and what it's called. For me, the most useful aspect of the log has been to allow me to indicate missing pages of a document (which are a real pain). I added missing-page information so I'm able to identify the page, find it elsewhere, and scan it.

The file-naming conventions have worked really well, especially since we were able to make changes as we went along. The files are easy to locate and identify in their (electronic) folders, and the folders nicely mirror the actual hierarchy of the archive.

At the same time, Ivan and I were working on the scanner settings.

The scanner settings were relatively easy to configure. We did a test-scan series which ran from the lowest scanner setting - 4bit gray 400dpi - to the (one of the) highest, true colors at around 600dpi. For most documents, there was virtually no difference in the quality of the scan or the printout of the scan. I included the printed test series in the packet I gave you during your visit. In terms of file size, the true color was about ten times the size of the 4bit gray. The other settings were not that much larger, but large enough to outweigh the small enhancements in appearance (an exercise in marginal utility).

Ivan also changed the scanning hardware around so that, now, the scanner can continuously scan 10 - 20 pages without my having to sit at the computer and click the "Scan" button for each page. This didn't take care of all the problems because the machine isn't sophisticated enough to provide the correct, sequential file number. It simply adds a "1" then "2" etc after whatever number comes last. Improving this could really help with productivity.

Speaking of that, I've found that I can scan about 75 pages in an hour - barring major problems. Right now, there are small, inevitable problems which take time to fix. For example, missing pages in the documents really mixes things up. We've addressed this problem indirectly by no longer indicating the page number printed on the page in the file name. Instead the page-number code in the file name merely records what page this is in our sequence.

Late April

Right now, I'm working to get things setup for the summer. We've decided it'd be best to take someone on for the summer (I'll be away for the duration). I'm currently searching out large, easy documents for the summer-scanner to scan, according to a schedule which I will devise for him. My hope is for him to complete a major category - e.g. "Reports" - by the end of the summer. This would be extremely helpful, especially considering that scanning these documents requires just that - scanning - and not special knowledge of the archive or of our document hierarchy.

In terms of time estimates for the completion of the project, it's hard to say. My hope is to have the project, or at least the bulk of it, finished by the end of next semester. I don't think this is unreasonable. Any time before then, though, seems very unlikely (at least with things set up the way they are). I'll try to work on getting a more precise estimation as soon as possible.

This page was last updated on 21 October 2002 by Arne Hessenbruch.