Mining the Electronic Documents for Local Collections by Raleigh Muns Transcript of a talk delivered at the Spring, 1995 Depository Library Council Meeting and Federal Depository Conference, Wednesday, April 12, 1995, Arlington, Virginia OUTLINE -. Some initial quotes about information I. Who am I? In which a personal and individual context is set. II. Why am I doing what I am doing? In which motivation and opportunity are explored. III. What am I doing? In which the overall approach is explained. IV. How am I doing it? In which some nuts and bolts are examined. V. Bells and whistles In which some fancier things about mining and providing are explained. VI. Risks In which some unforeseen problems are put forth. VII. Results In which feedback on activities is presented. VIII. Conclusion Some initial quotes about information: When action grows unprofitable, gather information; when information grows unprofitable, sleep. -Ursula K. Le Guin (The Left Hand of Darkness (1969), ch. 3). Information is the oxygen of the modern age. It seeps through the walls topped by barbed wire, it wafts across the electrified borders. -Ronald Reagan (Guardian; London, 14 June 1989). The government is us; we are the government, you and I. -Theodore Roosevelt (Speech, 9 Sept. 1902, Asheville, N.C.). I. Who am I? By design and trade I am a Reference Librarian and not a Government Documents specialist. I used to belong to GODORT as a "gov docs junkie" but the reality of having children and a librarian's pay caused me to forego that frill and thrill after about two years. As a product of UCLA I had access to their extensive documents collection and probably became unabashedly addicted to the information the government provides when I ran across a tattered volume of hearings from the 1950's on how comic books were turning America's youth into a bunch of crazed and violent communists. I love that kind of stuff! The University of Missouri-St. Louis is a state-supported university that honestly delivers a fine education but to be honest, has no real reputation as a flagship of higher learning. Established in the early 1960's, living within a budget imposed by a frugal state government, and existing in a country that appears to increasingly be supporting its educational institutions and libraries with Nike slogans ("Just do it") the university decided, as many others have done, to gorge on the govdocs teat as a full-depository; this was followed by a re-scaling four years later to about 90 percent selectivity which is where we stand today. Because we are young and under funded, necessity has led us to rely heavily on the documents collection. Under funding also means under staffing which at UM-St. Louis means that we are all multi-specialists, or, as I like to say, at UM-St. Louis we are ALL government documents librarians. Our single dedicated government documents librarian is a REAL reference librarian and all of the REAL reference librarians can cite SuDoc numbers in our sleep. One of the final pieces in the puzzle has been the intellectual integration of the collection by including the Government Printing Office/OCLC tapes in our online catalog allowing patrons to access the collection transparently. II. Why am I doing what I am doing? 1. I see this as traditional librarianship. 2. We are poor. 3. We can. 4. The information we are providing has real-world applications in our mission. "1. I see this as traditional librarianship." The main activities of our profession revolve around activities of "access" and "preservation." Simple, basic librarianship consists of acquiring materials (collection development); organizing (cataloging, shelving); intermediating (reference services); and maintaining (preserving). Technical considerations aside, this is exactly what I am doing with a local internet gopher-based collection. "2. We are poor." This is a flip way of pointing out the value of the depository program. Because of the materials we receive, we can put resources in other areas not covered by the depository program. There should be nothing new here to any in the audience. What I do is an extension of the desire to extract value from existing resources at a minimal cost. One of my colleagues contends that what I do is not traditional librarianship. She points out that I am more in the publishing business than the library business. I counter that when we take the traditional roles of librarianship, and apply the context of a specific institution with a specific mission, what I am doing is the same as what we have always done in the profession. This last part is the practical key to all that I do: the context of what we do. Let me elaborate: rather than become a vacuum cleaner for everything that is out there, I suggest that you act as I do and deal in a world where acquisition decisions of electronic materials (i.e., mined electronic government documents) are the same as acquisition decisions for "real" documents, or "real" non-documents. A projected need must be met. For example, I do not choose to put an electronic document up on our Internet gopher because I think it will be used; I put it up because I know it will be used. This is based on my hands-on experience with the government documents collection via our Reference Desk. When I ran across the Occupational Outlook Handbook on CD-ROM from the depository program, I knew that this was an item that would be in demand because of the constant use of the print version. Sometime last year I gave a talk in San Francisco that stated "Everything I ever learned of value I learned in library school." The struggle many librarians are having with the new technologies can be mitigated by stepping back and realizing that though information formats are changing radically, the underlying concepts of what we do have not changed. Evaluation of a resource, for example, should be independent of the medium. What good is it? What need does it meet? Are there alternatives? If the process of accessing an electronic document seems stupid, confusing, and non-intuitive, it is probably because it is stupid, confusing, and non-intuitive. I think what I am saying is that if you are a confused, yet fearless, librarian, you will do fine. Now, we may still use stupid, confusing, non-intuitive resources, but at least we should be doing it with open eyes. Why are we doing this? "3. We can." Two conditions come together in a large amount of the federal documents I use which make mining the electronic documents a minor technical exercise, and they are: 1. The documents are already in an electronic format. 2. Uses of the documents are (usually) not restricted by copyright. Lots of useful print documents are not copyrighted and require extra effort (prohibitive effort based on most of our resources) to utilize; lots of other electronic documents are simple to use, but are under copyright; but the synergy of these two simple conditions creates an explosive mix that ignited one over-caffeinated, altruistic librarian's ongoing activities. I would like to point out that one of my frustrations is the problem of determining the copyrighted nature of a depository item. For example, one of the products I have raided is the eminently useable National Trade Data Bank (NTDB) CD-ROM. On the NTBD is an excellent small monograph: OPPORTUNITIES IN MEXICO: A SMALL BUSINESS GUIDE is the product of a public/private sector initiative among the U.S. Small Business Administration (SBA), the Service Corps of Retired Executives (SCORE) and AT&T. This guide provides U.S. small businesses with practical trade information on exporting to Mexico. Unfortunately, the Program Description part of the file unambiguously states: Contents of this publication are copyrighted. All rights are reserved to Free Trade Consultants. No portion of this book may be reproduced mechanically, electronically or by any other means, including photocopying, without written permission from John L. Manzella, Author, President of Free Trade Consultants, Buffalo, New York. Since this item is available on the official NTDB gopher (gopher://sunny.stat-usa.gov:70/00/STAT-USA/NTDB/) worrying about this seems absurd, but violation of copyright in our profession is serious, even when it is absurd. Another item I would dearly love to mine is the Joint Electronic Library CD-ROM (D 5.21:994/2/1 A) which is chock full of all sorts of historical papers from Military War College sources. I have neither the time nor the inclination to pursue determining to a conclusion the true copyright nature of this source (and suspect that it is a piece-by-piece answer anyway and not global to the entire CD-ROM). However, this is again a barrier to mining information I rather was not there. In any case, although these items are coming through the Depository program, as with printed materials via the program, there is no guarantee that they are in the public domain. This problem is magnified in what I do because of the nature of providing electronic information on the Internet, even locally. It is one thing to make a single photocopy, and yet another to create a resource that can be easily reproduced by fifty million people. The final piece of "Why am I doing this" is that, "4. The information we are providing has real-world applications in our mission." The information items provided are inherently useful. This is not an exercise in academic experimentation, but another dimension in providing desired information to those who need or want it. If I can inspire anyone to contribute to this common pursuit of our profession as I am doing, then I have again leveraged more value than it would appear out of these "dry as dust" government documents. III. What am I doing? In brief: raiding, stealing, pointing, mirroring, manipulating documents received via the depository library program or on the Internet. As institutions, especially government institutions, shift from paper to electronic formats, the availability of electronic documents is exploding, and thus the available opportunities are exploding. Based on what is currently on our gopher, a user can find the Army Area Handbooks, Economic Reports of the President, the US. Industrial Outlook, "The Green Book (1994)" Overview of Entitlement Programs, and a list (unique?) Of all Depository libraries organized by state. I would like to even brag a bit about preceding the official National Trade Data Bank gopher site by about a year (and grouse at the same time at the initial announcement of the NTDB's availability on the Internet as "for the first time anywhere"). Though we did not mount all NTDB files on our gopher, we did extract, again, those we found most useful from our immediate experience such as the Background Notes and aforementioned Army Area Handbooks (among others). In fact, by extracting the most useful files (again reflecting our experience with local user needs) we have found that we have cut down on what I call the "noise" on the NTDB CD-ROM of having too rich a body of information. This is application of the selection and collection activities of traditional librarianship. The pleasing thing about this is that in mining the electronic documents we are less tied to pure economic forces (how much does an item cost?) and more tied to the intellectual activity of determining patron needs in an almost abstract manner. Though I am addressing "Mining the Electronic Documents for Local Collections," the borderless nature of the Internet really means that everything is universally accessible. I admit to, and encourage you do the same: to be driven by local needs. The truth is that many of our local needs are the same local needs as users of the Cleveland Public Library, the Library of Congress, or the America Online service. In fact, according to our user logs, the largest group of users of our Internet gopher government documents are subscribers to America Online. IV. How am I doing it? Also, how can YOU do it. Undeniably, a certain level of technical expertise is required. The more expertise you have, the more you can do, the fancier you can get, the sexier your site, and the happier you can make your patrons. However, you do not need to know how to do computer programming (though if you know any programming, you can do some fun things); you do not need to know calculus or algebra; you do not need to know assembly language programming; in fact, if you have conquered any modern word-processing program, you have already learned what is probably the most difficult (and onerous) part of all I do. What DO you need? 1. An existing Internet infrastructure of some kind. 2. Public domain files in an electronic (ASCII preferred) format. 3. The aforementioned word-processing skills. 4. The ability to download/upload files from/to local PC's/Mac and your net site. 5. About one hour of instruction (or decent documentation). Whether you are dealing with the World Wide Web, gopher, ftp sites, or whatever, a necessary but not sufficient, condition is that someone at your institution be running a machine on the Internet. Mainframes, PC's, Macs, whatever, can all be used to run freeware Internet server software. You will be hard-pressed to find institutions with sites on the Internet that do not have an existing server of some kind already up and running. Your job, Mr. and Ms. Phelps, should you decide to accept it, is to make the human connection to the people running the machines. Without an existing Internet infrastructure of machines, software, and people, you cannot do any of what I am about to describe. Interestingly, there is a growing array of commercial providers who will do this for you. For $9.95 a month you can lease space on the World Wide Web with a company called Webcom (http://www.webcom.com). They become the infrastructure about which I am talking. This is not a recommendation of Webcom. I am just using them as what I consider a prototypical example of how the commercial sector can provide the needed Internet infrastructure. In my situation, I noted that some of our computer techies had set up a prototype gopher server on the campus mainframe and I innocently asked if I could have an account called "The Library." After about fifteen minutes of instruction and with a single sheet of paper showing me how to set up gopher menu structures (all done with simple text editors), I was told I could start uploading files that could be accessed. For those of you who think some mysterious and arcane knowledge is required to put files on the Internet I cannot stress how far from the truth is such a misconception. You can do mysterious and arcane things on the Internet, but being a basic provider is incredibly simple, provided you have an existing Internet infrastructure (or buy access to one). Now, being a depository library, we (and you no doubt) receive tons of CD-ROMs. This is the crop from which you will harvest. Remember, WHAT you harvest is partly limited by technical considerations, but more critically related to understanding in a real-world sense what information is worth mining. Initially, I install the software for accessing a CD-ROM as directed by any accompanying documentation. There are still many people that do not know that the information on a CD-ROM is as accessible as files on a diskette or your workstation's hard drive. One does not necessarily need to install special software to look at files on a CD. It is not unusual to have workstations, old and cheap ones, which cannot use the interface software supplied. It may not have enough memory; it may not have a color monitor; it may not have the most recent version of the DOS operating system. By looking at the files on a CD as you would files on a diskette, one can still extract valuable information that would be otherwise inaccessible. Certainly much of what you can probably look at directly may require other programs. For example, by looking directly at the directories of files on GPO distributed CD-ROM's I've found groupings of Lotus 1-2-3 spreadsheet files that require spreadsheet software to access. If you have your Internet infrastructure in place and working, it can become as simple as setting up a gopher menu item "1-2-3 Spreadsheets from the 1995 Federal Budget" and then just uploading all of the files from the CD-ROM to the Internet server account. Though not recommended, this could be done without even having such a spreadsheet program yourself. The key here is to poke around directly and not to rely on the native accessing software. You may find all kinds of neatly arrayed files just sitting around. The Joint Electronic Library CD-ROM mentioned above is an outstanding example. I have used the same technique to extract GIFs or pictures from USGS CD-ROMs to create a local exhibit of disaster photographs. Also, do not keep yourself from understanding how the "native" search software for a CD-ROM product works, either. The NTDB, and other Dept. of Commerce products, usually have two available interfaces. By familiarizing yourself with the software you can select files on the NTDB to be extracted as separate file (e.g., all Department of State Background Notes come out as separate files for each country) or create one large file with all sections appended. Here is where you could create an ftp (file transfer protocol) archive with the entire text of a single Army Area Handbook, or create, as I have done (and as is done at the STAT-USA site that carries the NTDB on the Internet) Army Area Handbooks with each chapter a separate file. Each product is different and subsequent editions may have updated or changed interface software. The general approach again is to: 1. Access the CD-ROM directly 2. Familiarize yourself with the supplied interface software. Of special note is that for those products that (hopefully) come out with a certain regularity, such as the NTDB, the familiarization process will pay off over time as you understand how to extract information with each new edition, and then carry over that expertise to subsequent issues. V. Bells and Whistles So far I have spoken broadly about how easy it is to just pull files off a CD-ROM and post them to a gopher or World Wide Web site (and begged the question of exactly HOW to do that as beyond the scope of this, or any presentation - how you do things is so tied into local resources that it is impossible to say in any generic sense how one should proceed). You can do some fascinating things with these files with a little expertise. First, some files are prohibitively large. Putting the entire Occupational Outlook Handbook on an Internet server is trivial since the CD-ROM version has a single file with the full-text on it. By writing programs that can chop up larger files into constituent pieces, one can add value to the product. Accessing five or six paragraphs on the occupation of "library clerk" using the Internet gopher software is a lot more efficient, and faster, than using that same gopher software to transfer the entire Occupational Outlook Handbook. The same thing can be said for documents such as the North American Free Trade Agreement (NAFTA), the entire Federal Budget, or the Economic Report of the President. Overall, the value one can add is by judiciously chopping up larger documents for easier access to the constituent pieces. As dull and dry as this may sound, I consider this a necessary component to providing universal access. By catering to the lowest common denominators, whatever the components of those denominators are, the universal access (hopefully) mandated by various federal information distribution programs can be met. One never knows whether an accessor is using a dumb terminal logged onto a Unix computer account or a top-of-the-line, fully networked, high end workstation on the Internet. The least common denominator here requires designing for the slowest transfer speed as possible. It is going to take a while for someone with a 2400 baud modem to look at a document than someone with Mosaic on a networked Macintosh. Another level of value that can be added involves organizing the pieces of chopped up information. By expending more effort, complex documents can be arranged in hierarchies for easier access. Chapters can be listed within which sections can be arranged within which tables can be arrayed, all at different levels. There are no shortcuts to doing this, but when such judicious arrangement is done, we are again acting like librarians more than technicians. No shortcuts. Librarians, of all professionals, have little problem understanding the importance of thankless tasks. I like to point out a difference between technicians and librarians. If you ask a technician to do something onerous and time-consuming, you are likely to be told "it cannot be done" (and what they mean is "I do not want to spend the time doing this onerous and time-consuming task"). As a counter example, when our government documents librarian was asked about shelf shifting and rearranging our growing collection, the answer was twofold: 1. An analysis that the job would take six months of hard work. 2. Six months of hard work. Similarly, many of the best things one can do in mining electronic documents for both local and universal collections are time-consuming, onerous, thankless, invisible, and absolutely critical for providing useable and useful online resources. It is fine if you can find files on CD-ROMs or on the Internet chopped up into nice packages. Nevertheless, if you cannot, roll up your sleeves and start hacking. The chopping up need not be difficult. Most word processing programs can take large documents while allowing cutting and pasting. I have written some programs in BASIC that do the chopping automatically. The level of programming skill is that required by the most basic courses of even twenty years ago. In the case of the Occupational Outlook Handbook, all sections of the one large file were flagged with the unique characters of two backwards slashes. By writing a program that chopped up the larger file on every occurrence of "//" I was able to quickly produce files consisting of separate occupations to be mounted on the Internet. Note again that the driving force for doing this was based on a first-hand knowledge at the Reference Desk of the utility of this specific work, and how people use it - it is not read cover to cover but is accessed by specific profession. Another bell and whistle possible when you create and provide local access to government document collections is what I call commercials on the Internet. I've long proposed (to the snorts of disdain of my colleagues) that we put commercials on our online catalog, or OPAC. When I chop up larger files for local collections, I do just that when I make sure that each piece has a bit of advertisement for the University of Missouri-St. Louis. With the exception of some of the first documents I placed on our gopher, all other electronic documents placed on our gopher and World Wide Web servers have, and will have, innocuous little tags saying something like "access to this chapter of the China Army Area Handbook is brought to you courtesy of the libraries of the University of Missouri-St. Louis." Additionally, information as to the source of the electronic document (e.g., the NTDB for a specific month) is also included. Note these two important functions: 1. Providing provenance information of the document. 2. Advertising the expertise of the university. Both of these things are extremely relevant. The issue of provenance comes into play when patrons wish to find similar items at a local depository. How many times have you had to deal with a patron carrying a photocopy of a single page of a government document asking "where is the rest of this item?" It is more than a courtesy to include the source of a document in a piece of a larger electronic file - it is a necessity. When a patron retrieving a chapter of an Army Area Handbook from a UM-St. Louis Library Internet node brings the printout to you, there should be no problem directing them to your local holdings. I strongly suggest that this is another dull, dry, and thankless area that is crucial to proper maintenance of local electronic collections. Advertising one's expertise, I hold, is also relevant and not an ego trip. In an environment where dependency on public support is crucial, it is important that we toot our own horns in an attempt to keep ourselves visible to our local, national, and even international patrons. When America Online subscribers consistently run across "free" depositories of useful information, it is in our mutual self-interest to let these voters on tax issues understand from whence this information is coming. Without tooting our horn these prototypical America Online users are likely to erroneously assume that it is their network provider (America Online) who is giving them this information. We did it. We do it. We will do it. And if the citizenry benefits from our services it behooves us to let them know to whom to give credit. This is less an ego issue than a survival issue. For an honest public institution as mine, this is also an opportunity to demonstrate value returned for tax dollars invested. If we can all proceed in this manner, our modest and invisible profession can only benefit. VI. Risks Erroneous attribution A recurring theme of this talk is the connection between local collections and universal access. In practice, what this means is that local activities can be criticized by anyone on the Internet. I have received messages from Norway explaining to me that their country voted NOT to join the European Economic Union (EEU); Austrians have told me about abbreviations in the CIA World Factbook which are in error; and Pakistanis have corrected me on the transcription into English of the name of their currency. By providing access to information you will be setting yourself up as appearing to be the publisher of that information. Personal Attacks As a local/universal provider of access to government information, erroneous attribtution can lead to personal attacks. I was recently called a jew killer and Benedict Arnold for my efforts of mining government documents for our local collection. My crime? I posted "as is" a copy of the Yugoslav Army Area Handbook from the National Trade Databank. The irate virtual patron decided that my publication of this work was a racist slap in his face. As a good librarian I calmly responded to his complaint and apologized and explained the situation. I said that, at his request, I would forward to my colleagues on the internet a proposal to remove all magazine articles, atlases, globes, and books with the word "Yugoslavia" in them. I heard nothing more. Responsibility By providing access to this information you should be setting yourself up as a consistent and responsible provider of information. Another area where we can add value to our local electronic collections is by maintaining our documents, continuing to update them, and making sure that access is robust. The risk is that irresponsibility can be seen immediately by all users of your information. VII. Results Relative fame and no fortune are the results. The best we can hope for is enough recognition to continue support for us and our institutions so we can continue to provide resources and services to our constituencies. One of the most interesting things about setting up local electronic collections on Internet servers is the ability to monitor use. Gopher and World Wide Web servers typically have the capacity to create user logs. These files contain the date, time, accessor, and files accessed by anyone utilizing the local server's resources. At UM-St. Louis we have provided for about two years information other than that mined from, primarily, GPO distributed CD-ROMs. Our gopher logs, however, indicate that the items most heavily used are from the Government Documents section of our virtual collection. Specifically, the Army Area Handbooks are the single most heavily used items. Due to technical problems, we are only able to track what I call "accesses" or "transactions." Whenever a user presses a key to move to another level of the gopher or to retrieve a document, a line of text is written to the day's gopher log indicating, again, date, time, user, and file or path accessed. Accesses of government documents is up from a hundred transactions per month two years ago, to over one hundred THOUSAND transactions monthly today. In abeyance is my desire to demonstrate WHICH documents are being accessed. Efforts to cease publication of things like the Occupational Outlook Handbook or the Industrial Outlook might be combatted by hard statistics showing continued and heavy use of these documents. VIII. Conclusions My conclusions are handwritten at the last minute because, for the life of me, I couldn't come up with any way to tie everything together. I posit that the reason for this is because this is an open-ended, ongoing, amorphous and ambiguous process (a salient feature of these new technologies). There is no conclusion to these activities; products, changing formats of information (e.g., the increasing use of Adobe Acrobat and the PDF file format), and technologies in general are all in flux. My conclusion, and advice, is to invert the popular environmentalist's aphorism of "think globally and act locally," to the new internet maxim of "think locally and act globally."
WWW Home Page URL:http://www.umsl.edu/~muns/