Mining the Electronic Documents for Local Collections
	by Raleigh Muns

Transcript of a talk delivered at the Spring, 1995 Depository Library
Council Meeting and Federal Depository Conference, Wednesday, April 12,
1995, Arlington, Virginia

OUTLINE

-.      Some initial quotes about information

I.      Who am I?
	In which a personal and individual context is set.

II.     Why am I doing what I am doing?
	In which motivation and opportunity are explored.

III.    What am I doing?
	In which the overall approach is explained.

IV.     How am I doing it?
	In which some nuts and bolts are examined.

V.      Bells and whistles
	In which some fancier things about mining and providing are explained.

VI.     Risks
	In which some unforeseen problems are put forth.

VII.    Results
	In which feedback on activities is presented.

VIII.   Conclusion

Some initial quotes about information:

	When action grows unprofitable, gather information; when
	information grows unprofitable, sleep.

	-Ursula K. Le Guin (The Left Hand of Darkness (1969), ch. 3).


	Information is the oxygen of the modern age. It seeps through
	the walls topped by barbed wire, it wafts across the electrified
	borders.

	-Ronald Reagan (Guardian; London, 14 June 1989).


	The government is us; we are the government, you and I.

	-Theodore Roosevelt (Speech, 9 Sept. 1902, Asheville, N.C.).


I.      Who am I?

	By design and trade I am a Reference Librarian and not a
	Government Documents specialist. I used to belong to GODORT as a
	"gov docs junkie" but the reality of having children and a
	librarian's pay caused me to forego that frill and thrill after
	about two years. As a product of UCLA I had access to their
	extensive documents collection and probably became unabashedly
	addicted to the information the government provides when I ran
	across a tattered volume of hearings from the 1950's on how
	comic books were turning America's youth into a bunch of crazed
	and violent communists. I love that kind of stuff!

	The University of Missouri-St. Louis is a state-supported
	university that honestly delivers a fine education but to be
	honest, has no real reputation as a flagship of higher learning.
	Established in the early 1960's, living within a budget imposed
	by a frugal state government, and existing in a country that
	appears to increasingly be supporting its educational
	institutions and libraries with Nike slogans ("Just do it") the
	university decided, as many others have done, to gorge on the
	govdocs teat as a full-depository; this was followed by a
	re-scaling four years later to about 90 percent selectivity
	which is where we stand today.

	Because we are young and under funded, necessity has led us to
	rely heavily on the documents collection. Under funding also
	means under staffing which at UM-St. Louis means that we are all
	multi-specialists, or, as I like to say, at UM-St. Louis we are
	ALL government documents librarians. Our single dedicated
	government documents librarian is a REAL reference librarian and
	all of the REAL reference librarians can cite SuDoc numbers in
	our sleep.

	One of the final pieces in the puzzle has been the intellectual
	integration of the collection by including the Government
	Printing Office/OCLC tapes in our online catalog allowing
	patrons to access the collection transparently.


II.     Why am I doing what I am doing?

	1.      I see this as traditional librarianship.

	2.      We are poor.

	3.      We can.

	4.      The information we are providing has real-world
		applications in our mission.

	"1. I see this as traditional librarianship."  The main
	activities of our profession revolve around activities of
	"access" and "preservation." Simple, basic librarianship
	consists of acquiring materials (collection development);
	organizing (cataloging, shelving); intermediating (reference
	services); and maintaining (preserving). Technical
	considerations aside, this is exactly what I am doing with a
	local internet gopher-based collection.

	"2. We are poor." This is a flip way of pointing out the value
	of the depository program. Because of the materials we receive,
	we can put resources in other areas not covered by the
	depository program. There should be nothing new here to any in
	the audience. What I do is an extension of the desire to extract
	value from existing resources at a minimal cost.

	One of my colleagues contends that what I do is not traditional
	librarianship. She points out that I am more in the publishing
	business than the library business. I counter that when we take
	the traditional roles of librarianship, and apply the context of
	a specific institution with a specific mission, what I am doing
	is the same as what we have always done in the profession. This
	last part is the practical key to all that I do: the context of
	what we do.

	Let me elaborate: rather than become a vacuum cleaner for
	everything that is out there, I suggest that you act as I do and
	deal in a world where acquisition decisions of electronic
	materials (i.e., mined electronic government documents) are the
	same as acquisition decisions for "real" documents, or "real"
	non-documents. A projected need must be met.

	For example, I do not choose to put an electronic document up on
	our Internet gopher because I think it will be used; I put it up
	because I know it will be used. This is based on my hands-on
	experience with the government documents collection via our
	Reference Desk. When I ran across the Occupational Outlook
	Handbook on CD-ROM from the depository program, I knew that this
	was an item that would be in demand because of the constant use
	of the print version. Sometime last year I gave a talk in San
	Francisco that stated "Everything I ever learned of value I
	learned in library school." The struggle many librarians are
	having with the new technologies can be mitigated by stepping
	back and realizing that though information formats are changing
	radically, the underlying concepts of what we do have not
	changed. Evaluation of a resource, for example, should be
	independent of the medium. What good is it? What need does it
	meet? Are there alternatives? If the process of accessing an
	electronic document seems stupid, confusing, and non-intuitive,
	it is probably because it is stupid, confusing, and
	non-intuitive. I think what I am saying is that if you are a
	confused, yet fearless, librarian, you will do fine. Now, we may
	still use stupid, confusing, non-intuitive resources, but at
	least we should be doing it with open eyes.

	Why are we doing this? "3. We can." Two conditions come together
	in a large amount of the federal documents I use which make
	mining the electronic documents a minor technical exercise, and
	they are:

	1.      The documents are already in an electronic format.

	2.      Uses of the documents are (usually) not restricted by
		copyright.

	Lots of useful print documents are not copyrighted and require
	extra effort (prohibitive effort based on most of our resources)
	to utilize; lots of other electronic documents  are simple to
	use, but are under copyright; but the synergy of these two
	simple conditions creates an explosive mix that ignited one
	over-caffeinated, altruistic librarian's ongoing activities.

	I would like to point out that one of my frustrations is the
	problem of determining the copyrighted nature of a depository
	item. For example, one of the products I have raided is the
	eminently useable National Trade Data Bank (NTDB) CD-ROM. On the
	NTBD is an excellent small monograph:

		OPPORTUNITIES IN MEXICO: A SMALL BUSINESS GUIDE is the
		product of a public/private sector initiative among the
		U.S. Small Business Administration (SBA), the Service
		Corps of Retired Executives (SCORE) and AT&T.  This
		guide provides U.S. small businesses with practical
		trade information on exporting to Mexico.

	Unfortunately, the Program Description part of the file
	unambiguously states:

		Contents of this publication are copyrighted. All rights
		are reserved to Free Trade Consultants.  No portion of
		this book may be reproduced mechanically, electronically
		or by any other means, including photocopying, without
		written permission from John L. Manzella, Author,
		President of Free Trade Consultants, Buffalo, New York.

	Since this item is available on the official NTDB gopher
	(gopher://sunny.stat-usa.gov:70/00/STAT-USA/NTDB/) worrying
	about this seems absurd, but violation of copyright in our
	profession is serious, even when it is absurd.

	Another item I would dearly love to mine is the Joint Electronic
	Library CD-ROM (D 5.21:994/2/1 A) which is chock full of all
	sorts of historical papers from Military War College sources. I
	have neither the time nor the inclination to pursue determining
	to a conclusion the true copyright nature of this source (and
	suspect that it is a piece-by-piece answer anyway and not global
	to the entire CD-ROM).  However, this is again a barrier to
	mining information I rather was not there. In any case, although
	these items are coming through the Depository program, as with
	printed materials via the program, there is no guarantee that
	they are in the public domain. This problem is magnified in what
	I do because of the nature of providing electronic information
	on the Internet, even locally. It is one thing to make a single
	photocopy, and yet another to create a resource that can be
	easily reproduced by fifty million people.

	The final piece of "Why am I doing this" is that, "4. The
	information we are providing has real-world applications in our
	mission." The information items provided are inherently useful.
	This is not an exercise in academic experimentation, but another
	dimension in providing desired information to those who need or
	want it. If I can inspire anyone to contribute to this common
	pursuit of our profession as I am doing, then I have again
	leveraged more value than it would appear out of these "dry as
	dust" government documents.


III.    What am I doing?

	In brief: raiding, stealing, pointing, mirroring, manipulating
	documents received via the depository library program or on the
	Internet. As institutions, especially government institutions,
	shift from paper to electronic formats, the availability of
	electronic documents is exploding, and thus the available
	opportunities are exploding.

	Based on what is currently on our gopher, a user can find the
	Army Area Handbooks, Economic Reports of the President, the US.
	Industrial Outlook, "The Green Book (1994)" Overview of
	Entitlement Programs, and a list (unique?) Of all Depository
	libraries organized by state.

	I would like to even brag a bit about preceding the official
	National Trade Data Bank gopher site by about a year (and grouse
	at the same time at the initial announcement of the NTDB's
	availability on the Internet as "for the first time anywhere").
	Though we did not mount all NTDB files on our gopher, we did
	extract, again, those we found most useful from our immediate
	experience such as the Background Notes and aforementioned Army
	Area Handbooks (among others). In fact, by extracting the most
	useful files (again reflecting our experience with local user
	needs) we have found that we have cut down on what I call the
	"noise" on the NTDB CD-ROM of having too rich a body of
	information. This is application of the selection and collection
	activities of traditional librarianship. The pleasing thing
	about this is that in mining the electronic documents we are
	less tied to pure economic forces (how much does an item cost?)
	and more tied to the intellectual activity of determining patron
	needs in an almost abstract manner.

	Though I am addressing "Mining the Electronic Documents for
	Local Collections," the borderless nature of the Internet
	really means that everything is universally accessible. I admit
	to, and encourage you do the same: to be driven by local needs.
	The truth is that many of our local needs are the same local
	needs as users of the Cleveland Public Library, the Library of
	Congress, or the America Online service. In fact, according to
	our user logs, the largest group of users of our Internet gopher
	government documents are subscribers to America Online.


IV.     How am I doing it?

	Also, how can YOU do it. Undeniably, a certain level of
	technical expertise is required. The more expertise you have,
	the more you can do, the fancier you can get, the sexier your
	site, and the happier you can make your patrons. However, you do
	not need to know how to do computer programming (though if you
	know any programming, you can do some fun things); you do not
	need to know calculus or algebra; you do not need to know
	assembly language programming; in fact, if you have conquered
	any modern word-processing program, you have already learned
	what is probably the most difficult (and onerous) part of all I
	do.

	What DO you need?

	1.      An existing Internet infrastructure of some kind.

	2.      Public domain files in an electronic (ASCII preferred)
		format.

	3.      The aforementioned word-processing skills.

	4.      The ability to download/upload files from/to local
		PC's/Mac and your net site.

	5.      About one hour of instruction (or decent documentation).

	Whether you are dealing with the World Wide Web, gopher, ftp
	sites, or whatever, a necessary but not sufficient, condition is
	that someone at your institution be running a machine on the
	Internet. Mainframes, PC's, Macs, whatever, can all be used to
	run freeware Internet server software. You will be hard-pressed
	to find institutions with sites on the Internet that do not have
	an existing server of some kind already up and running. Your
	job, Mr. and Ms. Phelps, should you decide to accept it, is to
	make the human connection to the people running the machines.
	Without an existing Internet infrastructure of machines,
	software, and people, you cannot do any of what I am about to
	describe.

	Interestingly, there is a growing array of commercial providers
	who will do this for you. For $9.95 a month you can lease space
	on the World Wide Web with a company called Webcom 
	(http://www.webcom.com). They become the infrastructure about 
	which I am talking. This is not a recommendation of Webcom.  
	I am just using them as what I consider a prototypical example 
	of how the commercial sector can provide the needed Internet 
	infrastructure.

	In my situation, I noted that some of our computer techies had
	set up a prototype gopher server on the campus mainframe and I
	innocently asked if I could have an account called "The
	Library." After about fifteen minutes of instruction and with a
	single sheet of paper showing me how to set up gopher menu
	structures (all done with simple text editors), I was told I
	could start uploading files that could be accessed. For those of
	you who think some mysterious and arcane knowledge is required
	to put files on the Internet I cannot stress how far from the
	truth is such a misconception. You can do mysterious and arcane
	things on the Internet, but being a basic provider is incredibly
	simple, provided you have an existing Internet infrastructure
	(or buy access to one).

	Now, being a depository library, we (and you no doubt) receive
	tons of CD-ROMs.  This is the crop from which you will harvest.
	Remember, WHAT you harvest is partly limited by technical
	considerations, but more critically related to understanding in
	a real-world sense what information is worth mining.

	Initially, I install the software for accessing a CD-ROM as
	directed by any accompanying documentation. There are still many
	people that do not know that the information on a CD-ROM is as
	accessible as files on a diskette or your workstation's hard
	drive. One does not necessarily need to install special software
	to look at files on a CD.  It is not unusual to have
	workstations, old and cheap ones, which cannot use the interface
	software supplied. It may not have enough memory; it may not
	have a color monitor; it may not have the most recent version of
	the DOS operating system. By looking at the files on a CD as
	you would files on a diskette, one can still extract valuable
	information that would be otherwise inaccessible.

	Certainly much of what you can probably look at directly may
	require other programs. For example, by looking directly at the
	directories of files on GPO distributed CD-ROM's I've found
	groupings of Lotus 1-2-3 spreadsheet files that require
	spreadsheet software to access. If you have your Internet
	infrastructure in place and working, it can become as simple as
	setting up a gopher menu item "1-2-3 Spreadsheets from the 1995
	Federal Budget" and then just uploading all of the files from
	the CD-ROM to the Internet server account. Though not
	recommended, this could be done without even having such a
	spreadsheet program yourself.

	The key here is to poke around directly and not to rely on the
	native accessing software.  You may find all kinds of neatly
	arrayed files just sitting around. The Joint Electronic Library
	CD-ROM mentioned above is an outstanding example. I have used
	the same technique to extract GIFs or pictures from USGS CD-ROMs
	to create a local exhibit of disaster photographs. Also, do not 
	keep yourself from understanding how the "native" search software 
	for a CD-ROM product works, either. The NTDB, and other Dept. of
	Commerce products, usually have two available interfaces. By
	familiarizing yourself with the software you can select files on
	the NTDB to be extracted as separate file (e.g., all Department
	of State Background Notes come out as separate files for each
	country) or create one large file with all sections appended.
	Here is where you could create an ftp (file transfer protocol)
	archive with the entire text of a single Army Area Handbook, or
	create, as I have done (and as is done at the STAT-USA site that
	carries the NTDB on the Internet) Army Area Handbooks with each
	chapter a separate file.

	Each product is different and subsequent editions may have
	updated or changed interface software. The general approach
	again is to:

	1.      Access the CD-ROM directly

	2.      Familiarize yourself with the supplied interface
		software.

	Of special note is that for those products that (hopefully) come
	out with a certain regularity, such as the NTDB, the
	familiarization process will pay off over time as you understand
	how to extract information with each new edition, and then carry
	over that expertise to subsequent issues.
	

V.      Bells and Whistles

	So far I have spoken broadly about how easy it is to just pull
	files off a CD-ROM and post them to a gopher or World Wide Web
	site (and begged the question of exactly HOW to do that as
	beyond the scope of this, or any presentation - how you do
	things is so tied into local resources that it is impossible to
	say in any generic sense how one should proceed). You can do
	some fascinating things with these files with a little
	expertise.

	First, some files are prohibitively large. Putting the entire
	Occupational Outlook Handbook on an Internet server is trivial
	since the CD-ROM version has a single file with the full-text on
	it. By writing programs that can chop up larger files into
	constituent pieces, one can add value to the product. Accessing
	five or six paragraphs on the occupation of "library clerk"
	using the Internet gopher software is a lot more efficient, and
	faster, than using that same gopher software to transfer the
	entire Occupational Outlook Handbook. The same thing can be said
	for documents such as the North American Free Trade Agreement
	(NAFTA), the entire Federal Budget, or the Economic Report of
	the President.

	Overall, the value one can add is by judiciously chopping up
	larger documents for easier access to the constituent pieces. As
	dull and dry as this may sound, I consider this a necessary
	component to providing universal access. By catering to the
	lowest common denominators, whatever the components of those
	denominators are, the universal access (hopefully) mandated by
	various federal information distribution programs can be met.
	One never knows whether an accessor is using a dumb terminal
	logged onto a Unix computer account or a top-of-the-line, fully
	networked, high end workstation on the Internet. The least
	common denominator here requires designing for the slowest
	transfer speed as possible. It is going to take a while for
	someone with a 2400 baud modem to look at a document than
	someone with Mosaic on a networked Macintosh.

	Another level of value that can be added involves organizing the
	pieces of chopped up information. By expending more effort,
	complex documents can be arranged in hierarchies for easier
	access. Chapters can be listed within which sections can be
	arranged within which tables can be arrayed, all at different
	levels. There are no shortcuts to doing this, but when such
	judicious arrangement is done, we are again acting like
	librarians more than technicians.

	No shortcuts. Librarians, of all professionals, have little
	problem understanding the importance of thankless tasks. I like
	to point out a difference between  technicians and librarians.
	If you ask a technician to do something onerous and
	time-consuming, you are likely to be told "it cannot be done"
	(and what they mean is "I do not want to spend the time doing
	this onerous and time-consuming task"). As a counter example,
	when our government documents librarian was asked about shelf
	shifting and rearranging our growing collection, the answer was
	twofold:

	1.      An analysis that the job would take six months of hard
		work.

	2.      Six months of hard work.

	Similarly, many of the best things one can do in mining
	electronic documents for both local and universal collections
	are time-consuming, onerous, thankless, invisible, and
	absolutely critical for providing useable and useful online
	resources. It is fine if you can find files on CD-ROMs or on the
	Internet chopped up into nice packages. Nevertheless, if you
	cannot, roll up your sleeves and start hacking.

	The chopping up need not be difficult. Most word processing
	programs can take large documents while allowing cutting and 
	pasting. I have written some programs in BASIC that do the 
	chopping automatically. The level of programming skill is that 
	required by the most basic courses of even twenty years ago. In 
	the case of the Occupational Outlook Handbook, all sections of 
	the one large file were flagged with the unique characters of two
	backwards slashes. By writing a program that chopped up the
	larger file on every occurrence of "//"  I was able to quickly
	produce files consisting of separate occupations to be mounted
	on the Internet.

	Note again that the driving force for doing this was based on a
	first-hand knowledge at the Reference Desk of the utility of
	this specific work, and how people use it - it is not read cover
	to cover but is accessed by specific profession.

	Another bell and whistle  possible when you create and provide
	local access to government document collections is what I call
	commercials on the Internet.  I've long proposed (to the snorts
	of disdain of my colleagues) that we put commercials on our
	online catalog, or OPAC. When I chop up larger files for local
	collections, I do just that when I make sure that each piece has
	a bit of advertisement for the University of Missouri-St. Louis.
	With the exception of some of the first documents I placed on
	our gopher, all other electronic documents placed on our gopher
	and World Wide Web servers have, and will have, innocuous little
	tags saying something like "access to this chapter of the China
	Army Area Handbook is brought to you courtesy of the libraries
	of the University of Missouri-St. Louis." Additionally,
	information as to the source of the electronic document (e.g.,
	the NTDB for a specific month) is also included. Note these two
	important functions:

	1.      Providing provenance information of the document.

	2.      Advertising the expertise of the university.

	Both of these things are extremely relevant. The issue of
	provenance comes into play when patrons wish to find similar
	items at a local depository. How many times have you had to deal
	with a patron carrying a photocopy of a single page of a
	government document asking "where is the rest of this item?"  It
	is more than a courtesy to include the source of a document in a
	piece of a larger electronic file - it is a necessity. When a
	patron retrieving a chapter of an Army Area Handbook from a
	UM-St. Louis Library Internet node brings the printout to you,
	there should be no problem directing them to your local
	holdings. I strongly suggest that this is another dull, dry, and
	thankless area that is crucial to proper maintenance of local
	electronic collections.

	Advertising one's expertise, I hold, is also relevant and not an
	ego trip. In an environment where dependency on public support
	is crucial, it is important that we toot our own horns in an
	attempt to keep ourselves visible to our local, national, and
	even international patrons. When America Online subscribers
	consistently run across "free" depositories of useful
	information, it is in our mutual self-interest to let these
	voters on tax issues understand from whence this information is
	coming. Without tooting our horn these prototypical America
	Online users are likely to erroneously assume that it is their
	network provider (America Online) who is giving them this
	information. We did it. We do it. We will do it. And if the
	citizenry benefits from our services it behooves us to let them
	know to whom to give credit. This is less an ego issue than a
	survival issue. For an honest public institution as mine, this
	is also an opportunity to demonstrate value returned for tax
	dollars invested. If we can all proceed in this manner, our 
	modest and invisible profession can only benefit.


VI.     Risks

	Erroneous attribution

	A recurring theme of this talk is the connection between local
	collections and universal access. In practice, what this
	means is that local activities can be criticized by anyone on
	the Internet. I have received messages from Norway explaining to
	me that their country voted NOT to join the European Economic
	Union (EEU); Austrians have told me about abbreviations in the
	CIA World Factbook which are in error; and Pakistanis have
	corrected me on the transcription into English of the name of
	their currency. By providing access to information you will be
	setting yourself up as appearing to be the publisher of that
	information.

	Personal Attacks

	As a local/universal provider of access to government
	information, erroneous attribtution can lead to personal
	attacks. I was recently called a jew killer and Benedict Arnold
	for my efforts of mining government documents for our local
	collection. My crime? I posted "as is" a copy of the Yugoslav
	Army Area Handbook from the National Trade Databank. The irate
	virtual patron decided that my publication of this work was a
	racist slap in his face. As a good librarian I calmly responded
	to his complaint and apologized and explained the situation. I
	said that, at his request, I would forward to my colleagues on
	the internet a proposal to remove all magazine articles,
	atlases, globes, and books with the word "Yugoslavia" in them. I
	heard nothing more.

	Responsibility

	By providing access to this information you should be setting
	yourself up as a consistent and responsible provider of
	information. Another area where we can add value to our local
	electronic collections is by maintaining our documents,
	continuing to update them, and making sure that access is
	robust. The risk is that irresponsibility can be seen
	immediately by all users of your information.



VII.    Results

	Relative fame and no fortune are the results. The best we can
	hope for is enough recognition to continue support for us and
	our institutions so we can continue to provide resources and
	services to our constituencies. One of the most interesting
	things about setting up local electronic collections on Internet
	servers is the ability to monitor use. Gopher and World Wide Web
	servers typically have the capacity to create user logs. These
	files contain the date, time, accessor, and files accessed by
	anyone utilizing the local server's resources. At UM-St. Louis
	we have provided for about two years information other than that
	mined from, primarily, GPO distributed CD-ROMs. Our gopher logs,
	however, indicate that the items most heavily used are from the
	Government Documents section of our virtual collection.
	Specifically, the Army Area Handbooks are the single most
	heavily used items.

	Due to technical problems, we are only able to track what I call
	"accesses" or "transactions."  Whenever a user presses a key to
	move to another level of the gopher or to retrieve a document, a
	line of text is written to the day's gopher log indicating,
	again, date, time, user, and file or path accessed. Accesses of
	government documents is up from a hundred transactions per month
	two years ago, to over one hundred THOUSAND transactions monthly
	today. In abeyance is my desire to demonstrate WHICH documents
	are being accessed. Efforts to cease publication of things like
	the Occupational Outlook Handbook or the Industrial Outlook
	might be combatted by hard statistics showing continued and
	heavy use of these documents.

	

VIII.   Conclusions

	My conclusions are handwritten at the last minute because, for
	the life of me, I couldn't come up with any way to tie
	everything together. I posit that the reason for this is because
	this is an open-ended, ongoing, amorphous and ambiguous process
	(a salient feature of these new technologies). There is no
	conclusion to these activities; products, changing formats of 
	information (e.g., the increasing use of Adobe Acrobat and the 
	PDF file format), and technologies in general are all in flux.

	My conclusion, and advice, is to invert the popular
	environmentalist's aphorism of 
	
		"think globally and act locally,"
	
	to the new internet maxim of 
	
		"think locally and act globally."



Copyright 1995 by R. Muns
Email Address: SRCMUNS@UMSLVMA.UMSL.EDU

WWW Home Page URL:http://www.umsl.edu/~muns/