August 7, 1998 - University of Washington



January 23, 1999

Gr8 In4mation: An Orthographic Survey of Numbers Online

Primacy of numbers

All computer representations of text is fundamentally numbers

Numbers are part of search systems in numbering sets, reporting set size, limiting by years

"People in the data processing community have gotten used to viewing things in a highly simplistic way, dictated by the kind of tools they have at their disposal. And this may suggest another wonderful irony. People are awed by the sophistication and complexity of computers, and tend to assume that such things are beyond their comprehension. But that view is entirely backwards! The thing that makes computers so hard to deal with is not their complexity, but their utter simplicity." (Kent, 1978, p. viii)

** Establish database systems at beginning and give the survey results in brackets thoughout the essay.

These are bibliographic databases – focus is not on databases of numbers.

Parts of a bibliographic record that are numeric:

>>toxline

>>datastar

YF Field (Publication Year): You can search by publication year by specifying the entire year (e.g., 1991), or by using only the last two digits of the year (e.g., 91). In the YF field, the year appears as four digits. Note that there are currently over 3,000 records with publication year = 0000 (i.e., the publication year is unknown). Records go back at least as far as 1932. So by 2032, there may be some confusion when entering '32' to indicate the publication year.

The Y2K problem as a representation of numbers problem

“The Bugs in Your Future” Wired January 1999, 7.01, pp. 76 – 77

January 1, 1999 – programs that use “99” as a sentinel value (for example, to indicate that no year value was available for a given database entry) start treating everyday dates a special cases.

August 22, 1999 – GPS software rolls over its week counter for the first time. .. with the system dating from January 5, 1980, the rollover has never been tested live before.

September 9, 1999 – End-of-File Bug (Part 1) Programs that use “9999” as an end-of-file marker may mistake the date 9/9/99 as an end of file.

January 1, 2000 – Even if 85% of Y2K-prone applications are fixed, about 1.7 million will still fail next New Year’s Day.

September 8, 2001 – End-of-file Bug (Part 2) Unix programs using 999,999,999 as an end of file maker confuse the data with the date.

Numbers as identifiers

"A person might be identified by a social security number, employee number, membership number in various organizations, military service number, various account or plicy numbers (strictly speaking, these latter don't identify him, but sometjhing he's related to; on the other hand, you might also say that about a social security number). A department may have a name (Accounting) and a number (Z99). A book has a title, a Library of Congress number, an ISBN (International Standard Book Number), not to mention various Dewey decimal identifiers in local library catalogs. And each copy of a book may have an 'accession number', assigned locally by a library for their overall inventory management." (Kent, 1978, p.43).

Maping of number system of one database to another: Standard Industrial codes. Dialog’s Map Command.

Patent databases?

** Importance of tokenization and normalization. Numbers and punctuation push normalization routines to the limit.

The various ways of Writing Numbers

How to write a number: 1, 1.0, one, I, unity, 3 – 2, etc.

Forms of words describing numbers

“…billion, trillion and such jocular coinages as jillion, skillion, zillion…” Hurford, 1987, p. 44

How does pluralize a number: 1s

Adding an ‘s’ for a plural form changes the number/letter name of a tool, weapon

dialog

defense newsletters

Another example (here I am looking for articles about the U.S. Navy's ES-3 aircraft):

? s ES()3/ti

This fails to capture a record which has the title, "ES-3s Play Increasing Role in Carrier Operations."

This is not a huge problem, for more often than not, both singular and plural forms exist in the record; I did however find some instances where either only the singular or only the plural form existed in the record. This is also worthy of mention because to someone familiar with these and other aircraft, adding the "s" to the search argument doesn't really make sense (at least initially). It is one thing to see the plural form in text, quite another to search for an "HH-65AS" - there is no such thing. The letter designators at the end have meaning, they indicate the "model" or "version" of the aircraft, and it requires a shift in thinking so as to accommodate the computer.

How to search for a negative number?

Ancient forms of numbers/written numbers

There are no numerals to worry about at least. All numbers in the King James Bible appear as words. The form taken by these words (beyond the aforementioned lack of hyphens), however, can throw off the searcher unfamiliar with the period conventions. To take a famous example, if you were looking for references to the "Number of the Beast" and simply entered the query:

?s six(w)hundred(w1)sixty(w)six

You'd return zero hits. Finding these references will require either a much-modified search or the knowledge that period form of this number is "six hundred threescore and six". Questions concerning the configuration taken by numbers are made more difficult by the fact that modern and archaic forms are used interchangeably. Alongside the 91 uses of word 'threescore' are 13 of 'sixty' and the 35 'fourscore's alternate with 3 'eighty's etc. Just a point to keep in mind when searching for Biblical references to a number (a subject of great interest for certain types).

“But the use of an alternative notation for numerals is seldom, if ever, obligatory, and conventional orthographic forms exist. 365 can be written out three hundred and sixty five. The alternative notation can be seen as an efficient shorthand for the longer forms, although it is no doubt significant that such shorthands are especially common for numeral expressions. But there are other shorthands, such as e.g., i.e., &, +, @, (, =,%, in quite common use.” Hurford, 1987, p. 5

“It is interesting to note that the number 2 is never (standardly) named by an expression like one plus one, although the number 11 is, not surprisingly, often expressed as something like ten plus one.” Hurford, 1987, p. 8.

“…that the French numeral system has the remains of a 20-based system in the expression quatre-vingts. And then someone else will usually chip in with the information that in parts of French-speaking Belgium and Switzerland a purely decimal system with septante, octante, and nonante is found.” Hurford, 1987, p. 15.

Tokenizing and Normalizing Numbers

Query Tokenization makes target impossible or part of number conflict with parts of query system.

Normalization of numbers strips them of meaning

geobase

Searching for numbers is difficult. It is impossible to determine what punctuation is between numbers.

A search for 0()0()0()0 brought the following results:

0-0-0-0

(0.0(0.0-5.0,P>biography master index

Numbers: >Punctation is stripped out of numbers with the exception of hyphens. (We did not find an example of ampersands embedded within numbers and therefore make no claims on this.) As a result, decimal points and commas should be excluded from queries. For example, a record containing the birth and death dates of a particular individual make look like this:

>Smith, Henry 1545?-1649.

(We suspect the question-mark indicates indexer uncertainty.) Queries finding this record would not include the question-mark as the question-mark is not indexed. A question-mark in a query would either be interpreted as a truncation, e.g. 47=>f 1545?, or a wild-card, e.g. 48=>f 1545?-1649.

Roman numerals

dialog

(quotation database)

This was a relatively harmless error, but a more insidious orthographic error is the difference between a citation for "bk.I" in which the first book is represented by the Roman numeral (essentially, the letter "I") and a citation for "bk.1" in which the first book is represented by the Arabic numeral. The use of Roman numerals (I, II, III, etc.) and Arabic numerals (1, 2, 3, etc.) in the descriptions of books, volumes, parts, chapters, etc. seems to be about equal: most often, the first category is a Roman numeral, followed by Arabic ("bk.I, vol.1, ch.1"). However, this pattern does not always hold true and, in fact, sometimes no descriptions are used at all (I, 12, 150). It's necessary to know how the source is described if you're searching based on this criteria. A search for

s bk()i/nt

would not find a record in which the Note field included the phrase "bk.1."

Confusion of numbers/words = zero and “Oh”

agricola

Another number-related problem in Dialog will undoubtedly stump even the most prepared database user. Scanning the basic index reveals that

"0" (the number zero) sometimes stands in the place of "O" (the letter O):

E16 1 0BSERVATIONS

E23 2 0CTOBER

E48 2 0LD

>>ei compendex

7. In the CN field (as in the ID field), periods are honored. However, this isn't necessarily true in other fields.

?s 804.2

S25 2 804.2

------------------

>>419-008

>>foodline

>>datastar

>>numbers, punctuated

>>numbers, formulas

>>punctuation, comma

Interestingly, a search for '0' resulted in 44,358 records and a search for '000' resulted in 21 records. Let me qualify the preceding sentence by stating that this phenomenon is interesting to humans only, not computers. The fact that the number zero could be searched using one, two, three or more "zero characters" seems silly from a human perspective because, simply, zero is zero. The thought might never occur to a novice searcher to look for numbers in this fashion. Of course, to a computer, `0' and `000' are quite different and distinct words.

1_:'000' results in 21 records

One citation (FOST Accession #0000384609) included the phrase "20 000" and another citation (FOST Accession #0000372387) contained the phrase "Petrothene NA 214-000 Resin." Therefore, a search for "000" may result in numbers or parts of formulas or titles.

2_:`250' adj `000' results in 1 record

This citation (FOST Accession #0000357359) includes the phrase "250 000." One might conclude that this type of search will work for any number larger than 999. However, a search for:

3_:`14' adj `000' results in 0 records

4_:`14000' results in 26 records

Of these records, one included the word "14000" (without a comma) while another included the word "14,000" (with a comma). (FOST Accession #0000418881 and #0000387345) Searching for exact numbers in the FOST database has proven to be quite arbitrary.

Numbers treated phrases

>>datastar

>>pais

Sometimes, though, the period is retained:

PAIS 16_: a.14

AN 961108500 961216.

SD (Sales no. E.96.II.A.14) (UNCTAD/DTCI/32).

The SD (Series Description) field seems to contain many, many examples of numbers and letters joined with punctuation; how these are indexed is rather unpredictable. DataStar claims to drop all punctuation except decimal points in the middle of numbers ; this obviously is not always true. Decimal points at the beginning of numbers also seem to be indexed, even when immediately preceded by a letter; and hyphens are sometimes indexed.

>>102-024

>>agricola

>>dialog

>>numbers, punctuated

>>fields, different rules across

>>punctuation, retained in field

Note that a searcher might be confused further if she notices that number punctuation is retained in hard phrases:

?s '0.70 disease ratio'/id

S6 1 '0.70 DISEASE RATIO'/ID

>>411-005

>>book review index

>>dialog

?S "1,2,3"/TI

S53 0 "1,2,3"/TI

?S 1, 2, 3

S55 5 1, 2, 3

Numerical aspects of Dialog and Datastar themselves

- Calibrating truncation or wildcards

- Making back references to sets. Numbers to label search sets. Note DataStar's default to a set lable and then if that fails then to a text number.

Creation of forms and crosstabulations online

Numbers as corporate names

>>112-005

>>world reporter

>>dialog

>>names, corporate

>>names, with embedded numbers

>>numbers, spelled out

>>names, corporate

>>punctuation, hyphen

Numbers can be searched in Dialog, but they may appear as numbers or they may appear spelled out as words:

? s 7 () 11/co

30 7/CO

4 11/CO

S1 0 7 () 11/CO

? s 7 () eleven/co

30 7/CO

33 ELEVEN/CO

S2 29 7 () ELEVEN/CO

? s seven () 11/co

105 SEVEN/CO

4 11/CO

S3 0 SEVEN () 11/CO

? s seven () eleven/co

105 SEVEN/CO

33 ELEVEN/CO

S4 4 SEVEN () ELEVEN/CO

This company also has an embedded hyphen in its name, which gets stripped out in word-indexing but reinstated in phrase-indexing:

? e co=7-eleven

Ref Items Index-term

E1 1 CO=50 OFF STORES INCORPORATED

E2 2 CO=600 GROUP PUBLIC LIMITED COMPANY

E3 0 *CO=7-ELEVEN

E4 29 CO=7-ELEVEN CO.

E5 6 CO=7TH LEVEL INCORPORATED

? e co=seven-eleven

Ref Items Index-term

E1 84 CO=SEVEN NETWORK LIMITED

E2 3 CO=SEVEN SEAS PETROLEUM CORPN

E3 0 *CO=SEVEN-ELEVEN

E4 4 CO=SEVEN-ELEVEN JAPAN CO LIMITED

E5 1 CO=SEVEN-UP BOTTLING CO PLC (NIGERIA)

Numbers inside words

A10tion - attention

Gr8 – great

“H4ck1ng for g1rl13z” at http:hackedphiles/nytimes’hacked/

Dialog example of 4-6 is broken into two words.

>>423-006

>>dialog

>>art literature international

>>words, spelling errors

>>numbers, plus letters - punctuated

>>punctuation, retained in field

>>punctuation, apostrophe

An example of a typo being indexed is the string of terms below:

E5 1 THE'70S

The embedded single quote is part of the string of characters and needs to be used when searching for that thing the computer thinks is a word.

?s "the'70s"

S19 1 "THE'70S"

This is the title that the string is pulling from," Bookmaking in the'70s; redefining the artist's book."

e

>>103-029

>>art literature international

>>dialog

>>numbers, punctuated

>>punctuation, retained

>>punctuation, stripped and used to break words

>>punctuation, hyphen

>>punctuation, colon

>>numbers, standard

Numbers are treated the same as words in Dialog database 191, and are broken on the same punctuation. Numbers do not have to be literalized, unless they can be confused with search statements.

**The >>phrase 6:9-11 (as in the Bible verse Revelations 6:9-11) retains the hyphen, but the colon is stripped out and an adjacency operator must be used.

?s 6()9-11

Record #0126780

Numbers beginning words

Sorting

(dissertation abstracts)

Scanning Dialog's basic index and Datastar's dictionary file reveals numerous index entries that have a low probability of ever being targetted by a query. Here, for example, is the result of a root command in Datastar starting at "0":

0.0L

0.02V

0.2-2-3-32REV

0.30S

0.33PER

0.5MM

"0.33PER" refers to thirty-three cents (per unit), "0.55MM" refers to a precise metric measurement, "0.02V" refers to voltage, and so on. It might be interesting to try to determine what percentage of the Dialog and Datastar index space is wasted, how much network bandwidth is wasted, and how much searcher time is wasted in dealing with the indexes as they now stand.

>>301-008

>>numbers, homonymic use

>>words, spelling errors

This record shows two different spellings for the same man. The Ti field spells his name "2pac" while the su field spells his name "Tupac."

NO: BBIO94018201

AU: Hamilton, Kendall.

TI: Double trouble for 2pac.

SO: Newsweek v. 124 (Dec. 12 '94) p. 62-3

PH: p. 62-3 : pors.

IS: 0028-9604

PB: H. W. Wilson Co.

PL: United States

PD: 1994

RT: art

AC: biography

SU: Shakur, Tupac, rap musician and actor.

>>301-002

>>numbers, and letters

>>numbers, homonymic usage

Numbers as Text

Numbers as words

2 - to

4 - for

8 - ate

4 2sday night - for Tuesday night

Here is an entertaining record containing a word made-up of a number, hyphen and letters:

NO: BBIO85000760

AU: Donahue, Deirdre.; Kelley, Jack.; Schindehette, Susan.

TI: In the golden afterglow 10-acious Mary Lou Retton attacks the rest of

her life.

Numbers ending words

N umbers as words

Dialog can use single numbers such as 4, but deconstructs compound numbers such as 1,234

“In Beeptalk, The Words Add Up,” NY Times Wed April 29, 1998.

Numbers as money

>>109-040

>>datastar

>>pais

ROOT *L$

R1 1 DOC *L0.25

R2 2 DOCS *L0.28

R3 6 DOCS *L0.3

The '*l' seems to denote english pounds.

Numbers as time

publishing house on the Internet. Its name? "00h00", or "Zero

Heure", because he knows very well that he is taking off into

new, uncharted territory.

>>405-008

>>biography master index

>>dialog

>>fields, different rules across

>>numbers, dates

>>words, truncation - automatically done

Searching for Dates

Searching for double-digit, Common Era dates in the "year of birth" and "year of death" fields poses a further orthographic challenge. This database assumes truncation after double-digit dates, unless these are qualified as "bc" (e.g., the query yb=11 bc returns only one record).

Thus, if a searcher enters:

"yb=18",

479,853 records are captured. These include the birth years:

1822 (9999995 File 287)

1840s (9999967 File 287)

18 B.C. (9894880 File 287)

18? B.C. (9894881 File 287)

180? (9483076 File 287)

A search for the record(s) containing the exact birth year "18", then, is extremely cumbersome and time-consuming.

A search for "yb=60" also returns a record containing "6th Cent. B. C." in the year of birth field (9529863 File 287). A search for "yb=6th?", however, returns no records.

Numbers as scientific notation, mathematical notation

>>dialog

>>datastar

>>dissertation abstracts

It is unclear whether failure to find a particular formula means that it does not exist in the literature, the searcher just failed to translate the formula correctly, or a UMI typographic error is hiding instances of the formula.

With longer formulas that are written in Texcode instead of being presented as a graphic, the formula appears to the searcher as a big mess. We have heard that, in general, putting a formula in an abstract is frowned upon in the sciences, except in dissertations where the abstract is the introduction to the paper.

Some software (MS Word, for example) responds to the Texcode as command information instead of content, causing annoying side effects of dealing with the abstracts.

Here are extracts from several abstracts showing what longer formulas look like in DA Online:

01592145

A STUDY OF IMPURITY SCREENING IN ALCATOR C-MOD PLASMAS (ARGON, SCANDIUM)

Author: WANG, YING

Degree: PH.D.

Year: 1996

The argon screening efficiency $p\equiv{number\ of\ Ar\ atoms\ in\ the\

plasma\over number\ of Ar\ atoms\ injected}$ was found to be independent of

the divertor target plate strike point locations, the outer gap and the

European/America forms

>>109-019

>>pais

>>epic

>>numbers, money

>>numbers, punctuated

>>punctuation, comma

>>punctuation, period

>>numbers, money - inconsistently punctuated

Note that in the following example, the European wrote the title (use of period) and the Abstractor used a comma.

f francs

Searching ...

S E A R C H R E S U L T S

Search Records Search Term

ID Found

------ ------- -----------

S55 42 francs

R e c o r d 2 o f 42

PAIS International Database (c) 1997 by Public Affairs Information Service, Inc.

NO: 97-0302410

AU: Gilain, Bruno

TI: L'allocation universelle a 8.000 francs: entre necessite et utopie.

SO: Univ Catholique Louvain Inst Rech Econs et Socs 330 Belgian francs Ag 1996 28 leaves

YR: 1996

AB: Reviews theories justifying the provision of a minimum guaranteed income for every individual, in full or partial replacement of traditional social security payments; includes a proposal for payment of 8,000 francs a month to each adult, legally residing in Belgium, and estimates of the cost.

Numbers as accounting data

Numbers as classifications, or class numbers.

Their arrangement.

Numbers as unique database record identifiers.

Examples:

AT&T, @Home near $1.3B deal Article title in USA Today

Lewin, T. (1998, April 29) "In beeptalk, the words add up." NY Times p. ?

The conventional beeper greeting is still in dispute. Most teen-agers like 07734, which is hello if you turn it upside down and squint. But some go with lower-case l's, spelling it as 01134. Or just 14, for an upside-down hi. Upside down is a big thing in beeptalk. There is 710, which become OIL, meaning "I'm out of gas," and 87, which is L8 upside down for "late." How about sideway ones like 303 for Mom? Moving into a multicultural realm, there is 50538 (BESOS, or "kisses" in Spanish).

Worth magazine, March 1998

mid-70s

sub-$1,00

mid-50-percent

30-year treasury

ten-year periods ended 9/30/97

550-billion-barrel oil reserve

87-year-olds

Whiskers grow .01 inch per day

call 888-OP-SMILE

three- to five-year pick

development-$1.9 billion

davis rule #1

11:30 p.m.

1950s

$55-a-bottle

2-lb. Burrito

1-800-876-3261

1 800 Get-advice [letters stand in for numbers]

as simple as 1-2-3

top $$$$$ rating

19 mpg city/22 mpg highway

1-800-Franklin ext. T294

NetTax '9X

DataStarej397552

N=220 is the search element

References

Hurford, J.R. (1987). Language and number: The emergence of a cognitive system. Oxford, UK: Basil Blackwell.

Kent, W. (1978). Data and reality: Basic assumptions in data processing reconsidered. Amsterdam: North-Holland.

Citations to check:

Bartsch, R. 1973: The semantics and syntax of number and nuymbers, in Kimball, Syntax and semantics, v.2, pp. 51-93.

De Villiers, M., 1923: the Numeral Words: their Origin, Meaning, History and Lesson. H. F. and G. Witherby, London.

Research idea:

An investigation of everyday orthography

By understanding everyday orthography, one can establish a criterion for calibrating how easy or difficult a database query language is, that is, its distance for an everyday orthography.

Use an applet with a sound component. Have the user type in his search for the item requested.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download