Paper Title (use style: paper title)



A Proposed Numerical Data Standard

And Automated Network Cluster Analytics

Joseph E. Johnson, PhD

Department of Physics, University of South Carolina,

Columbia SC, 29208, USA

jjohnson@sc.edu

July 7, 2016

Abstract

A standard is proposed for all numeric data that tightly integrates (1) each numerical value with (2) its units, (3) accuracy (uncertainty) level, and (4) defining metadata into a new object called a metanumber with full mathematical processing among such metanumber objects. We propose this metanumber standard for published numeric data in order to lay a foundation for the fully automated processing by intelligent agents. This standard has been designed, programed, and is now operational on a server in a Python environment as a multiuser cloud application using any internet linked device. We then explore one example of how such a data standard can support artificial intelligence and Big Data with automated cluster analysis of the associated derived networks using our theorems based upon Markov type Lie algebras and groups.

Keywords—Units, Metadata, Uncertainty, Numerical Standards, Networks, Clustering, Markov Type Lie algebras

I. Introduction

Numerical data from most sources1,2,3, has the associated units of measurement and the exact definitions of the values imbedded in text, and located in row and column headings, footnotes, and other multiple locations and with multiple formats. While normally easily understood by humans, computers are unable to accurately infer the exact units for each value as well as how to extract the exact meaning of the value in such unstructured environments. Furthermore, assuming that the accuracies of the values are reflected in the number of significant digits, then when values are processed by a computer, the results are values with essentially as many digits as the computer retains in single or double precision. Thus the number of significant digits is lost unless they are captured immediately before any processing occurs and then processed correctly throughout the execution of all mathematical operations. Otherwise the accuracy of the results are obliterated with the first operation. Big Data problems refer not only to those data sets of vast size which require extensive processing speed, storage, and transmission rates, but also to the massive problems incurred with the astronomical number of existing numerical data tables where each is formatted and structured in different ways. This lack of standardization demands human preprocessing, for every time a table is used, to convert each table to the user’s framework for subsequent computation thus frustrating automation. In conclusion, the examination of most published tabular data suggests that they are designed for space compression and easy reading by humans. As space is now rarely a problem, and as human preprocessing becomes unacceptable, we seek a standard that supports simultaneous human and computer readability of the value, uncertainty, units, and exact meaning. We call this standardized string object a “metanumber” as it is actually a string, tightly linking all four information components. These are to be managed through all mathematical expressions and operations to give new “metanumbers”.4,5 This standard specifically satisfies the following requirements. It is to (a) be easily readable by both humans and computers, (b) be consistent with international scientific units, the SI metric foundation with automatic management of dimensional analysis including other units used in specific subject domains, (c) provide automatic computation and management of numerical uncertainty, (d) support a defining metadata standard for a unique name for each number and its exact meaning, (e) allow the linkage of unlimited metadata qualifiers and tags without the burden of an associated data transfer or processing, and (f) be sufficiently flexible to evolve and encompass existing domain standards that are currently in use. The structure must also support (g) computational collaboration and sharing among user groups, (h) be compatible with current standards in computer coding, and (i) provide an algorithm to convert, as best possible, current nonstandard representations to this standard, until such time as data is published in this standard. This standard should also support (j) the analysis of a number’s complete historical computational evolution, tracking the associated path dependent information loss. It is to support (k) optional extensions to the SI base units, and to easily allow output in any valid units as well as the default metric. It is to be (l) configured as a module or function that can be included in other languages and systems with an API call. With such a standard, (m) the problem of converting to the metric (SI) system is automatically accomplished as all default output is metric. Finally (n) the standard should be simple and minimal in its form while (o) supporting the smallest footprint of space with the fastest possible simplification of mathematical expressions. We will also demonstrate ways in which intelligent agents can operate automatically upon such standardized data, utilizing our recent work on the mathematical foundations of networks and associated cluster identification algorithms. That application utilizes a means of converting metanumber tables of entities with their properties into a form that allows conversion to a network thus supporting a new automated procedure for cluster identification for each new standardized metanumber table. Our cluster identification algorithm will provide an example of how this metanumber standard can support advanced intelligent analytics.

II. MetaNumber Standards Defined

1 A Numerical Accuracy and Uncertainty Standard

While some numbers are exact such as the number of people in a room or the value (by definition) of the speed of light, almost all measured values are limited in accuracy as is represented by the number of significant digits shown. In a very few cases one might know the standard deviation of the normal probability distribution but such knowledge is rarely available. The subject of the correct management of numerical accuracy (uncertainty) is very complex because one is actually replacing a real number with a probability distribution which is a function. This function needs multiple other real numbers to specify it unless one is confident that it is a normal probability distribution, but even then such functions do not close into other simple functions under all mathematical operations. While the author’s research into this subject led to a new class of number 6,7 (similar to one based upon quantum qbits), that method is so complex that we do not wish to utilize it in this initial standard framework. The programing of the retention of the correct representation of accuracy is also a complex task. Thus we were fortunate that an existing open-source Python module (Python Uncertainty)8 has been developed that can be integrated into our software. Because the number of significant digits is always available, we have chosen the Python Uncertainty module for the management of numerical accuracy where our python functions convert an uncertain value into a new class of number called a “ufloat” which retains both the numerical value (represented with correct accuracy) along with the associated uncertainty which arises from the limited number of digits. Thus it only remained for us to encode methods for when and how to convert a value to a ufloat or to keep it exact.

Our standard is to represent any exact value by using an upper case “E” for scientific notation for exact numbers or with the lower case “e” for uncertain values. If an exact value is not in scientific notation then it is to be followed by an “E0” such as “413E0” or “8.452E0” and 9.4E5. If these same values were to be uncertain as indicated by the number of digits, they would be written as 413, 8.452, and 9.4e5 respectively. #1 The Python algorithm automatically converts all keyed and data table values to the correct form as just described while all unit conversions are treated as exact values. A virtue of this system is that no other explicit notation is needed to encode numerical accuracy. Furthermore, the Python Uncertainty module also supports other forms of uncertainty which can be invoked in future releases of our software.

2 Units & Dimensional Analysis Standard

.

Our standard format for attaching units to a numerical value is to represent all units by variable names9 which are to be attached to the numerical value so that the expression is a valid mathematical expression such as the velocity 4.69e2*m/s. When data is expressed in tabular form, one has a choice of compressing the notation by writing the units for an entire table, or row or column in the heading and then requiring the subsequent multiplication to attach the units to the value to be executed at the time of use. While this is the more common convention and requires less storage space, it also requires more processing time when each value is retrieved. As space is rarely now a problem, we have generally chosen to adjoin the units directly to each value thus increasing storage, while reducing subsequent processing time, and improving human readability. Naturally, when units are homogeneous over an entire table, row or column for extremely large data tables, then the alternative notation allows the units to be adjoined at the time of retrieval with the result that a metanumber standardized table in this format requires no more space than otherwise required but increases the processing. That method, offered as an option, labels a row or column as “*” which causes the units that follow in that row or column (or even the table) at the time of retrieval to be adjoined to the value. #2

The general rules for unit names are given as follows. Units (1) are variable names which with numbers form valid mathematical expressions: 3*kg+2.4*lb. The conversions are automatically executed by the metanumber software. Units are to be (2) single names such as “ouncetroy” not “troy ounce” since each unit is a single variable name. (3) lower case with only a…z, 0-9, and the underscore “_” character and leading with an alphabetic character: e.g. mach, c, m3, s_1, gallon, ampere, kelvin. This is in keeping with modern software conventions which have variables in lower case and where variable names are case sensitive. The convention of lower case also removes potential ambiguity. (4) Unit names are never to have superscripts or subscripts or any information contained in font changes: m3 not m3 . Such fonts are sometimes inconsistently treated in different software languages and thus no information may be represented by a change in font. (5) Unit names are to never use Greek or other characters or special symbols and thus use “ohm” not “Ω”, “hb” not “ħ”, “pi” not “π”. (6) Unit names are defined with nested sequential definitions in terms of the basic SI units which can be also abbreviated as: m, kg, s, a, k, cd. The Python code defines each unit in terms of previous units back to the basic SI standard. (7) We use singular spelling and not plural: foot not feet, mile not miles. This is to avoid spelling mistakes and to reduce the code size and increased processing speed. (8) The software includes all dimensionless prefixes as separate words, to be compounded in any order as needed: micro*thousand*nano*dozen*ft. Note that the “*” or “/” must be present between each unit or prefix with parenthesis used as necessary. (9) Evaluation of expressions always results in SI (metric) units as the default. To obtain the result in any other units one uses the form (expression) ! (units desired) e.g. c ! (mile/week). (10) Dimensional errors are trapped: 3*kg + 4*m gives an error. (11) All unit conversion values are treated as exact numerical values without uncertainty. (12) A user can use a past result with the form: [my_423] where 423 is the sequence number of a previous result or just [423]. #3 (13) Units that are followed by an integer such as m3 imply that the unit is to be taken to that power thus m3 = m3 (a cubic meter). When used with an underscore such as m_2 = m-2, it is to imply a negative sign and thus means per square meter. #4 (14) The fundamental metric units have been extended with six new optional units. The bit, or b for a bit of information (T/F, 1/0); the flop or op (for a single floating point operation), a person, or p for a living human as with per capita; the dollar, usd, or d for the US dollar, bn for baryon number and ln for lepton number. These units vastly extend the dimensional analysis and clarity of the meaning of expressions by allowing both information based concepts such as flops = flop/s and baud = bit/s, as well as socioeconomic units of p/acre (population density), and d/p (income as dollars per person). A user can find the list of internal units on the web site () under “tables”. With the metanumber system one can calculate the gravitational force between two masses easily even when the values are expressed in diverse units such as between 15 lbs, and 12.33 kg that are 18.8 inches apart as: g*15*lb*12.33*kg/(18.8*inch)**2 using F=G m1 m2 /(r12)2.

3 The Metadata Table Format.

.

Standardized metanumbers are to be electronically published in tables of any number of dimensions: one dimension for a list of values such as masses, or two dimensions for a two dimensional table or array often representing entities in rows with the properties of those entities in columns. For one dimensional data the value can be retrieved with the form: [mass_proton] where “mass” is the name of the table on the default directory on the default server and where “proton” is the unique index name of the row given in column one. For a two dimensional table one might write [e_iron_thermal conductivity] which would retrieve the thermal conductivity of iron in the elements table abbreviated as “e”. Thus the general multidimensional form for retrieving standardized metanumbers is [file name _ row name _ column name] where this form can be extended to higher dimensional forms. For comparison, these names (like database indices) must be unique. The program removes all white space and lowers the case for each index string prior to comparison. If the table is not in the default directory and server then the metanumber is retrieved as [internet path to server_directory path __ file name _ row name _ column name] thus allowing data to be stored and retrieved from any computer on the net. #5 It is of great importance to realize that with this standard that this expression provides a unique name to every standardized numerical value (metanumber) exactly like the unique email address for every human. These expressions can be used as variables in any mathematical expression such as 5.4 times the ratio of copper density to zinc density as 5.4* [e_copper_density]/[e_zinc_density] . For books the file name might be the ISBN number followed by the path to that table or value in the electronic document. This design allows an automated conversion of all metanumber tables into a relational database structure or SAS data sets if desired. However metanumber tables are stored as simple comma separated values (CSV) in flat files.

For each table, there is metadata that applies to the whole table such as values for variables that include the table number, table name abbreviation, name full name, data source (s), name of table creator, email of table creator, date created, security codes (optional), optional units that are to be adjoined to all values, table format, and remarks. #6 These values are contained in a special table which has one row for every standardized table. This table can be used to access the names of all tables with the metadata for each as described above. The format of the table is to be in comma separated values (CSV) which is easily created or viewed in spreadsheet software thus commas are not allowed in the tables themselves except as separators. Each table must have a unique (index) name for each row given in column 1 and for each column given in row 1. These names must be unique when all case is lowered and white space is removed for retrieval. The (1,1) cell in each table is to be in the format %%3986%%wbc where 3986 is the unique table number and where wbc is the unique table abbreviation. #7 Thus it is easy to create a metanumber table from a standard spreadsheet array of values which are then to be saved as comma separated values and submitted to the MN archive.

Subsequent rows and columns can have either standard metanumber values or metadata that applies to the associated row or column. If it is metadata (such as a web link, or other freeform data) then that row or column name is to be preceded by a “%” to indicate that it is metadata. If that metadata can be used as a unique alternative identifier (index) for a row or column then it is to be preceded by “%%” to indicate this. Row or column names beginning with %% can be auxiliary unique index names for the corresponding row or column such as using an element symbol (with column labeled as %%symbol) or its atomic number (%%atomic number) rather than its full name. #8 This design allows unlimited metadata to be associated with any given metanumber value but by reference allowing it to be retrieved when desired but not encumbering any data transfer process. For example the column in the elements table might have row two beginning with “%links” and under the column beginning with “thermal conductivity” in row one, where row two could have a web link to articles on the equations and explanation of the use of thermal conductivity. Thus future intelligent algorithms will be able to navigate the web to associated information using this standard. As a consequence of this design, the simple path to a metanumber [table_row_column….] provides a universally unique name. This form is used to retrieve the metanumber and use it in expressions as a variable. This path name also denotes the path that provides unlimited metadata associated with the value indicated, without the transfer of that metadata. This design allows unlimited metadata tags to be linked to pharmaceuticals, accounting expenditures, medical procedures, scientific references, equipment or transportation items and other conditions or environments such as reference temperatures, longitude/latitude/altitude, and datetime.

While the above framework supports metadata that attaches to an entire table, row, or column of values, one often must attach metadata to specific individual metanumbers such as longitude, and latitude. This can be accomplished with the form *{var1= val1 | var2 = val 2 |…} which can be multiplied at the end of any metanumber to provide information specific to that value such as *{lon=…|lat=…|time=…}. Of special note is the missing value {.} or {.|reason = div by zero} and the questionable value {?} or {?|reason = ….}. Any missing value in an expression is to give a missing result when evaluated. Any questionable value is to compute normally but then {?} is attached to the result. #9 When the expression “{anything}” is encountered in a mathematical expression, it is converted to ‘1’ thus 3.45*{anything} results in 3.45. Thus the string “{anything}” is a container for attaching information in the mathematics without altering the results. Our metadata and metanumber standards are compliant with NIST4 and USGS5 recommendations as well as the latest new basis for the SI / metric standards that for the first time will base all fundamental units upon fundamental constants10

4 Other Components of the MetaNumber Standard.

There is a single table that contains all actions of all users containing the following variables: userid, seq#, date time, input string, evaluated result, unit id# of result, optional subject, and unique name. The userid, and seq# pair is a unique index for each line in the archive. #10 A user can access their own past archived file of inputs and evaluated results for a seq# value or for a given active subject. A user can only retrieve their own past actions. To obtain ones previous result 181, one would enter [my_181] or just [181] or [alpha32] where “alpha32 is a unique name for that value for that user. #11 To obtain a list containing the values for that sequence number one can enter [all_181] which outputs a python list of the values given above. #12 Also a series of calculations can be collected under a ‘subject’ name by entering the line “subject = string” then the “subject” variable for that user is set to the “string”. The subject name is retained for any session until “subject= null” or “subject = ” is entered or until a user logs out. This enables one to identify a set of calculations and remarks under a subject heading for later output as an individual user. The command subject = ? will output the current subject. Further details can be found by entering “help” which will give the procedures for the latest software release or in the web user guides.

It is often important to document ones work. This can be accomplished by entering the “#” symbol as the first symbol on an input line and anything else can follow this symbol. This is like the comment line for Python code. It allows one to insert documentation text for later use that describes ones calculations. When the system sees the “#” symbol in the first position, then all processing is bypassed and the subsequent string is stored as the result. If ### is encountered first on a input line then it and all subsequent lines are treated as remarks until another ### is encountered. #13

Another feature of the metanumber system is that a metanumber value can be written as a mathematical expression. For example the speed of sound in air is dependent upon the operating reference temperature “tr” in Kelvin which can be expressed as (331.5+(0.600)*(tr-273.15*k))*m*s_1. The use of such expressions, when applicable, greatly enhances and compresses the required storage since one formula can replace a vast array of values. #14 Temperature values can be input as Celsius as tc(24) to input a temperature of 24 degrees Celsius. When evaluated this becomes a function that returns the value in degrees Kelvin. Likewise tf(55) will evaluate to give a temperature of 55 degrees Fahrenheit but will return that temperature in Kelvin. The web site gives all such functions and discusses their use with additional information on asg.sc.edu .

III. Utility of the MetaNumber Standard for Automated Agents, AI, BigData, and Cluster Analysis

A. Networks and Clustering.

As an example of how automated intelligent agents can operate unassisted in the analysis of numerical information, we will explore how we are developing such agents to convert metanumber tables to networks11 and to then find clusters within those networks. It is well known that cluster analysis is one of the foundations of intelligent reasoning. Our very language is built upon names for entities that cluster in use, appearance or function. It is also the basis for the classification of biological organisms and the names for objects in our everyday life. Yet there is no generally agreed upon definition of clustering from a mathematical or algorithmic basis and there are over a hundred current algorithms in use for identifying clusters. The following will give a very brief overview of our research into the mathematical foundations of networks and an associated agnostic algorithm for cluster identification based upon an underlying Markov Lie algebra based model with a probability based proximity metric. We will utilize our previous research for a means for converting two dimensional metanumber standardized tables of entities (rows) and their properties (columns) into a network for the collection of entities and another network for the properties. The entity vs property tables might be medical records of persons on given examination dates as the entity (row) with about 100 numerical values of their physical profile, blood, and urine analysis along with metadata descriptor columns. Or in a nutritional table the entities (rows) might be the 16,000 common foods with 50 or so nutritional contents as properties. Or the table might be the properties (density, thermal conductivity, etc) of each of the elements as rows or even potentially extended to all 68 million known inorganic or organic compounds. We first give a brief overview of our past network research and then we can show how a numerical data standardization supports new levels of artificial intelligence.

Networks represent one of the most powerful means for representing information such as relationships (topologies) among abstract objects called nodes (points) which are identified by sequential integers 1, 2,..n. A network is defined as a square (sparse) matrix (with the diagonal missing) that consists of non-negative real numbers, and which normally is very large, and where the strength of connection between node i and node j is expressed in a connection matrix Cij. The diagonal is missing because there is no meaning for the connection of a thing to itself. The off diagonal elements must be positive because there is no meaning to a connection that has less than no connection at all. The mathematical classification and analysis of the topologies represented by such objects is one of the most challenging and unsolved of all mathematical domains12. Cluster analysis13 on these networks can uncover the nature of cohesive domains in networks where nodes are more tightly connected, such as similar elements (nickel, cobalt & iron) or foods with tightly clustered nutritional content.

B. Lie Algebras, Groups, and Markov Monoids

This section will give a very brief overview of our mathematical results that provide a new mathematical foundation for network analysis and cluster identification. These results are built upon other previous work the author discovered in developing a new method of decomposing the continuous general linear (Lie) group14 of (n x n) transformations. That decomposition15 showed the general linear Lie group to be the direct sum of two Lie groups: (1) a Markov type Lie group (with n2-n parameters) and an Abelian scaling group (with n parameters). Each group is generated by exponentiation of the associated (Markov or scaling) Lie algebra. The Markov type generating Lie algebra consists of linear combinations of the basis matrices that have a “1” in each of the (n2-n ) off-diagonal positions with a “-1” in the corresponding diagonal for that column, (an example of a Lagrange matrix16). When any linear combinations of such matrices is exponentiated, the resulting matrix M(a) = exp(a L) conserves the sum of elements of a vector upon which it acts. We call this a Markov-Type matrix (transformation). By comparison, the Lie algebra and group that preserve the sum of the squares of a vector’s components is the familiar rotation group in n dimensions. But in this current form the Markov type transformation can transform a vector with positive components into one with some negative components (which is not allowed for a true Markov matrix). However, if one restricts the linear combinations to only non-negative values, on this specific Lie algebra basis, then we proved that one obtains all discrete and continuous Markov transformations of that size. This links all of Markov matrix theory to the theory of continuous Lie groups and provides a foundation for both discrete and continuous Markov transformations. There are a number of other details not covered here. In particular, the number of terms in the expansion of the exponential form of the Markov Lie algebra matrix gives the number of degrees of separation. Also, one obtains a Markov matrix regardless of the number of terms in the expansion!

C. Lie Algebras & Groups and Markov Monoids

Our next discovery17 was that every possible network (Cij ) corresponds to exactly one element of the Markov generating Lie algebra (those with non-negative linear combinations) and conversely, every such Markov Lie algebra generator corresponds to exactly one network! This is achieved by inserting in each diagonal of the connection matrix, Cij, the negative of the sum of values in that column (or in that row). Thus they are isomorphic and one can now study all networks by studying the associated Markov transformations, the generating Lie algebra, and associated groups18,19. Our subsequent recent discoveries20,21 are that (a) all nodes in any network can be ordered by sorting the second order Renyi entropies of the associated columns (or rows) in that Markov matrix thus representing the network by a pair of Renyi entropy spectral curves and removing the combinatorial problems that frustrate many areas of network analytics. A second Renyi entropy spectral curve can be also generated using the rows rather than the columns for diagonal determination. Now one can both compare two networks (by comparing their entropy spectral curves) as well as study the change of a network’s topology over time. One can even define the “distance” between two networks as the exponentiated negative of the sum of squares of differences of the Renyi entropies. If there are differences in the number of nodes, one can use a spline smoothing. We recently showed that the entire topology of any network can be exactly represented by the sequence of the necessary number of Renyi entropy order differences. These converge very rapidly. This is similar to the Fourier expansion of a function such as for a sound wave where each order represents successively less important information. Our next and equally important result was that an agnostic (assumption free) identification of the n network clusters is given by the eigenvectors for this Markov matrix. This not only shows the degree of clustering of the nodes but actually ranks the clusters using the corresponding eigenvalues thus giving a “name” (the eigenvalue) to each cluster. A more descriptive name can be created from the nodal names (indices) associated with highest valued nodes (when sorted by eigenvalue) and as given by the associated eigenvector. The reasoning underlying the cluster identification is that the Markov matrix generated by the altered connection matrix (Lie algebra element) has eigenvalues that successively represent the rate of approach to equilibrium for a conserved substance that flows among the nodes in proportion to their degree of connectivity. Thus network clusters result in a higher rate of approach to equilibrium as measured by the associated eigenvalue for flows among the nodes identified by the eigenvector. This is similar to the study of eigenvalues of n coupled oscillators where the eigenvectors represent the normal nodes of vibration.

D. Metanumber Tables Define Netwroks and Clusters

Now let us recall the two dimensional metanumber standardized storage of entities and properties of those entities. For simplicity, let us just consider the table of metanumber values with a single unique index label for the row and another for the column. One example we have explored is the table of the chemical elements (rows) versus the 50 or so properties of those elements while another is the nutrition table with 16,000 foods verses 56 numerical properties of their nutritional contents. Another such table would be personal medical records of persons verses their approximately one hundred numerical metrics of blood, urine, and other numerical parameters including multiple records for each person identified by datetime. Another would be the table of properties of pharmaceutical substances including all inorganic and organic compounds, or even a corporation’s manufactured items. Our most recent development is that one can generate two different networks from such a table of values Tij for entities (such as the chemical elements in rows) with properties (such as density, boiling point) for each element in a column (the columns have identical units). To do this we first normalize the metanumber table by finding the mean and standard deviation of each column and then rewrite each value as the number of standard deviations away from the norm which is set numerically to zero. Since the dimensionality for each property has the same units, then this process also removes the units that are present and the new values are dimensionless. For example the density or thermal conductivity column will each have values with all the same dimensionality for that column. We then define a network Cij among the n entities (here the elements listed in rows) as the exponentiation of the negative of the sum of the squares of the differences between Tik and Tjk thus Cij = exp-Σk(Tik - Tjk)2. This gives a maximum connectivity of ‘1’ if the values are all the same and a connection of ‘0’ if they are far apart as would be expected for the definition of a network. This is similar to the expectation that a measured value actually differs from the true value by that degree. We can form a similar network among the properties for that table. One then, as before, adjusts the diagonal to be the negative of the sum of all terms in that column to give a new C, forms the Markov matrix M(a) = exp (aC), and finds the eigenvectors and eigenvalues for M to reveal the associated clustering. The rationale for how this works can be understood when the Markov matrix is viewed as representing the dynamical flow of a conserved hypothetical substance among the nodes. The result of its action on a vector of non-negative values leaves that sum invariant. This methodology includes and generalizes the known methodology with a Lagrange matrix.

We are currently exploring the clusters and network analysis that can be generated from tables of standardized metanumber values. We have done this for the elements table which was very revealing as it displayed many of the known similarities among elements. The cluster of iron, cobalt and nickel was very clear as well as other standard clusters of elements such as the halogens and actinides. We are currently analyzing a table on the properties of pesticides and we are also performing this analysis on a table of 56 nutrients for 16,000 natural and processed foods which leads to a network of 16,000 nodes!. The study of clustering in foods is based upon their nutrient and chemical properties as well as the clustering among the properties themselves. Our initial results show three of the dominant food-nutrition clusters to be oils for cooking, nuts, and clustering among baby foods.

Next imagine that every known numerical data table is standardized as described in every discipline and that this procedure is executed giving the dominant clusters for both the entities and their properties. One notes that each cluster is uniquely named and ordered by degree of clustering by its associated eigenvalue. Then note that the eigenvector associated with that eigenvalue consists of nodes with unique index metadata names such as the elements along with the degree of contribution of that node. As the metadata entity in one cluster (e.g. iron) has an associated degree of importance in one cluster then one can link that node to another cluster with a node with the same name (e.g. iron) and a “weight” that is the product of the importance of each node to that cluster. In this manner, the metadata of “iron” in all data tables becomes linked in such proportion creating a network that links all networks via the node names with associated product weights. The resulting network would contain all numerical clusters that have nodes labeled “iron” in blood analysis, vitamins, foods, industrial production, alloys and compounds, impurities, etc. creating a “supernet” that spans all numerical information. Because of the numerical standardization of the values, units, and linked metadata, this process can be fully automated with the associated supernet growing as data tables are processed and then linked. This process not only links the entity and property named nodes but also links by inference the metadata that is associated with each node. We have completed the software design and construction and are exploring tools that will allow one to study such an emerging network of linked numerical and metadata information.

The numerical standardization in terms of metanumber tables leads to other more holistic networks which can be built utilizing the metanumber structures with resultant cluster analysis. Users (using the PIN#), can each be linked to (a) unit id (UID) hash values of the results of their calculations, as well as to (b) the table, row, and column names of each value. These linkages can be supplemented with linkages of each user to those universal constants that occur in the expressions which they evaluate (“hb” or “stephanboltzman”). The resulting network links users to (a) concepts such as thermal conductivity, (b) substances such as silver alloys, and (c) core constants such as the Boltzmann constant, Planks constant, or the neutron mass. The expansion of this network in different powers, giving the different degrees of separation, can then link users via their computational profiles, (the user i x user j component of C) as well as linkages among substances, metadata tags, and constants. The clustering revealed in different levels of such expansions then can reveal groups of users with linkages that are connected by common computational concepts. Users working on particular domains of pharmaceuticals and methodologies are thus identified as clusters as well as groups of astrophysicists that are utilizing certain data and models. At that same time, the clustering can identify links among specific substances, models, and methodologies. Our current research is exploring such networks and clusters as the underlying metanumber usage expands and will be submitted in a more detailed paper on this methodology. By linking the metadata names of the dominant component nodes that comprise each cluster, we can construct a new type of network that links all clusters of numerical information. We are currently studying how to optimally create such a numerical “supernet” and build user tools for its exploration where the numerical universe seen as a single network.

IV Conclusions

A standard for the integration of numerical values along with their units, accuracy level, and exact unique path name has been proposed and programed in python on a server open for use at and compliant with the requirements listed above. The integrated object consisting of value, units, accuracy, and meaning is termed a “metanumber”. The MN software can be called to evaluate mathematical expressions that consist of MN objects as transactions in a terminal console or called with an API thus enhancing a user’s software. This standard is sufficiently defined to be totally unambiguous, yet support sufficient subject domain flexibility to accommodate other existing standards. Each metanumber has its value adjoined to (a) its units like mathematical ‘variables’, (b) the representation of accuracy by the presence of an “E” for exact values as described above, or otherwise by “e”, and (c) a unique metadata path name for every stored numerical value. This unique path name for every numerical value not only defines that number unambiguously, it also provides the link to unlimited descriptive metadata tags. The permanent archiving of every input and result for each user with sequence number, date-time, unit hash value, and optional subject, has the critical feature that the exact mathematical structure, with all exactly uniquely named metanumber variables, provides with this log, the mathematical computational history and path for every computed new number. The fact that the input expression is stored with the result in this archival log, means all metadata associated with each metanumber is traceable but not encumbering to the system. Our current work indicates that our new algorithm can effectively preprocess data from web sites and reformat it in the metanumber standard (with some operator oversight).

Our parallel work can take an entity vs property table and automatically generate two networks, one among the entities as nodes and one among the properties as nodes. This then allows the determination of clusters, named by the associated eigenvalue and sorted by the magnitude of that eigenvalue as a metric of the “tightness” of the cluster. Nodal names associated with the highest weights in the eigenvector provide descriptive metadata giving the identity and composition of the cluster. The solid mathematical foundation of this automated analysis of data structures uses the theory of Lie algebras and groups, continuous and discrete Markov transformations, network topology, and agnostic cluster analysis that can provide advanced analytics for still another stage of analysis. That next stage is now set for an analysis of the topological networks among users, units, tables, primary constants, unlimited linked metadata, and models. The collaboration model, under a subject name, supports either an individual or groups of users working with secure data-model sharing with documentation. Finally, because the system supports full archival tracing with numerical accuracy, it follows that one can study the path dependence of information loss (similar to the path dependence of non-conservative forces in physics). The ability to automate cluster identification with associated metadata linkages is a key component of an intelligent system. But most important, this system supports the sharing of standardized data instantly among computers for research groups, corporations, and governmental (and military) data analysis with fully automatic dimensional analysis, error analysis, and unlimited metadata tracking. We envision the cluster analysis and supernet construction to run continuously and automatically on all metanumber archived data. The resulting supernet would encompass all numerical information allowing one to explore the AI discovered clustering in our “numerical universe” like the current exploration of our material universe whose clusters are galaxies, groups of galaxies, and fibrous structures in space time.

References

1] NIST Fundamental List of Physical Constants

2] Bureau of Economic Analysis, NIPA Data Section 7 CSV format,

3] Statistical Abstract of the United States Table 945 electricity.html

4] NIST Data Standards

5] USGS Data Standards datastandards.php

6] Johnson, Joseph E., 2006 Apparatus and Method for Handling Logical and Numerical Uncertainty Utilizing Novel Underlying Precepts, US Patent 6,996,552 B2.

7] Johnson, Joseph E, Ponci, F. 2008 Bittor Approach to the Representation and Propagation of Uncertainty in Measurement, AMUEM 2008 International Workshop on Advanced Methods for Uncertainty Estimation in Measurement, Sardagna, Trento Italy

8] Leibigot, Eric O., 2014, A Python Package for Calculations with Uncertainties, .

9] Johnson, Joseph E. 1985 US Registered Copyrights TXu 149-239, TXu 160-303, & TXu 180-520

10] Newell, David B., A More Fundamental International System of Units, Physics Today 35 (July 2014) magazine/physicstoday/article/67/7/10.1063/PT.3.2448

11] Ernesto Estrada The Structure of Complex Networks, Theory and Applications, ISBN 978-0-19-959175-(2011)

12]  Newman, M. E. J. "The structure and function of complex networks" (PDF). Department of Physics, University of Michigan and  Newman, M.E.J. Networks: An Introduction. Oxford University Press. 2010

13] Bailey, Ken (1994). "Numerical Taxonomy and Cluster Analysis". Typologies and Taxonomies. p. 34. ISBN  9780803952591 wiki/ cluster_analysis

14] Theory of Lie Groups , ~kirillov/mat552/ liegroups.pdf

15] Johnson, Joseph E. 1985, Markov-Type Lie Groups in GL(n,R) Journal of Mathematical Physics. 26 (2) 252-257

16] Lagrange matrix, ch1.pdf ,

17] Johnson, Joseph E. 2005 Networks, Markov Lie Monoids, and Generalized Entropy, Computer Networks Security, Third International Workshop on Mathematical Methods, Models, and Architectures for Computer Network Security, St. Petersburg, Russia, Springer Proceedings, 129-135 ISBN 3-540-29113-X

18] Johnson, Joseph E. 2012 Methods and Systems for Determining Entropy Metrics for Networks US Patent 8271412

19] Johnson, Joseph E., 2009, Dimensional Analysis, Primary Constants, Numerical Uncertainty and MetaData Software”, APS AAPT Meeting, Columbia SC

20] Johnson, Joseph E, & Campbell, William 2014, A Mathematical Foundation for Networks with Cluster Identification, KDIR Conference Rome Italy

21] Johnson, Joseph E. 2014, A Numeric Data-Metadata Standard Joining Units, Numerical Uncertainty, and Full Metadata to Numerical Values, EOS KDIR Conference Rome Italy

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download