Measure Code Reproduction with Metagenomics

Successful genetic code—viruses, bacteria, plants, animals—reproduces. Genetic code contains patterns.  The patterns interact with their environment. The order in the patterns is converted into work, which re-creates the order. Successful patterns persist across generations; pattern generations and evolution of genetic code can be measured and tracked over time. 

Techniques developed in metagenomics allow us to measure arbitrary genetic code in the environment around us. The "meta" in metagenomics is information regarding where and when and under what environmental conditions genes are found. We can take a sample of DNA in water, from the earth, from air, identify different organisms in the sample, and identify functional aspects of the organisms. The “easy” version of metagenomic analysis starts with a library of code "fingerprints" associated with species (or other groupings). A “hard” version does not start with such a library, but identifies species or other groupings through statistical analysis of contiguous code groups. Both versions also use known functions performed by different sequences to identify functional aspects of the organisms.

In the context of computer science, the data sets and assembly of code groups, and identification of functions is much cleaner. Computer code is often instrumented with traceroute and debug routines to report when and how it is executed and if errors or other conditions occur. Though code is typically encrypted when stored to disk, it is unencrypted when processed in the processor, and modern chips include multiple processors, including ones that can see the unencrypted (processor) side, and provide reporting, diagnostics, pipelining, and speculative execution functions. Operating systems typically include a resource monitor that reports power-use by process, where "process" may be as course-grained as an entire executable file or more granular, including down to the operating system/kernal function library level. 

As with "meta"genomics, the place the code groups are measured must also be recorded. The structure of a network address translator ("NAT") is different from the structure of a CPU or GPU (NATs use FPGAs more), as are the code groups found in each environment.

Application of metagenomic techniques to computer media would be "easier" than application of metagenomics techniques to genetic media, though it would still be non-trivial. 

Fortunately, there is already a field of study devoted to code reuse. The metrics for it are cohesion and coupling. Cohesion is the degree to which elements inside a module belong together, it is a measure of the strength of relationship between the class's methods and data. Among different ways, it can be measured as sequential cohesion, with grouping determined by output from one part being input to another.  When software is reused, modules with high cohesion tend to be preferable, because high cohesion is associated with several desirable traits of software including robustness, reliability, reusability, and understandability. In contrast, low cohesion is associated with undesirable traits such as being difficult to maintain, test, reuse, or even understand. Coupling is the degree of interdependence between software modules.  Types of coupling include logical coupling, which exploits the release history of a software system to find change patterns among modules or classes: e.g., entities that are likely to be changed together or sequences of changes (a change in a class A is always followed by a change in a class B).

The "problem" with only studying software reuse, is that software reuse is a limited subset of all software which is executed. The techniques of software reuse and of metagenomics must be applied to a sample of ALL code executed across all processors, keeping a record of the "meta" information describing where and in what processor the code occurred. 

My hypothesis is that if we apply metagenomic/software reuse analytic techniques to computer media over time, we will observe software processes that coalesce, much as amino-acid networks on early pre-cellular Earth coalesced over time into cellular life.

Computer code also reproduces. The dominant reproductive strategy is analogous to that of a virus with a commensal or mutualistic relationship with a host, humans. Over time, the role of the host may recede. The code contains patterns. The patterns persist across generations. The generations and their evolution can be tracked. Are any of the code patterns associated with the creation of more hardware to host them?

The required software instrumentation is technically straight-forward. Many hardware and software developers already measure the reproduction of both hardware and software, though typically in the context of measuring performance and for debugging purposes. We need to measure code reproduction across the entire ecosystem of hardware and software.

We must start an open-source project to instrument software and hardware units so that units report to a public forum when executed. At the option of developers, execution reports can be anonymous, communicating only a unique identifier, provided via a blockchain identifier. Ideally, the identifier would allow identification of descendants (updates, upgrades) and may also include a functional category. "Units" can be as course or as fine-grained as developers want. Units can be an entire program (which would report less frequently), but can also be subroutines, libraries, etc. Hardware may only report at boot-time, though may be instrumented to report more frequently. Hardware reports may be combined with software reports. Over time, we will improve identification of units.

There is a more detailed approach, where we watch the reproduction of code at a more granular level, to more accurately measure the surface area/volume boundary between different reproductive entities. But the outline, above, would provide a rough view, which is good enough to start with.

We can use a similar approach to objectively measure the reproduction of corporations and other social organisms. A graph of all communication inside a corporation is relatively easy to produce. All email, social media, phone calls, texts, all bits produced by software, for all employees and software, all measured in bit volume/time, inside the corporate firewall. Do not record what was said, written, or generated by software, just between who (what leaf nodes), how much, and over what time period. Form a similar network of all communication with occurs between this internal network and all external parties. External parties are customers, suppliers, and governmental actors. This would reveal communication networks, whether and where the corporation deviates from a “most stable” configuration, and the volume and surface area of communication by and with the corporation. This analysis could identify when corporations deviate from Kleiber’s Law.