Measure Code Reproduction with Metagenomics

Successful genetic code—viruses, bacteria, plants, animals—reproduces. Genetic code contains patterns. Successful patterns persist across generations; pattern generations and evolution of genetic code can be measured and tracked over time.

Techniques developed in metagenomics allow us to measure arbitrary genetic code in the environment around us. We can take a sample of DNA in water, from the earth, from air, identify different organisms in the sample, and identify functional aspects of the organisms. The “easy” version of metagenomic analysis starts with a library of code "fingerprints" associated with species (or other groupings). A “hard” version does not start with such a library, but identifies species or other groupings through statistical analysis of contiguous code groups. Both versions also use known functions performed by different sequences to identify functional aspects of the organisms.

In the context of computer science, the data sets and assembly of code groups, and identification of functions is much cleaner. Computer code is often instrumented with traceroute and debug routines to report when and how it is executed and if errors or other conditions occur. Though code is typically encrypted when stored to disk, it is unencrypted when processed in the processor, and modern chips include multiple processors, including ones that can see the unencrypted (processor) side, and provide reporting, diagnostics, pipelining, and speculative execution functions. Operating systems typically include a resource monitor that reports power-use by process, where "process" may be as course-grained as an entire executable file or more granular, including down to the operating system/kernal function library level.

Application of metagenomic techniques to computer media would be "easier" than application of metagenomics techniques to genetic media, though it would still be non-trivial.

My hypothesis is that if we apply metagenomics techniques to computer media over time, we will observe software processes that coalesce, much as amino-acid networks on early pre-cellular Earth coalesced over time into cellular life.

Computer code also reproduces. The dominant reproductive strategy is analogous to that of a virus with a commensal or mutualistic relationship with a host, humans. Over time, the role of the host may recede. The code contains patterns. The patterns persist across generations. The generations and their evolution can be tracked.

The required software instrumentation is technically straight-forward. Many hardware and software developers already measure the reproduction of both hardware and software, though typically in the context of measuring performance and for debugging purposes. We need to measure code reproduction across the entire ecosystem of hardware and software.

We must start an open-source project to instrument software and hardware units so that units report to a public forum when executed. At the option of developers, execution reports can be anonymous, communicating only a unique identifier, provided via a blockchain identifier. Ideally, the identifier would allow identification of descendants (updates, upgrades) and may also include a functional category. "Units" can be as course or as fine-grained as developers want. Units can be an entire program (which would report less frequently), but can also be subroutines, libraries, etc. Hardware may only report at boot-time, though may be instrumented to report more frequently. Hardware reports may be combined with software reports. Over time, we will improve identification of units.

  1. Let developers see their own data. The anonymous identifier is decrypted by the developer who installs the instrumentation. Everyone else sees aggregates of anonymized data.
  2. Watch which units reproduce, and how much, over time. When possible, watch descendants.
  3. Monitor (via instrumentation and/or in the network) communication between units.

There is a more detailed approach, where we watch the reproduction of code at a more granular level, to more accurately measure the surface area/volume boundary between different reproductive entities. But the outline, above, would provide a rough view, which is good enough to start with.

We can use a similar approach to objectively measure the reproduction of corporations and other social organisms. A graph of all communication inside a corporation is relatively easy to produce. All email, social media, phone calls, texts, all bits produced by software, for all employees and software, all measured in bit volume/time, inside the corporate firewall. Do not record what was said, written, or generated by software, just between who (what leaf nodes), how much, and over what time period. Form a similar network of all communication with occurs between this internal network and all external parties. External parties are customers, suppliers, and governmental actors. This would reveal communication networks, whether and where the corporation deviates from a “most stable” configuration, and the volume and surface area of communication by and with the corporation. This analysis could identify when corporations deviate from Kleiber’s Law.