Human and Takifugu entropy

Human and Takifugu entropy is measured on a scaffold of the Takifugu genome, 3,387 Takifugu exons and partly assembled Human chromosomes. The exons can be devided in low and high entropic ones. Maybe the low entropic ones are from archaic genes. Comparing the results with (partly) assembled chromosome Y and 2 shows that the scaffold genome gives a "random noise" entropy spectrum whereas the chromosomes give a destinctive spectrum.

 

 

SATOCONOR.COM

J.G. van der Galiën ‘Shannon Entropy of Takifugu rubripes and Homo sapiens’ 4.2. (2005)

Full paper

SATOCONOR.COM Journal of RANDOMICS

 

 

Shannon Entropy of Takifugu rubripes and Homo sapiens

By Johan G. van der Galiën (M.Sc.)

(For comments e-mail: johan.van.der.galien@satoconor.com)

Version 1.1. April 6, 2006 (version 1.0. from August 27, 2005)

 

HOME of SATOCONOR.COM

 

Abstract:

A new method involving two successive compression stages is around 30% better than the single stage method performed by Ensembl for the databases on their FTP-site. The compression ratios of the second stage actually give a good estimation of the true entropy of the Takifugu and Human genomes (1.923 and 1.720 bits per base).

Takifugu has some low (1.1 - 1.8 bits per base) together with many high first order entropic exons (1.8 - 2.0). Takifugu is compositional self-complementary just like a Human DNA contig and a Fruitfly DNA chunk database from Ensembl. But the overall base frequencies of 3,387 Takifugu exons show that compositional self complementary DNA is not a must to effectively store hereditary information.

First order entropy spectra were recorded from the Takifugu scaffold and compared to partly assembled Human chromosome. With this new method low entropic regions (peaks below the noise) around 13 (S/N = 10 dB), 28 (9 dB) and 58 Mbases (15 dB) were detected in Human chromosome-Y and at 89 Mbases (15 dB) in Human chromosome-Two. The last peak falls in the region of the 2p11.2 IGK operon at 88.98 – 89.46 Mbases.

 

1. Introduction

Fugu rubripes or Takifugu rubripes, the Japanese puffer fish, has one of the smallest genomes of the vertebrates. 335 Mbases about 13% the size of the human genome. This compact size is largely due to the scarcity of dispersed repetitive sequences, which accounts for less then 15% of its genome. Yet it contains a set of genes similar to humans.1 Takifugu has 225,582 and Humans have 245,231 gene exons according to the latest statistics from Ensembl.2 The intergenic and intronic regions are considered to be highly compact.1 According to Information Theory all of this should lead to high entropy for the whole genome.

Entropy is a concept that comes from Thermodynamics. In Physics entropy has the dimension of energy of a statistical ensemble divided by its temperature. An ensemble is a dynamical system with a finite number of energy states that all are occupied to some extent (ergodic). The Second Law of Thermodynamics states that the entropy of the universe can only increase (arrow-of-time). The formula for the thermodynamical entropy, in Joules per degree Kelvin, is S = klog(W) where k is the Boltzmann constant and W is the number of possible molecular configurations. This formula is equivalent to the conceptual definition of information (negentropy). In other words information (Info) and entropy (H). It is this H in bits per symbol that will be calculated for DNA sequences in this paper. This H is also called the true, (near) infinite order Shannon, entropy to distinguish it from thermodynamical entropy.3 When I talk of entropy in this paper I mean the Shannon entropy.4,5

There are more scientists working on entropy and DNA. Look for a small selection in the following references.6-8. One of these references is H.C. Lee. He has said in a lecture "Information decreases with increasing entropy". Lee continued by saying: "Genomes are not closed systems, but the 2nd law does make it difficult for the genome to simultaneously: 1) grow stochastically. 2) acquire more information: a) lose entropy. b) gain order".9 I do not agree with him according to my knowledge entropy IS the measure of information. So information AND entropy will increase as evolution progresses in time. Iinfo and H formally differ only in the fact that Info is a negative and H is a positive value, but most of the times they are regarded as Info = H.3

 

2. Material and Methods

The databases used in this research were downloaded from the Ensembl website.2 The files are called fugu_rubripes.FUGU2.nov.dna.scaffold.fa (Modified 11 November 2004, 335,997,153 bytes, one base per byte, a preliminary framework of the whole genome) and fugu_rubripes12000.dat (17,217,761 bytes, gene bank). There are 21 .dat files in the Takifugu genebank FTP site of Ensembl called fugu_rubripes0.dat – fugu_rubripes20000.dat. The fugu_rubripes12000.dat was selected for further research solely because it was one of the smallest files, so that it still could be viewed in MS-Notepad, as part of the development of query programs in PASCAL. The files come with a .gz extension; in other words compressed with GZIP.12 You can uncompress them with GZIP –d filename.gz from the MS-DOS prompt.

A compressed file was made from the ASCII flatfile fugu_rubripes.FUGU2.nov.dna.scaffold.fa (one base per byte) with a PASCAL program in accordance of Table 1. Only the A, C, G and T entries where put in the resulting FuguCompress.dat file (4 bases per byte). Which was compressed further to FuguRandom.dat by GZIP -9 to give the highest compression possible with that algorithm. On these files 35 randomness tests where performed (ENT, RanTests, RaBenZi1 and DIEHARD v0.2 beta).13-16 To do an estimation of the true entropy I also compressed FuguCompress.dat with PASQDA 4.1 (option -6e), the best compressor available, to FuguRandom2.dat.17,18

The first order entropy is calculated either in bits per bit or in bits per base. To fully understand the bits per bit you need Table 1 for the coding of the bases.

 

Base

Bit pair representation

A

00

C

01

G

10

T

11

 

Table 1: The bit pair coding of the bases.

 

The statistics of the whole genome where done on fugu_rubripes.FUGU2.nov.dna.scaffold.fa and the screening of the whole genome with one Kbase resolution was done on fugu_rubripes.FUGU2.jul.dna.scaffold.fa (Modified 19 July 2005, 335,997,153 bytes). fugu_rubripes12000.dat was used for measuring the first order entropies of the exons. For comparison Human chromosomes-Y and -Two where also downloaded and screened: homo_sapiens.NCBI35.jul.dna.chromosome.Y.fa (Modified 22 July 2005, 58,663,443 byte) and homo_sapiens.NCBI35.jul.dna.chromosome.2.fa (Modified 22 July 2005, 247,068,585 byte).

Queries with first order entropy and base frequency measurements where done with PASCAL programs. Screening the whole genome of Takifugu and the two Human chromosomes with one Kbase resolution, and drawing the graph was done with VISUAL BASIC.NET programs. For your indication screening Human chromosome-Two took about one hour on 2.8 GHz Pentium-4 512 Mbytes.

Graphs with Entropy spectra where calculated from screening with one Kbase resolution and then calculating the mean of the entropy over 33, 65 or 367 of the one Kbase datapoints and scatter plot them. Each data point was also connected by the DrawLine() method to the previous data point to obtain a spectrum with 33, 65 or 367 Kbases resolution. These graphs were recorded with (X-axis) 30 pixels per Mbase unit. The Y-axis was either 4590 or 1530 pixels per bits per base unit.

 

3. Results

The original fugu_rubripes.FUGU2.nov.dna.scaffold.fa.gz had a size of 98,500,617 bytes. So there was a compression factor of 0.2932 (.gz / .fa) done by GZIP.

 

 

FuguCompress.dat

FuguRandom.dat

First Order Entropy in bits per bit

1.000000

0.999768

First Order Entropy in bits per byte

7.866537

7.997677

DIEHARD v0.2 beta Test

219 p = zero or one out of 229 p's

131 p = zero or one out of 229 p's

Out of 35 Randomness Tests

Four Tests passed successfully

17 Tests passed successfully

 

Table 2: The results of randomness tests applied on two different compressed files from the scaffold.

 

The compressed file FuguCompress.dat (78,868,306 bytes, compression factor = 0.2347) by the PASCAL compressor program from fugu_rubripes.FUGU2.nov.dna.scaffold.fa passes only four of in total 35 randomness tests (35 tests in total from the suites: ENT, RanTests, RaBenZi1 and DIEHARD v0.2 beta). Since FuguRandom.dat still contains the DNA patterns, I only stored the four bases in a byte; it can additionally be compressed. I used GZIP -9 the slowest but best compression option. This gives the file FuguRandom.dat of 77,247,626 bytes. Compression ratio second step is 0.9795 and overall is 0.2299. FuguRandom2.dat of 75,819,935 bytes was obtained with PASQDA 4.1. (option -6e) with a factor of 0.9613, overall 0.2256. The compression ratios of the Human chromosomes-Y and -Two for this second stage are 0.8321 and 0.8881 done by PASQDA.

The results of the first order entropy and base frequency measurements on fugu_rubripes.FUGU2.nov.dna.scaffold.fa are given in Table 3.

 

Item

Value

Percentage

Number of A’s

86,197,383

27.32%

Number of C’s

71,511,516

22.67%

Number of G’s

71,518,763

22.67%

Number of T’s

86,245,564

27.34%

Total ACGT

315,473,226

100%

 

 

 

Number of N’s

13,687,490

4.07%

Number of X’s

6,836,437

2.03%

Total ACGTNX

335,997,153

100%

 

 

 

First Order Entropy in bits per base (ACGT)

1.993721

 

  

 

 

Number of zero bits

315,425,045

49.99%

Number of one bits

315,521,407

50.01%

Total bits

630,946,452

100%

 

 

 

First Order Entropy in bits per bit (ACGT)

1.000000

 

 

Table 3: The results of first order entropy and base frequency measurements done on fugu_rubripes.FUGU2.nov.dna.scaffold.fa.

 

The first order entropy’s of the individual exons found in the fugu_rubripes12000.dat (gene bank) are in the 1.1 – 2.00 bits per base range, given in significant digits. Some of the high and low extremes are given in Table 4.

 

Exon identification number

First order entropy in bits per base given in significant digits (number of bases + N’s in exon)

High entropic

 

Exon_id=SINFRUE00000595424

2.0 (61)

Exon_id=SINFRUE00000748876

2.00 (121)

Exon_id=SINFRUE00000623413

2.00 (813)

Exon_id=SINFRUE00000748661

2.00 (909)

Exon_id=SINFRUE00000594480

2.00 (346)

 

 

Low entropic

 

Exon_id=SINFRUE00000752999

1.3 (93)

Exon_id=SINFRUE00000608397

1.39 (124)

Exon_id=SINFRUE00000762960

1.30 (287)

Exon_id=SINFRUE00000772961

1.1 (50 + 1)

Exon_id=SINFRUE00000734910

1.2 (75)

 

Table 4: Extreme high and low entropic exons found in fugu_rubripes12000.dat.

 

The base frequency and overall first order entropy measurements on fugu_rubripes12000.dat are given in Table 5.

 

Item

Value

Percentage

Number of A’s in exons

141,062

25.00%

Number of C’s in exons

152,754

27.07%

Number of G’s in exons

151,447

26.84%

Number of T’s in exons

118,990

21.09%

Total bases in exons

564,253

100%

 

 

 

Number of N’s in exons

67

0.01% (Of total bases and N's)

 

 

 

Number of exons found

3,387

 

 

 

 

First Order Entropy of exons

1.99317 bits per base

 

 

Table 5: Some statistics on the 3,387 exons in the fugu_rubripes12000.dat genebank.

 

4. Discussion

The compression factor of 0.2932 for fugu_rubripes.FUGU2.nov.dna.scaffold.fa.gz is higher then can be deduced from the fact that one can at least go from storing one base per byte to storing four bases per byte (this is a consequence of Table 1). According to this the compression factor should be lower (0.25). But on top of that comes the fact that there must be many patterns in the scaffold (repetitive N regions, dispersed repetitive sequences, introns, exons etc.) which can additionally be compressed. I think a fully sequenced and assembled genome would contain even more of these patterns than a scaffold. And theoretically when you compress at the Shannon limit the ratio can be much smaller than 0.25. But the Shannon Limit has never been achieved and I know for a fact that GZIP does not score badly compared to the best compression algorithms.17 The compression factor of 0.2347 for FuguCompress.dat is what can be expected because there go now four bases in a byte and also the X´s and the N´s are deleted. Actually the size of the file is exactly what you calculate from Table 3 (78,868,306.5 calculated versus 78,868,306 found) if you know that the PASCAL compressing program cannot, of course, process half a byte. FuguCompress.dat can be compressed further with GZIP -9 because it still contains DNA patterns in the contigs. This gives FuguRandom.dat with a compress ratio of 0.9795. The state-of-the-art compressor PASQDA 4.1. reduces the FuguCompress.dat file even further than GZIP the compression ratio is now 0.9613. This factor can be regarded as a good indication of the true entropy of the scaffold in bits per bit.5 This then leads to a true entropy estimation of 1.923 bits per base. For Human chromosome-Y and -Two this becomes 1.664 and 1.776 bits per base. (True entropy has the property that when divided by the size of the alphabet symbol in bits it gives the true entropy in bits per bit. The compression ratio is the same as an estimation of the true entropy in bits per bit. Consequently multiplied by two gives the true entropy estimation in bits per base.5)

It looks like, from the first order entropy (is 1.000000 bits per bit) of FuguCompress.dat, that genomic DNA is actually a good Random Number Generator. But as is mentioned in the Results section the compressed file only passes four out of 35 randomness tests (Passed: Compression Based On Bytes, Arithmetic Mean Of Data Bytes, Arithmetic Mean Of Data Bits and First Order Entropy In Bits Per Bit) including some of the most popular found on the internet. Specifically the DIEHARD Test Suite is very critical and the file passes none of these 14 tests. (Actually there are 17 in the suite but the files are to small for three of the tests.) This fact is also reflected by the fact that there are 219 p = zero or one out of 229 p's. A p = zero or one is an indication that the specific part of a test is not passed. FuguRandom.dat scores much better 17 out of 35 tests: Arithmetic Mean Of The Data Bits, First Order Entropy In Bits Per Byte, Compression Based On Bits and Bytes, Craps1 and 2, Up-Down Runs, 3D-Spheres, Minimum Distance, CD Park, Binary Rank 6x8 and 32x32 and 31x31, Birthday Spacings1 and 2, Zipf Double and Zipf Real48. This file is much more random despite the fact it is only a slightly more compressed version of FuguCompress.dat (compression factor of 0.9613 versus 0.9795). So slightly more compression increases the randomness property of a file dramatically.

The number of bases (is for Takifugu equal to the Golden Path Length, this means that all the difficult to sequence telomers and centromers were also processed) Ensembl specifies in their statistics is 329,140,338. I found a somewhat larger number: 329,160,716 for A, C, G, T and N entries for fugu_rubripes.FUGU2.nov.dna.scaffold.fa. The difference comes from the fact that the clones are sequenced in an overlapping manner and the overlapping sequences are also dumped in the scaffold to be used later on in building the assembly (in other words the one chromosome per .fa file collection to be made by the International Fugu Genome Consortium).

From Table 3 you can see that the PASCAL programs detects the N’s, A’s, C’s, G’s and T’s. Additionally there are X’s detected. These come from the fact that there are about 70 preceding ASCII characters with information about the database and after each chunk of 60 bases there is a delimiter. (A small rectangle, an ASCII character, in MS-Notepad and this becomes a return in MS-Word after copying and pasting.) The Total ACGTNX from Table 1 is exactly the size of the file fugu_rubripes.FUGU2.nov.dna.scaffold.fa in bytes. This is a check for the PASCAL program that it does indeed detect all the information available in the database. But the percentage X should be around (1 / 61) 1.64% and I found 2.03%. Does this mean that there are still errors in the fugu_rubripes.FUGU2.nov.dna.scaffold.fa database?

The fact that number of A's ≈ T's and C's ≈ G's is eye catching from Table 3. In other words the base complements are almost equal in size. Lee et al. has also observed this phenomenon and calls it the compositional self-complementary property of a single DNA strand.6 (Not to be confused with a DNA strand that can curl back on it self, because it consists of complementary blocks.) This fact is also true for the base frequency in homo_sapiens.NCBI35.nov.dna.contig.fa and drosophila_melanogaster.-DROM3A.nov.dna.chunk.fa.19 My assumption (hypothesis 1) is that this has something to do with the chemical and physical stability and / or biochemical activity of helical DNA. In other words natures laws demand an A's ≈ T's and C's ≈ G's distribution of the bases in a single strand. Of course the relative small percentage (4% Takifugu, 11% Human, 5% Drosophila) of N’s, which can be any base, will not alter the picture very much. The same is observed for origins of Fugu, Human and Drosophila, which can be considered as good samples taken at random from genomes despite the fact that they also sometimes contain a very small percentage of N’s.19

The first order entropy, N’s not taken in to account, of the scaffold genome is 1.993721 bits per base and the overall first order entropy of 3,387 exons is 1.99317 bits per base. These figures can be regarded as coming from very large samples taken at random. So they give a good indication of the whole genome and all exons of Takifugu despite the N's. Since first order entropy is calculated of relative frequencies of single base units, this figure for the scaffold is not affected by the random dumping of the contigs.5 But the true entropy from the compression ratio does. One can say definitely that this estimation is too high.

The extreme high entropic exons contain very much information; most likely they code for parts of multifunctional proteins. They all have: number of A's ≈ C's ≈ G's ≈ T's. I assume that this extraordinary circumstance comes from later on in evolution and that the LUCA had only low entropic exons, coding for monofunctional proteins, with by natural selection evolved in to high information exons in some cases. In other cases exons are still evolving. Maybe there even exist exons in present day species with are reminiscent or equal to the exons of the LUCA.

 

5. Conclusions

A file compressed by putting 4 bases in a byte from the scaffold is not a good random number generator although the first order entropy in bits per bit does indicate that. This can be understood because a true random file produced this way, with a true entropy very near 1 bits per bit, has number of A's is almost equal to number of C's is almost equal to number of G's is almost equal to number of T's. Such a file would pass all randomness tests. The Takifugu Genome does not even fulfil this almost equal base criterion; also it has a true entropy more near 0.9613 bits per bit. All indications that there are too many higher order patterns in the file compressed this way. Also is shown that further compressing the file (GZIP -9) near the Shannon Limit improves the randomness property of the DNA data.5

Since the compression factor found by Ensembl on the scaffold is 0.2932 and my method of storing four bases in a byte and subsequent compression by PASQDA.EXE gives 0.2256 (no N's and X's!). My method is (0.2932 - 0.2256) * 100 / 0.2256 = 30% better.

The first order entropy estimation of the whole genome of Takifugu is 1.993721 bits per base. From 3,387 exons of Takifugu the first order entropy estimation is 1.99317 bits per base. The conclusion is that the sum of all exons has almost the same entropy as the whole genome. An estimation of the true entropy of the Takifugu genome, based compression data of the scaffold, is 1.923 bits per base. This remains estimation because you are extrapolating from a scaffold to a genome and the compression is only near the Shannon Limit. Since the Human chromosomes-Y and -Two have a much lower value than the scaffold (1.664 and 1.776 respectively) one can say that the Takifugu genome might have a higher information density than the Human genome. Also is there a difference in true entropy between Human chromosomes. An estimation of the Human true entropy is the mean of the chromosomes tested: 1.720 bits per base.

The fact that single strands of DNA are compositional self-complementary can come from developments not long ago in evolution, because it increases the entropy of the whole genome and increasing entropy is the arrow-of-time. On the other hand it can be that compositional self-complementary DNA has something to do with optimal chemical and physical stability and / or biochemical activity of the helix (hypothesis 1).

The exons of Takifugu can be divided in to low and high entropic. The border is 1.8 bits per base. It might be possible to develop query programs that find solely low entropic exon genes in the databases of Ensembl.

A compositional self-complementary DNA region is not a must for hereditary information (hypothesis 2).

All data points of a 367 Kbases resolution first order entropy spectrum of the Takifugu scaffold lie somewhere between 1.969 - 1.988 bits per base. In a spectrum with better resolution (33 Kbases, Fig. 1) of the 210 - 240 Mbases region all data points lie between 1.953 - 1.994 bits per base.

 

 

Fig. 1: Part of first order entropy spectrum of the Takifugu rubripes scaffold. N's not took in to account. Note that this is the 210 - 240 Mbases range of 315 Mbases ACGT in Total.

 

Although the range has widened there is also no distinctive structure. The spectrum still looks like "random noise". However the 65 Kbases resolution spectrum (Fig. 2) of the Human Y-chromosome has peaks above the random noise (1.) at 13 (S/N = 10 dB), 28 (S/N = 9 dB) and 58 Mbases (S/N = 10 dB).21

 

Signal-to-noise ratio:

S/N = 20log10(Amplitude Signal / Amplitude Noise) (1.)

In dB (Decibel)

 

 

Fig. 2: First order entropy spectrum of chromosome Y of Homo sapiens. N's took in to account. 65 Kbases data points of all N's lie on the 2 bits per base lines, shown this way for clarity. Since it was great news that the human genome was totally sequenced some years ago the N's must be gaps that are not assembled yet.

The 33 Kbases resolution spectrum (Fig. 3) of the Human chromosome-Two has broad bands and a large peak at 89 Mbases (S/N = 15 dB).

 

 

Fig. 3: First order entropy spectrum of chromosome Two of Homo sapiens, N's taken in to account, see further Fig. 2.

 

The definition of a scaffold is: In genomic mapping, a series of contigs that are in the right order but not necessarily connected in one continuous stretch of sequence. The definition of a contig is: A stretch of genomic DNA assembled from raw sequence data. The contig lengths vary and may span part of a gene or many genes. When enough overlapping contigs become available they are assembled into whole chromosome sequences. I interpret these definitions so that it can explain the difference between the Fig. 1 and and Fig. 2 and 3. Fig. 2 and 3 are from chromosomes and Fig. 1 is from a scaffold genome. The entropy spectra of Fig. 2 and 3 clearly show distinctive structure and Fig. 1 does not. Although Fig. 1 looks like random noise it is just like the other spectra also a fingerprint, or some kind of a bar code if you like, of the database in question. The difference in the spectra can only mean that the scaffold is a random distribution of relatively small contigs and cannot give a distinctive entropy spectrum. The small high and low entropy regions are randomly distributed in a scaffold and not aligned in chromosome sequence order. The spectrum is of course also random noise, reflecting the distribution of the relative small contigs in the scaffold. Maybe a better resolution (< 33 Kbases) would give a distinctive spectrum of the scaffold, when the resolution is smaller then the mean contig size. But what is the point of such a spectrum? It cannot be used to locate low and high entropic regions in a genome because the contigs are not in the right order.

It looks like that low entropic exons, like which can be found in Takifugu, can also be found in Human chromosomes and are clustered in low entropic regions like I identified in Fig. 2 and 3. On the other hand the peaks can also come from dispersed repetitive sequences because they to can have a low first order entropy, but this is not a must for instance repetitive sequences of all ACGT or CAGT etc. will have maximal first order entropy.

The peak at 89 Mbases of Fig. 3 is certainly in the 88.98 – 89.46 Mbases region of the 2p11.2 IGK operon.2 In the theory, which says that genes involved sometimes from low to high entropic or in other words from coding for mono- to multifunctional proteins, this IGK can still be remaining archaic. Originated from life forms in the early stages of evolution and hence be very wide spread, in a reminiscence form, in present days species.

I also must mention the broad bands at 79 and 84 Mbases of Fig. 3 which indicate whole segments of around 7 Mbases wide relative low first order entropy.

Maybe more fine structure will be seen when the resolution of the graphs is increased.

Entropy spectra of chromosomes give the possibility to characterize and analyze fully assembled genomes. It may also make it possible to locate archaic genes. Such information maybe useful for (partly) reconstructing the genome of the LUCA. It could well be that these low entropic regions have more or less the same relative loci in species from the same family like the mammals.

The entropy spectrum of a DNA molecule is a fingerprint, or a bar code if you like, and can maybe used to characterize and identify the origin of a DNA sample from (incomplete) sequence data.

 

Acknowledgements

Special thanks to the International Fugu Genome Consortium and the International Human Genome Sequencing Consortium for the use of their very excellent databases, downloadable at Ensembl's FTP site.1,2,22

 

-o0o- Please also visit: The new Journal of Randomics site and the cumulated result of the site here

 

Notes & References:

1) Anonymous ‘Fugu genome project’ Institute of Molecular and Cell Biology, International Fugu Genome Consortium

http://www.fugu-sg.org/index.html

2) Anonymous ‘Browse a genome’ e! Ensembl

http://www.ensembl.org/index.html

3) Daugman G. ‘Information theory and coding’ Computer Science Tripos Part II, Michaelmas Term 12 Lectures,

http://www.cl.cam.ac.uk/Teaching/2003/InfoTheory/Notes.pdf

4) Meinsma G. ‘Data compression & information theory’

http://wwwhome.math.utwente.nl/~meinsmag/shannon.pdf

5) Van der Galiën J.G. ‘State-of-the-art compressors as tools for true entropy estimations’ Scientia Araneae Totius Orbis 4.4.  (2005)

http://home.versatel.nl/galien8

6) Chang C.H., Hsieh L.S., Chen T.Y., Chen H.D., Luo L.F. and Lee H.C. ‘Shannon information in complete genomes’ IEEE Proc. Computer Sys. Bioinformatics, 20-30 (2004)

http://sansan.phy.ncu.edu.tw/~hclee/rpr/Lee_H_Shannon.pdf

7)Stenkvist B., Strande G. ‘Entropy as an algorithm for the statistical description of DNA cytometric data obtained by image analysis microscopy’ Anal. Cell Pathol. 2(3), 159 - 165 (1990)

8) Kayser K., Kayser G.M., Altiner M., ‘Calculations of non-biased entropy for analysis of DNA distributions’ E J PATHOL

http://ejpath.amu.edu.pl/EJP32/972-04.HTM

9) Lee H.C. ‘Shannon information in complete genomes’ CSB2004, August 17 - 19, Stanford. From a MS-PowerPoint presentation that used to be on the internet.

10) Blom J. ‘The silicon cell: Towards computing the living cell’

http://homepages.cwi.nl/~gollum/SiC/

11) Institute For Advanced Biosciences ‘The E-cell project’

http://www.e-cell.org/

12) Gailly J-L., Adler M. ‘The GZIP home page’

http://www.gzip.org/

13) J. Walker J. 'Ent: A pseudorandom number sequence test program’

http://www.fourmilab.ch/random/

14) D.E. Knuth D.E. ‘The art of computer programming Volume 2 seminumerical algorithms' Reading MA, Addison-Wesley (1969)

15) J.G. van der Galiën J.G. "Proposal for a new kind of randomness test (Rabenzi)," Scientia Araneae Totius Orbis 3.3. (2004)

http://home.versatel.nl/galien8

16) G. Marsaglia G. ‘Diehard battery of tests of randomness v0.2 beta’

http://www.cs.hku.hk/~diehard/

17) W. Bergmans W. ‘The help file compression test’

http://www.maximumcompression.com/data/hlp.php

18) Mahoney M. ‘The PAQ data compression programs’

http://www.cs.fit.edu/~mmahoney/compression/

19) Results not shown, but available on request.

20) C. Woese C. ‘The universal ancestor’ Proc. Natl. Acad. Sci. USA, 95, 6854-6859 (1998)

21) Wikipedia ‘Signal-to-noise ratio’

http://en.wikipedia.org/wiki/Signal_to_noise

22) Anonymous ‘Human genome project centres’

http://www.sanger.ac.uk/HGP/publication2001/centres.shtml