Human and Takifugu
entropy is measured on a scaffold of the Takifugu genome, 3,387 Takifugu exons
and partly assembled Human chromosomes. The exons can be devided in low and high
entropic ones. Maybe the low entropic ones are from archaic genes. Comparing
the results with (partly) assembled chromosome Y and 2 shows that the scaffold
genome gives a "random noise" entropy spectrum whereas the
chromosomes give a destinctive spectrum.
SATOCONOR.COM
J.G. van der Galiën ‘Shannon Entropy of Takifugu
rubripes and Homo sapiens’ 4.2. (2005)
Shannon Entropy of Takifugu
rubripes and Homo sapiens
By Johan G. van der Galiën (M.Sc.)
(For comments e-mail: johan.van.der.galien@satoconor.com)
Version 1.1. April 6, 2006 (version 1.0. from August 27, 2005)
Abstract:
A new method
involving two successive compression stages is around 30% better than the
single stage method performed by Ensembl for the databases on their FTP-site.
The compression ratios of the second stage actually give a good estimation of
the true entropy of the Takifugu and Human genomes (1.923 and 1.720 bits per
base).
Takifugu has some
low (1.1 - 1.8 bits per base) together with many high first order entropic
exons (1.8 - 2.0). Takifugu is compositional self-complementary just like a
Human DNA contig and a Fruitfly DNA chunk database from Ensembl. But the
overall base frequencies of 3,387 Takifugu exons show that compositional self
complementary DNA is not a must to effectively store hereditary information.
First order entropy
spectra were recorded from the Takifugu scaffold and compared to partly
assembled Human chromosome. With this new method low entropic regions (peaks
below the noise) around 13 (S/N = 10 dB), 28 (9 dB) and 58 Mbases (15 dB) were
detected in Human chromosome-Y and at 89 Mbases (15 dB) in Human
chromosome-Two. The last peak falls in the region of the 2p11.2 IGK operon at
88.98 – 89.46 Mbases.
1. Introduction
Fugu rubripes or Takifugu rubripes, the
Japanese puffer fish, has one of the smallest genomes of the vertebrates. 335
Mbases about 13% the size of the human genome. This compact size is largely due
to the scarcity of dispersed repetitive sequences, which accounts for less then
15% of its genome. Yet it contains a set of genes similar to humans.1
Takifugu has 225,582 and Humans have 245,231 gene exons according to the latest
statistics from Ensembl.2 The intergenic and intronic regions are
considered to be highly compact.1 According to Information Theory
all of this should lead to high entropy for the whole genome.
Entropy is a concept that comes from
Thermodynamics. In Physics entropy has the dimension of energy of a statistical
ensemble divided by its temperature. An ensemble is a dynamical system with a
finite number of energy states that all are occupied to some extent (ergodic).
The Second Law of Thermodynamics states that the entropy of the universe can
only increase (arrow-of-time). The formula for the thermodynamical entropy, in
Joules per degree Kelvin, is S = klog(W) where k is the Boltzmann constant and
W is the number of possible molecular configurations. This formula is
equivalent to the conceptual definition of information (negentropy). In other
words information (Info) and entropy (H). It is this H in bits per symbol that
will be calculated for DNA sequences in this paper. This H is also called the
true, (near) infinite order Shannon, entropy to distinguish it from
thermodynamical entropy.3 When I talk of entropy in this paper I
mean the
There are more scientists working on
entropy and DNA. Look for a small selection in the following references.6-8.
One of these references is H.C. Lee. He has said in a lecture "Information
decreases with increasing entropy". Lee continued by saying: "Genomes
are not closed systems, but the 2nd law does make it difficult for the genome
to simultaneously: 1) grow stochastically. 2) acquire more information: a) lose
entropy. b) gain order".9 I do not agree with him according to
my knowledge entropy IS the measure of information. So information AND entropy
will increase as evolution progresses in time. Iinfo and H formally differ only
in the fact that Info is a negative and H is a positive value, but most of the
times they are regarded as Info = H.3
2. Material and Methods
The databases used in this research
were downloaded from the Ensembl website.2 The files are called
fugu_rubripes.FUGU2.nov.dna.scaffold.fa (Modified 11 November 2004, 335,997,153
bytes, one base per byte, a preliminary framework of the whole genome) and
fugu_rubripes12000.dat (17,217,761 bytes, gene bank). There are 21 .dat files
in the Takifugu genebank FTP site of Ensembl called fugu_rubripes0.dat –
fugu_rubripes20000.dat. The fugu_rubripes12000.dat was selected for further
research solely because it was one of the smallest files, so that it still
could be viewed in MS-Notepad, as part of the development of query programs in
PASCAL. The files come with a .gz extension; in other words compressed with
GZIP.12 You can uncompress them with GZIP –d filename.gz from
the MS-DOS prompt.
A compressed file was made from the
ASCII flatfile fugu_rubripes.FUGU2.nov.dna.scaffold.fa (one base per byte) with
a PASCAL program in accordance of Table 1. Only the A, C, G and T entries where
put in the resulting FuguCompress.dat file (4 bases per byte). Which was
compressed further to FuguRandom.dat by GZIP -9 to give the highest compression
possible with that algorithm. On these files 35 randomness tests where
performed (ENT, RanTests, RaBenZi1 and DIEHARD v0.2 beta).13-16 To
do an estimation of the true entropy I also compressed FuguCompress.dat with
PASQDA 4.1 (option -6e), the best compressor available, to FuguRandom2.dat.17,18
The first order entropy is
calculated either in bits per bit or in bits per base. To fully understand the
bits per bit you need Table 1 for the coding of the bases.
|
Base |
Bit pair
representation |
|
A |
00 |
|
C |
01 |
|
G |
10 |
|
T |
11 |
Table 1: The bit
pair coding of the bases.
The statistics of the whole genome where
done on fugu_rubripes.FUGU2.nov.dna.scaffold.fa and the screening of the whole
genome with one Kbase resolution was done on
fugu_rubripes.FUGU2.jul.dna.scaffold.fa (Modified 19 July 2005, 335,997,153
bytes). fugu_rubripes12000.dat was used for measuring the first order entropies
of the exons. For comparison Human chromosomes-Y and -Two where also downloaded
and screened: homo_sapiens.NCBI35.jul.dna.chromosome.Y.fa (Modified 22 July
2005, 58,663,443 byte) and homo_sapiens.NCBI35.jul.dna.chromosome.2.fa
(Modified 22 July 2005, 247,068,585 byte).
Queries with first order entropy and
base frequency measurements where done with PASCAL programs. Screening the
whole genome of Takifugu and the two Human chromosomes with one Kbase
resolution, and drawing the graph was done with VISUAL BASIC.NET programs. For
your indication screening Human chromosome-Two took about one hour on 2.8 GHz
Pentium-4 512 Mbytes.
Graphs with Entropy spectra where
calculated from screening with one Kbase resolution and then calculating the
mean of the entropy over 33, 65 or 367 of the one Kbase datapoints and scatter
plot them. Each data point was also connected by the DrawLine() method to the
previous data point to obtain a spectrum with 33, 65 or 367 Kbases resolution.
These graphs were recorded with (X-axis) 30 pixels per Mbase unit. The Y-axis
was either 4590 or 1530 pixels per bits per base unit.
3. Results
The original
fugu_rubripes.FUGU2.nov.dna.scaffold.fa.gz had a size of 98,500,617 bytes. So
there was a compression factor of 0.2932 (.gz / .fa) done by GZIP.
|
|
FuguCompress.dat |
FuguRandom.dat |
|
First Order Entropy in bits per bit |
1.000000 |
0.999768 |
|
First Order Entropy in bits per byte |
7.866537 |
7.997677 |
|
DIEHARD v0.2 beta Test |
219 p = zero or one out of 229 p's |
131 p = zero or one out of 229 p's |
|
Out of 35 Randomness Tests |
Four Tests passed successfully |
17 Tests passed successfully |
Table 2: The
results of randomness tests applied on two different compressed files from the
scaffold.
The compressed file FuguCompress.dat
(78,868,306 bytes, compression factor = 0.2347) by the PASCAL compressor
program from fugu_rubripes.FUGU2.nov.dna.scaffold.fa passes only four of in
total 35 randomness tests (35 tests in total from the suites: ENT, RanTests,
RaBenZi1 and DIEHARD v0.2 beta). Since FuguRandom.dat still contains the DNA
patterns, I only stored the four bases in a byte; it can additionally be
compressed. I used GZIP -9 the slowest but best compression option. This gives
the file FuguRandom.dat of 77,247,626 bytes. Compression ratio second step is
0.9795 and overall is 0.2299. FuguRandom2.dat of 75,819,935 bytes was obtained
with PASQDA 4.1. (option -6e) with a factor of 0.9613, overall 0.2256. The
compression ratios of the Human chromosomes-Y and -Two for this second stage
are 0.8321 and 0.8881 done by PASQDA.
The results of the first order
entropy and base frequency measurements on
fugu_rubripes.FUGU2.nov.dna.scaffold.fa are given in Table 3.
|
Item |
Value |
Percentage |
|
Number of A’s |
86,197,383 |
27.32% |
|
Number of C’s |
71,511,516 |
22.67% |
|
Number of G’s |
71,518,763 |
22.67% |
|
Number of T’s |
86,245,564 |
27.34% |
|
Total ACGT |
315,473,226 |
100% |
|
|
|
|
|
Number of N’s |
13,687,490 |
4.07% |
|
Number of X’s |
6,836,437 |
2.03% |
|
Total ACGTNX |
335,997,153 |
100% |
|
|
|
|
|
First Order Entropy in bits per base (ACGT) |
1.993721 |
|
|
|
|
|
|
Number of zero bits |
315,425,045 |
49.99% |
|
Number of one bits |
315,521,407 |
50.01% |
|
Total bits |
630,946,452 |
100% |
|
|
|
|
|
First Order Entropy in bits per bit (ACGT) |
1.000000 |
|
Table 3: The
results of first order entropy and base frequency measurements done on
fugu_rubripes.FUGU2.nov.dna.scaffold.fa.
The first order entropy’s of the
individual exons found in the fugu_rubripes12000.dat (gene bank) are in the 1.1
– 2.00 bits per base range, given in significant digits. Some of the high and
low extremes are given in Table 4.
|
Exon identification number |
First order entropy in bits per base
given in significant digits (number of bases + N’s in exon) |
|
High entropic |
|
|
Exon_id=SINFRUE00000595424 |
2.0 (61) |
|
Exon_id=SINFRUE00000748876 |
2.00 (121) |
|
Exon_id=SINFRUE00000623413 |
2.00 (813) |
|
Exon_id=SINFRUE00000748661 |
2.00 (909) |
|
Exon_id=SINFRUE00000594480 |
2.00 (346) |
|
|
|
|
Low entropic |
|
|
Exon_id=SINFRUE00000752999 |
1.3 (93) |
|
Exon_id=SINFRUE00000608397 |
1.39 (124) |
|
Exon_id=SINFRUE00000762960 |
1.30 (287) |
|
Exon_id=SINFRUE00000772961 |
1.1 (50 + 1) |
|
Exon_id=SINFRUE00000734910 |
1.2 (75) |
Table 4: Extreme
high and low entropic exons found in fugu_rubripes12000.dat.
The base frequency and overall first order entropy measurements on
fugu_rubripes12000.dat are given in Table 5.
|
Item |
Value |
Percentage |
|
Number of A’s in exons |
141,062 |
25.00% |
|
Number of C’s in exons |
152,754 |
27.07% |
|
Number of G’s in exons |
151,447 |
26.84% |
|
Number of T’s in exons |
118,990 |
21.09% |
|
Total bases in exons |
564,253 |
100% |
|
|
|
|
|
Number of N’s in exons |
67 |
0.01% (Of total bases and N's) |
|
|
|
|
|
Number of exons found |
3,387 |
|
|
|
|
|
|
First Order Entropy of exons |
1.99317 bits per base |
|
Table 5: Some
statistics on the 3,387 exons in the fugu_rubripes12000.dat genebank.
4. Discussion
The compression factor of 0.2932 for
fugu_rubripes.FUGU2.nov.dna.scaffold.fa.gz is higher then can be deduced from
the fact that one can at least go from storing one base per byte to storing
four bases per byte (this is a consequence of Table 1). According to this the
compression factor should be lower (0.25). But on top of that comes the fact
that there must be many patterns in the scaffold (repetitive N regions,
dispersed repetitive sequences, introns, exons etc.) which can additionally be
compressed. I think a fully sequenced and assembled genome would contain even
more of these patterns than a scaffold. And theoretically when you compress at
the
It looks like, from the first order
entropy (is 1.000000 bits per bit) of FuguCompress.dat, that genomic DNA is
actually a good Random Number Generator. But as is mentioned in the Results
section the compressed file only passes four out of 35 randomness tests
(Passed: Compression Based On Bytes, Arithmetic Mean Of Data Bytes, Arithmetic
Mean Of Data Bits and First Order Entropy In Bits Per Bit) including some of
the most popular found on the internet. Specifically the DIEHARD Test Suite is
very critical and the file passes none of these 14 tests. (Actually there are
The number of bases (is for Takifugu
equal to the Golden Path Length, this means that all the difficult to sequence
telomers and centromers were also processed) Ensembl specifies in their
statistics is 329,140,338. I found a somewhat larger number: 329,160,716 for A,
C, G, T and N entries for fugu_rubripes.FUGU2.nov.dna.scaffold.fa. The
difference comes from the fact that the clones are sequenced in an overlapping
manner and the overlapping sequences are also dumped in the scaffold to be used
later on in building the assembly (in other words the one chromosome per .fa
file collection to be made by the International Fugu Genome Consortium).
From Table 3 you can see that the
PASCAL programs detects the N’s, A’s, C’s, G’s and T’s. Additionally there are
X’s detected. These come from the fact that there are about 70 preceding ASCII
characters with information about the database and after each chunk of 60 bases
there is a delimiter. (A small rectangle, an ASCII character, in MS-Notepad and
this becomes a return in MS-Word after copying and pasting.) The Total ACGTNX
from Table 1 is exactly the size of the file
fugu_rubripes.FUGU2.nov.dna.scaffold.fa in bytes. This is a check for the
PASCAL program that it does indeed detect all the information available in the
database. But the percentage X should be around (1 / 61) 1.64% and I found
2.03%. Does this mean that there are still errors in the
fugu_rubripes.FUGU2.nov.dna.scaffold.fa database?
The fact that number of A's ≈
T's and C's ≈ G's is eye catching from Table
The first order entropy, N’s not
taken in to account, of the scaffold genome is 1.993721 bits per base and the
overall first order entropy of 3,387 exons is 1.99317 bits per base. These
figures can be regarded as coming from very large samples taken at random. So
they give a good indication of the whole genome and all exons of Takifugu
despite the N's. Since first order entropy is calculated of relative
frequencies of single base units, this figure for the scaffold is not affected
by the random dumping of the contigs.5 But the true entropy from the
compression ratio does. One can say definitely that this estimation is too
high.
The extreme high entropic exons
contain very much information; most likely they code for parts of
multifunctional proteins. They all have: number of A's ≈ C's ≈ G's ≈
T's. I assume that this extraordinary circumstance comes from later on in
evolution and that the LUCA had only low entropic exons, coding for
monofunctional proteins, with by natural selection evolved in to high
information exons in some cases. In other cases exons are still evolving. Maybe
there even exist exons in present day species with are reminiscent or equal to
the exons of the LUCA.
5. Conclusions
A file compressed by putting 4 bases
in a byte from the scaffold is not a good random number generator although the
first order entropy in bits per bit does indicate that. This can be understood
because a true random file produced this way, with a true entropy very near 1
bits per bit, has number of A's is almost equal to number of C's is almost
equal to number of G's is almost equal to number of T's. Such a file would pass
all randomness tests. The Takifugu Genome does not even fulfil this almost
equal base criterion; also it has a true entropy more near 0.9613 bits per bit.
All indications that there are too many higher order patterns in the file
compressed this way. Also is shown that further compressing the file (GZIP -9)
near the Shannon Limit improves the randomness property of the DNA data.5
Since the compression factor found
by Ensembl on the scaffold is 0.2932 and my method of storing four bases in a
byte and subsequent compression by PASQDA.EXE gives 0.2256 (no N's and X's!).
My method is (0.2932 - 0.2256) * 100 / 0.2256 = 30% better.
The first order entropy estimation
of the whole genome of Takifugu is 1.993721 bits per base. From 3,387 exons of
Takifugu the first order entropy estimation is 1.99317 bits per base. The
conclusion is that the sum of all exons has almost the same entropy as the
whole genome. An estimation of the true entropy of the Takifugu genome, based
compression data of the scaffold, is 1.923 bits per base. This remains
estimation because you are extrapolating from a scaffold to a genome and the
compression is only near the Shannon Limit. Since the Human chromosomes-Y and
-Two have a much lower value than the scaffold (1.664 and 1.776 respectively)
one can say that the Takifugu genome might have a higher information density
than the Human genome. Also is there a difference in true entropy between Human
chromosomes. An estimation of the Human true entropy is the mean of the
chromosomes tested: 1.720 bits per base.
The fact that single strands of DNA
are compositional self-complementary can come from developments not long ago in
evolution, because it increases the entropy of the whole genome and increasing
entropy is the arrow-of-time. On the other hand it can be that compositional
self-complementary DNA has something to do with optimal chemical and physical
stability and / or biochemical activity of the helix (hypothesis 1).
The exons of Takifugu can be divided
in to low and high entropic. The border is 1.8 bits per base. It might be
possible to develop query programs that find solely low entropic exon genes in
the databases of Ensembl.
A compositional self-complementary
DNA region is not a must for hereditary information (hypothesis 2).
All data points of a 367 Kbases
resolution first order entropy spectrum of the Takifugu scaffold lie somewhere
between 1.969 - 1.988 bits per base. In a spectrum with better resolution (33
Kbases, Fig. 1) of the 210 - 240 Mbases region all data points lie between
1.953 - 1.994 bits per base.

Fig. 1: Part of
first order entropy spectrum of the Takifugu rubripes scaffold. N's not
took in to account. Note that this is the 210 - 240 Mbases range of 315 Mbases
ACGT in Total.
Although the range has widened there
is also no distinctive structure. The spectrum still looks like "random
noise". However the 65 Kbases resolution spectrum (Fig. 2) of the Human
Y-chromosome has peaks above the random noise (1.) at 13 (S/N = 10 dB), 28 (S/N = 9 dB) and 58 Mbases (S/N
= 10 dB).21
Signal-to-noise ratio:
S/N = 20log10(Amplitude Signal / Amplitude Noise) (1.)
In dB (Decibel)

Fig. 2: First order
entropy spectrum of chromosome Y of Homo sapiens. N's took in to
account. 65 Kbases data points of all N's lie on the 2 bits per base lines,
shown this way for clarity. Since it was great news that the human genome was
totally sequenced some years ago the N's must be gaps that are not assembled
yet.
The 33 Kbases resolution spectrum
(Fig. 3) of the Human chromosome-Two has broad bands and a large peak at 89
Mbases (S/N = 15 dB).

Fig. 3: First
order entropy spectrum of chromosome Two of Homo sapiens, N's taken in to
account, see further Fig. 2.
The definition of a scaffold is: In
genomic mapping, a series of contigs that are in the right order but not
necessarily connected in one continuous stretch of sequence. The definition of
a contig is: A stretch of genomic DNA assembled from raw sequence data. The
contig lengths vary and may span part of a gene or many genes. When enough
overlapping contigs become available they are assembled into whole chromosome
sequences. I interpret these definitions so that it can explain the difference
between the Fig. 1 and and Fig. 2 and 3. Fig. 2 and 3 are from chromosomes and
Fig. 1 is from a scaffold genome. The entropy spectra of Fig. 2 and 3 clearly
show distinctive structure and Fig. 1 does not. Although Fig. 1 looks like
random noise it is just like the other spectra also a fingerprint, or some kind
of a bar code if you like, of the database in question. The difference in the
spectra can only mean that the scaffold is a random distribution of relatively
small contigs and cannot give a distinctive entropy spectrum. The small high
and low entropy regions are randomly distributed in a scaffold and not aligned
in chromosome sequence order. The spectrum is of course also random noise,
reflecting the distribution of the relative small contigs in the scaffold.
Maybe a better resolution (< 33 Kbases) would give a distinctive spectrum of
the scaffold, when the resolution is smaller then the mean contig size. But
what is the point of such a spectrum? It cannot be used to locate low and high
entropic regions in a genome because the contigs are not in the right order.
It looks like that low entropic
exons, like which can be found in Takifugu, can also be found in Human
chromosomes and are clustered in low entropic regions like I identified in Fig.
2 and 3. On the other hand the peaks can also come from dispersed repetitive
sequences because they to can have a low first order entropy, but this is not a
must for instance repetitive sequences of all ACGT or CAGT etc. will have
maximal first order entropy.
The peak at 89 Mbases of Fig. 3 is
certainly in the 88.98 – 89.46 Mbases region of the 2p11.2 IGK operon.2 In the
theory, which says that genes involved sometimes from low to high entropic or
in other words from coding for mono- to multifunctional proteins, this IGK can
still be remaining archaic. Originated from life forms in the early stages of
evolution and hence be very wide spread, in a reminiscence form, in present
days species.
I also must mention the broad bands
at 79 and 84 Mbases of Fig. 3 which indicate whole segments of around 7 Mbases
wide relative low first order entropy.
Maybe more fine structure will be
seen when the resolution of the graphs is increased.
Entropy spectra of chromosomes give
the possibility to characterize and analyze fully assembled genomes. It may
also make it possible to locate archaic genes. Such information maybe useful
for (partly) reconstructing the genome of the LUCA. It could well be that these
low entropic regions have more or less the same relative loci in species from
the same family like the mammals.
The entropy spectrum of a DNA
molecule is a fingerprint, or a bar code if you like, and can maybe used to
characterize and identify the origin of a DNA sample from (incomplete) sequence
data.
Acknowledgements
Special thanks to the International
Fugu Genome Consortium and the International Human Genome Sequencing Consortium
for the use of their very excellent databases, downloadable at Ensembl's FTP
site.1,2,22
-o0o-
Notes
& References:
1) Anonymous ‘Fugu genome
project’ Institute of Molecular and Cell Biology, International Fugu Genome
Consortium
http://www.fugu-sg.org/index.html
2) Anonymous ‘Browse a
genome’ e! Ensembl
http://www.ensembl.org/index.html
3) Daugman G. ‘Information
theory and coding’ Computer Science Tripos Part II, Michaelmas Term 12
Lectures,
http://www.cl.cam.ac.uk/Teaching/2003/InfoTheory/Notes.pdf
4) Meinsma G. ‘Data compression
& information theory’
http://wwwhome.math.utwente.nl/~meinsmag/shannon.pdf
5) Van der Galiën J.G.
‘State-of-the-art compressors as tools for true entropy estimations’ Scientia
Araneae Totius Orbis 4.4. (2005)
http://home.versatel.nl/galien8
6) Chang C.H., Hsieh L.S.,
Chen T.Y., Chen H.D., Luo L.F. and Lee H.C. ‘Shannon information in complete
genomes’ IEEE Proc. Computer Sys. Bioinformatics, 20-30 (2004)
http://sansan.phy.ncu.edu.tw/~hclee/rpr/Lee_H_Shannon.pdf
7)Stenkvist B., Strande G.
‘Entropy as an algorithm for the statistical description of DNA cytometric data
obtained by image analysis microscopy’ Anal. Cell Pathol. 2(3),
159 - 165 (1990)
8) Kayser K., Kayser G.M.,
Altiner M., ‘Calculations of non-biased entropy for analysis of DNA
distributions’ E J PATHOL
http://ejpath.amu.edu.pl/EJP32/972-04.HTM
9) Lee H.C. ‘
10) Blom J. ‘The silicon
cell: Towards computing the living cell’
http://homepages.cwi.nl/~gollum/SiC/
11) Institute For Advanced
Biosciences ‘The E-cell project’
12) Gailly J-L., Adler M.
‘The GZIP home page’
13) J. Walker J. 'Ent: A
pseudorandom number sequence test program’
http://www.fourmilab.ch/random/
14) D.E. Knuth D.E. ‘The art of computer programming Volume 2
seminumerical algorithms'
15) J.G. van der Galiën J.G.
"Proposal for a new kind of randomness test (Rabenzi)," Scientia
Araneae Totius Orbis 3.3.
(2004)
http://home.versatel.nl/galien8
16) G. Marsaglia G. ‘Diehard
battery of tests of randomness v0.2 beta’
http://www.cs.hku.hk/~diehard/
17) W. Bergmans W. ‘The help
file compression test’
http://www.maximumcompression.com/data/hlp.php
18) Mahoney M. ‘The PAQ data
compression programs’
http://www.cs.fit.edu/~mmahoney/compression/
19) Results not shown, but
available on request.
20) C. Woese C. ‘The
universal ancestor’ Proc. Natl. Acad. Sci. USA, 95, 6854-6859 (1998)
21) Wikipedia
‘Signal-to-noise ratio’
http://en.wikipedia.org/wiki/Signal_to_noise
22) Anonymous ‘Human genome
project centres’
http://www.sanger.ac.uk/HGP/publication2001/centres.shtml