No, there's more to us than just DNA. For example methylation, the addition of methyl chemical groups to some bases, which isn't tracked in "normal" DNA sequencing controls which genes get expressed by which cells. Plus there's a reasonable chance of errors in the sequencing due to the need to copy the DNA repeatedly to identify.
- Data size
Most bioinformatics data formats are plain ASCII. So even the reference data would be 3Gb per person. But 1,000 genomes contains sequencing reads where each DNA bases is sample multiple times (20-40 is typical "read depth") so that errors in identifying bases can be minimised. Each base of each of these samples has a quality score associate (which is about a 6 bit value). Plus identifiers for all billion odd reads per person.
For fairly simple organisms we have 'cold booted' DNA and swapped a cells DNA with artificial DNA from a closely related organism and it still worked. So, while humans have a lot of Genetic information I suspect you could get it to work if your willing to accept fairly low success changes.
As to read errors that effectively just a 'mutation' which are generally fairly harmless. If you stay below say 1,000 mutations, which would still take vary high accuracy, you have not significantly reduced your chances for success.
We have the raw data on the actual DNA for several people which is what you need to make a copy. What we don't know is what the data means and what all the mutations are in the wild. Which is what you want to know if we are going to start making changes.
Actually we don't have the full DNA sequence for any human. For example, if you look at the data from say the Genome Reference Consortium the first 10,000 bases on Chromosome are designated as N - unknown.
True that we don't have a full sequence, but that's not the best example. The telomeres (ends) consist of the same set of bases repeated thousands of times. Recent research suggests that the length is probably super important. We're good at approximating length, but not detecting exactly.
There are a bunch of regions of 'N' in the reference sequence, most are just repeats.
The genome is incredibly complex, and yes, much we still can't represent accurately. As one example, some genes are given a location in the reference genome, while every person actually has multiple copies that are scattered across the genome.
No, there's more to us than just DNA. For example methylation, the addition of methyl chemical groups to some bases, which isn't tracked in "normal" DNA sequencing controls which genes get expressed by which cells. Plus there's a reasonable chance of errors in the sequencing due to the need to copy the DNA repeatedly to identify.
- Data size
Most bioinformatics data formats are plain ASCII. So even the reference data would be 3Gb per person. But 1,000 genomes contains sequencing reads where each DNA bases is sample multiple times (20-40 is typical "read depth") so that errors in identifying bases can be minimised. Each base of each of these samples has a quality score associate (which is about a 6 bit value). Plus identifiers for all billion odd reads per person.