Insights from the revised complete genome sequences of Acinetobacter baumannii strains AB307-0294 and ACICU belonging to global clone 1 and 2

2. Abstract The Acinetobacter baumannii global clone 1 (GC1) isolate AB307-0294, recovered in the USA in 1994, and the global clone 2 (GC2) isolate ACICU, isolated in 2005 in Italy, were among the first A. baumannii isolates to be completely sequenced. AB307-0294 is susceptible to most antibiotics and has been used in many genetic studies and ACICU belongs to a rare GC2 lineage. The complete genome sequences, originally determined using 454 pyrosequencing technology which is known to generate sequencing errors, were re-determined using Illumina MiSeq and MinION (ONT) technologies and a hybrid assembly generated using Unicycler. Comparison of the resulting new high-quality genomes to the earlier 454-sequenced version identified a large number of nucleotide differences affecting protein coding features, and allowed the sequence of the long and highly-repetitive bap and blp1 genes to be properly resolved for the first time in ACICU. Comparisons of the annotations of the original and revised genomes revealed a large number of differences in the protein coding features (CDSs), underlining the impact of sequence errors on protein sequence predictions and core gene determination. On average, 400 predicted CDSs were longer or shorter in the revised genomes and about 200 CDS features were no longer present. 3. Impact statement The genomes of the first 10 A. baumannii strains to be completely sequenced underpin a large amount of published genetic and genomic analysis. However, most of their genome sequences contain substantial numbers of errors as they were sequenced using 454 pyrosequencing, which is known to generate errors particularly in homopolymer regions; and employed manual PCR and capillary sequencing steps to bridge contig gaps and repetitive regions in order to finish the genomes. Assembly of the very large and internally repetitive gene for the biofilm-associated proteins Bap and BLP1 was a recurring problem. As these strains continue to be used for genetic studies and their genomes continue to be used as references in phylogenomics studies including core gene determination, there is value in improving the quality of their genome sequences. To this end, we re-sequenced two such strains that belong to the two major globally distributed clones of A. baumannii , using a combination of highly-accurate short-read and gap-spanning long-read technologies. Annotation of the revised genome sequences eliminated hundreds of incorrect CDS feature annotations and corrected hundreds more. Given that these revisions affected hundreds of non-existent or incorrect CDS features currently cluttering GenBank protein databases, it can be envisaged that similar revision of other early bacterial genomes that were sequenced using error-prone technologies will affect thousands of CDS currently listed in GenBank and other databases. These corrections will impact the quality of predicted protein sequence data stored in public databases. The revised genomes will also improve the accuracy of future genetic and comparative genomic analyses incorporating these clinically important strains. 4. Data summary The corrected complete genome sequence of A. baumannii AB307-0294 has been deposited in GenBank; GenBank accession number CP001172.2 (chromosome url - ). The corrected complete genome sequence of ACICU has been deposited in GenBank under the GenBank accession numbers CP031380 (chromosome; url - ), CP031381 (pACICU1; url - ) and CP031382 (pACICU2; url - ). The authors confirm all supporting data, code and protocols have been provided within the article or through supplementary data files.
