Time-resolved comparative molecular evolution of oxygenic photosynthesis

Oxygenic photosynthesis starts with the oxidation of water to O2, a light-driven reaction catalysed by photosystem II. Cyanobacteria are the only prokaryotes capable of water oxidation and therefore, it is assumed that relative to the origin of life and bioenergetics, the origin of oxygenic photosynthesis is a late innovation. However, when exactly water oxidation originated remains an unanswered question. Here we use relaxed molecular clocks to compare one of the two ancestral core duplications that are unique to water-oxidizing photosystem II, that leading to CP43 and CP47, with some of the oldest well-described events in the history of life. Namely, the duplication leading to the Alpha and Beta subunits of the catalytic head of ATP synthase, and the divergence of archaeal and bacterial RNA polymerases and ribosomes. We also compare it with more recent events such as the duplication of cyanobacteria-specific FtsH metalloprotease subunits, of CP43 variants used in a variety of photoacclimation responses, and the speciation events leading to Margulisbacteria, Sericytochromatia, Vampirovibrionia, and other clades containing anoxygenic phototrophs. We demonstrate that the ancestral core duplication of photosystem II exhibits patterns in the rates of protein evolution through geological time that are nearly identical to those of the ATP synthase, RNA polymerase, or the ribosome. Furthermore, we use ancestral sequence reconstruction in combination with comparative structural biology of photosystem subunits, to provide additional evidence supporting the premise that water oxidation had originated before the ancestral core duplications. Our work suggests that photosynthetic water oxidation originated closer to the origin of life and bioenergetics than can be documented based on species trees alone.

polymerases and ribosomes. We also compare it with more recent events such as the 23 duplication of cyanobacteria-specific FtsH metalloprotease subunits, of CP43 variants used in 24 a variety of photoacclimation responses, and the speciation events leading to 25 Margulisbacteria, Sericytochromatia, Vampirovibrionia, and other clades containing 26 anoxygenic phototrophs. We demonstrate that the ancestral core duplication of photosystem 27 II exhibits patterns in the rates of protein evolution through geological time that are nearly 28 identical to those of the ATP synthase, RNA polymerase, or the ribosome. Furthermore, we 29 use ancestral sequence reconstruction in combination with comparative structural biology of 30 photosystem subunits, to provide additional evidence supporting the premise that water 31 oxidation had originated before the ancestral core duplications. Our work suggests that 32 (formerly Melainabacteria) [3,4], followed by Sericytochromatia [5] and Margulisbacteria 48 [6]. Currently, no photosynthetic strains have been described in these groups of uncultured 49 bacteria and this has led to the hypothesis that oxygenic photosynthesis arose during the time 50 spanning the divergence of Vampirovibrionia and the MRCA of Cyanobacteria, starting from 51 an entirely non-photosynthetic ancestral state [5,7]. Molecular clock studies have suggested 52 that the span of time between the divergence of Cyanobacteria and Vampirovibrionia could 53 be of several hundred million years [8,9]. However, it is still unclear from molecular clock 54 analyses and the microfossil record when exactly the MRCA of Cyanobacteria occurred [10]. 55 For example, two recent independent analysis placed the same cyanobacterial ancestor 56 around a mean age younger than 1.5 Ga [11] and older than 3.5 Ga [12]. 57 58

Evolution of photosystem II 59
The heart of PSII is made up of a heterodimeric reaction centre (RC) core coupled to a core 60 antenna. The two subunits of the RC core of PSII are known as D1 and D2, and these are 61 associated respectively with the antenna subunits known as CP43 and CP47. D1 and CP43 62 make up one monomeric half of the RC, and D2 and CP47, the other half. Water oxidation is 63 catalysed by a Mn4CaO5 cluster coordinated by ligands from both D1 and CP43 [13,14]. The 64 cluster is functionally coupled to a redox active tyrosine-histidine pair (YZ-H190) also 65 located in D1, which relays electrons from Mn to the oxidised chlorophyll pigments of the 66 128  Figure S2). We found that the level of sequence identity between any two 141 type I RC proteins is always greater than between a type I RC protein and CP43/CP47 142 (Supplementary Table S1). Therefore, the distance between CP43/CP47 and other type I 143 antenna domains is the largest distance in the molecular evolution of RC proteins after that 144 between type I and type II RC core domains. The phylogenetic relationships between type I 145 and type II RCs have been reviewed in detail before [19,34,35]. These are briefly 146 summarized and schematized in Supplementary Figure S3. 147 The phylogeny of Alpha and Beta subunits of the F-type ATP synthase showed that 148 all Cyanobacteria have a F-type ATP synthase, and a fewer number of strains have an 149 additional Na + -translocating ATPase (N-ATPase) of the bacterial F-type, as had been 150 reported before [36]. We found that MSV have a standard F-type ATP synthase 151 (Supplementary Figure S4), but some N-ATPase Alpha and Beta sequences were also found 152 in Vampirovibrionia and Sericytochromatia datasets, but not in Margulisbacteria. In this 153 study we focused on the standard F-type ATP synthase of Cyanobacteria for further analysis. 154 The phylogeny of bacterial RNA polymerase subunit β (RpoB), with the intention of 155 molecular clock analysis, focused on Cyanobacteria and MSV, as well as phyla with known 156 phototrophic representatives and included Thermotogae and Aquificae as potential roots 157 (Supplementary Figure S5). The tree was largely consistent with previous observed 158 relationships between the selected groups [37], within Cyanobacteria and MSV,and within 159 other phototrophs and their non-phototrophic relatives [38,39]. The only exception was 160 Aquificae, which branched as a sister clade to Acidobacteria, a feature that had been reported 161 before for RpoB [27], and likely represents an ancient horizontal gene transfer event. 162 163

Distances and rates 164
To gather temporal information, we compared the phylogenetic distances between CP43 and 165 CP47, Alpha and Beta subunits of the F-type ATP synthase, and archaeal and bacterial RpoB 166 (visualized in Figure 2, but see also Figures S1b, S2 and S3b). We found that the distances 167 between Alpha and Beta, and the divergence of archaeal and bacterial RpoB, are very large 168 relative to the distance between the divergence of Vampirovibrionia and Cyanobacteria. In 169 the case of RpoB, the distance between Vampirovibrionia and Cyanobacteria is about a fifth 170 of the distance between Archaea and Bacteria. However, the distance between CP43 and 171 CP47 (but also between D1 and D2 [18]) is of similar magnitude to that between Alpha and 172 Beta, and to that between archaeal and bacterial RpoB, but substantially surpasses the 173 distance between MVS and Cyanobacteria ( Figure 2). These observations suggest that 174 ancestral proteins to CP43/CP47 and D1/D2 existed before the divergences of MVS. 175 176 177 178   Supplementary Table S2). We found consistently, that Vampirovibrionia and 195 Margulisbacteria have larger within-group mean distances compared to Cyanobacteria,which 196 suggests greater rates of evolution in the non-photosynthetic clades. These were consistently 197 larger for Margulisbacteria relative to the other two groups. For example, RpoB in 198 Vampirovibrionia and Margulisbacteria showed 1.6x and 4.0x larger corrected mean 199 distances than Cyanobacteria, respectively (Supplementary Table S2). At the level of the 200 concatenated ribosomal proteins dataset, Margulisbacteria showed an almost 2-fold larger 201 within-group mean distance than Cyanobacteria. 202 We then compared the rates of evolution of CP43 and CP47 with those of Alpha and 203 Beta, using a Bayesian relaxed molecular clock approach with identical calibrations, 204 molecular clock parameters, and a simplified, but highly constrained sequence dataset (see 205 Materials and Methods for an expanded rationale). The goal of these experiment is not to use 206 the clock to estimate divergence times, but to measure and compare the rates of protein 207 evolution that are required to simulate any chosen span of time between the ancestral 208 duplications and the MRCA of Cyanobacteria. We used an autocorrelated log normal model 209 of rate variation with a non-parametric CAT+Γ model of amino acid substitutions to extract 210 rates of evolution. We will refer to the span of time between the duplication points leading to 211 Alpha and Beta (dAB), or to CP43 and CP47 (dCP), and the MRCA of Cyanobacteria as ΔT 212 (schematized in Figure 3). 213 In Figure 4a to d we examine the changes in the rate of evolution under specific 214 evolutionary scenarios. In the case of ATP synthase, we first assumed that the MRCA of 215 Cyanobacteria occurred after the GOE to simulate scenarios similar to those presented in [8] 216 or [11], at about 1.7 Ga; and that dAB occurred at 3.5 Ga (ΔT = 1.8 Ga). Under these 217 conditions the average rate of evolution of Alpha and Beta is calculated to be 0.28 ± 0.06 218 substitutions per site per Ga (δ Ga -1 ). We will refer to the average rate through the 219 Proterozoic as νmin. In this scenario, the rate of evolution at the point of duplication, which we 220 denote νmax, is 7.32 ± 1.00 δ Ga -1 , making νmax/νmin 26. In other words, when the span of time 221 between the ancestral pre-LUCA duplication and the MRCA of Cyanobacteria is 1.8 Ga, the 222 rate of evolution at the point of duplication is about 26 times greater than any rate observed 223 through the diversification of Cyanobacteria or photosynthetic eukaryotes. To place these 224 rates in the larger context of protein evolution, we encourage the reader to refer to 225 unchanged, νmax is 6.02 ± 0.9 δ Ga -1 resulting in a νmax/νmin of 21. If instead we keep the 245 duplication at 3.5 Ga but assume that the MRCA of Cyanobacteria occurred before the GOE 246 to simulate a more conservative scenario, at 2.6 Ga (ΔT = 1.1 Ga), we obtain that νmin is 247 consequently slower, 0.25 ± 0.06 δ Ga -1 , when compared to a post-GOE ancestor. This older 248 MRCA (smaller ΔT) thus leads to a rise in νmax, calculated to be 10.22 ± 1.37 δ Ga -1 and 249 leading to a νmax/νmin of 40. Given that the phylogenetic distance is a constant, the rate of 250 evolution increases with a decrease in ΔT following a power law function. We had observed 251 nearly identical evolutionary patterns for the core RC proteins D1 and D2 of PSII [18]. The 252 change in νmax/νmin as a function of ΔT is shown in Figure 4d. The core antenna of PSII, CP43 and CP47, showed patterns of divergence very similar to 274 those of Alpha and Beta, both in terms of phylogenetic distances between paralogues and 275 rates of evolution between orthologues (Figure 4a and b). The average rate of evolution of 276 CP43 and CP47, assuming that the MRCA of Cyanobacteria occurred at 1.7 Ga, and the 277 duplication (dCP) at 3.5 Ga (ΔT = 1.8 Ga), is 0.19 ± 0.05 δ Ga -1 . Slower than for Alpha and 278 Beta under the same condition. This slower rate is consistent with the fact that CP43 and 279 CP47 show less sequence divergence between orthologues at all taxonomic ranks of oxygenic 280 phototrophs when compared to Alpha and Beta (see Supplementary Table S3). Furthermore, 281 the rate at dCP, νmax, was 5.17 ± 0.84 δ Ga -1 , generating a νmax/νmin of 27, similar to Alpha 282 and Beta ( Figure 4). Thus, even when ΔT is 1.8 Ga, the rate at duplication point needs to be 283 27 times greater than the average rates observed during the Proterozoic. If we consider 284 instead that the MRCA of Cyanobacteria occurred at 2.6 Ga and dCP at 3.5 Ga (ΔT = 1.1 285 Ga), this would slowdown νmin to 0.16 ± 0.04 δ Ga -1 , while νmax would increase to 7.81 ± 1.01 286 δ Ga -1 resulting in a νmax/νmin of 49. Therefore, the molecular evolution of the core subunits of 287 PSII parallels that of ATP synthase both in terms of rates and distances through geological 288 time. 289 We then studied a relatively recent gene duplication event (Figure 4c), which 290 occurred long after the LUCA, but also after the MRCA of Cyanobacteria: that leading to 291 . This more recent duplication served as a 292 point of comparison and control (see Figure 3 for a scheme). In marked contrast to dAB, the 293 rate at the point of duplication was 0.66 ± 0.21 δ Ga -1 . We found that FtsH1 is evolving at an 294 average rate of 1.42 ± 0.29, while FtsH2 at a rate of 0.24 ± 0.06 δ Ga -1 under the assumption 295 that MRCA of Cyanobacteria occurred at 1.7 Ga. Thus, under the assessed conditions, FtsH1 296 is evolving about 5.3 times faster than FtsH2, while the latter is evolving at a rate similar to 297 that of Alpha and Beta. If the MRCA of Cyanobacteria is assumed to have occurred at 2.6 298 Ga, all rates slowdown respectively, but the rate of FtsH1 remains over five times faster than 299 FtsH2. Unlike dAB, dH0 is consistent with classical neofunctionalization, in which the copy 300 that gains new function experiences an acceleration of the rate of evolution [41,42]. Like 301 PSII and ATP synthase, the calculated rates of evolution match observed distances as 302 estimated by the change in the level of sequence identity as a function of time, in which the 303 fastest evolving FtsH1 accumulated greater sequence change than FtsH2 in the same period 304 (Supplementary Table S3). 305 Given that the complex evolution of CP43 and CBP involved several major 306 duplication events and potentially large variations in the rate of evolution ( Figure 1 and 307 Supplementary Figure S1), we carried out a molecular clock analysis of a large dataset of 392 308 CP43 and 465 CBP proteins using cross-calibrations across paralogues. Clocks were executed 309 with no constraint on the MRCA of Cyanobacteria. The estimated mean divergence time for 310 the oldest node, the duplication at the origin of CBP, is 2.23 Ga (95% confidence interval, CI: 311 1.90 -2.69 Ga) using an autocorrelated log normal model (see Figure 4e  average rate of evolution of CP43, not including CBP sequences, was found to be 0.14 ± 0.05 317 δ Ga -1 , which is in the same range as determined in the simplified, but highly constrained 318 experiment above. We noted a 6-fold increase in the rate of evolution associated with the 319 duplication leading to the farlip-CP43 variant (Figure 4e). This duplication led to an 320 acceleration of the rate similar in magnitude to that of FtsH1/FtsH2 and is consistent with a 321 neofunctionalization as the photosystems evolved to use far-red light and bind chlorophyll f. 322 CBP sequences, on average, display rates of evolution about three times faster than 323 CP43 ( Figure 4e). However, the serial duplications that led to the evolution of CP43-derived 324 light harvesting complexes resulted in accelerations in the rate of evolution of a similar 325 magnitude as observed for dAB and dCP. The largest of these is associated with the origin of 326 PcbC [30], a variant commonly found in heterocystous Cyanobacteria and Cyanobacteria that 327 use alternative pigments, such as chlorophyll b, d and f. The ancestral node of PcbC was 328 timed at 2.07 Ga (95% CI: 1.76 -2.50 Ga) with a rate of 11.7 ± 2.42 δ Ga -1 , decelerating 329 quickly, but stabilizing at about four times faster rates than the average rate of CP43. We find 330 it noteworthy that the fast rates of evolution associated with the origin of CBP are not 331 associated with very large spans of time between these and CP43, nor did it result in very old 332 root node ages despite the use of very broad constraints. 333 334

Species divergence 335
To understand the evolution of MSV relative to Cyanobacteria we wished to apply a 336 molecular clock to a system where the calculated rates could be compared to observed rates 337 as determined by distances between species of known divergence times or at similar 338 taxonomic ranks. We found RpoB to be suitable for this because it has been inherited 339 vertically with few instances of horizontal gene transfer and had enough signal to resolve 340 known phylogenetic relationships between and within clades. In collecting the RpoB 341 sequences, we noted for the first time that Margulisbacteria and Vampirovibrionia share a 342 comparatively greater level of divergence at similar taxonomic ranks than Cyanobacteria. For 343 example, the level of sequence divergence of RpoB from two species of Termititenax 344 (Margulisbacteria) [43] is about 40% greater than the distance between Gloeobacter spp. and 345 any other cyanobacterium (not including gaps and insertions), the latter being the largest 346 distance between oxygenic phototrophs. In the case of Gastranaerophilales 347 (Vampirovibrionia) [5], which are specialised gut bacteria and should therefore not be much 348 older than animals, the level of sequence identity of RpoB was found to be 70% for the two 349 most distant strains in this group, contrasted to 84% for Gloeobacter spp. when compared to 350 any other cyanobacterium. As listed in Supplementary Table S2, within-group mean 351 distances suggest that faster rates are widespread and not just unique to RpoB. 352 We implemented a set of 12 calibrations across bacteria, including two calibrations on 353 Margulisbacteria and two in Vampirovibrionia with the aim of covering both slower and 354 faster evolving lineages. The following results are based on an autocorrelated log normal 355 molecular clock using CAT+Γ, a root constrained with a broad interval ranging from between 356 4.52 and 3.41 Ga, and as described in Materials and Methods ( Figure 5). We found this to 357 perform well and provided results comparable to other independent studies that did not 358 combine a full set of MVS sequences and other clades with phototrophs in a single tree 359 (Table 1). Nonetheless, a pipeline of sensitivity experiments tested the dependency of these 360 results on models and prior assumptions: these are shown and described in Supplementary 361 Figure S7 and S8. 362 The root of the tree (divergence of Thermotogae) was timed at 3.64 Ga (95% CI: 3.42 363 -4.11 Ga) and the divergence of Cyanobacteria at 2.74 Ga (95% CI: 2.46 -3.12 Ga). Thus, 364 the span of time between the mean age of the root and the mean age of the MRCA of 365 Cyanobacteria was calculated to be 0.89 Ga. The span of time between Margulisbacteria and 366 Cyanobacteria was found to be 0.67 Ga; and between Vampirovibrionia and Cyanobacteria 367 0.44 Ga (Figure 4f and 5). The latter is a value that is consistent with previous studies using 368 entirely different rationales, datasets and calibrations [8,9]. 369 We also noted an exponential decrease in the rates of evolution of RpoB through the 370 Archean, which stabilised at current levels in the Proterozoic ( Figure 4f). The rate at the root 371 node was calculated to be 2.37 ± 0.45 δ Ga -1 and the average rate of evolution of RpoB 372 during the Proterozoic was found to be 0.19 ± 0.06 δ Ga -1 . The average rate of cyanobacterial 373 RpoB was 0.14 ± 0.04 δ Ga -1 ; for Margulisbacteria was 0.44 ± 0.17 δ Ga -1 , and for 374 Vampirovibrionia 0.19 ± 0.05 δ Ga -1 : about 3.1x and 1.3x the mean cyanobacterial rate 375 respectively. These rates agree reasonably well with the observed distances (Supplementary  376   Table S2), further indicating that the calibrations used in these clades performed well and 377 recapitulated patterns of evolution consistent with lifestyle and trophic modes. Nevertheless, 378 we suspect that the values for MSV represent underestimations of the true rates of evolution 379 (slower than they should be), as some of the clades that include symbionts still appear much 380 older than anticipated from their hosts (Supplementary Text S2). 381 Furthermore, a more complex, but commonly used model like CAT+GTR+Γ 382 implementing a birth-death prior with 'soft bounds' on the calibrations, resulted in rates that 383 were smoothed out, which translated into spread-out divergence times with Margulisbacteria 384 and Vampirovibrionia evolving at 1.9x and 0.7x times the cyanobacterial rates, respectively 385

394
Horizontal bars within the tree mark 95% confidence intervals. These are shown in selected nodes of interest for 395 clarity but see Table 1. To investigate the effect of the age of the root on the divergence time of MVS and 410

396
Cyanobacteria we also varied the root prior from 3.2 to 4.4 Ga (Supplementary Figure S7). 411 We noted that regardless of the time of origin of Bacteria (approximated by the divergence of 412 Thermotogae in our analysis), a substantially faster rate is required during the earliest 413 diversification events, decreasing through the Archaean and stabilizing in the Proterozoic. 414 This matches well the patterns of evolution of ATP synthase and PSII core subunits as shown 415 in the previous section. 416 We then compared the above RpoB molecular clock with a different clock that 417 included a set of 112 diverse sequences from Archaea, in addition to the sequences from 418 Thermotogae, MSV, and Cyanobacteria, but removing all other bacterial phyla (Figure 6a). 419 We found that the calculated average rate of evolution of bacterial RpoB during the 420 Proterozoic was slower (0.09 ± 0.03 δ Ga -1 ) than in the absence of archaeal sequences (0.19 ± 421 0.06 δ Ga -1 ), resulting in overall older mean ages (see Figure 4g). However, the rate at the 422 Bacteria/Archaea divergence point was 6.87 ± 1.17 δ Ga -1 , similar to the rate for dAB and 423 dCP, requiring therefore an exponential decrease in the rates similar to that observed for ATP 424 synthase and the PSII core subunits. Notably, this rate and exponential decrease is associated 425 with a span of time between the mean estimated ages of the LUCA and the MRCA of  younger ages compared to RpoB (see Table 1

Structural analysis 455
A fundamental premise of our investigation is that water oxidation started before the 456 duplication of D1 and D2, and CP43 and CP47. The rationale behind this premise has been 457 laid out before [17,48], and more extensively recently [18]. This rationale raises the question 458 of how the D2/CP47 side of the RC lost its capacity to carry out water-splitting catalysis. To 459 gain further insight on the nature of the structural site around the water-oxidising complex in 460 the ancestral photosystem, we used ancestral sequence reconstruction to predict the most 461 probable ancestral states. We will refer to the ancestral protein to D1 and D2 as D0 462 (Supplementary Figure S9). We generated 14 predicted D0 sequences using a combination of 463 three ASR methods and amino acid substitution models. On average the 14 D0 sequences had 464 87.12 ± 0.55% sequence identity indicating that the different algorithms provided largely 465 consistent results. While the regions that include all transmembrane helices are aligned 466 unambiguously, the N-terminal and C-terminal ends were aligned less confidently due to 467 greater sequence variability at both ends. Nonetheless, we found that the predicted D0 468 sequences retain more identity with D1 than D2 along the entire sequence. The level of 469 sequence identity of D0 compared to the D1 (PsbA1) of Thermosynechococcus vulcanus was 470 found to be 69.58 ± 0.55%, and 36.32 ± 0.15% compared to D2. The D1 ligands H332, E333, D342, and A344 located at the C-terminus.
3) The CP43 475 ligands, E354 and R357, located in the extrinsic loop between the 5 th and 6 th helices, with the 476 latter residue less than 4 Å from the Ca. Remarkably, there is structural and sequence 477 evidence supporting the loss of ligands in these three different regions of CP47/D2. 478 In all the D0 sequences, at position D1-170 and 189, located in the unambiguously 479 aligned region, the calculated most likely ancestral states were E170 and E189, respectively. 480 The mutation D170E results in a PSII phenotype with activity similar to that of the wild-type 481

500
The CP43 subunit of PSII with the extrinsic loop shown in colours. c The CP47 subunit of PSII. Immediately 501 after the 5 th helix (E), a long alpha helix protrudes outside the membrane in both CP43 and CP47 and showing 502 structural and sequence identity (orange). We denote this helix EF 2 . After EF 2 structural differences are noticed 503 between CP43 and CP47 as schematised in panel d. In CP43, after helix EF 2 a loop is found (shown in red 504 ribbons), which we denote EF 3 . This contains the residues that bind the Mn 4 CaO 5 cluster and it is followed by a 505 domain that resembles EF 1 in the HbRC at a structural level. In CP47, EF 3 and EF 1 retain sequence identity with 506 the respective regions in CP43. CP47 has additional sequence that is not found in CP43 (EF 4 , purple). The green 507 arrows mark the position at which the domain swap occurred in CP43 relative to CP47. We found that the 508 CP43-E354 and R357 are found in the equivalent domain in CP47 as E436 and N438 coordinating a Ca atom. 509 N438 (EF 3 ) links to EF 1 via K332. It is unclear if the EF 1 region in the HbRC is strictly homologous to that in 510 CP43 and CP47 as very little sequence identity is found between the two: however, a couple of conserved 511 residues between all EF 1 may suggest it emerged from structural domains present in the ancestral RC protein 512 (see Supplementary Figure S13 We then noted that in the swapped region (EF3 in Figure 7d), sequence identity is 526 retained between CP43 and CP47 (Supplementary Figure S12). We found that CP43-E354 527 and R357 are equivalent to CP47-E435 and N438. An inspection of the crystal structure of 528 cyanobacterial PSII showed that these two residues specifically bind a Ca of unknown 529 function in (Figure 7c and d). The presence of an equivalent glutamate to CP43-E354 in 530 CP47 is consistent with this being already present before duplication. 531 Finally, a peculiar but well-known trait conserved across Cyanobacteria and 532 photosynthetic eukaryotes is that the 5' end of the psbC gene (CP43) overlaps with the 3' end 533 of the psbD gene (D2) usually over 16 bp (Supplementary Table S6 In consequence, to argue that oxygenic photosynthesis is a late evolutionary 576 innovation relative to the origin of life, one must first demonstrate that the rates of protein 577 evolution of PSII and ATP synthase catalytic core subunits, RNA polymerase, and the 578 ribosome, have not remained proportional to each other through geological time. In such a 579 way that PSII-exclusively-experienced unprecedentedly faster rates of evolution. One 580 could potentially argue that the origin of water oxidation itself could account for this 581 hypothetical unprecedentedly faster rates. However, because of the relationships between 582 speed, distance, and time, a decrease in ΔT of about 60% (from 1.1 to 0.46 Ga for example), 583 would result in a 30-fold additional increase in the rates at the earliest stages of 584 diversification, resulting in photosystems that would be evolving at rates that surpass those of 585 the fastest evolving peptide toxins (Figure 4d and Supplementary Text S1). It should be 586 noted, that a ΔT of over a billion years already accounts for a period of fast evolution at the 587 origin of all these ancient systems. 588 The reader should also be reminded that these patterns of evolution, though they 589 might appear somewhat unusual, do not emerge from the application of any particular 590 molecular clock approach or computational analysis, but from the inherently long 591 phylogenetic distance that separates Alpha and Beta, Archaea and Bacteria, or the core 592 subunits of PSII, together with relatively slow rates of evolution throughout their entire 593 multibillion-year diversification process. 594 Archaea diversified rapidly after the LUCA and hypothesized that the long distance between 604 archaeal and bacterial ribosomal proteins could be attributed to fast rates of ribosome 605 evolution during their divergence [64], but see also [65]. Our analyses suggest that the more 606 explosive the diversification of prokaryotes, the greater the chance that photosynthetic water-607 splitting is an ancestral trait of all life. And yet, if phototrophic communities-whether 608 oxygenic or not-already existed 3.2 to 3.4 Ga ago, as it is supported by the geochemical 609 record [47,66], then the earliest bacteria were likely photosynthetic. That the earliest bacteria 610 were photosynthetic is entirely consistent with the evolution of RC proteins, which indicates 611 that the structural and functional specialization that led to the two photosystem types, 612 antedated the diversification processes leading to the known phyla containing photosynthetic 613 bacteria [34]. This is also consistent with the fact that none of the groups of extant 614 photosynthetic bacteria appear to be older than the earliest geochemical evidence for 615 photosynthesis. 616 The presented data is also consistent with recent studies of the oxygenation of the 617 planet, which suggests that even if photosynthetic O2 evolution started as early as the oldest 618 rocks, the properties of early Earth biogeochemistry would have maintained very low 619 concentrations of O2 over the Archean without the need to invoke any particular biological 620 innovation or trigger to coincide with the GOE [67-69]. 621 622

Structural constraints 623
We showed that the phylogenetic distance between CP43/CP47 and other type I RC proteins 624 is the second largest distance after that between type I and type II (Supplementary Figure S2). acceptors between A and F X . c A monomeric unit of PSII (CP43/D1 and CP47/D2 in the inset). It has a 654 structural organization similar to that of type I RCs. However, the monomeric unit is split into two proteins after 655 the 6 th transmembrane helix. The Mn 4 CaO 5 cluster is coordinated by D1, but is directly connected to the antenna 656 domain via E354 and R357 in manner that resembles the HbRC. d A monomeric unit of an anoxygenic type II 657 reaction (PbRC, Proteobacteria). Unlike type I RCs, type II RCs lack F X . Instead, a non-heme Fe 2+  CP43/D1 and CP47/D2 suggest a mechanism for heterodimerization and loss of catalysis on 665 one side that accounts for all ligands. These include direct amino acid substitutions, the 666 domain swap within the extrinsic region of the ancestral protein to CP47, and the gene 667 overlap between psbC (CP43) and psbD (D2). It would be difficult to reconcile these unique 668 structural and genetic features with a scenario in which PSII evolved water oxidation at a late 669 stage, or starting as a purple anoxygenic type II RC, once the heterodimerization process was 670 well underway or completed, or in the absence of core antenna domains, given that these 671 interact directly with the electron donor side of PSII, in a manner strikingly similar to that of 672 the homodimeric type I RC of the Firmicutes [19]. What is more, these structural and 673 functional features also suggest that the atypical forms of D1, lacking ligands to the Mn4CaO5 674 cluster, such as the so-called chlorophyll f synthase [32,71], or the so-called "rogue" or 675 "sentinel" D1 [31,72], diverged from forms of D1 that were able to support water oxidation. 676 Even if they originated from early duplications that could have antedated the MRCA of 677 Cyanobacteria [2] and as additionally supported by ASR analysis (Supplementary Figure S9  678 and S10). 679 We have calculated that (oxygenic) PSII has experienced the slowest rates of 680 evolution between type II RCs, with the core of the anoxygenic type II RC of Proteobacteria 681 and Chloroflexi evolving approximately five times faster than the core of PSII [18]. That 682 considerably faster rate has led to conspicuous structural changes of the anoxygenic RC 683 relative to PSII and type I photosystems. The consequence of these differences in the rates are 684 visually apparent (Figure 8 and Supplementary Figure S3) as the anoxygenic RC has 685 experienced greater sequence and structural change than PSII. That is, PSII retains a greater 686 number of ancestral traits that can be traced to before the ancestral core duplications. For 687 example, like type I, PSII retains: 1) substantially greater structural symmetry at the core, 2) 688 the antenna functionally linked to the core and the electron donor side, 3) lack of histidine 689 ligands to the monomeric photochemical chlorophylls (marked M in Figure 8), or 4) the use 690 of a chlorophyll a derived pigment as the primary electron acceptor. This list is not meant to 691 be exhaustive, but see refs. [19,73] for additional detail. It follows then, that because these 692 conserved traits can be traced to the common ancestor of type I and type II, then the rates of 693 evolution of PSII should have remained slow and constrained relative to that of other 694 photosystems, since before the core duplications. These structural constraints demonstrate 695 that the core subunits of PSII did not experience unprecedentedly faster rates of evolution. In 696 fact, and not surprisingly, the anoxygenic type II RC which experienced the highest rates of 697 evolution, has accumulated greater change and greater loss of ancestral traits. 698 We think it is plausible that there was never a discrete origin of photosynthesis, but 707 that the process may trace back to abiogenic photochemical reactions, some of which may 708 have resulted in the oxidation of water, and at the interface of nascent membranes, membrane 709 proteins, photoactive tetrapyrroles and other inorganic cofactors: much in the same way that 710 ribosomes may have originated at the interface of nascent genetics and protein synthesis [74]. Methanocaldococcus jannaschii was used as a BLAST query. A total 213 sequences across 752 the entire diversity of Archaea were obtained. Two version of these were observed, those 753 with split B (RpoB1 and RpoB2) and those with full-length B subunits: from the latter, 112 754 full-length B subunits were selected for molecular clock analysis. 755 All sequences were aligned with Clustal Omega using 5 combined guided trees and 756 Hidden Markov Model Iterations [83]. Given that RpoB sequences are known to contain 757 many clade specific indels [27], we further processed this particular dataset using Gblocks 758 [84] to remove these indels and poorly  To compare the level of sequence identity between RC proteins, two datasets of 10 798 random amino acid sequences were generated using the Sequence Manipulation Suit [101]. 799 The datasets contained sequences of 350 and 750 residues. These were independently aligned 800 as described above, resulting in 45 pairwise sequence identity comparisons for each dataset. 801 These random sequence datasets were used as a rough minimum threshold of identity. 802 Alignments of RC proteins were generated using three representative sequences spanning 803 known diversity. Cyanobacterial CP43, CP47, standard D1 and D2 sequences were from 804

Quantification of rates of evolution (rationale) 813
Molecular clocks are conventionally used to estimate divergence times. In general terms, 814 given: 1) a tree topology, which sets the relationship between taxa; 2) a sequence alignment, 815 which sets the phylogenetic distance between taxa; and 3) some known events (calibrations), 816 which set the rates of evolution, the molecular clock can then estimate divergence times. This 817 means that if the tree topology and divergence times for two sets of protein sequences are the 818 same, any differences in phylogenetic distances between these two should only reflect 819 differences in the rate of evolution. Thus, assuming that CP43/CP47 and Alpha/Beta have 820 mainly been inherited vertically in Cyanobacteria and photosynthetic eukaryotes, any 821 difference in phylogenetic distance between the two is the result of differences in the rates of 822 evolution. For example, the level of sequence identity between CP43 in Cyanidioschizon and 823 Arabidopsis is 78%, and the level of sequence identity between Alpha in the same species is 824 69%. Given that these plastid-encoded subunits have mostly been inherited vertically since 825 the MRCA of Archaeplastida, then one can argue that Alpha is evolving somewhat faster 826 than CP43. This is because faster rates of protein evolution should lead to faster rates of 827 change resulting in a faster decrease in the level of sequence identity (increase in 828 phylogenetic distance). Now, because CP43 and CP47 are paralogues, we can then use this 829 approach to estimate the rates of evolution at the moment of duplication under a given  Node 18 denotes the MRCA of Cyanobacteria. The age for this node is highly debated 872 ranging from before to after the GOE. Node 19 represents the duplication events leading to 873 CP43 and CP47, and to Alpha and Beta. To calculate the rates of evolution under different 874 scenarios, node 18 and node 19 were varied. Firstly, a molecular clock was run using a 875 scenario that assumed that the MRCA of Cyanobacteria postdated the GOE. To do this, node 876 18 was set to be between 1.6 and 1.8 Ga, which emulates results reported in recent studies [8, 877 11]. This was compared to a scenario that assumed that the MRCA of Cyanobacteria 878 antedated the GOE, and thus node 18 was set to be between 2.6 and 2.8 Ga, which simulates 879 other evolutionary scenarios [103,114]. In both cases, the duplication event (node 19) was 880 set to be 3.5 Ga old, or changed as stated in the main text, by assigning a gamma prior at the 881 desired time fixed with a narrow standard deviation of 0.05 Ga. In a separate experiment, the 882 age of the duplication was varied while maintaining node 18 restricted to between 1.6 and 1.8 883 Ga, while node 19, the root, was set with a gamma prior with an average varied from 0.8 to 884 4.2 and with a narrow standard deviation of 0.05 Ga. 885 The period of time between the duplication event (node 19), which led to the 886 divergence of CP43 and CP47, and the MRCA of Cyanobacteria (node 18), we define as ΔT. 887 ΔT is calculated as the subtraction of the mean age of node 19 and node 18. For PSII, we 888 used node 18 from the CP43 subunit and for ATP synthase we used node 18 from the Alpha 889 subunit. In consequence, varying the age of the duplication from 0.8 to 4.2 Ga allows changes 890 in the rate of evolution to be simulated with varying ΔT, ranging from 0.2 to 2.5 Ga. That is 891 to say, it allows the rate of evolution at the point of duplication to be calculated if it occurred 892 at any point in time before the MRCA of Cyanobacteria. 893 894 Molecular clocks were calculated with Phylobayes 3.3f using a log normal autocorrelated 896 molecular clock model under the CAT+Γ non-parametric model of amino acid substitution 897 and a uniform distribution of equilibrium frequencies. We preferred a CAT model instead of 898 a CAT+GTR as the latter only outperforms the former on sequence alignments of over 1000 899 sites and it is also much less computationally expensive. Four discrete categories for the 900 gamma distribution were used and four chains were executed in parallel through all 901 experiments carried out in this study. The instant rates of evolution, which are the rates at 902 each internal node in the tree, were retrieved from the output files of Phylobayes. These rates 903 are calculated by the software as described by the developers elsewhere [115,116] Figure S14b was used as 915 template for the calculation of the rate of evolution and was based on the topology presented 916 by Shao,et al. [40] It was concluded that the Cyanobacteria-inherited closest paralog to 917 FtsH1 and FtsH2 in photosynthetic eukaryotes was also acquired before their initial 918 duplication. Therefore, from all FtsH paralogs in photosynthetic eukaryote genomes, those 919 with greater sequence identity to cyanobacterial FtsH1/2 were used. Because this duplication 920 is specific to Cyanobacteria, a few additional strains were included in this tree following 921 well-established topologies [103,105]. Calibrations were placed as indicated in 922 Supplementary Figure S14b. To test the change in the rate of evolution at the time of 923 duplication in comparison with CP43/CP47, node 19 was set to 1.6-1.8 Ga or 2.6-2.8 Ga and 924 molecular clocks were run as described in the preceding paragraph. 925 Finally, we conducted a large molecular clock using the combined 897 CP43 and CBP 926 sequences, including 40 eukaryotic CP43 sequences, to test whether using a more complex 927 phylogeny would result in rates of evolution substantially different to those calculated with 928 the method described above. Calibrations were assigned as illustrated in Supplementary 929 Figure S15. Cross-calibrations were used across paralogs constraining the origin of 930 heterocystous Cyanobacteria. In this case, only the minimum constraint of 0.72 Ga was used 931 with no maximum constraint to allow greater flexibility. Additional calibrations were 932 assigned also across paralogs (point 20 in Supplementary Figure S15), this was considered as 933 the node made by Richelia intracellularis and its closest sister sequence, as implemented in 934 ref.
[114] This strain is a specific endosymbiont of a diatom and its divergence was set to be 935 no older than the earliest discussed age for diatoms [118]. The root equivalent to the MRCA 936 of Cyanobacteria (divergence of Gloeobacter in CP43) was not calibrated. The root of the 937 tree was varied: first it was given a maximum age of 4.52 Ga as recently implemented and 938 justified by Betts,et al. [11] as the earliest plausible time in which the planet was inhabitable 939 after the moon forming impact [119], and no minimum age was used. A second tree was 940 executed with no constraint on the root and no root prior. A third root was implemented 941 constrained to be between 2.3 Ga (the GOE) and 3.2 Ga. The latter date represents the age of 942 the cyanobacteria-like well-preserved microbial mats of the Berberton Greenstone Belt in 943 South Africa and neighbouring Eswatini [66]. Rates were obtained using the autocorrelated 944 CAT+Γ model as described above. Because these root constraints did not have a strong effect 945 in the overall estimated rates, we carried out an additional control applying an uncorrelated 946 gamma clock model [120] with a root constrained at 4.52 Ga and no minimum age. 947 948

Molecular clock of RpoB and concatenated ribosomal proteins 949
The primary objective of this experiment was not to determine the absolute time of origin of 950 Cyanobacteria, but to understand the spans of time between Cyanobacteria and their relatives. 951 We also wanted to understand what rates of evolution are associated with those spans of time 952 and how these change under different evolutionary scenarios. To do this, we applied a 953 molecular clock to the phylogeny of RpoB sequences described above. We implemented 12 954 calibrations. The calibrations were assigned on the phylogeny as shown in Supplementary 955 Figure S16 and listed in Table 2. A set of calibrations consisted of the earliest unambiguous 956 evidence for Chroococcales Cyanobacteria of the Belcher group (point 21), the age of which 957 has been recently revisited to 2.01 Ga [121]. This was assigned to the younger node from 958 where Chroococcales strains branch out in the tree, with no maximum restrictions. The 959 appearance of heterocystous Cyanobacteria were restricted from 0.72 Ga and 1.56 Ga as 960 described above. No constraints on the node representing the MRCA of Cyanobacteria were 961 used. However, for rigor, we also tested an alternative single calibration on the node 962 representing the MRCA of Cyanobacteria with a maximum of 2.01 Ga and no minimum, and 963 with no other calibrations in the clade. This considered a scenario in which crown group 964 Cyanobacteria are younger than the Belcher fossils. 965 In addition to cyanobacterial calibrations, we also applied the often-used biomarker 966 evidence for phototrophic Chlorobi and Chromatiaceae at 1.64 Ga [122], see for example 967 refs. [9, 63] These were used as a minimum with no maximum constraints (node 30 and 31 968 respectively in Supplementary Figure S16). 969 A set of calibrations were chosen from well-described symbiotic relationships. Two of 970 these are known in the phylum Margulisbacteria. The first one is Termititenax, these are 971 specific ectosymbionts of spirochetes that live within oxymonad protists in the gut of diverse 972 termites and cockroaches [43]. The two Termititenax sequences used in this analysis 973 clustered together and therefore we calibrated this node to be between 396 Ma, some of the 974 oldest fossil evidence of insects [123]  reported to form close associations with eukaryotes. The clade Gastranaerophilales is thought 987 to be composed mostly of strains that inhabit the animal gut. Thus, we calibrated the node 988 separating two strains isolated from human and koala faeces that clustered together with a 989 maximum age of 201 Ma, representing the Jurassic split of marsupials and placental 990 mammals [127] (node 24a). However, the koala and human sequences were embedded in the 991 Gastranaerophilales clade within other sequences from the human gut. Because of this, we 992 trialled changing this calibration to 55 Ma instead, the oldest primate fossil [128] and 993 assuming that the retrieved sequences from the human gut had a common ancestor younger 994 than the MRCA of primates. Alternatively, we tested moving this calibration to the ancestral 995 nodes of the clade that included all the human gut sequences (node 24b). Gastranaerophilales 996 is closely related to the order Vampirovibrionales, which include Vampirovibrio 997 chlorellavorus. This strain is a predator of the eukaryotic green algae Chlorella [129], and 998 therefore we trialled a calibration assuming that Gastranaerophilales and Vampirovibrionales 999