Origin of photosynthetic water oxidation at the dawn of life

Oxygenic photosynthesis starts with the oxidation of water to O2. Cyanobacteria are the only known prokaryotes capable of oxygenic photosynthesis and therefore, it is assumed that water oxidation is a late innovation relative to the origin of life. However, when exactly oxygenic photosynthesis originated remains lively debated. Here we show that the origin of photosystem II, the water-splitting enzyme, occurred at an early stage during the evolution of life and long before the origin of Cyanobacteria. We use relaxed molecular clocks, ancestral sequence reconstruction, and comparative structural biology to demonstrate that photosystem II exhibits patterns of evolution through geological time that are indistinguishable from those of ATP synthase, RNA polymerase, or the ribosome, some of the oldest known enzymes. Our work suggests that water oxidation originated during the establishment of bioenergetics reaching farther into the past than can be documented based on species trees alone.


Introduction 26
The origin of oxygenic photosynthesis is considered a turning point in the history of life 27 marking the transition from the ancient world of anaerobes into a productive aerobic world 28 that permitted the emergence of complex life 1 . Oxygenic photosynthesis starts with 29 photosystem II (PSII), the water-oxidising and O 2-evolving enzyme of Cyanobacteria and 30 photosynthetic eukaryotes. Today PSII is a highly conserved, multicomponent, membrane 31 protein complex, which was inherited by the most recent common ancestor (MRCA) of 32 Cyanobacteria in a form that is structurally and functionally quite similar to that found in all 33 This would imply that Margulisbacteria, Sericytochromatia, and Vampirovibrionia (MSV) 48 could have been ancestrally more like Cyanobacteria than is apparent from their specialized 49 heterotrophic lifestyles. Hence, to time the origin of oxygenic photosynthesis one should first 50 resolve when PSII evolved water-oxidation photochemistry relative to the MRCA of 51 Cyanobacteria. 52 The heart of PSII is made up of a heterodimeric reaction centre (RC) core coupled to 53 a core antenna. The two subunits of the RC core of PSII are known as D1 and D2, and these 54 are associated respectively with the antenna subunits known as CP43 and CP47. D1 and 55 CP43 make up one monomeric half of the RC and D2 and CP47 the other half. Water 56 oxidation is catalysed by a Mn4CaO5 cluster coordinated by ligands from both D1 and 57 CP43 12,13 . The cluster is functionally coupled to a redox active tyrosine-histidine pair (YZ-58 H190) also located in D1, which relays electrons from Mn to the oxidised chlorophyll 59 pigments of the RC during charge separation 14 . In a cycle of four consecutive light-driven 60 charge separation events, O2 is released in the decomposition of two water molecules. 61 Photosystems evolved first as homodimers 15,16 : therefore, the core and the antenna of PSII 62 originated from ancestral gene duplication events that antedated the MRCA of Cyanobacteria. 63 In this way CP43/D1 retain sequence and structural identity with CP47/D2. The conserved 64 structural and functional traits between CP43/D1 and CP47/D2 suggest that the ancestral PSII 65 homodimer-prior to the duplication events-was not only structurally similar to 66 of the distance between Archaea and Bacteria. However, the distance between CP43 and 135 CP47 (but also between D1 and D2 11 ) is of similar magnitude to that between Alpha and 136 Beta, and to that between archaeal and bacterial RpoB, but substantially surpasses the 137 distance between MVS and Cyanobacteria ( Figure 2). These observations suggest that 138 ancestral proteins to CP43/CP47 and D1/D2 existed before the divergences of MVS. 139 We compared the within-group mean distances for Alpha, Beta, RpoB, and a 140 concatenated dataset of ribosomal proteins compiled in a previous independent study 36 (see 141 Supplementary Table S2). We found consistently, that Vampirovibrionia and 142 Margulisbacteria have larger within-group mean distances compared to Cyanobacteria, which 143 suggests greater rates of evolution in the non-photosynthetic clades. These were consistently 144 larger for Margulisbacteria relative to the other two groups. For example, RpoB in 145 Vampirovibrionia and Margulisbacteria showed 1.6 and 4.0 times larger corrected mean 146 distances than Cyanobacteria, respectively (Supplementary Table S2). At the level of the 147 concatenated ribosomal proteins dataset, Margulisbacteria showed an almost 2-fold larger 148 within-group mean distance than Cyanobacteria. 149 We then compared the rates of evolution of CP43 and CP47 with those of Alpha and 150 Beta using a Bayesian relaxed molecular clock approach with identical calibrations, 151 molecular clock parameters, and a simplified, but highly constrained sequence dataset (see 152 Materials and Methods for an expanded rationale). We used an autocorrelated log normal 153 model of rate variation with a non-parametric CAT+Γ model of amino acid substitutions to 154 extract rates of evolution. We will refer to the span of time between the duplication points 155 leading to Alpha and Beta (dAB), or to CP43 and CP47 (dCP), and the MRCA of 156 Cyanobacteria as ΔT (schematized in Figure 3). 157 In Figure 4a to d we examine the changes in the rate of evolution under specific 158 evolutionary scenarios. In the case of ATP synthase, we first assumed that the MRCA of 159 Cyanobacteria occurred after the GOE, at about 1.7 Ga, and that dAB occurred at 3.5 Ga (ΔT 160 = 1.8 Ga). Under these conditions the average rate of evolution of Alpha and Beta is 161 calculated to be 0.28 ± 0.06 substitutions per site per Ga (δ Ga -1 ). We will refer to the average 162 rate through the Proterozoic as ν min. In this scenario, the rate of evolution at the point of 163 duplication, which we denote νmax, is 7.32 ± 1.00 δ Ga -1 making νmax/νmin 26. In other words, 164 when the span of time between the ancestral pre-LUCA duplication and the MRCA of 165 Cyanobacteria is 1.8 Ga, the rate of evolution at the point of duplication is about 26 times 166 greater than any rate observed through the diversification of Cyanobacteria or photosynthetic 167 eukaryotes. 168 Now, if we consider a scenario in which dAB is 4.0 Ga and leaving all other 169 constraints unchanged, νmax is 6.02 ± 0.9 δ Ga -1 resulting in a νmax/νmin of 21. If instead we 170 keep the duplication at 3.5 Ga but assume that the MRCA of Cyanobacteria occurred before 171 the GOE at 2.6 Ga (ΔT = 1.1 Ga), we obtain that νmin is consequently slower, 0.25 ± 0.06 δ 172 Ga -1 , when compared to a post-GOE ancestor. This older MRCA (smaller ΔT) thus leads to a 173 rise in νmax, calculated to be 10.22 ± 1.37 δ Ga -1 and leading to a νmax/νmin of 40. Given that 174 the phylogenetic distance is a constant, the rate of evolution increases with a decrease in ΔT 175 following a power law function. The change in ν max/νmin as a function of ΔT is shown in 176 Figure 4d. We had observed nearly identical evolutionary patterns for the core RC proteins 177 D1 and D2 of PSII 11 . 178 The core antenna of PSII, CP43 and CP47, showed patterns of divergence very 179 similar to those of Alpha and Beta, both in terms of phylogenetic distances between 180 paralogues and rates of evolution between orthologues (Figure 4a and b). The average rate of 181 evolution of CP43 and CP47, assuming that the MRCA of Cyanobacteria occurred at 1.7 Ga, 182 and the duplication (dCP) at 3.5 Ga (ΔT = 1.8 Ga), is 0.19 ± 0.05 δ Ga -1 . Slightly slower than 183 for Alpha and Beta under the same condition. This slower rate is consistent with the fact that 184 CP43 and CP47 show less sequence divergence between orthologues at all taxonomic ranks 185 of oxygenic phototrophs when compared to Alpha and Beta (see Supplementary Table S3). 186 Furthermore, the rate at dCP, ν max, was 5.17 ± 0.84 δ Ga -1 , generating a νmax/νmin of 27, 187 similar to Alpha and Beta ( Figure 4). Thus, even when ΔT is 1.8 Ga, the rate at duplication 188 point needs to be 27 times greater than the average rates observed during the Proterozoic. If 189 we consider instead that the MRCA of Cyanobacteria occurred at 2.6 Ga and dCP at 3.5 Ga 190 (ΔT = 1.1 Ga), this would slowdown νmin to 0.16 ± 0.04 δ Ga -1 , while νmax would increase to 191 7.81 ± 1.01 δ Ga -1 resulting in a νmax/νmin of 49. Therefore, the molecular evolution of the 192 core subunits of PSII parallels that of ATP synthase both in terms of rates and distances 193 through geological time. 194 We then studied a relatively recent gene duplication event (Figure 4c), which 195 occurred long after the LUCA, but also after the MRCA of Cyanobacteria: that leading to 196 Cyanobacteria-specific FtsH1 and FtsH2 (dH0) 38 . This more recent duplication served as a 197 point of comparison and control (see Figure 3 for a scheme). In marked contrast to dAB, the 198 rate at the point of duplication was 0.66 ± 0.21 δ Ga -1 . We found that FtsH1 is evolving at an 199 average rate of 1.42 ± 0.29, while FtsH2 at a rate of 0.24 ± 0.06 δ Ga -1 under the assumption 200 that MRCA of Cyanobacteria occurred at 1.7 Ga. Thus, under the assessed conditions, FtsH1 201 is evolving about 5.3 times faster than FtsH2, while the latter is evolving at a rate similar to 202 that of Alpha and Beta. If the MRCA of Cyanobacteria is assumed to have occurred at 2.6 203 Ga, all rates slowdown respectively, but the rate of FtsH1 remains over five times faster than 204 FtsH2. Unlike dAB, dH0 is consistent with classical neofunctionalization, in which the copy 205 that gains new function experiences an acceleration of the rate of evolution 39,40 . Like PSII and 206 ATP synthase, the calculated rates of evolution match observed distances as estimated by the 207 change in the level of sequence identity as a function of time, in which the fastest evolving 208 FtsH1 accumulated greater sequence change than FtsH2 in the same period (Supplementary  209   Table S3). 210 Given that the complex evolution of CP43 and CBP involved several major 211 duplication events and potentially large variations in the rate of evolution (Figure 1 Table S4 for a comparison of estimated ages  218 under different models). The mean divergence time for the node representing the CP43 219 inherited by the MRCA of Cyanobacteria was calculated to be 2.22 Ga (95% CI: 1.88 -2.68 220 Ga). Thus, a span of time of only 15 Ma is seen between these two mean ages. The average 221 rate of evolution of CP43, not including CBP sequences, was found to be 0.14 ± 0.05 δ Ga -1 , 222 which is in the same range as determined in the simplified, but highly constrained experiment 223 above. We noted a 6-fold increase in the rate of evolution associated with the duplication 224 leading to the farlip-CP43 variant ( Figure 4e). This duplication led to an acceleration of the 225 rate similar in magnitude to that of FtsH1/FtsH2 and is consistent with a neofunctionalization 226 process as the photosystems evolved to use far-red light to drive water-splitting. 227 CBP sequences, on average, display rates of evolution about three times faster than 228 CP43 ( Figure 4e). However, the serial duplications that led to the evolution of CP43-derived 229 light harvesting complexes resulted in accelerations in the rate of evolution of a similar 230 magnitude as observed for dAB and dCP. The largest of these is associated with the origin of 231 PcbC 28 , a variant commonly found in heterocystous Cyanobacteria and Cyanobacteria that 232 use alternative pigments, such as chlorophyll b, d and f. The ancestral node of PcbC was 233 timed at 2.07 Ga (95% CI: 1.76 -2.50 Ga) with a rate of 11.7 ± 2.42 δ Ga -1 , decelerating 234 quickly, but stabilizing at about four times faster rates than the average rate of CP43. We find 235 it noteworthy that the fast rates of evolution associated with the origin of CBP are not 236 associated with very large spans of time between these and CP43, nor did it result in very old 237 root node ages despite the use of very broad constraints. 238 239

Species divergence 240
To understand the evolution of MSV relative to Cyanobacteria we wished to apply a 241 molecular clock to a system where the calculated rates could be compared to observed rates 242 as determined by distances between species of known divergence times or at similar 243 taxonomic ranks. We found RpoB to be suitable for this because it has been inherited 244 vertically with few instances of horizontal gene transfer and had enough signal to resolve 245 known phylogenetic relationships between and within clades. In collecting the RpoB 246 sequences, we noted for the first time that Margulisbacteria and Vampirovibrionia share a 247 comparatively greater level of divergence at similar taxonomic ranks than Cyanobacteria. For 248 example, the level of sequence divergence of RpoB from two species of Termititenax 249 (Margulisbacteria) 41 is about 40% greater than the distance between Gloeobacter spp. and 250 any other cyanobacterium, the latter being the largest distance between oxygenic phototrophs. 251 In the case of Gastranaerophilales (Vampirovibrionia) 5 , which are specialised gut bacteria 252 and should therefore not be much older than animals, the level of sequence identity of RpoB 253 was found to be 70% for the two most distant strains in this group, contrasted to 84% for 254 Gloeobacter spp. when compared to any other cyanobacterium. As listed in Supplementary 255 Table S2, within-group mean distances suggest that faster rates are widespread and not just 256 unique to RpoB. 257 We implemented a set of 12 calibrations across bacteria, including two calibrations on 258 Margulisbacteria and two in Vampirovibrionia with the aim of covering both slower and 259 faster evolving lineages. The following results are based on an autocorrelated log normal 260 molecular clock using CAT+Γ, a root with a broad interval ranging from between 4.52 and 261 3.41 Ga, and as described in Materials and Methods ( Figure 5). We found this to perform 262 well and better than other more complex models (e.g. CAT+GTR+Γ with birth-death priors 263 and soft bounds on calibrations). In addition, it provided results comparable to other 264 independent studies that did not combine a full set of MVS sequences and other clades with 265 phototrophs in a single tree (Table 1). Nonetheless, a pipeline of sensitivity experiments 266 tested the dependency of these results on models and prior assumptions: these are shown and 267 described in Supplementary Figure S7 and S8. 268 The root of the tree (divergence of Thermotogae) was timed at 3.64 Ga (95% CI: 3.42 269 -4.11 Ga) and the divergence of Cyanobacteria at 2.74 Ga (95% CI: 2.46 -3.12 Ga). Thus, 270 the span of time between the mean age of the root and the MRCA of Cyanobacteria was 271 calculated to be 0.89 Ga. The span of time between Margulisbacteria and Cyanobacteria was 272 found to be 0.67 Ga; and between Vampirovibrionia and Cyanobacteria 0.44 Ga (Figure 4f  273 and 5). The latter is a value that is consistent with previous studies using entirely different 274 rationales, datasets and calibrations 8,9 . However, we note that the lower bound of the 275 confidence interval between Sericytochromatia and Vampirovibrionia overlaps with the 276 upper bound of Cyanobacteria, by over 200 Ma in the latter case. 277 We also noted an exponential decrease in the rates of evolution of RpoB through the 278 Archean, which stabilised at current levels in the Proterozoic (Figure 4f). The rate at the root 279 node was calculated to be 2.37 ± 0.45 δ Ga -1 and the average rate of evolution of RpoB 280 during the Proterozoic was found to be 0.19 ± 0.06 δ Ga -1 . The average rate of cyanobacterial 281 RpoB was 0.14 ± 0.04 δ Ga -1 ; for Margulisbacteria was 0.44 ± 0.17 δ Ga -1 , and for 282 Vampirovibrionia 0.19 ± 0.05 δ Ga -1 : about 3.1 and 1.3 times the mean cyanobacterial rate 283 respectively. These rates agree reasonably well with the observed distances, further indicating 284 that the calibrations used in these clades performed well (Supplementary Table S2). 285 Nevertheless, we suspect that the values for MSV represent underestimations of the true rates 286 of evolution (slower than they should be), as some of the clades that include symbionts still 287 appear much older than anticipated from their hosts (Supplementary Text S1). In contrast, the 288 more complex CAT+GTR+Γ model implementing a birth-death prior with soft bounds on the 289 calibrations resulted in smoothed out rates, which translated into unrealistically spread out 290 divergence times (Supplementary Figure S8, models n to p) with Margulisbacteria and 291 Vampirovibrionia evolving at 1.9 and 0.7 times the cyanobacterial rates, respectively. 292 To investigate the effect of the estimated divergence time of the root on the 293 divergence time of MVS and Cyanobacteria we also varied the root prior from 3.2 to 4.4 Ga 294 (Supplementary Figure S7). We noted that regardless of the time of origin of Bacteria 295 (approximated by the divergence of Thermotogae in our analysis), a substantially faster rate 296 is required during the earliest diversification events, decreasing through the Archaean and 297 stabilizing in the Proterozoic. This matches well the patterns of evolution of ATP synthase 298 and PSII core subunits as shown in the previous section. 299 We then compared the above RpoB molecular clock with a different clock that 300 included a set of 112 diverse sequences from Archaea, in addition to the sequences from 301 Thermotogae, MSV, and Cyanobacteria, but removing all other bacterial phyla 302 (Supplementary Figure S9). We found that the calculated average rate of evolution of 303 bacterial RpoB during the Proterozoic was slower (0.09 ± 0.03 δ Ga -1 ) than in the absence of 304 archaeal sequences (0.19 ± 0.06 δ Ga -1 ), resulting in overall older mean ages (see Figure 4g  305 and Supplementary Figure S8a and b). However, the rate at the Bacteria/Archaea divergence 306 point was 6.87 ± 1.17 δ Ga -1 , similar to the rate for dAB and dCP, requiring therefore an 307 exponential decrease in the rates similar to that observed for ATP synthase and the PSII core 308 subunits. Remarkably, this rate and exponential decrease is associated with a span of time 309 between the mean estimated ages of the LUCA and the MRCA of Cyanobacteria of 1.21 Ga. 310 The span between Vampirovibrionia and Cyanobacteria was found to be 0.46 Ga with an 311 overlap in the confidence intervals of about 120 Ma (Supplementary Figure S9a). 312 Figure 5 also highlights that the MRCA of none of the groups containing phototrophs 313 nor their divergence from their closest non-phototrophic relatives appears to be older than 314 one of the oldest and best accepted geochemical evidence for photosynthesis at 3.41 Ga 42 315 (Table 1). This is consistent with the molecular evolution of photosynthetic RCs, which 316 indicates that the early evolution of RCs proteins, including the divergence of type I and type 317 II RCs, antedated the diversification of the known groups of phototrophs 32 (see also 318 Supplementary Figure S3). It suggests that losses of photosynthesis and/or extinctions of 319 phototrophs have been a common and continuous process through time. Cyanobacteria. Despite the fact that this dataset generated younger ages compared to RpoB 323 (see Table 1 and Supplementary Figure S7), a similar exponential decrease in the rates was 324 observed with a rate at the Bacteria/Archaea divergence point measured at 4.62 ± 0.71 δ Ga -1 325 and an average rate of bacterial ribosomal protein evolution of 0.30 ± 0.07 δ Ga -1 (Figure 4h). 326 The estimated mean age for the MRCA of Cyanobacteria was found to be 2.22 Ga (95% CI:

Structural analysis 338
A fundamental premise of our investigation is that water oxidation started before the 339 duplication of D1 and D2, and CP43 and CP47. The rationale behind this premise has been 340 laid out before 16,43 , and more extensively recently 11 . This rationale raises the question of how 341 the D2/CP47 side of the RC lost its capacity to carry out water-splitting catalysis. To gain 342 further insight on the nature of the structural site around the water-oxidising complex in the 343 ancestral photosystem, we used ancestral sequence reconstruction to predict the most 344 probable ancestral states. We will refer to the ancestral protein to D1 and D2 as D0 345 (Supplementary Figure S10). We generated 14 predicted D0 sequences using a combination 346 of three ASR methods and amino acid substitution models. On average the 14 D0 sequences 347 had 87.12 ± 0.55% sequence identity indicating that the different algorithms provided largely 348 consistent results. While the regions that include all transmembrane helices are aligned 349 unambiguously, the N-terminal and C-terminal ends were aligned less confidently due to 350 greater sequence variability at both ends. Nonetheless, we found that the predicted D0 351 sequences retain more identity with D1 than D2 along the entire sequence. The level of 352 sequence identity of D0 compared to the D1 (PsbA1) of Thermosynechococcus vulcanus was 353 found to be 69.58 ± 0.55% and 36.32 ± 0.15% compared to D2. The D1 ligands H332, E333, D342, and A344 located at the C-terminus. 3) The CP43 358 ligands, E354 and R357, located in the extrinsic loop between the 5 th and 6 th helices, with the 359 latter residue less than 4 Å from the Ca. Remarkably, there is structural and sequence 360 evidence supporting the loss of ligands in these three different regions of CP47/D2. 361 In all the D0 sequences, at position D1-170 and 189, located in the unambiguously 362 aligned region, the calculated most likely ancestral states were E170 and E189, respectively. 363 The mutation D170E results in a PSII phenotype with activity similar to that of the wild-364 Supplementary Figure S12 and Table S5. In contrast, D2 has strictly conserved phenylalanine 369 residues at these positions, but the PP of phenylalanine being found at either of these 370 positions was less than about 5% for all predicted D0 sequences. As a comparison, the redox 371 active tyrosine residues YZ (D1-Y161) and YD (D2-Y160), which are strictly conserved 372 between D1 and D2 have a predicted average PPs of 68.8% (FastML), 98.8% (MEGA) and 373 98.6% (PAML). Therefore, the ligands to the catalytic site in the ancestral protein leading to 374 D2 were likely lost by direct substitutions to phenylalanine residues, while retaining the 375 redox active D2-Y160 (YD) and H189 pair (Supplementary Figure S11). 376 Prompted by the finding of a Ca-binding site at the electron donor site of the 377 homodimeric type I RC of Heliobacteria (Firmicutes) with several similarities to the 378 Mn 4CaO5 cluster of PSII, including a link to the antenna domain and the C-terminus 17 , we 379 revisited the sequences and structural overlaps of CP43 and CP47. We found that a 380 previously unnoticed structural rearrangement within the extrinsic loop occurred in one 381 subunit relative to the other (marked EF3 and EF4 in Figure 6, Supplementary Figure S13 and 382 S14). CP43 retains the simplest domain, being about 60 residues shorter than CP47. If CP43 383 retains the ancestral fold, the additional sequence in CP47's swapped domain (EF4 in Figure  384 6d) would have contributed to the loss a catalytic cluster as it inserted one phenylalanine 385 residue (CP47-F360) into the electron donor site, less than 4Å from Y D. An equivalent 386 residue does not exist in CP43. 387 We then noted that in the swapped region (EF3 in Figure 6d), sequence identity is 388 retained between CP43 and CP47 (Supplementary Figure S13). We found that CP43-E354 389 and R357 are equivalent to CP47-E435 and N438. An inspection of the crystal structure of 390 cyanobacterial PSII showed that these two residues specifically bind a Ca of unknown 391 function in (Figure 6c and d). The presence of an equivalent glutamate to CP43-E354 in 392 CP47 is consistent with this being already present before duplication. 393 Finally, a peculiar but well-known trait conserved across Cyanobacteria and 394 photosynthetic eukaryotes is that the 5' end of the psbC gene (CP43) overlaps with the 3' end 395 of the psbD gene (D2) usually over 16 bp (Supplementary Table S6 There is an emerging consensus that oxygenic photosynthesis was occurring already 434 at 3.0 Ga 56-58 . Well-preserved Cyanobacteria-like microbial mats are suggestive, although not 435 entirely conclusive, of oxygenic photosynthesis at 3.2 Ga 59,60 . The possibility of oxygenic 436 photosynthesis existing as early as the Eoarchean has been discussed before, but the evidence 437 is contentious. If the MRCA of Cyanobacteria occurred 3.0 to 3.2 Ga ago, or even before 438 that, it would imply slower rates of evolution within Cyanobacteria than currently anticipated 439 from most molecular clock analyses (Table 1). It would also imply slower rates of evolution 440 of PSII than reported in this work pushing the ancestral duplications towards even earlier 441 times. If that is case, it becomes even more likely that water oxidation is a trait primordial to 442 life given the large ΔT and the constraints imposed by the rates of evolution 11 , but would still 443 necessitate losses of oxygenic photosynthesis across Bacteria or extinction events. 444 Alternatively, it may be that ancestral forms of Bacteria had more in common with 445 Cyanobacteria that it is apparent based on today's biodiversity of prokaryotes 36,37 . 446 Our data is consistent with recent models of the stepwise oxygenation of the planet, 447 which suggests that even if relatively high fluxes of O 2 (4.5 × 10 13 mol O2 equivalent per 448 year) started about 4.0 Ga ago, the inherent properties of global biogeochemical cycling 449 would result in a "great" oxygenation at around 2.5 to 2 Ga, while maintaining low 450 concentrations of O 2 over the Archean, and without involving any particular triggers 61 . In this 451 study, we found no evidence to justify an origin of oxygenic photosynthesis that coincided 452 with the GOE. For this to be the case, exceedingly high rates of amino acid substitutions of 453 the PSII core subunits would be required relative to other RC proteins 11 , occurring at a late 454 stage in their evolution, and followed by a precipitous decline in the rate. While we cannot 455 completely reject the possibility that this sudden spike occurred, neither phylogenetic data 456 nor the comparative structural biology of the photosystems supports a scenario in which PSII 457 experienced rates of evolution greatly superior to those of any other RC. In fact, structural 458 constraints indicate that the opposite is likely to be the case. 459 460

Structural constraints 461
We have calculated that (oxygenic) PSII has experienced the slowest rates of evolution 462 between type II RCs, with the core of the anoxygenic type II RC of Proteobacteria and 463 Chloroflexi evolving approximately five times faster than the core of PSII 11 . That 464 considerably faster rate has led to conspicuous structural changes of the anoxygenic RC 465 relative to PSII and type I RCs. This is not only visually apparent (Figure 7), but also PSII 466 retains greater structural symmetry at its core and a greater number of conserved structural 467 traits with type I RCs not found in its anoxygenic type II RC cousin 17,62 . It follows then that 468 because these conserved traits can be traced to the homodimeric stage of the earliest RC, the 469 rates of evolution of PSII should have remained slow relative to that of other RCs since 470 before the core duplications and as it has been the case for well over the past two billion 471 years. However, the large distance between the core subunits of PSII not only already 472 accounts for a period of fast evolution at its origin, but it also requires a large span of time 473 between the core duplications and the MRCA of Cyanobacteria. 474 One of the earliest events in the evolution of photosynthesis is the structural and 475 functional specialisation that led to type I and type II RCs. It is conventionally considered 476 that the first six transmembrane helices of RC proteins make up the antenna domain, while 477 the photochemical core encompasses the last five helices. In actuality, the antenna domain 478 extends to the 8 th helix, both in type I RCs and in PSII; with the latter retaining one antenna 479 chlorophyll in the equivalent 8 th helix (marked Z in Figure 7), as well as substantial sequence To compare the level of sequence identity between RC proteins, two datasets of 10 601 random amino acid sequences were generated using the Sequence Manipulation Suit 91 . The 602 datasets contained sequences of 350 and 750 residues. These were independently aligned as 603 described above, resulting in 45 pairwise sequence identity comparisons for each dataset. 604 These random sequence datasets were used as a rough minimum threshold of identity. 605 Alignments of RC proteins were generated using three representative sequences spanning 606 known diversity. Cyanobacterial CP43, CP47, standard D1 and D2 sequences were from 607 Gloeobacter violaceus, Stanieria cyanosphaera, and Nostoc sp. PCC 7120; Heliobacterial Molecular clocks are conventionally used to estimate divergence times. In general terms, 617 given: 1) a tree topology, which sets the relationship between taxa; 2) a sequence alignment, 618 which sets the phylogenetic distance between taxa; and 3) some known events (calibrations), 619 which set the rates of evolution, the molecular clock can then estimate divergence times. This 620 means that if the tree topology and divergence times for two sets of protein sequences are the 621 same, any differences in phylogenetic distances between these two should only reflect 622 differences in the rate of evolution. Thus, assuming that CP43/CP47 and Alpha/Beta have 623 mainly been inherited vertically in Cyanobacteria and photosynthetic eukaryotes, any 624 difference in phylogenetic distance between the two is the result of differences in the rates of 625 evolution. For example, the level of sequence identity between CP43 in Cyanidioschizon and 626 Arabidopsis is 78%, and the level of sequence identity between Alpha in the same species is 627 69%. Given that these plastid-encoded subunits have mostly been inherited vertically since 628 the MRCA of Archaeplastida, then one can argue that Alpha is evolving somewhat faster 629 than CP43. This is because faster rates of protein evolution should lead to faster rates of Node 18 denotes the MRCA of Cyanobacteria. The age for this node is highly debated 675 ranging from before to after the GOE. Node 19 represents the duplication events leading to 676 CP43 and CP47, and to Alpha and Beta. To calculate the rates of evolution under different 677 scenarios, node 18 and node 19 were varied. Firstly, a molecular clock was run using a 678 scenario that assumed that the MRCA of Cyanobacteria postdated the GOE. To do this, node 679 18 was set to be between 1.6 and 1.8 Ga, which emulates results reported in recent 680 studies 8,104 . This was compared to a scenario that assumed that the MRCA of Cyanobacteria 681 antedated the GOE, and thus node 18 was set to be between 2.6 and 2.8 Ga, which simulates 682 other evolutionary scenarios 93,105 . In both cases, the duplication event (node 19) was set to be 683 3.5 Ga old, or changed as stated in the main text, by assigning a gamma prior at the desired 684 time fixed with a narrow standard deviation of 0.05 Ga. In a separate experiment, the age of 685 the duplication was varied while maintaining node 18 restricted to between 1.6 and 1.8 Ga, 686 while node 19, the root, was set with a gamma prior with an average varied from 0.8 to 4.2 687 and with a narrow standard deviation of 0.05 Ga. 688 The period of time between the duplication event (node 19), which led to the 689 divergence of CP43 and CP47, and the MRCA of Cyanobacteria (node 18), we define as ΔT. 690 ΔT is calculated as the subtraction of the mean age of node 19 and node 18. For PSII, we 691 used node 18 from the CP43 subunit and for ATP synthase we used node 18 from the Alpha 692 subunit. In consequence, varying the age of the duplication from 0.8 to 4.2 Ga allows changes 693 in the rate of evolution to be simulated with varying ΔT, ranging from 0. template for the calculation of the rate of evolution and was based on the topology presented 717 by Shao,et al. 38 It was concluded that the Cyanobacteria-inherited closest paralog to FtsH1 718 and FtsH2 in photosynthetic eukaryotes was also acquired before their initial duplication. 719 Therefore, from all FtsH paralogs in photosynthetic eukaryote genomes, those with greater 720 sequence identity to cyanobacterial FtsH1/2 were used. Because this duplication is specific to 721 Cyanobacteria, a few additional strains were included in this tree following well-established 722 topologies 93,95 . Calibrations were placed as indicated in Supplementary Figure S15b. To test 723 the change in the rate of evolution at the time of duplication in comparison with CP43/CP47, 724 node 19 was set to 1.6-1.8 Ga or 2.6-2.8 Ga and molecular clocks were run as described in 725 the preceding paragraph. 726 Finally, we conducted a large molecular clock using the combined 897 CP43 and CBP 727 sequences, including 40 eukaryotic CP43 sequences, to test whether using a more complex 728 phylogeny would result in rates of evolution substantially different to those calculated with 729 the method described above. Calibrations were assigned as illustrated in Supplementary 730 Figure S16. Cross-calibrations were used across paralogs constraining the origin of 731 heterocystous Cyanobacteria. In this case, only the minimum constraint of 0.72 Ga was used 732 with no maximum constraint to allow greater flexibility. Additional calibrations were 733 assigned also across paralogs (point 20 in Supplementary Figure S16), this was considered as 734 the node made by Richelia intracellularis and its closest sister sequence, as implemented in 735 ref. 105 This strain is a specific endosymbiont of a diatom and its divergence was set to be no 736 older than the earliest discussed age for diatoms 109 . The root equivalent to the MRCA of 737 Cyanobacteria (divergence of Gloeobacter in CP43) was not calibrated. The root of the tree 738 was varied: first it was given a maximum age of 4.52 Ga as recently implemented and 739 justified by Betts,et al. 104 as the earliest plausible time in which the planet was inhabitable 740 after the moon forming impact 110 , and no minimum age was used. A second tree was 741 executed with no constraint on the root and no root prior. A third root was implemented 742 constrained to be between 2.3 Ga (the GOE) and 3.2 Ga. The latter date represents the age of 743 the cyanobacteria-like well-preserved microbial mats of the Berbeton Greenston Belt in 744 South Africa 111 . Rates were obtained using the autocorrelated CAT+Γ model as described 745 above. Because these root constraints did not have a strong effect in the overall estimated 746 rates, we carried out an additional control applying an uncorrelated gamma clock model 112 747 with a root constrained at 4.52 Ga and no minimum age. 748 749

Molecular clock of RpoB and concatenated ribosomal proteins 750
The primary objective of this experiment was not to determine the absolute time of origin of 751 Cyanobacteria, but to understand the spans of time between Cyanobacteria and their relatives. 752 We also wanted to understand what rates of evolution are associated with those spans of time 753 and how these change under different evolutionary scenarios. To do this, we applied a 754 molecular clock to the phylogeny of RpoB sequences described above. We implemented 12 755 calibrations. The calibrations were assigned on the phylogeny as shown in Supplementary 756 Figure S17 and listed in Table 2. A set of calibrations consisted of the earliest unambiguous 757 evidence for Chroococcales Cyanobacteria of the Belcher group (point 21), the age of which 758 has been recently revisited to 2.01 Ga 113 . This was assigned to the younger node from where 759 Chroococcales strains branch out in the tree, with no maximum restrictions. The appearance 760 of heterocystous Cyanobacteria were restricted from 0.72 Ga and 1.56 Ga as described above. 761 No constraints on the node representing the MRCA of Cyanobacteria were used. However, 762 for rigor, we also tested an alternative single calibration on the node representing the MRCA 763 of Cyanobacteria with a maximum of 2.01 Ga and no minimum, and with no other 764 calibrations in the clade. This considered a scenario in which crown group Cyanobacteria are 765 younger than the Belcher fossils. 766 In addition to cyanobacterial calibrations, we also applied the often-used biomarker 767 evidence for phototrophic Chlorobi and Chromatiaceae at 1.64 Ga 114 , see for example refs. 9,54 768 These were used as a minimum with no maximum constraints (node 30 and 31 respectively in 769 Supplementary Figure S17). Gastranaerophilales clade within other sequences from the human gut. Because of this, we 793 trialled changing this calibration to 55 Ma instead, the oldest primate fossil 120 and assuming 794 that the retrieved sequences from the human gut had a common ancestor younger than the 795 MRCA of primates. Alternatively, we tested moving this calibration to the ancestral nodes of 796 the clade that included all the human gut sequences (node 24b). Gastranaerophilales is closely 797 related to the order Vampirovibrionales, which include Vampirovibrio chlorellavorus. This 798 strain is a predator of the eukaryotic green algae Chlorella 121 , and therefore we trialled a 799 calibration assuming that Gastranaerophilales and Vampirovibrionales radiated after the 800 MRCA of eukaryotes (node 25). We thus assigned a maximum calibration to this node of 1.8 801 Ga representing the earliest described plausible eukaryote fossils 100 and no minimum age. 802 Another highly specific obligate symbiosis is that of the betaproteobacterium 803 Polynucleobacter necessarius and ciliates of the genus Euplotes (Spirotrichea) 122 . 804 Polynucleobacter has close free-living phototrophic relatives within the same genus 122 . We 805 set the node separating the phototrophic and non-phototrophic Polynucleobacter (node 26) a 806 maximum age of 444 Ma for the oldest fossil evidence of spirotrichs, as implemented in 807 Parfrey et al. 123 , and which predates the radiation of the genus Euplotes 124 . 808 Another well-known association is that of the soil bacteria Bradyrhizobium and 809 legumes. Thus we gave the node separating Bradyrhizobium spp. from its closest relative in 810 the RpoB tree, Xanthobacter autotrophicus, a maximum age of 86 Ma for Rosids, which 811 contain legumes 83 (node 27). 812 The Rickettsiales are Alphaproteobacteria that exists in very close association with 813 eukaryotes 125 . An association that may reach to the lineage leading to the origin of 814 mitochondria 126 . Therefore, we assumed that the divergence of Rickettsiales occurred before 815 the MRCA of eukaryotes and gave this node a minimum age of 1.8 Ga 100 (node 28). Finally, 816 the family Anaplasmataceae contains bacteria that exists in close association with insects as 817 endosymbionts (e.g. Wolbachia) or as parasite vectors (e.g. Anaplasma). Therefore, we set a 818 maximum constraint for the MRCA of Wolbachia and Anaplasma (node 29), excluding 819 Neorickettsia, to be as old as the earliest evidence for insects about 395 Ma ago 115 . 820 To constrain the age of the root, we first set a broad gamma prior with an average of 821 3.8 Ga and a standard deviation of 0.5 Ga. We found this to perform well and used it as 822 benchmark to compare with a range of evolutionary models and the effects of key 823 calibrations (Supplementary Figure S8). Alternatively, we applied a broad calibration on the 824 root with a maximum of 4.52 Ga as described above and a minimum of 3.41 Ga, which is the 825 earliest well-accepted evidence for photosynthesis 42 . This evidence was hypothesized to be 826 anoxygenic in ref. 42    FtsH is universally conserved in Bacteria, has a hexameric structure like that of ATP 1316 synthase's catalytic head, and can be found usually as homohexamers, but also as 1317 heterohexamers. The MRCA of Cyanobacteria likely inherited three variant FtsH subunit 1318 forms, one of which appears to have duplicated after the divergence of the genus 1319 Gloeobacter, and possibly other early-branching Cyanobacteria 38 . This late duplication led to 1320 FtsH1 and FtsH2, which form heterohexamers with FtsH3, following the nomenclature of 1321 Shao,et al. 38 FtsH1/FtsH3 is found in the cytoplasmic membrane of Cyanobacteria, while 1322 FtsH2/FtsH3 is involved in the degradation of PSII and other thylakoid membrane proteins. 1323  Superimposed at the top are the implied distribution and divergence time for ATP synthase 1351 and PSII. Horizontal bars within the tree mark 95% confidence intervals. These are shown in 1352 selected nodes of interest for clarity but see Table 1. 1353 donor side is bound from an extrinsic loop between the 5 th (E) and 6 th (F) helices. This 1358 extrinsic loop, EF 1 (blue), is made of two small alpha helices. The fourth molecular view 1359 furthest to the right shows the link between the electron donor site and EF1 in closer detail. b 1360 The CP43 subunit of PSII with the extrinsic loop shown in colours. c The CP47 subunit of 1361 PSII. Immediately after the 5 th helix (E), a long alpha helix protrudes outside the membrane 1362 in both CP43 and CP47 and showing structural and sequence identity (orange). We denote 1363 this helix EF 2. After EF2 structural differences are noticed between CP43 and CP47 as 1364 schematised in panel d. In CP43, after helix EF2 a loop is found (shown in red ribbons), 1365 which we denote EF 3. This contains the residues that bind the Mn4CaO5 cluster and it is 1366 followed by a domain that resembles EF1 in the HbRC at a structural level. In CP47, EF3 and 1367 EF 1 retain sequence identity with the respective regions in CP43. CP47 has additional 1368 sequence that is not found in CP43 (EF4, purple). The green arrows mark the position at 1369 which the domain swap occurred in CP43 relative to CP47. We found that the CP43-E354 1370 and R357 are found in the equivalent domain in CP47 as E436 and N438 coordinating a Ca 1371 atom. N438 (EF 3) links to EF1 via K332. It is unclear if the EF1 region in the HbRC is 1372 strictly homologous to that in CP43 and CP47 as very little sequence identity is found 1373 between the two: however, a couple of conserved residues between all EF 1 may suggest it 1374 emerged from structural domains present in the ancestral RC protein (see Supplementary 1375 Figure S14). 1376 are the RC core domain (grey and light-blue ribbons). Below the structure, the organization 1382 of the 11 helices is laid down linearly for guidance: 1 N denotes the first N-terminal helix and 1383 11 C the last C-terminal helix. P denotes the "special pair" pigment; M, the "monomeric" 1384 bacteriochlorophyll electron donor; and A, the primary electron acceptor. FX is the