Introduction

Plasmodium vivax is the most geographically widespread malaria species, with active transmission throughout South and Southeast Asia, Latin America, the Pacific, and parts of Africa1. P. vivax is also an emerging threat in areas where P. falciparum has been eliminated or is on track for elimination2. Transmission of P. vivax is more complex due to the presence of hypnozoites in the liver, which cause disease relapse without a new infectious mosquito bite3, and earlier gametocytogenesis than P. falciparum4, meaning that it can be transmitted earlier in an infection, even among asymptomatic patients5. Genomic epidemiology is a powerful tool for understanding parasite biology and transmission, and while some studies have been performed on P. vivax6,7,8,9,10,11,12,13,14,15,16,17,18, most have been done in P. falciparum19,20,21,22,23,24. Given the challenging biology impeding control25, there is an urgent need for the development of high-throughput genotyping methods to better understand this neglected parasite.

Genomic epidemiology studies of P. falciparum using whole genome sequencing (WGS), molecular inversion probes (MIPs), and amplicon sequencing have generated useful data to support malaria elimination by identifying parasite population structure19,20, importation and transmission networks and connectivity20, and the geographical spread of antimalarial19,20,22,23,24,26,27,28 and diagnostic resistance29. Such studies of P. vivax also exist, but are rarer6,7,8,9,10,11,12,13,14,15,16,17,18. However, they are similarly necessary, especially in South America, where P. vivax is now the predominant species and sympatric with P. falciparum15. Peru had the fourth largest number of malaria cases in the Americas in 2022, with the majority being P. vivax30. Most transmission occurs in the Amazonian region and department of Loreto31,32, where morbidity and economic losses stemming from malaria are greater than in the rest of the country33. Cases tend to cluster at the household and community levels34, with most malaria cases being asymptomatic and submicroscopic34,35,36. While the malaria burden in Peru has significantly declined from 1990 levels, progress has stalled and even reversed in the past decade33, with increases from 2010 onwards reversing the successes of the early 2000s32. Furthermore, P. vivax prevalence has been slower to decline than P. falciparum due to its unique biology that makes it more resistant to most current control measures, which are designed to target P. falciparum37.

In contrast to P. falciparum, where there are multiple large-scale genotyping panels in use19,20,22,23,24,26,27,28,29, there are relatively few existing panels for P. vivax. Genotyping panels based on targeted deep sequencing have been used to study P. falciparum populations and their capacity to escape control interventions in Africa and Latin America, including population structure and gene flow, geographic transmission patterns, origins of imported malaria cases, spread of antimalarial drug resistance, and genetic diversity of vaccine targets19,20,22,23,24,26,27,28,29. P. vivax genotyping, however, has historically relied on microsatellite genotyping with SNP barcodes and multiplex amplicon sequencing panels only more recently being used in focal studies18,38,39,40,41,42. A summary of existing panels is shown in Table 1.

Table 1 Summary of Existing P. vivax Genotyping Panels

MIPs are a highly multiplexed targeted sequencing approach that allow thousands of loci to be captured in a single tube. They are high-throughput and cost-effective for large genomic studies43 (see Supplementary Methods for an estimate) and have several advantages including the ability to tile across genes and integrated unique molecular identifiers (UMIs) that tag specific strands of DNA and allow bioinformatic error correction. This approach has been highly successfully used for studying P. falciparum epidemiology in Africa19,20,22,23. Thus, we sought to replicate it by creating four panels of MIPs to address multiple pertinent questions regarding P. vivax biology, including population structure, antimalarial resistance orthologs, and selection in potential blood-stage vaccine targets. Using publicly available P. vivax WGS datasets (details in Supplementary Methods, Supplementary Data 1), we designed our panels based on all known P. vivax genetic diversity to make them deployable in any epidemiological context. Here, we report the use of these MIP panels on 866 P. vivax samples collected in the Peruvian Amazon by NAMRU SOUTH between 2011 and 2018 (Supplementary Table 1) which enabled us to detect the frequency of putative antimalarial resistance alleles, identify balancing selection on red blood cell (RBC) invasion ligands, calculate complexity of infection, detect copy number variation in multiple genes, and reveal highly related (IBD ≥ 0.99) clonal transmission networks that span years.

Results

High-throughput sequencing and genotyping of Peruvian P. vivax isolates

Following panel optimization and MIP rebalancing on high parasitemia isolates, we performed MIP captures on 866 samples collected by NAMRU South, primarily in Loreto (Fig. 1). Two captures were performed for each sample, with one for the tiled gene panel (PvG) and one multiplexed capture for three SNP panels (PvFST, PvMAF5, PvMAF40), with a repool (only samples with average MIP < 100X UMI coverage included in repool) after initial sequencing to improve sequencing coverage and depth. We successfully extracted usable sequences from 685 samples. Additional details on sequencing output (Supplementary Fig. 1, Supplementary Fig. 2), samples successfully sequenced (Supplementary Table 1) and impact of parasitemia (Supplementary Fig. 3) are described in the Supplementary Results and Discussion. The number of samples retained for each step of the analysis is shown in Supplementary Fig. 4.

Fig. 1: Study sites and sample size.
figure 1

A Locations of sample collection by community within Peru, scaled by size of circle. Darker gray boundaries identify the regions and departments of Loreto (where all but one of the circles are located) and Junín. Blue rectangle defines Iquitos, expanded in (B) Inset map depicting region around Iquitos, scaled by color of shading. Map created in R using open data from GADM database, v 3.6. www.gadm.org and the Sf package (https://cran.r-project.org/package=sf) with GPL-2 license (https://cran.r-project.org/web/licenses/GPL-2).

Clonal transmission detected in Belén 2012–2014

To assess relatedness between isolates, a principal component analysis (PCA) was performed for 685 samples that passed quality filters and had sufficient associated metadata for further analysis. The PCA revealed no broad geographical or temporal structure, but PC1 and PC2 included multiple clusters that were associated by both city and year (plotted by city, collection year, and both in Supplementary Fig. 5). To investigate these clusters further, pairwise identity by descent (IBD) was calculated for these samples using the MIPanalyzer R package (https://github.com/mrc-ide/MIPanalyzer). The median, mean, and maximum pairwise IBD were 0, 0.0878, 0.99, respectively. The overall distribution was bimodal, with 599 isolates being distantly related to at least one other pair (IBD < 0.25) and 448 being very highly related to at least one other pair (IBD ≥ 0.9) (Supplementary Fig. 6, Supplementary Fig. 7, Supplementary Fig. 8, Supplementary Fig. 9). To detect evidence of clonal transmission, IBD networks were generated between isolates with IBD ≥ 0.99 for five communities with a sufficient number of isolates: Belén, Indiana, Iquitos, Punchana, and San Juan Bautista (Fig. 2). In Belén, multiple clonal clusters were detected (Fig. 2), with many of these clusters spanning multiple years of collection. Smaller clonal networks were also detected in isolates collected from 2016 and 2017 in Indiana, 2012 and 2013 in Punchana, 2016 and 2017 in Punchana, and 2015–2017 in San Juan Bautista (Fig. 2).

Fig. 2: Identity by descent (IBD) networks for five communities within the vicinity of Iquitos.
figure 2

Only monoclonal samples are included. Each network node is a sample, with the edges representing IBD estimates. Nodes are color coded by collection year. An IBD threshold of 0.99 was used to generate networks. The Belén network shows evidence of clonal transmission in 2013 that is closely related to isolates collected in 2012 and 2014. In all five cities, there are networks connecting closely related isolates across years. A Belén, B Indiana, C Iquitos, D Punchana, and E San Juan Bautista.

Most infections are monoclonal

Of 697 isolates (all samples that passed quality filters, regardless of metadata) analyzed with THEREALMcCOIL, the majority (637, 91.4%) were monoclonal (Fig. 3). Among the 60 isolates estimated to be polyclonal, 44 (73.3%) were estimated to contain two clones, 14 (23.3%) were estimated to contain three, and two (3.3%) were estimated to contain four clones. The median COI estimate across all samples was 1.

Fig. 3: Violin plot depicting the complexity of infection (COI) of P. vivax isolates included in the analysis.
figure 3

A all P. vivax isolates analyzed. Each dot represents the mean COI estimate for a given isolate. The majority of isolates (91.4%, n = 637/697) are monoclonal, with 44 isolates (6.31%) containing two clones, 14 isolates (2.01%) containing three clones, and two isolates (0.287%) containing four clones. B P. vivax isolates from five communities with largest sample numbers. COI does not differ significantly by community. C P. vivax isolates from the same five communities ordered by collection year. COI does not differ significantly by year.

Copy number variation detected in duffy and reticulocyte binding proteins

Prior to calculating copy number based on UMI depth, the larger dataset of 685 samples was filtered for missingness and high coverage (see “Methods” for details), as well as monoclonality, leaving 595 samples for copy number calculation. Copy number variants (CNV) were detected in twelve samples (2.02%) (Fig. 4). All twelve samples contained multiple copies of reticulocyte binding protein 2b (Pvrbp2b), eleven of which had two copies while one had three copies. Ten of the twelve samples showed CNV only in Pvrbp2b, while two had additional duplicated genes (Fig. 4). The sample with three copies of Pvrbp2b also had two copies of Duffy Binding Protein (Pvdbp), Pvrbp1b, Pvrbp2a, and Pvrbp2c (Fig. 4). The final sample had a duplication in Pvrbp2c. No CNV was detected in either Pvrbp1a or erythrocyte binding protein (Pvebp).

Fig. 4: Upset plot showing the frequency of copy number variation (CNV) among monoclonal isolates.
figure 4

The majority of isolates (98.0%) do not exhibit CNV. CNV in Pvrbp2b was observed in twelve (2.01%) samples, with one isolate having three copies, as well as duplications in Pvdbp, Pvrbp1b, Pvrbp2a, and Pvrbp2c. One other isolate showed duplications in both Pvrbp2b and Pvrbp2c. No CNV was detected in either Pvebp or Pvrbp1a.

Mutations in antimalarial resistance orthologs

Allele frequencies were calculated for mutations that are found in antimalarial resistance orthologs in P. vivax. The full set of allele frequencies for all antimalarial resistance ortholog genes is reported in Supplementary Data 2. Mutations were detected in Pvcrt, Pvdhfr-ts, Pvpppk-dhps, and Pvmdr1 (Table 2). Allele frequencies in detected mutations are mostly rare (below 15%) (Table 2), though the Pvpppk-dhps G383A mutation was detected at a relatively high frequency (49.6%) (Table 2).

Table 2 Frequencies of mutations in antimalarial resistance orthologs detected in the sample set

Potential balancing selection in multiple RBC invasion ligands

In addition to determining the allele frequencies of antimalarial resistance ortholog markers, we also calculated Tajima’s D in 100 bp bins with a 50 bp step size across loci encoding merozoite surface and secreted proteins that are involved in the red blood cell invasion that we tiled with the PvG MIP panel. These ligands are potential P. vivax vaccine targets (see Supplementary Methods). The strongest signatures of balancing selection (Tajima’s D > 2), were detected in 13 of 20 genes analyzed, including apical membrane antigen 1 (Pvama1), Pvdbp, Pvebp, merozoite surface protein 1 (Pvmsp1), Pvrbp1a, Pvrbp1b, Pvrbp2a, Pvrbp2b, Pvrbp2c, tryptophan-rich antigen 111 (Pvtrag11), Pvtrag19, Pvtrag22, Pvtrag34, and Pvtrag36 (Fig. 5). The median value across all RBC invasion ligands was −0.577, while the minimum was −1.39 (Pvrbp1a) and the maximum was 3.14 (Pvtrag11). No Tajima’s D values <−2, consistent with strong directional selection, were detected in any RBC invasion ligand. All polymorphisms detected in RBC invasion ligands are reported in Supplementary Data 3 and all Tajima’s D values are shown in Supplementary Fig. 10.

Fig. 5: Scatter plots depicting Tajima’s D values in 100 bp bins with a 50 bp step size across potential vaccine target genes.
figure 5

Gray boxes indicate gene regions that were not captured by MIPs. Only genes with Tajima’s D values > 2, i.e., showing strong signatures of balancing selection, are depicted. Dots are placed in the midpoint of each bin, with values >|2| highlighted in red. A Pvama1, B Pvdbp, C Pvebp, D Pvmsp1, E Pvrbp1a, F Pvrbp1b, G Pvrbp2a, H Pvrbp2b, I Pvrbp2c, J Pvtrag11, K Pvtrag19, L Pvtrag22, M Pvtrag36.

Discussion

In this study, we demonstrate that the high-throughput MIP genotyping approach used to characterize the genomic epidemiology of P. falciparum in Africa is also effective and informative for South American P. vivax parasite populations. Combined, our novel genotyping panels are the broadest genotyping panel currently available, working reliably on P. vivax isolates with parasitemias ≥100 parasites/µL to characterize population structure, relatedness, antimalarial resistance ortholog mutation frequency, and identify signatures of selection in candidate vaccine targets. By deploying them on samples from Peru, we were able to simultaneously detect highly related (IBD ≥ 0.99) samples within the same community across years, detect CNV in Pvdbp and multiple Pvrbps, estimate COI, identify mutations in multiple antimalarial resistance orthologs, and detect balancing selection in 13 vaccine candidate genes.

These new MIP panels build on both the success of P. falciparum MIPs and other existing P. vivax genotyping panels to enable a thorough analysis of multiple key elements of P. vivax biology. MIPs are more easily multiplexed than amplicon panels and incorporate unique molecular identifiers that enable correction of amplification bias, but are typically less sensitive than amplicons. Crucially, like the microhaplotype panels42 and unlike the existing amplicon panels18,38, our MIP panels should be deployable in any geographic context without modifications. In addition, unlike any existing panels, our MIP panels contain both geographically discriminating and neutral SNPs in addition to 21 vaccine-candidate genes and five genes that are antimalarial resistance orthologs. The 1,200 SNPs genotyped across the three panels are more than sufficient to generate robust IBD estimates20,44. This single set of panels therefore enables the simultaneous detection of population structure and transmission networks along with surveillance of potential variants of concern for antimalarial resistance and mutations that may decrease the effectiveness of vaccines.

While our MIP panels are efficient, reliable, and high-throughput, they nevertheless will require further refinement to better meet every need for a widely deployed P. vivax genotyping panel. As with the amplicon, SNP, and microhaplotype panels, MIPs require sophisticated molecular laboratories with sequencing capabilities, which are rare in malaria-endemic regions. In addition, the computational analyses necessary to analyze the genotyping results require substantial bioinformatic resources and expertise, which are also in short supply in malaria-endemic regions. Another limitation of both the MIP and amplicon panels is that the incorporation of antimalarial resistance orthologs is of questionable significance to malaria control programs since none of these mutations are validated. In addition, unlike the microhaplotype panels, the MIP panels have not been validated for use in treatment efficacy studies (TES), which are critical to distinguish reinfection from recrudescence and relapse. Finally, a major limitation of MIPs is inconsistent performance in low-parasitemia samples, which hampers highly-multiplex amplicon panels as well. Since P. vivax typically causes low-density infections, many of which are undiagnosed in the Amazon36, we are currently limited to analyzing the high-density minority of infections, which could mean missing vital population data.

In addition to demonstrating the viability of MIPs to characterize and surveil P. vivax in a highly endemic region, we also identified multiple aspects of Peruvian P. vivax biology with potential implications for malaria control and elimination plans in the country. Within the immediate vicinity of Iquitos, we identified multiple clonal (IBD ≥ 0.99) networks with temporal-geographic connections. The Amazon region is known for its high clonality18, which contributes to a high proportion of submicroscopic infections34. Most notably in our study, the community of Belén appears to have experienced clonal P. vivax transmission from 2012-2014, which has also been observed elsewhere in the Peruvian Amazon18. Iquitos is known to be a source of P. vivax transmissions in the region, due to the highly mobile population37,45. The presence of clonal transmission in the suburb of Belén suggests that this strain likely spread elsewhere as well, meaning that monitoring transmission in the Iquitos region should provide a representative picture of most P. vivax within the Peruvian Amazon.

Several genomic epidemiology studies of P. vivax have occurred in Loreto with variable findings. Transmission patterns are complex, with a heterogenous spatio-temporal malaria distribution that is correlated with human mobility patterns, with travel history, and occupation influencing prevalence17,34. Urban areas have a similar proportion of monoclonal and polyclonal infections, but monoclonal infections predominate in rural areas and complexity of infection (COI) is higher in urban areas45, although most isolates regardless of urbanization are monoclonal17,18,37,46,47. While heterozygosity (He) is similar in all localities, multiple studies have identified substantial differentiation by geography17,18,46, which is also reflected in principal component analysis that separates rural and urban isolates45. These studies have also identified substantial substructure and clonal propagation in the parasite population18, with Iquitos functioning as the main source of parasites for the rest of the region45. Clonal populations reflected by isolates from the same spatio-temporal context clustering with high identity by descent (IBD) have been identified18. However, a study of isolates collected between 2015 and 2019 identified three subpopulations but no geographic substructure37.

CNV was rare in these samples and most frequently found in Pvrbp2b, although one isolate exhibited CNV in five genes. Although CNV has been previously observed in Pvebp48, we did not detect any. CNV has previously been identified in Pvrbp2a and Pvrbp2b, with duplications detected in two Thai isolates49. However, duplications in Pvrbp1b and Pvrbp2c have not been previously described. The Pvrbp gene family plays an important role in invading reticulocytes50,51,52, is immunogenic, and is a potential vaccine target. While there are multiple functional Pvrbp genes, the family also contains pseudogenes50. The Pvrbps vary in their degree of polymorphism, e.g., Pvrbp2c is more polymorphic than Pvrbp149, and their different domains may be under differential selective pressure49. Previous analyses have not identified deviations from neutrality in the Pvrbps53,54.

We also detected a duplication in Pvdbp in one isolate. Pvdbp plays a critical role in red blood cell (RBC) invasion in Duffy antigen/receptor for chemokines (DARC)-positive individuals, who make up the vast majority of people living in areas with high P. vivax transmission55. As such, PvDBP is a promising blood-stage vaccine candidate55. However, studies have demonstrated low Pvdbp immunogenicity in the Amazon region, and Pvdbp is highly polymorphic55. Pvdbp copy number variation (CNV) has been detected globally in isolates from both DARC-positive and DARC-negative individuals56,57. The implications of the CNV we detected in 14 isolates are somewhat unclear, but may suggest that vaccines targeting DBP or the RBPs that do not account for this variability may have reduced efficacy58.

While none of the mutations in antimalarial resistance orthologs in P. vivax are validated, they have been associated with resistance to chloroquine (CQ)59,60, sulfadoxine-pyrimethamine (SP)61,62,63, and other drugs64,65. Since CQ is still the first-line treatment for P. vivax in Peru and many other contexts30, monitoring putatively resistance-conferring mutations to this drug is critical to maintaining efficacious treatment programs. Previous studies have identified mutations in chloroquine resistance transporter (Pvcrt), which have been associated with vivax CQ resistance37, that were nearly at fixation17. However, since these mutations are in intronic regions, the functional significance is unclear17,59. Polymorphisms in multidrug resistance protein 1 (Pvmdr1) have been detected at heterogeneous levels throughout the Peruvian Amazon60, including in the CQ resistance-associated Y976F and F1076L mutations18,60. In our analysis, we find the Pvcrt K9 duplication at a relatively low level (11%), while both Pvmdr1 Y976F and F1076L are lower frequency than the 15.2% found with amplicon sequencing elsewhere in Loreto18.

SP is used as a partner drug in artemisinin combination therapy for P. falciparum. In addition, while SP is not a preferred treatment for P. vivax, it is sometimes given in areas where P. falciparum and P. vivax co-occur, although it was removed as a first-line treatment in Peru in 200166. As with Pvcrt, previous studies have identified mutations in dihydrofolate reductase (Pvdhfr) and dihydropteroate synthase (Pvdhps) ranging in frequency from very low (1.3%) to fixation in Peru17,18,37, which may have occurred while SP was still used for P. falciparum treatment in Peru37. The frequency of the Pvdhps A383G mutation is similar in our analysis (49.6%) to that found with amplicon sequencing (61.7%)18. We also found similar frequencies of the Pvdhfr S117N mutation (3.3% vs. 1.3%), but identified no isolates with the S58R mutation, in contrast to the relatively high frequency (54.3%) found in the other study18. Conversely, we identified the F57L mutation at a low frequency (1.6%), while the other study did not find it at all18. Our findings also contrast with data from PNG that found the F57L and S58R mutations to be consistently linked61. Although the clinical and operational significance of these mutations associated with reduced CQ and SP susceptibility is unclear, monitoring them may enable Peru’s malaria elimination goals, e.g., by increasing surveillance intensity in regions where their frequency is rising.

We also demonstrate evidence of balancing selection in 13 RBC invasion ligands which are potential vaccine targets, including Pvama1, Pvdbp, Pvebp, Pvmsp1, four members of the Pvrbp family, and four members of the Pvtrag family. The biological aspects of balancing selection in these targets are discussed in the Supplementary Results and Discussion. While the PvG panel identified potential balancing selection in 13 genes, it is limited by our ability to successfully design MIPs to target the whole genes. For some genes, such as Pvama1, we were unable to design MIPs targeting the beginning of the gene, while for others, such as Pvrbp2a, there are multiple smaller regions within the gene that we were unable to target. Other genes, such as Pvcsp, contain repeats that are difficult to target with the short length of the capture design. A future redesign of the PvG panel may improve coverage within these genes of interest. Nonetheless, even with this limitation, we were able to find signatures of selection.

Our four MIP panels, along with existing amplicon and microhaplotype panels, are poised to begin replicating the successful genomic epidemiology studies of P. falciparum in P. vivax. While their performance is less reliable with lower parasitemia (a common feature of P. vivax infections), this shortcoming may be overcome by incorporating enrichment methods such as selective whole genome amplification prior to MIP capture. In this analysis of Peruvian isolates, we demonstrate their utility to simultaneously detect population structure, CNV, COI, mutation frequencies in antimalarial resistance orthologs, and signatures of selection. As malaria control programs move toward the elimination of P. falciparum, such a tool for P. vivax will be critical if indeed it expands to fill the niche left by P. falciparum. Future MIP analyses of isolates from other geographical regions and/or collection years will likely enable a robust characterization of global and/or temporal trends in P. vivax biology, and refinement of the panel for low density infections.

Methods

Ethics approval

The protocol for this study was approved by the Institutional Review Board of the U.S Naval Medical Research Unit SOUTH (NAMRU SOUTH) in compliance with all applicable federal regulations governing the protection of human subjects (protocol NMRCD.2007.0004). Informed consent was obtained from all participants. For participants aged 8 to 17 years, written consent from parents or guardians and assent from participants were obtained.

Sample collection

Samples were collected between 2011 and 2018 as part of a malaria passive surveillance study conducted in Iquitos, the main city in the Peruvian Amazon Basin, and its surrounding areas in the Loreto region. Loreto consistently contributed 90–95% of all malaria cases reported in Peru during the study period (2011–2018) with an approximate distribution of P. vivax and P. falciparum of 80% and 20%, respectively. The mean API in Loreto region for vivax malaria during the study period was 33.0 cases/1000 population with a range from 9.2 (2011) to 48.9 (2014) and malaria incidence was mainly homogeneous in all the study sites (https://www.dge.gob.pe/portalnuevo/publicaciones/boletines-epidemiologicos/).

We enrolled individuals older than 1 year old, with fever or history of fever during the previous 72 h in the absence of another cause of fever. Enrolled participants provided a written and/or assent form. We collected clinical, demographic and epidemiological information67 and up to 8 milliliters of venous whole blood. Two thin and thick smears were prepared and stained with 10% Giemsa for each participant. Slides were read independently by two dedicated microscopists; discordant results were solved by a third microscopist. All participants were microscopy positive to Plasmodium vivax with a parasitemia range of 12 to 141,373 par/ul during the study period68,69.

Sample processing and sequencing

Dried blood spots were extracted using the Chelex method, as previously described20,43. Briefly, DNA was extracted from three 6-mm punches, which represent approximately 25 µL of blood. Extracted DNA was used for molecular inversion probe (MIP) captures, which were sequenced as previously described for P. falciparum20,43. The human MIP protocol was used instead of the parasite MIP protocol due to the GC richness of the P. vivax genome. When comparing the parasite and human MIP protocols, the differing buffer and enzyme in the respective capture and PCR steps, in addition to the increased temperatures in the PCR cycling conditions allow for more efficient amplification. Due to the difficulty in capturing the P. vivax genome from human samples, the amount of DNA input into the capture reaction was increased from 3 µL to 5 µL and the volume of water in the capture master mix was decreased, in exchange. Four new MIP panels were designed for P. vivax based on the PvP01 genome assembly using variation data from 918 published P. vivax genomes (Supplementary Table 2): 1) PvG, a tiled gene panel designed to densely genotype genes of interest, including antimalarial resistance ortholog markers and invasion ligands that are potential vaccine targets (Supplementary Table 3); 2) PvFST, a panel designed to capture geographically discriminating SNPs between and within continents; 3) PvMAF5, a panel designed to capture rare neutral (non-coding) SNPs (minor allele frequency <0.05); and 4) PvMAF40, a panel designed to capture common neutral SNPs (MAF > 0.4). The details of the MIP design scheme are given in  Supplementary Methods and probe sequences are provided in Supplementary Data 47. The generated libraries were sequenced on an Illumina Nextseq 550 platform using 150 bp paired-end sequencing with dual indexing using Nextseq 500/550 Mid-output Kit v2 in the case of the gene panels and 75 bp paired-end sequencing in the case of the SNP panels.

MIP variant calling and filtering

MIP sequencing data was processed using MIPWrangler within MIPTools (https://github.com/bailey-lab/MIPTools), which normalizes read counts based on unique molecular identifiers (UMIs) that are incorporated in the MIPs. Variant calling for all panels was performed using FreeBayes70 within MIPTools. For the SNP panels, we specified a minimum UMI depth of 3, a within-sample allele frequency threshold of 0.5, and a minimum alternate read count of 2 to obtain 616 variant SNPs across 760 samples (PvFST), 437 variant SNPs across 763 samples (PvMAF5), and 377 variant SNPs across 750 samples (PvMAF40). Bcftools71 v1.9 was used to filter the resulting VCF to include only targeted SNPs. The filtered VCFs were imported into R using vcfR68 (version 1.14), where they were assessed for sample and variant missingness. Samples with >50% missingness on any panel were excluded from the VCFs, which were then sorted and concatenated with bcftools. Finally, bcftools was used to subset the resulting VCF to include only biallelic SNPs and exclude SNPs with >50% missingness. The tiled gene panel was processed in a similar fashion, yielding 13,927 variant SNPs across 740 samples.

Population structure analysis

MIPanalyzer (https://github.com/mrc-ide/MIPanalyzer) was used to assess parasite relatedness and detect population structure. We first performed PCA of all samples. Given the unwieldy size and lightly structured nature of the dataset, we subsetted samples by city for plotting, with samples coded by collection year. To more thoroughly determine population structure and relatedness, we calculated identity by descent (IBD) of monoclonal isolates using maximum likelihood estimation. IBD networks connecting samples from each city with IBD ≥ 0.99 were generated using ggraph69 (version 2.1).

Complexity of Infection (COI), i.e., the number of parasite clones present in each sample, was calculated with McCOILR (https://ojwatson.github.io/McCOILR/index.html, version 1.3), an R implementation of THE REAL McCOIL (v2) categorical method72.

Analysis of antimalarial resistance orthologs and potential vaccine candidates

All amino acid changes detected with the tiled gene panel and their respective frequencies were summarized using miplicoRn (https://github.com/bailey-lab/miplicorn, v0.2.1) and flextable73 (v0.9). We prioritized the analysis of mutations that have previously been associated with drug resistance.

To detect potential selection on the potential vaccine candidates, we used vcf-kit (https://github.com/AndersenLab/VCF-kit) to calculate Tajima’s D in 100 bp sliding windows with a 50 bp step size.

Prior to CNV calculation, samples and MIPs were filtered for missingness as follows: samples with <5 unique molecular identifier (UMI) counts for more than 50% of MIPs were excluded, as were MIPs with <5 UMI counts for more than 50% of samples sequenced. For CNV calculation at loci of interest, the goal was to mitigate technical variations in the read coverage while maintaining true biological modifications. To correct the raw read data for systematic errors, z-score normalization of UMI coverage of each MIP was performed by accounting for potential sample and locus effects. Individual UMI coverages were transformed into z-score as \({z}_{ij}=\frac{|UM{I}_{ij}-{\underline{X}}_{UMI}|}{{\sigma }_{UMI}}\), where UMIij is the UMI coverage of a given MIP (or microhaplotype) \(i\) within a sample \(j\) and \({\underline{X}}_{UMI}\) and \({\sigma }_{UMI}\) represent the mean of per sample means and standard deviations of UMI coverage, respectively. For each sample, the mean and standard deviation of UMI coverages were initially calculated across all loci included in the MIP panel. Consequently, this step generated distributions of means and standard deviations for the entire sample set in which we computed averages to obtain the parameters \({\underline{X}}_{UMI}\) and \({\sigma }_{UMI}\). The relative read abundance of a candidate gene within a sample, which corresponds to CNV, was estimated by dividing the z-score (\({z}_{{ij}}\)) of each MIP targeting this gene by the averaged z-score of all the other genes captured that are known not to be subject to CNV. As this method deals with sample and locus-derived errors and not batch effects, it was only applied to calculate CNV in samples that are sequenced together in the same run. The approach was previously validated in laboratory strains of P. falciparum, including 3D7, 7G8, and Dd2, in independent runs. Unlike 3D7 and 7G8, Dd2, the positive control with multiple copies of mdr174,75 consistently showed CNV > 2 for this gene. We have also previously used MIP data to generate precise copy number estimates76.

Reporting summary

Further information on research design is available in Nature Portfolio Reporting Summary linked to this article.