Multi-center planning study of radiosurgery for intracranial metastases through Automation (MC-PRIMA) by crowdsourcing prior web-based plan challenge study

Background: Planning radiosurgery to multiple intracranial metastases is complex and shows large variability in dosimetric quality among planners and treatment planning systems (TPS). This project aimed to determine whether autoplanning using the Muliple Brain Mets (AutoMBM) software can improve plan quality and reduce inter-planner variability by crowdsourcing results from prior international planning study. Methods: Twenty-four institutions autoplanned with AutoMBM on a five metastases case from a prior international planning competition from which population statistics (means and variances) of 23 dosimetric metrics and resulting composite plan score (maximum score = 150) of other TPS (Eclipse, Monaco, RayStation, iPlan, GammaPlan, MultiPlan) were crowdsourced. Plan results of AutoMBM and each of the other TPS were compared using two sample t -tests for means and Levene ’ s tests for variances. Plan quality of AutoMBM was correlated with the planner ’ experience and compared between academic and non-academic centers. Results: AutoMBM produced plans with comparable composite plan score to GammaPlan, MultiPlan, Eclipse and iPlan (127.6 vs. 131.7 vs. 127.3 vs. 127.3 and 126.7; all p > 0.05) and superior to Monaco and RayStation (118.3 and 108.6; both p < 0.05). Inter-planner variability of overall plan quality was lowest for AutoMBM among all TPS (all p < 0.05). AutoMBM ’ s plan quality did not differ between academic and non-academic centers and uncorrelated with planning experience (all p > 0.05).


Introduction
Historically, stereotactic radiosurgery (SRS) was offered only to treat limited number of lesions [1].Recent evidences have been established to show safety of SRS for patients with more than four lesions and even beyond ten without compromised overall survival and increasing incidence of SRS-related adverse events [2][3][4][5].
Besides the prohibitively long treatment duration associated with treating a large number of lesions, each of which may require additional shots for GammaKnife and CyberKnife, or in the case of C-arm linac a separate isocenter, the other major challenge is concerned with treatment planning.Planners typically rely on experience to determine and iteratively refine a range of treatment platform-and delivery techniquespecific variables.A particular aspect of this exhaustive process is to minimize the dose bridging between lesions, which leads to unwanted high dose in the surrounding normal tissues.As the number of lesions increases, the complexity of finding an optimal set of variables by, for example, adjusting the size and shape of the collimation aperture and the irradiation trajectory (e.g., beam incident angles) can rapidly expand beyond the capacity of human planner.The planner's experience eventually becomes the determinator of the plan quality and main driver of plan variability, which might not be adequately compensated with the aid of general inverse-optimizers.
On the other hand, standardization of plan quality might not necessarily be achieved through plan benchmarking, as shown by the UK National Health Service commissioning of SRS service [6].Despite provision of detailed protocol, guidelines and feedbacks, for two plan benchmark cases comprising three and seven lesions marked variations of the SRS plans from 21 institutions were still noted not just between treatment platforms but also within the same type of treatment platform.The variation of plan quality was found to be particularly large among institutions that deployed C-arm linacs for SRS indicative of strong dependence of planner's skill.
Automated planning (AP) through scripts, templates or plan library solutions has demonstrated its potential to improve plans throughout and plan quality in many disease sites [7][8][9].Among a few commercially available AP solutions, only two were developed with dedicated algorithms to address the unique problem of plan optimization for multiple brain metastases.In a dual-center study, Haisong et al. found favourable target dose conformity and intermediate dose volume 10 and 8 Gy received by the normal brain in non-AP plans generated on the Eclipse TPS (Varian, Palo Alto, CA, USA) for volumetric modulated arc radiotherapy (VMAT) delivery but worse low dose volume < 5 Gy compared to those autoplanned by the Multiple Brain Mets SRS TM (AutoMBM; Brainlab, Munich, Germany) software for dynamic conformal arc (DCA) delivery [10].In a more recent study by Hofmaier et al., the same AutoMBM as used in Haisong et al. [10] was found to generate plans that outperformed non-AP plans using the Monaco (Elekta, Stockholm, Sweden) TPS for VMAT delivery [11].Earlier, Gevaert et al. indicated that AutoMBM was able to produce superior target dose conformity and normal brain dose than manually optimized VMAT plans on Eclipse [12].Vergalasova et al. in another multi-center study found that the DCA plans from AutoMBM produced generally worse target dose conformity than the other AP solution of VMAT by HyperArc TM (Eclipse, Varian) and non-AP solution of GammaKnife plans [13].Nonetheless, all these studies and others [6,14] merely reflect the experiences of academically oriented practices and might not be generalizable to clinical environments where quality control and improvement program are less developed.The level of expertise of the participating institutions, case selection and the limited number of plans that could be generated in the experiment and the comparison arms, etc., prohibit unbiased results and objective conclusions in the above single and multicenter planning studies.
To gain better insights into the potential of AP for multiple brain metastases, a large scale multi-center study of the full range of planning solutions that reflect a variety of academic and non-academic practices should be performed.However, this requires extensive work to create benchmark plans on multiple TPS with and without AP solutions.Moghanaki et al. have recently adopted a crowdsourcing approach to analyze the quality and variability of a large number of stereotactic body radiotherapy plans for a single case of lung cancer using a web-based platform [15].Hardcastle et al. investigated the challenges of spinal radiosurgery by crowdsourcing 149 plans of spinal radiosurgery from an international plan challenge study that was organized by the Trans-Tasmania Radiation Oncology Group (TROG) [16].The same group also crowdsourced 160 plans for a case of SRS to five brain metastases from the other international planning study [17].None of the 160 submissions to this planning competition was known to be auto-planned by AutoMBM.The crowdsourcing approach offers the possibility to efficiently benchmark plan solution belonging to a certain category of treatment platform, delivery technique or TPS by crowdsourcing the results that existed for the other categories from the cloud-database.The feasibility of adapting the crowdsourcing approach has been demonstrated by Giles et al. to benchmark their in-house developed conformal arc informed VMAT (CAVMAT) solution [18].
This Multi-Center Planning Radiosurgery for Intracranial Metastases through Automation (MC-PRIMA) study adapted the crowdsourcing approach to benchmark the performance of AutoMBM against the same 160 plans in the TROG planning study that were analyzed by Hardcastle et al. [17].But unlike their study that focused on the plan variability by delivery technique (e.g., GammaKnife, CyberKnife, VMAT, intensity modulated therapy), the primary objective of MC-PRIMA was to assess the potential of AutoMBM to improve the quality and reduce the variability of plans versus other existing TPS without dedicated AP solution to SRS of multiple brain metastases.As secondary objective to understand how AutoMBM may address the technical demands of SRS planning, the correlation of plan quality from AutoMBM with the planning experience and the plan quality between academic and non-academic centers were further evaluated.

Recruitment of participants
This study involved as broad a spectrum of participants as possible to emulate the constitution of participants in the TROG planning competition.The final recruitment of participants included nine non-academic and fourteen academic from seven regions (North / South America n = 2/2; Europe n = 15; Asia n = 2; Middle East n = 1; Africa n = 1; Australasian n = 1).The eligibility of institutions in MC-PRIMA was that either the institution clinically use the autoplanning Multiple Brain Mets SRS TM (AutoMBM) software to perform autoplanning (AP) or had undergone training from the vendor to use AutoMBM.Furthermore, these institutions must have prior experiences with other SRS planning solutions.The vendor of AutoMBM was also invited to participate.

Plan study dataset, planning protocols and plan quality metrics
This study adapted the international planning study case of five brain metastases that was originated from the Trans-Tasman Radiation Oncology Group (TROG) Local HER 0 trial [19].This study case was published through the publicly accessible web-based plan challenge study platform ProKnow1 (Elekta, Stockholm, Sweden).The dataset includes anonymized CT images from the skull apex to the second cervical vertebra and a set of tumor and normal organ contours defined according to the Local HER 0 trial [19].The resolution of the planning CT is 1.0 × 0.468 × 0.468 mm 3 .These gross tumor volumes (GTV) of sizes 0.52 (GTV1), 0.39 (GTV2), 0.07 (GTV3), 2.82 (GTV4) and 0.12 (GTV5) cm 3 were defined in the frontal, parietal and temporal lobes and the cerebellum adjacent to the brainstem.The sphericity of the GTV, calculated using the OpenCAD extension to 3DSlicer [20] as a measure of the roundness or spherical nature of the target, is 0.54, 0.59, 0.64, 0.59 and 0.66 for GTV1-5, respectively.The sphericity of a sphere is the maximum value of 1.The smaller the value, the less the target approaches to be spherical.
Other normal organs included in the study case were both eyes, lens, optic nerves, hippocampuses, brainstem, optic chiasm and normal brain (i.e., brain minus all GTVs).Per the Local HER 0 trial, GTVs were treated without safety margin for microscopic disease (i.e., no clinical target volume) and geometric uncertainty (i.e., no planning target volume), and no planning risk volume (PRV) for normal organ was defined.

User process
Participants were provided with an instruction which included the planning protocol and a web-link to download the plan dataset.Prior to the start of planning, participants were advised to gain full understanding of the planning protocol and the plan quality metrics.Participants then performed AP using AutoMBM and uploaded the resulting plan to the principal investigators for data analysis.Besides collection of the SRS plans, experiences of the planners were documented.

Treatment planning with AutoMBM
The AutoMBM is an automated planning solution to treat multiple brain metastases by mono-isocentric arc delivery with multileafcollimator (MLC) on C-arm linac.Different from the previous studies that all used the early AutoMBM version 1.5, over half of our participants planned with the later AutoMBM version 2.0 using different C-arm linacs.Common to both versions is the template-driven automated planning process.The planner define a set of templates called Clinical Protocols and Setup Protocols each catering to specific treatment objectives and irradiation geometry, respectively.Since version 2.0, AutoMBM employs an inverse algorithm which actively optimizes target heterogeneity, normal brain dose as well as OAR dose by optimizing collimator angle, field shapes, arc weight (monitor unit) and BEV margins.Besides the change of the optimization algorithm, AutoMBM v.2.0 adapts the highly automated planning approach to enable the users in graphical interface to explicitly adjust the target dose homogeneity, the dose constraints and their strength on the critical organs that may have defined in the Clinical Protocol.
In order to perform crowd-knowledge-based planning benchmark of AutoMBM against other non-AP treatment planning systems (TPS), the same planning protocol, as defined in the original TROG planning competition, was adhered to by this study (Supplementary Table S1) except that all participants were demanded to achieve 20 Gy to cover 99% of every GTV as a hard constraint.Other plan quality metrics are also given in Table S1.All institutions applied their own AP templates to reflect their clinical practices.The general approach to achieve the optimal plan quality metrics among institutions that planned by AutoMBM v.1.5 was primarily by changing the AP template.Among institutions planning by AutoMBM v.2.0, the dose distribution could have been further optimized by adjusting the prescription isodose line for individual targets, and the dose-volume constraints and their strength on other critical organs when necessary.Final doses were calculated pencil beam algorithm with adaptive dose calculation grid of resolution from 0.63 to 1.25 mm 3 , except for one institution that applied Monte Carlo dose-to-medium calculation at a resolution of 1.9 × 1.9 × 2.0 mm 3 and statistical uncertainty of 2 %.

Extraction and analysis of plan quality metric
For all cases, the dose matrices were exported in DICOM format of uniform 1 mm 3 resolution.The objective scoring algorithm underlying the ProKnow platform was developed by Nelm et al. [21].Each plan was scored based on 23 metrics including target coverage per lesion, Paddick conformity index (PCI) [22] and R50% [23] as a proxy to the steepness of dose gradient per total as well as individual lesion volume, etc (Supplementary Table S1).
Each metric has a maximum score and a baseline score for the ideal value and the minimum requirement, respectively.Zero score was given to metric that did not meet the minimum requirement.The sum of the scores of these 23 metrics was total to 150 points.All scoring metrics were obtained directly from the AutoMBM software and their values were populated onto an Excel sheet (Microsoft Excel version 2102, Microsoft Corporation, Redmond, USA) with formulae written specifically for this study to calculate their respective scores and the composite score according to the exact scoring functions devised in the TROG planning competition (Supplementary Table S1).It is important to note that the dosimetry scoring matrix was based on the Local HER 0 trial protocol [19] and did not necessarily reflect the clinical practice of individual participant.
For comparison, this study crowdsourced the population statistics of seven TPS (Table 1) that were employed in the TROG planning competition.One of the TPS, Pinnacle (Philips Radiation Oncology Systems, Fitchburg, WI, USA), was used in three plan submissions, its population statistics were crowdsourced but were not included in the statistical comparisons.Note that different techniques such as staticfield/arc modulated radiotherapy (IMRT / VMAT), dynamic arc radiotherapy (DCA) delivered with circular collimator or multileaf collimator (MLC), and single or multiple isocenters may have been planned and were combined to produce the population statistics per TPS.The population statistics that are publicly accessible from the cloud-based Pro-Know system are the medians, means, and one standard deviations (S.The PCI and dose gradient index (GI) [24] values per lesion that were calculated automatically by the AutoMBM plans were also recorded for quantitative analyses.Besides collection of the SRS plans, experiences of the planners were documented.
All centers calibrated the machine output so that one monitor unit gave one cGy at the depth of maximum dose for a reference 10 × 10 cm 2 field.

Comparison of plan variability and quality between TPS
To facilitate statistical comparisons of the variability and averaged performance of dose statistics between AutoMBM and each of the other TPS, this study assumed that the sample followed a normal distribution with sample number, mean and standard deviation known for each TPS.For comparison of the standard deviation between AutoMBM and each of the other TPS, the Levene's tests were used.For comparison of mean, we used the two sample t-tests with either equal or unequal variances for PCI, R50%, GI, normal brain volume receiving 10 and 12 Gy (NB V10Gy and NB V12Gy ), maximum (D max ) dose of the chiasm, eyes and lens, dose to 0.3 cm 3 (D 0.3cc ) of the brainstem, volume receiving 8 Gy (V 8Gy ) in the optic nerves and mean dose (D mean ) to the hippocampuses, the composite plan score and the monitor units (MU) as complexity metrics [25], according to the results of the preceding Levene's tests.Normal distributions of different dose metrics were generated and statistical tests were performed using Matlab v. R2018a (Mathwork Inc.MA, USA).

Dependence of AutoMBM plan quality on the planner's experience and the nature of the treating center
The potential of AutoMBM to lower or even eliminate the dependence of the plan quality in terms of composite plan score on the general and SRS planning experiences and the nature of SRS treating center, i.e., academic vs. non-academic, were evaluated by Pearson's correlation and two sample t-test, respectively.Shapiro-Wilk tests for normality of quantities in all correlation analyses were performed.

Autoplanning with AutoMBM
Table 1 summarizes the characteristics of the AutoMBM submission plans.The majority of institutions applied five table angles (19 of 24) and a gantry arc length of 160 • in the AP.Fig. 1 summarizes the distributions of the treatment table angle and the gantry arc length defined by different planners in their AP templates, and the automatically optimized collimator angle.

Comparison of plan variability and quality between TPS
Table 2 gives the dosimetric results of the AP solution using AutoMBM and non-AP solutions using other TPS that were crowdsourced from the cloud-based ProKnow system.The statistical significance at p < 0.05 in the comparison of the mean and the standard deviation is indicated by the bold value in Table 2. Supplementary Table S2 gives the corresponding objective score for each evaluated target and OAR, normalized to the respective maximum per TPS.

Dependence of plan quality on the planner's experience and the nature of the treating center
The overall plan quality, evaluated by the composite score, shows no dependence of the participants' experience in SRS and general planning (Fig. 4), with Pearson's correlation coefficients r = -0.06(p = 0.767) and-0.06(p = 0.764), respectively.These statistical results are not affected after adjusting for the potentially influential parameter of the machine's MLC width in the ANONVA tests.Furthermore, the difference in the means of the composite score between academic and nonacademic centers is not significant (two sample t-test; p = 0.975), and continues to be insignificant further adjusting for the technical factor of MLC width (ANOVA; p = 0.481).

Discussions
The evaluation of automated stereotactic radiosurgery (SRS) planning against other manual solutions by either inverse or forward optimization is often subject to bias even in multi-center studies.This study overcame the classical limitation of planning benchmark by adapting a prior web-based plan challenge study from which dose metric statistics of 160 plans from a range of treatment planning systems (TPS) can be crowdsourced.This allowed us to bypass the need to involve many institutions to generate a large number of comparison plans yet enabling adequate statistics power in the critical appraisal of autoplanning by the Multiple Brain Mets SRS TM (AutoMBM) software.Another main advantage of crowdsourcing plan results from a heterogeneity of academic, non-academic and standalone institutions is that biases due to the variables of individual experiences in SRS and equipment specific characteristics (i.e., treatment machines) were effectively reduced.

Characteristics of AutoMBM plans
The AutoMBM incorporates the user-defined template in its AP solution.By such AP approach the planner reserves some degree of freedom to navigate the solution space for possible better plan dosimetry.Definition of these templates and hence the beam geometry is generally a non-trivial task, requiring certain level of expertise of the   angle defined therein.Nevertheless, most of the planners was believed to have set the minimum collimator as 4 • that was favourably chosen by the AutoMBM.The auto-optimized collimator angle may depend on a number of factors such as the couch angle, gantry arc span as well as the machine characteristics (e.g., maximum field size and width of multileaf collimator (MLC), etc).For all plans that resulted in > 10 • collimator rotation, the collimator aperture had a limited field size in one dimension of 22 cm.

Inter-planner variation in plan quality using AutoMBM and other non-Auto TPS
The heterogeneity of the template definition and machine characteristics did not appear to contribute to excessive variability in the plan quality from AutoMBM.Considering two of the most concerning metrics in SRS, the values of PCI and GI showed relatively small dispersion among the institutions in each lesion (Fig. 2).Large GI (10.06 and 9.34) and low PCI (0.30 and 0.36) all corresponded to the smallest lesion of 0.07 cm 3 .Interestingly, the machine characteristics, and more specifically concerning the width of the MLC, might not be the absolute factor attributing to these outlying values as it seemed as there were other AutoMBM plans created for the same type of machine and 5 mm MLC on Elekta Agility (Elekta, Crawley, UK) and Varian Millenium 120 (Varian, CA, USA) achieving GI of 6.89 and 6.84, and PCI of 0.53 and 0.56 for this smallest lesion, respectively.Furthermore, the beam arc geometry of these outlying plans did not differ substantially from other AutoMBM plans either, regardless of the machine / MLC model.What may influence the plan quality remains in the different settings of the Clinical Protocol template and the level of planner-enabled interactive smart tuning for controlling the degree of normal tissue sparing and MU spread during the optimization.
An important aspect that influenced the variability of plan quality among planners from different institutions is the treatment planning system (TPS) in use.This study showed that AutoMBM did not differ from most other TPS without using SRS-dedicated AP concerning the variability of PCI, except for Monaco (Elekta, MO, USA) and iPlan (BrainLab, Munich, Germany).There were, however, marked differences in the inter-planner variability of R50% among TPS.AutoMBM reduced this variability compared to some TPS, especially those that are not dedicated to SRS such as Eclipse (Varian, CA, USA), Monaco and RayStation (Raysearch, Stockholm, Sweden).The fact that the Gam-maPlan TPS dedicated to GammaKnife SRS (Elekta, Stockholm, Sweden) achieved significantly smaller inter-planner variability in R50% compared to AutoMBM was likely associated with the historical practice of dose prescription at about the 50% isodose [26].We also observed such practice in most participants using GammaPlan whereas the distribution of prescription isodose level was much wider for planners using Eclipse, Monaco, and RayStation in the TROG planning competition.The difficulty to achieve uniform PCI and GI using Monaco was found even within the same institution.Hofmaier et al. obtained a wide range of CI and GI values from 0.38-0.88 and 3.35-33.0for the Monaco plans, which were reduced to 0.58-0.89and 3.50-15.73by AutoMBM [11].Similar difficulty to obtain uniform GI within the same institution was reported for Eclipse [11].Gevaert et al. achieved GI at one standard deviation (1 S.D.) of 3.1 vs. 1.6 planning VMAT on Eclipse without AP and dynamic conformal arc (DCA) by AutoMBM, respectively [12].On the other hand, the majority of planners using iPlan were believed to have performed forward dose optimization for DAC delivery using circular collimator or MLC.Such approach is effective at producing good target dose conformity when the target shape is fairly regular [27].Given the poor sphericity of the targets in this TROG benchmark case, the possibility of creating a complex dose distribution confirming to the irregular targets' surfaces became critically planner-dependent and hence significant variability of the PCI.
One may anticipate lower variability of plan quality when plans were generated for treatment platforms from the same vendor.This was generally true for the dedicated SRS delivery platforms but not for the linac-based platforms.The one S.D. of composite plan score is comparable between MultiPlan for CyberKnife (Accuray, CA, USA), Gamma-Plan for GammaKnife and AutoMBM.For CyberKnife planning on other vendor-independent TPS which has recently become possible on RayStation, the proposed crowdsourcing approach is deemed to be useful to efficiently evaluate the inter-institution / planner variability compared with the vendor-dependent MultiPlan.For plans that were generated on Monaco and Eclipse for linacs presumably belonging to the same vendors (Elekta and Varian, respectively), the excessive variability of composite plan scores (p < 0.05) compared to AutoMBM could be partly attributed to the absence of dedicated AP algorithm to manage the overlapping of MLC apertures between targets and the sharing of same pairs of MLC by two or more targets that increased the dose bridging between targets in the normal brain.This problem also applied to the other TPS independent of the linac vendor such as RayStation.As demonstrated in previous studies [18,28,29], the dose bridging problem cannot be easily resolved even with collimator angle optimization and finer MLC width owing to the intrinsic limitation of the optimizer in couple with MLC sequencer.In contrast, AutoMBM automated the allocation of targets among arcs to treat as many targets by as many arcs as possible by optimizing the collimator angle while avoiding two or more metastases sharing the same pair of MLC.If two targets shared a leaf pair, the targets were assigned to different arcs at the same couch position.Else, both targets were assigned to the same arc.This fully automated planning process was responsible for reducing the planner's intervention in the control of the intermediate-to-low dose spill and consequently reduced the variability of R50%, and the normal brain receiving 10 and 12 Gy (NB V10Gy and NB V12Gy , respectively).
The uniqueness of AutoMBM to achieve uniform target dose coverage was clearly demonstrated in Table 2.For one reason, this study demanded for every lesion that the planner must prescribe 20 Gy to cover 99% of the GTV (GTV V 20Gy ).The algorithm of AutoMBM guaranteed the precise prescription of GTV V 20Gy = 99% by a stochastic optimization of arc weights that followed the final dose distribution resulting from the optimized arc configuration and dynamic arc MLC sequencing.With other TPS, renormalization was inevitably needed given deviations of the desired target coverage at the end of the final dose calculation.This process involved non-trivial trade-off between target coverage and PCI that varied according to the clinical preference of individual planners and radiation oncologist.When separate plans were created for different lesions, the optimal target coverage vs. PCI that had been achieved by renormalization for one plan was likely to change after plans for individual lesions were summed.This situation necessitated repeated renormalization for every lesion by trial and error until the planner achieved the global but also compromised optimality of target coverage and PCI for all lesions.In case where all lesions were co-optimized in one single plan, the renormalization after the final dose calculation would affect the target coverage and PCI of individual lesions all in once, making it nearly impossible to achieve uniform target coverage.
As the overall effect, AutoMBM achieved significantly more uniform composite plan score compared to other non-AP TPS for linac-based SRS, as evidenced in Table 2.Moreover, the larger variability of plan quality also suggested greater spread of plan complexity which was inferred by the significantly larger number of monitor units [30].

Overall plan quality between AutoMBM and other non-autoplan TPS
The ability of harmonizing the SRS plan quality is by far not enough to consider AutoMBM as viable solution to SRS for multiple brain metastases.Equally important is that AutoMBM achieves plan quality standard that is comparable to other non-AP TPS.Despite the statistical significance, V 20Gy of GTV1-5 was grossly comparable across different TPS regardless of AP.One exception was observed with iPlan where the mean V 20Gy for GTV4 was merely 81% and was likely caused by an extreme outlier as indicated by the one S.D. of 31%.
The fact that MBM could not achieve comparable dose fall R50%, NBV 12Gy and NBV 10Gy to dedicated radiosurgery TPS GammaPlan and MultiPlan despite AP augmentation was not entirely surprising.It was largely because AutoMBM was a AP solution to linac-based radiosurgery using MLC that shows broader dosimetric penumbra compared with conical collimators used in GammaKnife and CyberKnife.GammaPlan and MultiPlan also used a large number of isocenters and non-isocenter confocal beams which could significantly protect the normal brain and other OARs than other TPS for non-coplanar radiosurgery on linacs [31][32][33][34].Nonetheless, the composite plan scores of GammaPlan, Mul-tiPlan and AutoMBM were statistically equal although AutoMBM showed slightly worse dose statistics in other OARs in general.Because the scoring functions (Supplementary Table S1) that were designed per the Local HER-0 trial had higher weights on the target coverage, AutoMBM scored more points from the ideal target coverage for all lesions than GammaPlan and eventually equivalent composite plan score.On the other hand, MultiPlan achieved almost full scores from target coverage, and higher scores from better CI, NBV 12Gy and other OARs and eventually higher composite plan score despite statistical insignificance.Rossi et al. have recently shown that the boundary of the CyberKnife plan quality could be pushed even further by AP using the vendorindependent iCycle software in prostate radiotherapy [35].As another vendor-independent TPS has also become available for CyberKnife planning, further crowdsourcing plan benchmark will provide more insights into the role of TPS and the incorporation of AP in the overall plan quality.
Results for linac-based SRS were different between TPS, with Eclipse and iPlan showing very similar R50%, normal brain doses and doses in other OARs to AutoMBM, while Monaco and RayStation almost reversed the results in comparison to AutoMBM.Although Monaco and RayStation scored almost the maximum possible points of 45.9 and 49.9 out of 50 for GTV V 20Gy , the worse performance in PCI, R50%, NBV 10Gy , NBV 12Gy and dose to 0.3 cc of brainstem (D 0.3cc ) rendered them to lose a great deal of points to these high-weighted dosimetric metrics and ultimately significantly lower composite plan scores.A possible reason for these results could be that a large number of plans on Eclipse and iPlan were created for Varian linacs assuming the finest MLC width of 2.5 mm while Monaco for Elekta linacs assuming the finest MLC width of 4-5 mm [28,36,37].Regardless of this hypothesis, it is still clear that planners using AP-powered were able to overcome the influence of machine / MLC configuration and achieve comparable and even superior dose statistics to other existing TPS.The ability of AP to reduce the plan quality variability among treatment machines is not unique to AutoMBM but also reported for other TPS in other treatment sites of head and neck, pancreas and rectal cancers [38].

Limitations
The dosimetric scoring matrix and the eventual composite plan score were uniquely devised based on Local HER-0 trial protocol.There was other cloud-based international plan challenge study that used different scoring matrix [18].Although the interpretation of overall plan quality may change with the scoring algorithm and the definition of the plan quality [39], the results that AutoMBM could reduce the inter-planner variability and improve the averaged performance of a range of dose metrics remain valid.The retrospective nature of this study precluded the case selection bias in favor of AutoMBM.Nevertheless, there was a rare possibility of bias stemming from the prior knowledge of scores by the other TPS which might steer the participants to work strategically on certain scoring metrics to higher scores.Yet, this study achieved RAT-ING scores of 94% and 92% by two authors (XX and XX) [40].
It was acknowledged that the statistical comparisons may be subject to debate because of the assumption of normality in the scoring metrics.The ProKnow system provided only the population statistics of dosimetric results per either TPS or delivery technique.Statistical comparisons using the Levene's tests for variance and the two sample ttests for mean represented the best effort and were deemed reasonable for quantifying the relative performance of TPS.As this study focused on TPS as the main factor that influenced the plan quality and variability, the impact of delivery technique was ignored, which was partly due to lack of the information about the delivery techniques that were planned per TPS from the ProKnow system.Delivery technique is known to influence the plan quality.For example, dosimetric comparisons showed PCI and GI on the same TPS with VMAT better than IMRT [41] and comparable between mono-and multi-isocentric DCA [12].Contradictory results also existed comparing for the same delivery technique on the same TPS, for example, cone-based vs. MLC-based DCA on linac [8,42,43] and on CyberKnife [44,45].Although institutions shall not be limited with the choice of delivery technique, the above studies together with the present MC-PRIMA suggest especially for linac-based SRS to multiple brain metastases that the delivery technique(s) shall be carefully chosen per TPS.We strongly recommend that organizers of international plan competitions share detailed information of individual plan submission on the web-platforms for their upcoming as well as past studies such as the one for the TROG Local HER 0 trial study case.The availability of this information is important to fully unlock the potential of plan crowdsourcing when potentially interacting factors such as TPS, machine characteristics and delivery technique could be taken into account in the statistical analyses [46].Upon the ultimate goal of knowledge sharing via could based plan crowdsourcing, SRS institutions can be more informed of which TPS combined with what delivery technique to avoid suboptimal and large variability of plan quality.
The plan complexity has an implication on the plan deliverability [25].In the original TROG plan competition and this study involving a spectrum of delivery techniques planned by different TPS, the number of MU per Gy may be the most simplest and applicable complexity metrics for evaluation and comparison of the relative plan deliverability between TPS [47].Other complexity metrics, such as modulation complexity score [48] and variations of the nominal dose rate or gantry speed [49], etc, suffer from limited application to certain delivery technique that could be planned by certain TPS and therefore unsuitable for this study.In Table 2, the MU resulting from AutoMBM was found to be significantly lower than from other TPS, which may result in superior deliverability [30].Although highly recommended [40], a dry-run test for the deliverability and the dosimetric accuracy of the AutoMBM submission plan was not requested in MC-PRIMA.It is common that the post-planning dosimetric validation was left out in non-sponsored multicenter planning studies [10,13,46,50], like MC-PRIMA, predominately because financial resource was generally beyond reach to arrange rigorous dosimetry audit like in clinical trials.Nonetheless, all institutions that participated in MC-PRIMA had rigorous commissioning program and routine plan quality assurance in place to demonstrate acceptable deliverability and accuracy of their clinical AutoMBM plans.

Dependence of plan quality on the planner's experience and the nature of treating center
This study collected information about the planning experience of each individual planner.When the composite plan score was plotted versus the general planning and radiosurgery planning experience of the planner, no significant correlations were found.Plan scores resulting from those planners from academic were also compared with the others from non-academic centers, again without observed significant difference.Other studies [51,52] have showed that AutoMBM could outperform human planners but this study, to our best knowledge, is the first to demonstrate that AutoMBM could also eliminate the dependence of plan quality on the planning experiences and the nature of the treating center.When radiosurgery is increasingly applied to treat multiple brain metastases, AutoMBM proves to offer a viable option to alleviate the problem of global shortage of planners / dosimetrists that typically take years to develop their expert skills.Further crowdsourcing plan benchmark from other cloud-based international plan challenge studies is warranted to validate the results of this MC-PRIMA study and to investigate that other AP in general are able to improve plan quality and variability in SRS planning for multiple brain metastases.

Conclusions
Plan crowdsourcing offers an efficient means to benchmark new TPS or delivery technique.This plan crowdsourcing study shows promises of AutoMBM to achieve clinically acceptable plans with minimal interplanner variability for SRS to multiple intracranial metastases independent of the planning experience and the institution.

Declaration of Competing Interest
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: M.K.H. Chan was an employee at Imperial College NHS Healthcare Trust, UK and University Hospital Essen, Germany at the time of preparing of this study.Current study was conducted wihout involvment of these employments.Matthew Wong was an employee at Tuen Mun Hospital, China at the time of participation in this study.The authors received technical support in the collection of plan submissions and technical information of the Multiple Brain Metastases SRSTM software from Brainlab (Brainlab AG, Munich, Germany).No authors received financial support involved in the study design or materials, analysis and interpretation of data nor in the writing of the publication.Part of this work was presented as e-Poster in ASTRO annual meeting 2019.

Fig. 1 .
Fig. 1.Polar graphs showing the distributions oftable, gantry angles covered the arcs and collimator angles for twenty-four AutoMBM plans.Each axis (solid grey) in these graphs represent the angle of the table, gantry and collimator and the radial grid lines (dotted light grey) indicate the number of these table, gantry and collimator angles, respectively.

1Fig. 4 .
Fig. 4. Plot of radiosurgery and general planning experiences versus composite plan score of AutoMBM.

Table 1
Characteristics of the AutoMBM plan submissions in MC-PRIMA study.
D.) of the 23 plan quality metrics and their respective scores.

Table 2
Means ± one standard deviations (SD) of various dosimetric parameters calculated for AutoMBM and crowdsourced for different planning solutions from the cloudbased ProKnow platform.