Document Type


Publication Date



This is the accepted manuscript version of the article form the journal of Stroke and Cerebrovascular Disease, 2021 Jul;30(7):105832.

The final version of the article can be found on the journal's website:


BACKGROUND: Machine learning algorithms depend on accurate and representative datasets for training in order to become valuable clinical tools that are widely generalizable to a varied population. We aim to conduct a review of machine learning uses in stroke literature to assess the geographic distribution of datasets and patient cohorts used to train these models and compare them to stroke distribution to evaluate for disparities.

AIMS: 582 studies were identified on initial searching of the PubMed database. Of these studies, 106 full texts were assessed after title and abstract screening which resulted in 489 papers excluded. Of these 106 studies, 79 were excluded due to using cohorts from outside the United States or being review articles or editorials. 27 studies were thus included in this analysis.

SUMMARY OF REVIEW: Of the 27 studies included, 7 (25.9%) used patient data from California, 6 (22.2%) were multicenter, 3 (11.1%) were in Massachusetts, 2 (7.4%) each in Illinois, Missouri, and New York, and 1 (3.7%) each from South Carolina, Washington, West Virginia, and Wisconsin. 1 (3.7%) study used data from Utah and Texas. These were qualitatively compared to a CDC study showing the highest distribution of stroke in Mississippi (4.3%) followed by Oklahoma (3.4%), Washington D.C. (3.4%), Louisiana (3.3%), and Alabama (3.2%) while the prevalence in California was 2.6%.

CONCLUSIONS: It is clear that a strong disconnect exists between the datasets and patient cohorts used in training machine learning algorithms in clinical research and the stroke distribution in which clinical tools using these algorithms will be implemented. In order to ensure a lack of bias and increase generalizability and accuracy in future machine learning studies, datasets using a varied patient population that reflects the unequal distribution of stroke risk factors would greatly benefit the usability of these tools and ensure accuracy on a nationwide scale.

PubMed ID