An unanswered call: Where is multilevel modeling in male infertility research involving semen parameters?

Consider This

Upvote 0 Comment


Emily M. D'Agostino, Dr.P.H., M.S., M.A.
Miami-Dade Department of Parks, Recreation and Open Spaces
275 NW 2nd St., Ste. 416
Miami, Florida


In March 2000, both Fertility and Sterility and the Journal of the American Statistical Association published papers calling for greater sophistication of analyses pertaining to human fertility, including the use of multilevel methods.  Infertility occurs in approximately 12% of reproductive aged couples in the United States, and can be attributed to male factors in up to 60% of the cases. Given semen parameters show between-person variability, and also within-person variability over time, multilevel methods are ideally suited for research when semen quality serves as a proxy for male infertility.  However, many papers reporting on semen parameters only collect one sample per male, although the literature indicates that multiple specimen will improve study precision and reduce misclassification of infertility status. Further, even when multiple samples per male are collected, or when data are clustered by other factors such as study site or geographic region, multilevel models are rarely used.  In essence, despite compelling work demonstrating the necessity of multilevel methods in semen parameters research, this important analytic tool has yet to be adopted as standard research practice.  Here, I discuss the specific application of multilevel methods to male infertility research to reinforce the call for drawing from this important methodology when working with clustered data.  

Consider This:

Infertility occurs in approximately 12% of reproductive aged couples in the United States, and can be attributed to male factors in up to 60% of the cases (1, 2).  In March 2000, Weinber and Dunson published a manuscript in the Journal of the American Statistical Association calling for greater sophistication of analytic methods used in fertility research.  The authors noted that multilevel methods accommodates variability in fertility associated with environmental and genetic factors, as well as changes in exposures over time (3).  That same month, Fertility and Sterility featured Hogan and Blazar’s paper demonstrating the importance of multilevel methods for analyzing clustered data, such as when embryos or oocytes are the unit of analysis (4).  This approach may improve the accuracy of confidence intervals and p values compared with traditional regression models, and depending on the statistical tests used. Consistent with these papers, multilevel methods should be used to model clustered data observed across semen specimen.  In fact, the fertility literature reports high intraclass correlations (a measure of similarity across observations within vs. across different groups or clusters of data) for between-subjects variability in semen specimen parameters including sperm count, concentration, motility, morphology, and ejaculate volume that range from 0.58 to 0.92 (5-7).  Further, semen parameters also show within-person variability over time (5-7), which must be considered given the World Health Organization recommends that multiple semen samples should be collected from individual men to improve study precision and avoid underdiagnosis of infertility (8).  Given multilevel methods take into account both between- and within-male variability, they are typically the most appropriate method for analyzing data using semen quality as a proxy for male infertility.  

Surprisingly, however, male infertility researchers have yet to routinely draw from this framework when examining semen parameters. Indeed, using the search terms “Male infertility” and “Andrology” with “multilevel model” and related terms (“mixed model”, “hierarchical model” and “random-effects model”) in Google Scholar and also PubMed, approximately 20 papers were found over the last two decades that used a multilevel analytic approach.  Although some sources may have been overlooked, multilevel methods are nevertheless extremely uncommon in this literature.  To begin, many papers do not collect more than one semen sample per male. This is unfortunate given that the consistency in semen parameters across samples from the same men shows a significant reduction when three or fewer samples are collected (6). Also, for studies that examine factors associated with fertility status, collecting multiple samples per male would reduce potential for misclassification of outcome (5).  Further, even when multiple samples are collected from each male, or when data are clustered by other factors such as study site or geographic region, multilevel models are rarely used.  In essence, despite compelling work demonstrating the necessity of multilevel methods in male infertility research involving semen parameters, this important analytic tool has not been adopted as standard research practice.  
Why, then, are multilevel analyses of semen quality data not routinely used in male infertility research?  Here, I reiterate multilevel modeling’s relevance to the field of infertility research involving semen parameters in hopes it will be adopted by more scholars in the field.  I provide an introduction to multilevel modeling as applied to work in this area, and offer a hypothetical example for how a researcher examining the association between obesity and male infertility would apply multilevel methods to analyzing their data.  To this end, I encourage male infertility/andrology researchers to draw from this important methodology when analyzing data collected at different levels of observation.

Overview of generalized linear multilevel models as applied to male infertility research

Clustered data is common in infertility research (4).  For example, an investigation on the association between occupational exposures and male infertility (e.g. using semen quality as a proxy) may collect individual specimen from multiple men across different work sites.  Data from men who work in the same location would be considered clustered or grouped by work site in this study.  Alternatively, repeated specimen collected from the same men within the study (i.e. repeated measures) would be considered data that is clustered or grouped within each individual. It is well established that differences in random variations between observations on individuals in the same cluster/group, or observations collected repeatedly from the same individual over time tend to be smaller (i.e. correlated or dependent) than differences in variations across observations from different clusters/groups (9).  Indeed, men from the same work sites are likely to share common exposures related to fertility outcomes of interest.  Likewise, research has shown that intra-individual differences in semen parameters collected over repeated observations from the same men are more similar (and therefore have less variation/smaller values for error terms in a statistical model) than parameters across different men (5-7).  

What are the benefits of multilevel methods for male infertility research?  

Conventional linear regression models do not take into account differences in specimen across men, or across repeated specimen collected from the same man.  In other words, traditional regression models do not account for whether observations from within a total pool come from different groups of individuals, or if data are drawn from repeated observations within the same individuals.  This similarity in observations collected from members of the same cluster/group, or repeated observations over time from the same individual, violates a main assumption of simple random sampling (assuming observations are independent of one another when they are not) (9).  As a result, when semen specimen are analyzed as a proxy for male infertility and traditional regression methods are used, they are likely to produce inaccurate beta (i.e. slope) coefficient estimates based on underestimated standard errors (i.e. margin of error/inaccuracy), that may result in Type I error (i.e. incorrect rejection of the null hypothesis). For clustered/grouped observations, however, we can circumvent this problem by using generalized linear multilevel models (multilevel models) (9).    Multilevel models include both individual- and group-level error terms, providing for a more complex error structure.  Multilevel models then weight individual-level coefficients toward between-group estimates (10). The final beta coefficient estimates for multilevel models therefore draw information from both observations within a group (i.e. semen specimen collected from men at the same work place, or multiple specimen collected from the same male), and pooled information across different groups or different men with multiple specimen, respectively (9, 10).  When analyzing data that comes from different levels or clusters, multilevel models therefore generate more accurate beta coefficient estimates compared with conventional regression methods.  Although outside the scope of this manuscript, the literature addresses the number and size of groups in multilevel analyses to avoid Type II error (i.e. failure to detect a significant effect when one exists).  Briefly, multilevel models may not be appropriate when the total number of groups is small (e.g. too few men in the study, or too few clusters). In such instances, studies may be underpowered (11).  The literature provides examples of the number and size of groups necessary to avoid this error (11, 12).  Also, although multilevel models have the potential to improve the accuracy of beta coefficients as compared with conventional regression methods, they are particularly complex.  Care must be taken to properly set up models in order avoid erroneous conclusions.  In summary, multilevel methods can provide more accurate beta coefficient estimates for andrology research when data come from clusters or different levels of organization.      

Hypothetical multilevel analyses on the male obesity-infertility association 

While a growing body of literature associates male obesity with infertility, including hypogonadism, impaired spermatogenesis, reduced circulating testosterone levels, lower semen quality, and lower live birth rates per cycle of assisted reproduction technology (13), seldom do authors draw from multilevel models to examine this association.  For example, in a meta-analysis on obesity and infertility which included 21 studies (14), not one paper used multilevel models to analyze the data.  However, seven of the studies drew from more than one specimen per participant (repeated measures), including one study which simultaneously analyzed multiple biological specimen associated with fertility collected from the same individuals (repeated measures), and five studies examined data that had potential clustering by other factors including geographic area, lab technician, and study center.  None of these papers described methods which tested whether multilevel models were appropriate (although these tests are uncomplicated for trained individuals).  In other words, although multilevel analyses may have been the most appropriate method for analyzing data from 12 of the 21 studies in this meta-analysis, conventional approaches were used.  What, then, would a multilevel analysis of male obesity and infertility look like?  Suppose a researcher were to draw from a large prospective cohort of men followed over six months to examine the association of weight loss and semen quality, with semen specimen and weight collected at baseline and every two months (i.e. repeated observations).  The main predictor for the model would be change in weight from baseline; the outcome and potential confounding variables could include semen quality (concentration, morphology and motility), and diet, exercise, smoking status, alcohol and drug use, and time from abstinence, respectively (see Figure 1).  In this hypothetical study, repeated observations collected from the same male over different specimen collection times would include the predictor (weight change) and potential confounders (e.g. abstinence time and alcohol/drug intake via diary since last visit).  These exposures differ from those which occur at the person-level (i.e. observations that vary across men in the study), such as age (measured in years), marital status, and smoking status (ever smoked/not). By teasing apart standard errors based on the specimen- vs. person-levels, the multilevel model would account for correlation (i.e. interdependence) between repeated observations, therein producing more accurate standard errors than a traditional linear regression approach and reducing potential for Type I error.  The statistical model testing for the association between semen quality and weight change would be adjusted for potential confounders (diet, marital status, smoking status, drug/alcohol use, and abstinence time). The specific data analysis may be performed using statistical software such as the lme() function in R (lme4 package), .xtmixed and MIXED commands in STATA (College Station, TX) and SPSS (Chicago, IL), respectively, and PROC MIXED in SAS 9.3 (Cary, NC) (15). 

Figure 1. Comparison of conventional linear regression models and repeated measures multilevel models in infertility research examining the association of weight change and semen quality.  Whereas the traditional model (left) assumes independence across observations within a data pool, the multilevel model takes into account differences in variance between specimen collected from the same male (middle graph). Repeated measures multilevel models allow for differences over time in the relationship between different trajectories of weight change and semen quality (right) across different men within the same data pool. Schematization based on a modification of Plewis.(16)

Multilevel modeling—a necessary analytical tool for andrology research 

In summary, multilevel models provide male infertility researchers with a methodology to account for variance across clustered data, particularly when examining semen parameters as a proxy for infertility.  This analytic approach is relevant to data collected at different levels of observation in infertility research, including but not limited to both specimen-level (abstinence period, pattern of alcohol use, exposure to heat or tight clothing), and person-level (e.g. workplace factors, history of smoking, drug use, STIs, and urban vs. rural place of residence) exposures. To this end, I hope we can increase the potential for infertility scholars to draw from multilevel models where clustering of observations occur.


1. Louis JF, Thoma ME, Sorensen DN, et al. The prevalence of couple infertility in the United States from a male perspective: Evidence from a nationally representative sample. Andrology. 2013;1(5):741-748. 

2. Irvine DS. Epidemiology and aetiology of male infertility. Hum Reprod. 1998;13 Suppl 1:33-44. 

3. Weinberg CR, Dunson DB. Some issues in assessing human fertility. J Am Stat Assoc. 2000;95(449):300-303. 

4. Hogan JW, Blazar AS. Hierarchical logistic regression models for clustered binary outcomes in studies of IVF-ET. Fertil Steril. 2000;73(3):575-581. 

5. Chiu YH, Edifor R, Rosner BA, et al. What does a single semen sample tell you? implications for male factor infertility research. Am J Epidemiol. 2017;186(8):918-926. 

6. Francavilla F, Barbonetti A, Necozione S, Santucci R, Cordeschi G, Macerola B, et al. Within-subject variation of seminal parameters in men with infertile marriages. Int J Androl. 2007;30(3):174-81. 

7. Leushuis E, van der Steeg JW, Steures P, Repping S, Bossuyt PM, Blankenstein MA, et al. Reproducibility and reliability of repeated semen analyses in male partners of subfertile couples. Fertil Steril. 2010 Dec;94(7):2631-5. 

8. World Health Organization (WHO). WHO laboratory manual for the examination and processing of human semen. 5th ed. Geneva, Switzerland: World Health Organization, Department of Reproductive Health and Research; 2010. 

9. Luke DA. Multilevel modeling. Vol 143. Sage; 2004:82. 

10. DiPrete TA, Forristal JD. Multilevel models: Methods and substance. Annu Rev Sociol. 1994;20(1):331 357. 

11. Theall KP, Scribner R, Broyles S, Yu Q, Chotalia J, Simonsen N, et al. Impact of small group size on neighbourhood influences in multilevel models. J Epidemiol Community Health. 2011 Aug;65(8):688-95. 

12. Bauer DJ, Preacher KJ, Gil KM. Conceptualizing and testing random indirect effects and moderated mediation in multilevel models: new procedures and recommendations. Psychol Methods. 2006 Jun;11(2):142-63. 

13. Chambers TJ, Richard RA. The impact of obesity on male fertility. Hormones. 2015;14:563-568. 

14. Sermondade N, Faure C, Fezeu L, Shayeb AG, Bonde JP, Jensen TK, et al. BMI in relation to sperm count: an updated systematic review and collaborative meta-analysis. Hum Reprod Update. 2013;19(3):221-31. 

15. Albright JJ, Marinova DM. Estimating Multilevel Models using SPSS, Stata, SAS, and R. Indiana University Scholar Works [Internet]. 2010:12/24/17. Available from: 

16. Plewis I. Multilevel models. Social research update [Internet]. 1998;23:9/2/2017. Available from:

Go to the profile of Fertility and Sterility

Fertility and Sterility

Editorial Office, American Society for Reproductive Medicine

Fertility and Sterility® is an international journal for obstetricians, gynecologists, reproductive endocrinologists, urologists, basic scientists and others who treat and investigate problems of infertility and human reproductive disorders. The journal publishes juried original scientific articles in clinical and laboratory research relevant to reproductive endocrinology, urology, andrology, physiology, immunology, genetics, contraception, and menopause. Fertility and Sterility® encourages and supports meaningful basic and clinical research, and facilitates and promotes excellence in professional education, in the field of reproductive medicine.


Go to the profile of Mary Samplaski
Mary Samplaski 14 days ago

This is a relatively novel way of looking at semen analysis testing, but it makes a lot of sense. I will be very curious moving forward to see if when this methodology is applied if it provides more accurate results. I will also be curious to see if this eventually becomes widely used, and how clinicians are trained in this methodology of evaluation. 

Go to the profile of Emily D’Agostino
Emily D’Agostino 8 days ago

Thanks Mary, it is true that further research should address whether multilevel models yield more accurate results in semen analysis testing.  Applying this methodology to analyses involving clustered data in other fields suggests it does, and further, that false conclusions may be drawn if it is not used.  It would be ideal to foster more opportunities to train clinicians in this methodology. I hope other epidemiologists share my interest in helping clinicians and their teams become more familiar with this approach.