Analysis of Birth weight using Singular Value Decomposition
The researchers have drawn much attention about the birth weight of newborn babies in the last three decades. The birth weight is one of the vital roles in the babys health. So many researchers such as (2),(1) and (4) analyzed the birth weight of babies. The aim of this paper is to analyze the birth weight and some other birth weight related variable, using singular value decomposition and multiple linear regression.
💡 Research Summary
The manuscript reports an exploratory study that seeks to identify maternal factors influencing newborn birth weight by applying both Singular Value Decomposition (SVD) and multiple linear regression to a small dataset collected from a private hospital in Chennai, India, during the calendar year 2008. The authors first introduce the public‑health importance of low birth weight (defined as < 2500 g) and list a range of maternal characteristics that have been implicated in previous research, including gestational age, maternal height, weight, age, blood pressure, and hemoglobin concentration.
Data are presented in a single table (Table I) that groups newborns into five birth‑weight intervals (<2000 g, 2000‑2400 g, 2400‑2800 g, 2800‑3200 g, >3600 g) and reports the average values of the five maternal variables for each interval. The authors treat this five‑by‑five matrix as the data matrix A, where rows correspond to weight categories and columns to the maternal predictors. No information is given about the number of individual births that contributed to each average, nor about the variability within each group.
In the SVD analysis the matrix A is factorised as A = U S Vᵀ. All five singular values are computed, but the authors retain only the two largest (rank‑2 approximation) and truncate the remaining components. They then project the original predictor matrix X and the response vector Y (birth weight) into the reduced space using the formulas X̂ = X Uₖ Sₖ⁻¹ and Ŷ = Y Uₖ Sₖ⁻¹, where Uₖ and Sₖ contain the first two columns of U and the corresponding 2×2 block of S, respectively. Cosine similarity is calculated between each transformed predictor and the transformed response. The resulting similarity scores are reported as 0.94228 for hemoglobin (X₅), 0.9298 for blood pressure (X₄), 0.2371 for maternal height (X₁), 0.3608 for maternal weight (X₂), and 0.5169 for maternal age (X₃). The authors interpret the similarity values as “the proportion of variability in Y explained by each X” and conclude that hemoglobin concentration is the dominant factor.
Parallel to the SVD work, a conventional multiple linear regression model is fitted:
Y = β₀ + β₁ X₁ + β₂ X₂ + β₃ X₃ + β₄ X₄ + β₅ X₅ + ε
The estimated coefficients (Table II) are β₁ = –96.82, β₂ = 9.05, β₃ = 68.39, β₄ = –29.46, β₅ = 45.02, with an overall R² = 0.738. The authors also compute individual R² contributions (Table III) and report that hemoglobin alone accounts for 8.1 % of the variance, while the other variables contribute between 1.5 % and 7.3 %. A side‑by‑side comparison (Table IV) shows that both the SVD‑based similarity scores and the regression‑based R² point to hemoglobin as the most influential predictor.
In the discussion the authors argue that increasing maternal hemoglobin through better nutrition could reduce the incidence of low birth weight, and they list a series of public‑health recommendations (prenatal check‑ups, iron supplementation, control of maternal weight, etc.).
While the study’s aim is commendable, several methodological shortcomings limit the credibility of its conclusions. First, the dataset is effectively reduced to five aggregated observations; the lack of raw case numbers, standard deviations, or any measure of sampling error makes it impossible to assess statistical significance or to perform any robust validation. Second, the use of SVD for variable importance is unconventional. SVD is a matrix factorisation technique primarily used for dimensionality reduction, noise filtering, or latent‑semantic analysis; interpreting cosine similarity between reduced‑space vectors as a direct measure of explanatory power lacks a solid statistical foundation. Third, the regression analysis shows severe multicollinearity: variance inflation factors (VIF) for X₂ and X₅ exceed 70, indicating that the coefficient estimates are unstable and that the reported R² is likely inflated. No remedial steps (e.g., ridge regression, variable selection, or principal component regression) are reported. Fourth, the authors provide no out‑of‑sample validation, cross‑validation, or bootstrapping to test the generalisability of either model. Consequently, the reported R² = 0.738 may be an over‑optimistic estimate of predictive performance.
Finally, the manuscript suffers from numerous typographical, formatting, and citation errors, which further hinder reproducibility. In sum, the paper demonstrates an interesting attempt to combine linear algebraic decomposition with classical regression, but the execution falls short of the standards required for reliable inference. Future work should (a) collect a larger, individual‑level dataset with proper measurement of variability, (b) justify the choice of SVD for variable ranking or replace it with more appropriate techniques such as principal component analysis coupled with regression, (c) address multicollinearity through regularisation or variable selection, and (d) validate the models on independent data. Only with these improvements can the claim that maternal hemoglobin is the primary driver of birth weight be substantiated.
Comments & Academic Discussion
Loading comments...
Leave a Comment