Perhaps some people might be interested in our latest preprint:
I can see at least a couple of potential problems with this paper, at first glance:

Iâ€™m not at all sure that Linear Regression (LR), Structural Equation Modeling (SEM) and Machine Learning (ML) are mutually exhaustive in terms of approaches to this sort of problem. Iâ€™m very sure that ML is not the only nonlinear statistical approach available.

From what I have heard of ML, it tends to be substantially better at predicting the dataset it was trained on, than at predicting new, untrainedon data. This therefore means that it is good practice to hold back a substantial amount of available data, train the model on the remainder, and then test the model against the untrained portion. I did not (unless I missed it) see any indication in the paper that this was done.
I think this is what they did to address that:
We used the TensorFlow (39) Adam optimizer with 10fold cross validation to train both
LR and NN models. The TensorFlow error function mirrors ordinary least squares
estimation, which is commonly used to train LR. All performance was computed using 10
fold crossvalidated predictions. One tenth of rows were held out as validation set and the
remaining observations are used as the training set. Ten models were trained with different
holdout sets, such that each observation was in the validation set once.
Youâ€™ll have to read closer then.
Our key point doesnâ€™t require exhaustive application of methods.
I read it through (understood almost all of it) and I have a comment and a question:
Comment: This is really cool. Now that Iâ€™m working in institutional research (doing a fair amount of higher ed data science) I have been thinking about a similar issue in student success and outcome predictions. We are often using race as an â€śeasyâ€ť variable, knowing that itâ€™s a proxy for a lot of things. What Iâ€™m seeing though is that there is an increasing number of students who are either not reporting race or reporting â€śtwo or moreâ€ť. In light of this and all our discussions about race from a genetic perspective, Iâ€™m wondering if higher ed needs to work on replacing race as a major variable of interest and replace it with more complicated (multiple, nonlinear variables) variables that are more meaningful for intervention, policy, and determining causation.
Question: Iâ€™ve done a little bit of modeling (all linear so far) and donâ€™t know a lot about neural networks yet, but my understanding is that NN models are more â€śblack boxâ€ť, hence the â€śhiddenâ€ť layer, which will hopefully give you a better r^2 but makes it harder to actually know how the variables are related to each other. It seems like this would make things like policy suggestions or meaningful interventions, as opposed to passively being able to make predictions, more difficult. Thoughts?
So I see. My apologies.
Yes, but my point is that you would be conflating improvement in fit due to neural models over nonneural models with improvement due to relaxing the constraint of linearity (which is available even in nonneural models).
Except our point has almost nothing to do with this. Certainly nothing directly. Whether one uses neural networks or nonlinear models of a different sort, our point that subtle nonlinearities better untangle these relationships, even when the non linearities are subtle, still stands.
A comment:
In contrast to linear models, nonlinear models ranked income, neighborhood disadvantage, and experiences of discrimination higher in importance while modeling birthweight than race.
I found this sentence from the abstract hard to parse. I suggest â€¦
In contrast to linear models, nonlinear models ranked income, neighborhood disadvantage, and experiences of discrimination higher in importance than race while modeling birthweight.
My take is that the NN found the relationships to the more meaningful variables (Income, neighborhood, discrimination), rather than Race as a marker for these other factors. The model fit is not that different, R^2 0.17 versus 0.14 is not a big deal, but finding the more meaningful relationship is pretty nifty.
@Michael_Okoko just posted an article about Race as a marker variable which make a similar point in a different way. Race is a proxy for more meaningful factors, and we like our midels to be meaningful.
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.