Perhaps some people might be interested in our latest preprint:
I can see at least a couple of potential problems with this paper, at first glance:
-
I’m not at all sure that Linear Regression (LR), Structural Equation Modeling (SEM) and Machine Learning (ML) are mutually exhaustive in terms of approaches to this sort of problem. I’m very sure that ML is not the only non-linear statistical approach available.
-
From what I have heard of ML, it tends to be substantially better at predicting the dataset it was trained on, than at predicting new, untrained-on data. This therefore means that it is good practice to hold back a substantial amount of available data, train the model on the remainder, and then test the model against the untrained portion. I did not (unless I missed it) see any indication in the paper that this was done.
I think this is what they did to address that:
We used the TensorFlow (39) Adam optimizer with 10-fold cross validation to train both
LR and NN models. The TensorFlow error function mirrors ordinary least squares
estimation, which is commonly used to train LR. All performance was computed using 10-
fold cross-validated predictions. One tenth of rows were held out as validation set and the
remaining observations are used as the training set. Ten models were trained with different
hold-out sets, such that each observation was in the validation set once.
You’ll have to read closer then.
Our key point doesn’t require exhaustive application of methods.
I read it through (understood almost all of it) and I have a comment and a question:
Comment: This is really cool. Now that I’m working in institutional research (doing a fair amount of higher ed data science) I have been thinking about a similar issue in student success and outcome predictions. We are often using race as an “easy” variable, knowing that it’s a proxy for a lot of things. What I’m seeing though is that there is an increasing number of students who are either not reporting race or reporting “two or more”. In light of this and all our discussions about race from a genetic perspective, I’m wondering if higher ed needs to work on replacing race as a major variable of interest and replace it with more complicated (multiple, non-linear variables) variables that are more meaningful for intervention, policy, and determining causation.
Question: I’ve done a little bit of modeling (all linear so far) and don’t know a lot about neural networks yet, but my understanding is that NN models are more “black box”, hence the “hidden” layer, which will hopefully give you a better r^2 but makes it harder to actually know how the variables are related to each other. It seems like this would make things like policy suggestions or meaningful interventions, as opposed to passively being able to make predictions, more difficult. Thoughts?
So I see. My apologies.
Yes, but my point is that you would be conflating improvement in fit due to neural models over non-neural models with improvement due to relaxing the constraint of linearity (which is available even in non-neural models).
Except our point has almost nothing to do with this. Certainly nothing directly. Whether one uses neural networks or nonlinear models of a different sort, our point that subtle nonlinearities better untangle these relationships, even when the non linearities are subtle, still stands.
@jordan this would be another domain worth applying the idea to.
A comment:
In contrast to linear models, non-linear models ranked income, neighborhood disadvantage, and experiences of discrimination higher in importance while modeling birthweight than race.
I found this sentence from the abstract hard to parse. I suggest …
In contrast to linear models, non-linear models ranked income, neighborhood disadvantage, and experiences of discrimination higher in importance than race while modeling birthweight.
My take is that the NN found the relationships to the more meaningful variables (Income, neighborhood, discrimination), rather than Race as a marker for these other factors. The model fit is not that different, R^2 0.17 versus 0.14 is not a big deal, but finding the more meaningful relationship is pretty nifty.
@Michael_Okoko just posted an article about Race as a marker variable which make a similar point in a different way. Race is a proxy for more meaningful factors, and we like our midels to be meaningful.
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.