Spatial statistical machine learning models to assess the relationship between development vulnerabilities and educational factors in children in Queensland, Australia

Background The health and development of children during their first year of full time school is known to impact their social, emotional, and academic capabilities throughout and beyond early education. Physical health, motor development, social and emotional well-being, learning styles, language and communication, cognitive skills, and general knowledge are all considered to be important aspects of a child’s health and development. It is important for many organisations and governmental agencies to continually improve their understanding of the factors which determine or influence development vulnerabilities among children. This article studies the relationships between development vulnerabilities and educational factors among children in Queensland, Australia. Methods Spatial statistical machine learning models are reviewed and compared in the context of a study of geographic variation in the association between development vulnerabilities and attendance at preschool among children in Queensland, Australia. A new spatial random forest (SRF) model is suggested that can explain more of the spatial variation in data than other approaches. Results In the case study, spatial models were shown to provide a better fit compared to models that ignored the spatial variation in the data. The SRF model was shown to be the only model which can explain all of the spatial variation in each of the development vulnerabilities considered in the case study. The spatial analysis revealed that the attendance at preschool factor has a strong influence on the physical health domain vulnerability and emotional maturity vulnerability among children in their first year of school. Conclusion This study confirmed that it is important to take into account the spatial nature of data when fitting statistical machine learning models. A new spatial random forest model was introduced and was shown to explain more of the spatial variation and provide a better model fit in the case study of development vulnerabilities among children in Queensland. At small-area population level, increased attendance at preschool was strongly associated with reduced physical and emotional development vulnerabilities among children in their first year of school. Supplementary Information The online version contains supplementary material available at 10.1186/s12889-022-14541-7.

, where z i = x i − x and S = n i=1 n j=1 w i,j . Here, x i is the independent variable, x is the associated sample mean and w ij is an element of the spatial S matrix, which shows the degree of spatial connection between regions i and j. There are two types of the spatial weight matrix [17]. One type is based on distances among observations (e.g., inverse distances raised to some power, bandwidth as the nth nearest neighbour distance, ranked distances, spatial kernel functions), and the second type is calculated using contiguity relationships. since nearest neighbours typically supply enough spatial dependence information in spatial modelling and prediction [9]. As for individual-tree data, Voronoi cells are one of the optimal ways of identifying trees' nearest neighbours. In a contiguity matrix, a value of 0 indicates no common border between two observations, and 1 indicates the presence of a common border between observations which we use in the formula. [14] Tango's MEET Let c i be the observation in a region i, n i the population size of the region i, C the total number of observations, N the total population, d ij the distance between region i and j, and u j(i) the population size in region i and its j nearest neighbors. then Tango's EET can be defined as: where w ij is typically defined as w ij = e −4( d ij λ ) 2 [26] or w ij = e (− d ij λ ) [28] where, λ is a measure of spatial scale clustering. Another weighted function based on the nearest neighbour is:w ij = (1/l) s where l indicates the closest area to area i. A second choice is the adjacent neighbourhood weight which takes the value 1 if area i and j are neighbours or i = j and 0 otherwise. Finally, a populationbased weight is another possible weight function which uses the product of the proportion of the corresponding counties to the total population and is given as: [26,28].

B Statistical machine learning algorithms Random Forest
A popular type of machine learning method is the decision tree, which has been shown to be fast, flexible, and can deal with large amounts of data [27]. A decision tree is built by grouping data using a recursive binary partitioning algorithms into more homogeneous groups. Each binary split is selected based on specified splitting criteria. The random forest (RF) approach suggested by Breiman [5] creates and combines a large number of individual decision trees generated from the data set of interest. Random forests address some of the drawbacks of having a single classification and regression tree (CART), such as over-fitting, correlation between variables and trees sets of splits to describe non linear relationships. It is a combination of random sub-space methods [22] and bagging [3]. Every decision tree is randomly generated by sampling a proportion( usually approximately two-thirds) of the training data and leaving the reminder out of training. Furthermore, at each decision node, only a subset of features is picked at random during the tree construction process. The final result is generated using the majority vote (classification) or the average prediction of all trees (regression). Simultaneously, the data left out of training for each tree is used to compute an output goodness of fit assessment called the out-of-bag (OOB) error estimate [5]. The importance of the covariates can be calculated using the OOB error, where the most effective approach is to use the increase in the mean squared error (iMSE) measure to determine variable importance. The values of each feature are permuted randomly, and the OOB error iMSE (increase in mean square error) is calculated . This method can be used for both regression and classification problems in many different fields [29,20,1,21,2].

Spatial decision tree
Breimanet [5] presented CART (Classification And Regression Trees) as a statistical approach for building tree predictors for regression and classification. In the classification instance, each observation is defined by input variables collected in vector X and a binary label Y that serves as the output (or response) variable. The general principle of CART is to recursively partition the input space using binary splits and then determine an optimal partition for prediction. The traditional representation of the model connecting Y to X is a tree that represents the model's underlying creation process. If the explanatory variables are spatial coordinates, we obtain a spatial decision tree, which causes the space to tessellate (of the X variables). A cell of this tessellation corresponds to a leaf of the decision tree. For a leaf of this tree, the response variable Y is constant and corresponds to the majority label of the observations belonging to this leaf. In the spatial database approach, classification is viewed as an organisation of items using their characteristics and their neighbours' properties. These might be direct neighbours, neighbours of neighbours, and so on, up to degree N . The presence of geographic variables in data differentiates a decision tree from a spatial decision tree, but the phases are the same. A geographical decision tree, like entropy in a decision tree, is used to assess a sample data set's heterogeneity (diversity). Entropy increases as the data set becomes more varied [12,30].

Geographical Random forest
To understand more about this algorithm, we use a regression equation of the form, where Y i is the ith observation's value of the response variable, ax i is the RF's nonlinear prediction based on a set of x covariates, and e is the error term. For GRF, equation (3) is extended as follows.
where a(u i , v i )x i is the RF model prediction for location i. and (u i , v i ) are the spatial coordinates. The neighbourhood (or kernel) is the field in which the sub-model runs. For each location, a submodel is created takes into account just nearby observations [9]. The area that the sub model operates in is called the neighbourhood (or kernel), and the maximum distance between a location i and its neighbourhood is called the bandwidth [15]. The neighbourhood is defined by n nearest neighbours or by a circle whose radius is the bandwidth (number of the nearest neighbours). Neighbours are defined as spatial units which share an edge or vertex [24]. The bandwidth is the maximum distance between a data point and its kernel [6]. There are two kinds of kernels that are commonly used, "adaptive" and "fixed" [13]. When sampling density varies across space, using an adaptive kernel is beneficial [9].

Neural Network
Like the RF, Neural Networks (NN) provide more representational flexibility and freedom from the constraints of a linear model [25]. A NN is made up of many or perhaps millions of tightly linked basic processing nodes. The majority of today's neural networks are structured into layers of nodes and are "feed-forward," meaning that data flows in just one direction through them. A single node may be linked to multiple nodes in the layer below it from which it receives data, as well as several nodes in the layer above it from which it transmits data. A node will assign a numerical "weight" to each of its incoming connections. Each node receives a new data item a distinct number, and multiplies it by the associated weight. The resulting items are then added together to produce a single number. If that number falls below a certain threshold, the node does not send any data to the next layer. An input, hidden, and output layer make up a basic NN. The connection weights of a basic NN from the hidden to the output layer can be interpreted as the coefficients of a linear model of non-linearly transformed variables, namely the outputs of the hidden neurons [10].

Neural Network for Spatial Data
The key distinction between a GWANN and a basic ANN is that a GWANN calculates an error signal using a geographical weighted error function rather than the standard quadratic error function [10] [7]. The geographically weighted error function is given by: where t i is the target value, o i the output of output neuron i, v i the geographically weighted distance between the observation and the location of output neuron i, and n the number of targets. From equation (5), the difference between the output neuron's value and the target value is weighted by the spatial distance between the output neuron's location and the observation; when the output neuron's location and observation are close, the difference is given more weight than when they are farther apart. A 10-fold cross-validation (CV) is typically used to calculate the number of GWANN training iterations. The models are trained within each fold until their performance on the current fold's test data does not increase after many iterations. The additional iterations are designed to offer networks a chance to break free from local minima. This method, known as "early stopping with patience" [4] minimises the chance of overfitting the training data significantly. The iteration with the best mean performance across all folds, as well as the performance value obtained, are then presented.

Relative importance
Garson [8] devised a method for calculating the relative importance of each of the input variables based on the connection weights. In this algorithm, each variable's input is stored as a weight in the network model, and the contribution of each of these variables to the output is largely determined by the magnitude and direction of these link weights. A positive connection weight enhances the magnitude of the network output, whereas a negative weight suppresses the value of the response variable [19]. Furthermore, when compared to the other factors, a variable with a considerably larger connection weight is regarded to have a bigger influence on the network output. Thus, where RI x denotes the relative importance of input neuron x, w xy the final weights of the connection from input neuron to hidden neurons, w yz the final weights of the connection from hidden neuron to output neuron. y represents the total number of hidden neurons, and z is the output neuron.

Linear Model
Given a dependent or response vector y, and a matrix of input variables X, a general linear model can be expressed as where β is a vector of regression coefficients and ϵ is a vector of residuals. In normal linear regression, ϵ is assumed to have a zero mean Gaussian distribution. The matrix X contains independent variables or covariates {X 1 , X 2 , ..., X p }, as well as any interactions or functions of these variables (e.g, quadratic or cubic terms to allow for non-linear relationships between the independent and response variables).
Linear Models for Spatial Data generalized linear models (GAMs) are non-parametric extensions of linear model regressions in which non-parametric smoothers f are applied to each predictor, and the component response is calculated additively, i.e.,

spatial regression
The general formula of the spatial regression called SARMA (Spatial Autoregressive Moving Average [11]) is given as where Y is the dependent variable, X is the matrix of independent variables, ρ is the spatial autoregressive parameter, W is a weights matrix, β is a regression coefficient vector, λ is a spatial error coefficient, and ϵ is a residual vector.

Conditional Autoregressive Model (CAR)
A popular spatial prior proposed by Leroux [16] includes a spatial random effect ψ such that, with covariance matrix D, where D is usually described by its generalized inverse [16] Here, R is the intrinsic autoregression matrix which represents the neighbourhood structure of the regions with typical element R ij , which equals n i when i = j, where n i is the numbers of neighbours of region i, and I(i ∼ j) otherwise, where I(i ∼ j) is an indicator function taking the value 1 when i and j are neighbours.
The term ρ is introduced as a spatial dependence parameter, ρ ∈ [0, 1], whose two extreme cases give rise to the independence model (i.e., ψ i = v i and D = σ 2 I ). The spatial residual is typically considered to have an independent normal distribution v i ∼ N (0, σ v ), and intrinsic auto regression (i.e., ψ i = u i and D = σ 2 R − ), respectively [16]. For ρ close to 1, the conditional variance becomes close to σ 2 /n i and for ρ close to 0, the variance becomes close to σ 2 , that is independent of the number of neighbours n i [23]. The univariate full conditional distribution for ψ i |ψ −i can be written as where ψ −i denotes the random effect vector with the ith component deleted.
C Hyper-parameters of the models used for the case study

Random Forest
The most common parameters for the random forest are mtry and ntree, which have been found using 10-fold cross-validation and three repeats (random search) in caret packages. We found mtry = 3, and to find the ntree, we tried different values (500,1000,1500,2000,2500) and calculated the RMSE. The optimal number of trees was 1000.

Spatial Random Forest
The same method has been used to find the optimal parameters. We found mtry=3, ntree=1000

Geographical Random Forest
For geographical random forest, we used the adaptive kernel, which can be used when the density is different across space -which is the case in our dataset, the bandwidth setup to be 400, which gave the lowest RMSE. mtry=3 , ntree=1000 Neural network hidden=c(10,3), threshold=0.01, stepmax = 100,000

D Spatial maps
The actual and estimated responses using Spatial Random Forest (SRF), Geographical Random Forest (GRF), and CAR model. Where A: the actual responses, B: the estimated responses from SRF, C: the estimated responses from GRF, D: the estimated responses from CAR model