Use of Statistical Models in Predicting Groundnut Yield in Relation to Weather Parameters

In Tamil Nadu, groundnut is an essentialand major oilseed crop, mainly grown under rainfed conditions. The changes in weather parameters might affect the productivity of groundnut. Hence, crop yield forecasting based on weather parameters is essential for proper planning, decision-making, and buffer stocking policy formulation. As for the data with multicollinearity, penalized regression models i.e.Ridge, Least Absolute Selection and Shrinkage Operator (LASSO) and Elastic Net (ENet), are better alternatives to classical linear regression. The data on weather parameters such as maximum temperature(Tmax), minimum temperature (Tmin), morning relative humidity (RH I), evening relative humidity (RH II), and rainfall were collected for 29 years from1991-2019. The weather indices approach was used in this study. The collected data were partitioned into training, and testing datasets and the hyperparameters of penalized regression models were tuned using cross-validation. The performance of the models wasevaluated using an adjusted coefficient of determination (R 2adj ), Root Mean Squared Error (RMSE), normalized RMSE (nRMSE), Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) as the goodness of fit criteria. The results revealed that all the Penalized regression models provide a better fit to data. The SMLR and ENet were found to predict with better accuracy. Hence, these methods can be used for groundnut yield forecasting during Kharif season for the Coimbatore district of Tamil Nadu.


INTRODUCTION
Groundnut is one of the most essentialoilseed crops of India. In Tamil Nadu, it is annually grown in about 0.35 million hectares with about 0.98 million tons production. In India, Tamil Nadu ranks third in production contributing 9.74% of the total production of groundnut crop in the country, with an average yield of 2840 kg ha -1 (Directorate of Economics and Statistics, 2019-20). In Tamil Nadu, two-thirds of the groundnut cultivation area is under rainfed condition and the remaining one-third is under irrigated condition.
The overall growth of the Indian economy relies on the performance of agriculture, which depends, upon the weather conditions every year. A timely and reliable forecast of crop yield is of great importance for monsoon-dependent countrieslike India, where the economy is mainly based on agricultural production (Vinaya et al., 2017).
Crop yield is influenced by technological change and weather variability. Technological factors increase yield smoothly through time and therefore, years or some other parameters of time can be used to study the overall effect of technology on yield (Agarwal et al., 1980). Generally, there are two approaches for crop yield forecasting: crop simulation and empirical statistical models (Bocca and Rodrigues,2016).Though crop simulation models are precise, they are input data-intensive and the lack of sufficient data sets makes their application limited to smaller scales rather than regional scales.
Hence, empirical statistical models with simple regression techniques havebeen largely used as an alternative to process-based simulation models (Lobell and Burke,2010;Shi et al ,2013). Calibrated and tested statistical models lead to successful crop yield forecasting based on weather parameters. Since groundnut is mainly cultivated under rainfed conditions, weather conditions highly affect its productivity.Most of the previous studies have used Multiple Linear Regressions (MLRs) to develop a statistical crop yield prediction model (Rai et al.,2013;Dhekle et al.,2014;Kumar et al.,2014). However, MLR results in over-fitting when (a) the number of samples is less than the number of predictors, (b) multicollinearity exists i.e. when the independent variables are correlated (Verma et al., 2016). One of the consequences of multicollinearity is the large standard errors of regression coefficients, making the inaccurate inference based on the fitted model (Yakubu, 2010;Dormann et al., 2013).
To overcome these drawbacks, feature selection and penalized regression methods such as Stepwise Multiple Linear Regression (SMLR), least absolute shrinkage and selection operator (LASSO), elastic net (ENET) and ridge regression techniques can be used (Das et al.,2017). In this context, the main objective of our study is to develop and select a statistical groundnut yield forecasting model for the Coimbatore district of Tamil Nadu with the predictive performance and efficiency of the developed models.

Data Collection
Time series data of groundnut yield (Arachis hypogea L) for Coimbatore district of Tamil Nadu for 29 years (1991 to 2019) has been collected from the Season and Crop Report, Department of Economics andStatistics. Daily weather data were collected from Agro Climate Research Centre, TNAU. The data on five weather variables namely maximum temperature (Tmax, o C), minimum temperature (Tmin, o C), morning and evening relative humidity (RH I & RH II (%)) and rainfall (mm) for a total of 18 weeks of crop cultivation which includes 14 th to 31 st standard meteorological week (SMW) has been used in the study as the sowing of the groundnut in Coimbatore district is usually carried out during the month of April-May (Chithiraipattam). Daily data of Tmax, Tmin, RH I and RH II had been converted into its weekly average, whereas the weekly sum of rainfall has been considered. Out of the 29-year data, 24 years were used for calibration, while the remaining 5 years were used for validation.

Detrending of Yield Time Series Data
The fluctuations in yield data over the years due to the technology differences, climatic variability, etc., leads to a nonlinear and non-stationary trend which hasto be removed. The correlation between detrended yield and weather parameters is used to calculate weight for model development (Wu et al., 2007). In the present investigation, a simple linear regression model has been applied to detrend the yield of groundnut.
Yt= β0 + β1*t (1) Where, t is the time period,Yt is the crop yield at time t, and β0 and β1 are the coefficients. This model's residuals (detrended yield) were used for indices calculation (Trnka et al.,2009).

Weather Indices Approach
For each weather variable, two indices are developed, one as the total values of weather parameters over different weeks and the other one as weighed total;the weights are the correlation coefficient between detrended yield and weather variable in respective weeks.

Unweighed weather indices: Zij = ∑ =1
Weighed weather indices: Where, = value of i th weather variable in w-th week = correlation coefficient of detrended yield with i-th weather variable m = week of forecast For j=0, we have unweighed indices and for j=1, weighed indices. Totally 11 weather variables were generated as per the procedure mentioned above is presented in Table.1.

Stepwise Multiple Linear Regression
Multiple Linear Regression (MLR) is the most straightforwardapproach for the development of statistical models. However, its application for the dataset with more significantexplanatory variables and is not always successful (Balabin et al., 2011). A stepwise regression procedure was adopted to select the best regression variables among many independent variables (Singh et al, 2014). A fundamental problem with stepwise regression is that some real explanatory variables that have causal effects on dependant variables may happen to be statistically insignificant, while nuisance variables may be coincidentally significant (Smith et al., 2018). Hence, we opt for alternative methods such as penalized regression methods.

Penalized Regression
Penalized regression is a better alternative for the linear regression model (or the ordinary least squares method). The penalized regression adds a constraint (penalty) in the equation. The consequence of imposing this penalty is to reduce the coefficient values towards zero. This allows the less contributive variables to have a coefficient close to zero or equal to zero. The logic behind penalized regression is to reduce the impact of multicollinearity since all independent variables in the study are related.

Ridge Regression
Ridge regression shrinks the regression coefficients so that variables with a minor contribution to the outcome have their coefficients close to zero. The shrinkage of the coefficients is achieved by penalizing the regression model with a penalty term called L2-norm, which is the sum of the squared coefficients (Zou and Hastie, 2005).
Where y is the independent variable, is the corresponding coefficient and λ is the L2 norm penalty. A large value of λ means a more significantamount of shrinkage. Ridge regression keeps all the predictors in the model without making any variable selection.

Lasso Regression (Least Absolute Shrinkage And Selection Operator)
It shrinks the regression coefficients toward zero by penalizing the regression model with a penalty term called L1-norm, which is the sum of the absolute coefficients. In the case of lasso regression, the penalty has the effect of forcing some of the coefficient estimates, with a minor contribution to the model, to be exactly equal to zero (Tibshirani, 1996). One obvious advantage of lasso regression over ridge regression is that it produces more straightforwardand more interpretable models that incorporate only a reduced set of predictors.

Elastic Net Regression
Elastic Net combines characteristics of both lasso and ridge, i.e., penalized with both the L1 and L2norm (Hoerl and Kennard, 1970). The consequence of this is to effectively shrink coefficients (like in ridge regression) and to set some coefficients to zero (like in LASSO). Hence it reduces the impact of different features while not eliminating all of the features (Cho et al., 2009).
Where y is the independent variable, is the corresponding coefficient and λ is the penalty.
These methods have two parameters, namely lambda and alpha, which need to be optimized. The optimal lambda values were selected by minimizing the average mean square error in leave-oneout cross-validation (Piaskowski et al., 2016).The overall strength of the penalty is controlled by tuning parameter λ (Hastie and Qian,2014). The other tuning parameter alpha was set at 0 for Ridge, 1 for LASSO and 0.5 for ELNET. The data were analyzed using 'glmnet' R-package (Friedman et al., 2009).

Model Performance
The performance of the developed statistical models is tested using, adjusted R 2 , root mean square error (RMSE), normalised RMSE, mean absolute error (MAE) and mean absolute percentage error (MAPE) were calculated using the following formula: (8) yi = actual value ŷ i = Model output R 2 adj towards 1 and RMSE towards 0 indicates better performance of the developed models. Also lesser the MAE and MAPE values, the better fit the model is. According to nRMSE, the model performance is judged as excellent, good, fair and poor when the values are in the range of <10%, 10-20%, 20-30% and >30%, respectively (Jamieson et al., 1991).

Summary Statistics Of Yield Data
The summary statistics of groundnut yield data (1991-2019) of the Coimbatore district of Tamil Nadu is presented in Table 2. The maximum yield was 2877 kgha -1, whereas the minimum yield was 1519 kgha -1 . The coefficient of variation of yield was found to be 17.84%. A normal Q-Q plot was constructed for testing the normality of yield data and it was affirming the normality, thus satisfying the basic assumptions of parametric models (Fig.1). Figure.2. Shows the Pearson's coefficient of correlation between all variables. Significant positive correlations (correlation coefficient greater than 0.5) were found between yield and Z11, Z30, Z31, Z51 (P<0.05). The yield was found to be strongly correlated with those variables (p<0.01) and the correlation coefficients ranged between 0.50 and 0.59 (Iqbal et al., 2019).

Groundnut Yield Forecasting Models
In Multiple Linear Regression (MLR), regression coefficients along with their standard errors and VIF values are shown in Table.3. All the predictors included in the model explained 82.15% of the variation. The VIF values of more than 5 observed for most of the variables may be considered a cause of concern, whereas a value of more than 10 indicates severe multicollinearity (Sheater, 2009;Kutner et al., 2004). As the data needed further examination, we opted for alternative approaches to fix this problem of multicollinearity.
This study utilized Ridge, LASSO, ENet regression and SMLR as alternative methods to MLR due to the presence of multicollinearity. In the regularization techniques such as ridge, LASSO and ENet cross-validation is done for selecting the optimal lambda (λ min) values (Fig.3).The results of applying these methods for groundnut yield prediction are shown in Table.4.

Stepwise Multiple Linear Regression (SMLR)
The developed SMLR model explains about 83.76% variation in the yield due to weather parameters. The most criticalparameters identified using SMLR weremaximum temperature followed by morning relative humidity, minimum temperature and rainfall. This model is considered excellent according to nRMSE value (Jamieson et al., 1991)

Ridge Regression
The ridge regression contains all the predictors, whereas other methods consisted of a reduced number of predictors thus reducing the model complexity. Ridge regression explains 87.85% of variation in yield due to all the predictors. The most influential parameter was found to be maximum temperature followed by rainfall. The developed ridge model is considered good in accordance with nRMSE value.

LASSO (Least Absolute Shrinkage and Selection Operator)
In LASSO, feature selection is made along with regularization of parameters, thus preventing the model from overfitting. The developed model explains about 87.46% of the variation in yield. The most influential parameter was found to be maximum temperature followed by rainfall. The developed model is considered excellent in agreement with nRMSE value.

Elastic Net Regression
In ENet method, which is a combination of both ridge and LASSO, the maximum temperature was found to be the most criticalparameter followed by rainfall. The developed model explains about 87.48% of the variation in yield. The nRMSE value depicted that the model performance was excellent.
For comparing the performance of SMLR and penalized regression techniques, we used goodness-of-fit measures i.e., R 2 adj, RMSE, MAE, MAPE. The adjusted coefficient of determination (R 2 adj) was significant for all the models included in the study. When considering the R 2 adj, ridge regression was found to have a better fit compared to other models. The RMSE value of 114.40 was found the least for SMLR followed by MLR and other models. ENet method was found to have a minimum MAE of 149.62 followed by LASSO and SMLR. In the view of MAPE, SMLR is found to have the least value of 0.082 followed by MLR and other models. However, the best prediction is not necessarily provided by a model that fits the data well. Hence, we evaluated the predictive performance of the models using a validation data set and found that the RMSE value of 178.06 was found to be the least for SMLR, followed by ENet and other models. In accordance with the nRMSE value, all models except ridge wereconsidered excellent (<10%), whereas ridge was considered good (10-20%).

CONCLUSION
In the present investigation, five different multivariate models were compared to predict the groundnut yield of Coimbatore district during Kharif season. The classical Multiple Linear Regression (MLR) model shows the presence of multicollinearity. So, the performance of other models such as SMLR, ridge, LASSO and ENet were ranked based on the different goodness-of-fit measures (Table.4). SMLR and ENet provide better data fitting, revealing that these models can be used for groundnut yield forecasting for the studied region.

Acknowledgment
The Agro Climate Research Centre, TNAU and Department of Economics and Statistics, Tamil Nadu are duly acknowledged for providing the required weather data and groundnut yield data.

Ethics statement
No specific permits were required for the described field studies because no human or animal subjects were involved in this research.

Consent for publication
All the authors agreed to publish the content.

Competing interests
There were no conflict of interest in the publication of this content Volume xxx | Issue xxxx | 10