Precision in Prediction: Groundwater Level Forecasting with Random Forest Regression in Coimbatore's Upper Bhavani River Basin Area

Er.Ravanashree; Dr.Balaji Kannan; Dr.K.Arunadevi; Dr.CS.Sumathi

doi:https://doi.org/10.29321/MAJ.10.701133

Research Article | Open Access | Peer Review

Precision in Prediction: Groundwater Level Forecasting with Random Forest Regression in Coimbatore's Upper Bhavani River Basin Area

, , ,

Volume : 112

Issue: March(1-3)

Pages: 52 - 57

DOI: https://doi.org/10.29321/MAJ.10.701133

Downloads: 10

Published: May 07, 2025

Download

Abstract

This study presents a highly accurate method for predicting groundwater levels using Random Forest Regression (RFR) in Coimbatore, India's Upper Bhavani River Basin area. Daily groundwater level data from 1995 to 2021 were analysed along with relevant environmental factors. The model demonstrated exceptional presentation, with R² values of 0.9999 and 0.9994 for training and testing datasets.

DOI

https://doi.org/10.29321/MAJ.10.701133

Pages

52 - 57

Creative Commons

Copyright

© The Author(s), 2025. Published by Madras Agricultural Students' Union in Madras Agricultural Journal (MAJ). This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited by the user.

Keywords

Groundwater level prediction Random Forest Regression Upper Bhavani River Basin Machine learning

Introduction

Accurate prediction of groundwater levels is essential for effective water resource management, especially in regions facing water scarcity or overexploitation of aquifers. The Upper Bhavani River Basin in Coimbatore, Tamil Nadu, is one area where sustainable groundwater management is vital for agriculture and urban water supply. Traditional methods often struggle to capture the complex interactions between various environmental factors and groundwater dynamics. Machine learning approaches have shown promise in addressing these challenges in recent years. This study explores the application of Random Forest Regression (RFR) in predicting groundwater levels with high precision in this critical watershed.

2. Study Area:

The Upper Bhavani River Basin is located in the western part of Coimbatore district, Tamil Nadu, India. The Upper Bhavani River Basin, an integral part of the larger Bhavani River system and a significant tributary of the Cauvery River, is situated in the Western Ghats. The basin is predominantly agricultural, with Horticulture crops being the primary cultivation. Hydro geologically, the area features a description of the aquifer system, rock types, and groundwater availability. The Upper Bhavani River Basin is crucial for irrigation, domestic water supply, and industrial use in the region, making accurate groundwater level predictions essential for sustainable water resource management. The Upper Bhavani Basin faces challenges such as specific water-related issues in the area, e.g., seasonal water scarcity, overexploitation of groundwater, making accurate groundwater level prediction essential for sustainable water management.

Groundwater level forecasting has become increasingly important for sustainable water resource management, especially in regions facing water scarcity. Over the past decade, machine learning techniques have gained prominence due to their ability to handle complex, non-linear relationships in hydrological systems. Random Forest Regression (RFR) has emerged as a powerful tool for groundwater level prediction. Rajaee et al. (2019) compared various machine learning techniques and found that RFR often outperforms other methods in terms of accuracy and robustness. They attributed this to RFR's ability to handle high-dimensional data and its resistance to overfitting. In a study focused on semi-arid regions, Sahoo et al. (2017) demonstrated the effectiveness of RFR in predicting groundwater levels under varying climatic conditions. They highlighted the importance of feature selection in improving model performance. To address data scarcity, Naghibi et al. (2020) proposed a hybrid approach combining RFR with other data-driven methods. Their results showed improved prediction accuracy, especially in areas with limited historical data. Raghavendra et al. (2014) utilized multiple linear regression (MLR) for predicting pest incidence in cotton crops. "Raghavendra et al. (2014) also used MLR for predicting pest incidence of cotton. "For the specific context of river basins, Chen et al. (2020) applied RFR to forecast groundwater levels in a complex river-aquifer system. They found that incorporating river stage data significantly enhanced the model's predictive power. In agricultural watersheds, similar to the Upper Bhavani River Basin, Nair and Kumar (2018) demonstrated the superiority of RFR over traditional time series models. They emphasized the importance of including land use and irrigation data as input features. Recent work by Prasad et al., (2022) has focused on integrating remote sensing data with RFR models. Their approach showed promise in improving long-term groundwater level forecasts, particularly in data-scarce regions.

Methodology

Data Collection:

Daily groundwater level data, measured meters below ground level (mbgl), were collected from 1995 to 2021. The data was sourced from the Water Resources Information System (WRIS). In addition to groundwater levels, the following environmental variables were included in the analysis. Rainfall data was sourced from the Indian Meteorological Department (IMD), and additional climatic parameters such as soil moisture, relative humidity, minimum temperature, and maximum temperature were collected from NASA POWER. The basin is predominantly agricultural, with Coconut being the primary cultivation.

Selection of parameters

Fig 1 Rainfall over the year

Fig 2 Water level (m) over year

Data Collection:

Daily groundwater level data, measured meters below ground level (mbgl), were collected from 1995 to 2021. The data was sourced from the Water Resources Information System (WRIS). In addition to groundwater levels, the following environmental variables were included in the analysis. Rainfall data was sourced from the Indian Meteorological Department (IMD), and additional climatic parameters such as soil moisture, relative humidity, minimum temperature, and maximum temperature were collected from NASA POWER. The basin is predominantly agricultural, with Coconut being the primary cultivation.

Selection of parameters

Fig 1 Rainfall over the year

Fig 2 Water level (m) over year

Fig 3 Minimum and Maximum Temperatures over the year

Fig 5 Relative Humidity over the year

Fig 4 Soil Moisture over the year

Correlation coefficients, particularly the Pearson correlation coefficient (r), are crucial in assessing linear relationships between variables. The formula r = Σ[(xi - x̄)(yi - ȳ)] / √[Σ(xi - x̄)²][Σ(yi - ȳ)²] calculates this coefficient, where xi and yi represent individual data points, and x̄ and ȳ denote the means of the x and y variables respectively. The process involves calculating means, computing differences from these means, determining products of these differences, summing these products and squares of differences, and dividing the sum of products by the square root of the product of summed squares. The resulting r-value ranges from -1 to 1, with -1 indicating a perfect negative linear relationship, 1 signifying a perfect positive linear relationship, and 0 suggesting no linear relationship. Values between 0 and 1 (or 0 and -1) indicate varying degrees of positive (or negative) linear relationships. This method forms the foundation for creating correlation heatmaps, visually representing relationships among multiple variables.

The heat map reveals several important relationships among environmental variables. Rainfall shows positive correlations with both soil moisture and water level, indicating that increased precipitation leads to higher soil moisture content and elevated water levels. Conversely, relative humidity exhibits negative correlations with minimum and maximum temperatures, suggesting higher temperatures are associated with lower humidity levels. A strong positive correlation exists between minimum and maximum temperatures, which is expected given their related nature. Additionally, water level and soil moisture demonstrate a positive correlation, implying that higher water levels coincide with increased soil moisture. These interrelationships provide valuable insights into the complex dynamics of environmental factors, highlighting how changes in one variable can influence others within the ecosystem.

3.2 Data Pre-processing:

To confirm data quality, rows with missing values (NA) were removed using the pandas function df.dropna(inplace=True). The dataset was then split into training (70%) and testing (30%) sets using the train_test_split function from scikit-learn, with a random state 42 for reproducibility. Feature scaling was performed using StandardScaler to normalize the input variables, which is crucial for many machine learning algorithms to perform optimally.

3.3 Model Development:

A Random Forest Regression model was developed using the pre-processed data. The model was trained on the scaled training dataset and evaluated on both the training and testing sets.

ŷ = 1/B * Σ[b=1 to B] fb(x)

Where:

ŷ = the predicted output

B = the number of trees in the forest

fb(x) = the prediction of the bth tree

x = the input features

ŷ(x) = 1/B * Σ[b=1 to B] Σ[m=1 to M] cmb * I(x ∈ Rmb)

Where:

M = the number of regions (leaf nodes) in each tree

cmb = the predicted value in region m of tree b

Rmb = the mth region in the bth tree

I(•) = an indicator function that equals 1 if the condition is true, 0 otherwise

This formula shows that each tree partitions the feature space into regions, and the prediction for a given input is determined by which region it falls into for each tree. The final prediction is the average of these individual tree predictions.

Results Discussion

Fig. 7 Prediction of water level for train data Fig. 8 Prediction of water level for test data

The RFR model demonstrated exceptional performance in predicting groundwater levels.

Table 1: Model Performance

Metric	Training	Testing
R²	0.9999	0.9994
MSE	0.0004	0.0025
RMSE	0.0196	0.0499
MAE	0.002	0.0062

The extremely high R² values (0.9999 for training and 0.9994 for testing) indicate that the model explains nearly all the variance in both datasets, suggesting an exceptionally high level of accuracy. The very low Mean Squared Error (MSE) values (0.0004 for training and 0.0025 for testing) reflect minimal average squared differences between predicted and actual values, demonstrating excellent model precision. The Root Mean Squared Error (RMSE) values (0.0196 for training and 0.0499 for testing) further confirm the model's high accuracy. The slight increase in RMSE for the testing data is expected and indicates good generalization. The low Mean Absolute Error (MAE) values (0.002 for training and 0.0062 for testing) show that the average magnitude of prediction errors is minimal.

4.2 Model Interpretation

The slight increase in error metrics from training to testing data is standard and indicates that the model generalizes exceptionally well with negligible over-fitting. The consistently low error values across all metrics suggest that the RFR model has effectively captured the underlying patterns in the data.

4.3 Limitations and Future Work

While the model shows remarkable performance, future studies could explore the following areas:

- Investigating the relative importance of each input variable in predicting groundwater levels

- Comparing the RFR model with other machine learning algorithms or traditional hydrological models

- Extending the study to include longer-term predictions or different geographical regions

Conclusion

This study demonstrates the high potential of Random Forest Regression in predicting groundwater levels with exceptional accuracy. The model's performance, as evidenced by near-perfect R² values and very low error metrics, suggests it can be a valuable tool for water resource managers and policymakers. This approach can contribute to more effective and sustainable groundwater management strategies by accurately predicting groundwater levels.

Acknowledgments: The authors sincerely thank TNAU for providing groundwater level data. We are also grateful to Agricultural Engineering College and Research Institute, Coimbatore for their assistance and access to computational resources. Special thanks to ArunaDevi K for valuable guidance, and to the anonymous reviewers for their constructive feedback.