Data-driven modelling can provide important insight for the control and optimization of
industrial processes. In particular, it can infer hard-to-measure parameters, and make
predictions. Its application supplements and enhances the capability of the existing apparatuses
for on-line monitoring of these processes. However, recent studies indicated that data-driven
models performed poorly in applications for pollutant removal and green energy production.
This study examined the issues and proposed possible solutions.
In the field of pollutant removal, there is a tendency to generate unbalanced datasets following
a highly-skewed and heavy-tailed distribution. In a case study on soft sensor development for
measuring total phosphorus (TP) in processed wastewater (effluent), the ridge regress...[
Read more ]
Data-driven modelling can provide important insight for the control and optimization of
industrial processes. In particular, it can infer hard-to-measure parameters, and make
predictions. Its application supplements and enhances the capability of the existing apparatuses
for on-line monitoring of these processes. However, recent studies indicated that data-driven
models performed poorly in applications for pollutant removal and green energy production.
This study examined the issues and proposed possible solutions.
In the field of pollutant removal, there is a tendency to generate unbalanced datasets following
a highly-skewed and heavy-tailed distribution. In a case study on soft sensor development for
measuring total phosphorus (TP) in processed wastewater (effluent), the ridge regression model
initially yielded good performance (adjusted R
2 = 0.89). However, it was found that when the
adjusted R
2 was measured for the frequently-occurring and lower-amplitude data points, the
performance was significantly worse (R
2 ˂ 0.29). On the other hand, when adjusted R
2 was
measured the sparsely-occurring and higher-amplitude, its value (R
2 = 0.91) was closer to the
overall R
2. Based on mathematical simulations, it was concluded that the overall R
2
was biased
towards the highest amplitudes for this distribution. Moreover, it was observed that the model
errors came from the fact that the highest-correlated estimator for effluent TP in these higher
amplitudes (effluent orthophosphates) did not have a strong linear correlation with effluent TP
in the lower amplitudes. The problem was addressed by developing a ridge regression ensemble,
combining one model dedicated to extreme high amplitudes, and a second model for lower
amplitudes. Results indicated that the ridge regression ensemble performed significantly better than a single model with the highest R
2. As typical bio-chemistry processes are associated with
highly-skewed data sets, our results provide valuable insights for future data modelling in biochemical
applications.
In the field of green energy production, biomass cultivation can have significant lead times,
which makes it a point of interest to know whether or not yield will be satisfactory. In this case,
data on optical density and pH during the cultivation cycle were available to use as predictors.
However, these predictors were also highly collinear in time, resulting in overfitting for most
models. L
1 regularization is known to be a robust method against overfitting, however the
results showed that it effectively removed all the redundant input variables leading to the loss
of accountability for important chemical reactions. Unlike other applications, important
chemical elements cannot be removed even if it leads to improvement in empirical modelling.
Further examinations indicated that penalizing the model coefficients with L
2
regularization
reduced the overfitting from collinear variables without removing any of the variables. This
resulted in better performance for the case study. These results provide important insights for
future data modelling effort in chemical processes.
In essence, this thesis identified two contextual problems in data-driven modelling for
sustainable technologies, through case studies in pollutant removal and energy production
specifically. In both cases, the problem is about handling a characteristic of the predictors that
are treated as noise or otherwise detrimental to modelling. The solution was to highlight these
features, to make better use of the underlying patterns in estimating the target parameter as a
continuous variable.
Post a Comment