Eastsheng's Wiki

论文复现-机器学习预测甲烷水合物平衡条件

2024-02-26 18:37:28

[toc]

论文简介

Renewable and Sustainable Energy Reviews 173 (2023) 113103

题目 A reliable model to predict the methane-hydrate equilibrium: An updated database and machine learning approach
数据 The database includes 987 data samples for pure water, 15 different single-salt brines, and 6 different multi-salt systems.
模型 Multilayer Perceptron (MLP) ANN, Decision Tree (DT), and Extremely Randomized Trees (ET)
实现 Sickit-Learn library
预测目标 平衡温度

数据库

  • The gathered database includes 987 data samples for pure water, 15 different single-salt brines, and 6 different multi-salt systems. Covering 15 different salts, it is the largest database gathered on the topic of interest so far.

数据检查

  • errors in the devices or human error in recording:修正
  • the database was checked for duplicate samples:保留一个,去除重复

特征选取

  • Almost all the experimental works in the literature have reported C in %wt. However, it is opted to use ion’s mole fraction (x) for the modeling purpose for several reasons.

    • using C, it is not possible to correctly address the multi-salts systems.
    • using ions concentration, there would be 12 characteristic features for brines, which is lower than the number of different salts, which is 15. Using a fewer number of features not only decreases the processing load but also lowers the overfitting probability.
    • More importantly, using the ions concentration, it is possible to address the salts not present in the reported experimental data samples.
  • In addition to brine characteristics, P will also be used as the operational parameter to predict the equilibrium T for the CH4-hydrate.

数据划分

  • The gathered database includes 977 data samples after removing the duplicates.
  • Data splitting in the ratio of 80/20 for training/testing was considered. Before splitting, 40 data samples belonging to MgBr2 were set aside for additional testing. Although there is no other reference reporting experimental data for this salt, data samples are available containing both Mg⁺⁺ and Br⁻ ions for other salts.
  • As additional testing, if the developed model provides an acceptable prediction for MgBr2, it would be able to be used for any hypothetical salt the ions of which are available in the database.
  • The remaining 937 data samples were split into training and testing subsets in the ratio of 80/20.
  • A 5-fold Cross-Validation (CV) was used to validate the developed models.

模型

MLP多层感知机

  • MLP是一个black-box模型,拟合过程没有明确显示;
  • MLP的结构包括一个输入层、一个或多个隐藏层和一个输出层;
  • 输入和输出层的神经元数量对应于输入和输出特征的数量;
  • 隐藏层中的神经元数量是任意的。

DT决策树

  • 这种预测工具是非参数模型;
  • DT通过基于输入变量开发简单的决策规则来工作;
  • 由于DT可以可视化,因此可以将其视为with-box模型;
  • 但是,如果DT太深,它就会失去直观理解和解释的简单性;

ET超随机树

  • ET是一种集成方法,它受益于对几个DT进行平均,以提供更好的结果并最小化过拟合概率;

超参数优化

  • 超参数表示模型的外部形状,应该在训练步骤之前确定;
  • 另一方面,在训练阶段确定模型的参数。对于给定的数据集,最佳的超参数集保证了模型的最佳性能。
  • 不用说,就像所有其他操作一样,超参数优化只是使用训练数据集完成的。

Model fitting

  • the 5-fold CV RMSE acquired from 20 trials for MLP, DT, and ET models.
  • The closer error values to each other for different random states, the more stable the model. As each trial uses a different random state, the yielded results are not the same.

Accuracy assessment

  • Residual ($Res_i$), Coefficient of determination ($R^2$), $RMSE$, Absolute Relative Error ($ARD$), and Average Absolute Relative Error ($AARD$).

  • image-20240226165732964

Github Project