2025/04/10

AutoMLで手軽に予測モデルを作ってみたい！！！ By AutoGluon

この記事の目次

目標：AutoGluonを試してみる！

今回は近年話題になっているAutoMLの一つであるAutoGluonのクイックスタートを試してみて、実際にどのようなことができるのかを理解しようと思います！

なぜこのチャレンジをするのか

近年、生成AIやデータサイエンス業界の発展により、学生のデータサイエンス力が年々向上しています。その影響もあり、毎年開催しているデータサイエンティストコースのインターンシップでは、模範解答の精度が学生の精度よりも低いという結果になってしまったことも…。

これはまずいということで、AutoMLが予測精度向上に役立つのではないかと思い、試してみることにしました！

AutoML（自動機械学習）は、機械学習モデルの構築プロセスを簡素化し、専門知識がなくても高精度なモデルを作成できるようにする技術です。AutoGluonはその中でも特に使いやすく、強力なツールとして注目されています。

私はAutoMLは全く触ったことがなく初心者ですが、このチャレンジを通じてAutoGluonの基本的な使い方を学び、実際のデータセットで試してみることで、その利便性と性能を体感したいと思います！

AutoGluonとは

AutoGluonは、Amazonが開発したオープンソースのAutoMLフレームワークです。以下の特徴があります。

簡単なインストールと使用：数行のコードでモデルのトレーニングと予測が可能です。
多様なデータ形式に対応：表形式データ、画像、テキスト、時系列データなど、さまざまなデータタイプをサポートします。
自動化されたプロセス：特徴量エンジニアリング、モデル選択、ハイパーパラメータ調整など、多くのステップが自動化されています。

AutoGluon Tabular - クイックスタート

英語のクイックスタートを翻訳しているので、多少の日本語の変なところはスルーしていただけると助かります。
実際に実行したい方はこちらのリンクから実行できます！

インストール

まずは、AutoGluonのTabularDatasetとTabularPredictorをインポートします。
TabularDataset：データをロードする際に利用
TabularPredictor：モデルをトレーニングと予測に利用

!python -m pip install --upgrade pip
!python -m pip install autogluon

from autogluon.tabular import TabularDataset, TabularPredictor

サンプルデータ

使用するデータセット：Nature issue 7887
このデータセットの目的は、結び目の特性に基づいてそのシグネチャを予測することです。
元のデータから10,000のトレーニング例と5,000のテスト例をサンプリングしています（オリジナルデータ）。
サンプリングされたデータセットにより、このチュートリアルは迅速に実行できますが、必要に応じてAutoGluonはフルデータセットを処理することもできます。

data_url = 'https://raw.githubusercontent.com/mli/ag-docs/main/knot_theory/'
train_data = TabularDataset(f'{data_url}train.csv')
train_data.head()

Unnamed: 0	chern_simons	cusp_volume	hyperbolic_adjoint_torsion_degree	injectivity_radius	longitudinal_translation	meridinal_translation_imag	meridinal_translation_real	short_geodesic_imag_part	short_geodesic_real_part	Symmetry_0	Symmetry_Z/2 + Z/2	volume	signature
0	70746	0.090530	12.226322	10	0.507756	10.685555	1.144192	-0.519157	-2.760601	1.015512	1.0	11.393225	-2
1	240827	0.232453	13.800773	14	0.413645	10.453156	1.320249	-0.158522	-3.013258	0.827289	1.0	12.742782	0
2	155659	-0.144099	14.761030	14	0.436928	13.405199	1.101142	0.768894	2.233106	0.873856	0	15.236505	2
3	239963	-0.171668	13.738019	22	0.249481	27.819496	0.493827	-1.188718	-2.042771	0.498961	0	17.279890	-8
4	90504	0.235188	15.896359	10	0.389329	15.330971	1.036879	0.722828	-3.056138	0.778658	0	16.749298	4

目的変数は「signature」列に格納されており、18種類のユニークな整数が含まれています。pandasはこのデータ型をカテゴリカルとして正しく認識しませんでしたが、AutoGluonがこの問題を修正してくれます。

label = 'signature'
train_data[label].describe()

実行結果

count    10000.000000
mean        -0.022000
std          3.025166
min        -12.000000
25%         -2.000000
50%          0.000000
75%          2.000000
max         12.000000
Name: signature, dtype: float64

学習

predictor = TabularPredictor(label=label).fit(train_data)

実行結果

  No path specified. Models will be saved in: "AutogluonModels/ag-20241205_012036"
Verbosity: 2 (Standard Logging)
=================== System Info ===================
AutoGluon Version:  1.2
Python Version:     3.10.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP PREEMPT_DYNAMIC Thu Jun 27 21:05:47 UTC 2024
CPU Count:          2
Memory Avail:       11.57 GB / 12.67 GB (91.3%)
Disk Space Avail:   74.20 GB / 107.72 GB (68.9%)
===================================================
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
    Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
    presets='experimental' : New in v1.2: Pre-trained foundation model + parallel fits. The absolute best accuracy without consideration for inference speed. Does not support GPU.
    presets='best'         : Maximize accuracy. Recommended for most users. Use in competitions and benchmarks.
    presets='high'         : Strong accuracy with fast inference speed.
    presets='good'         : Good accuracy with very fast inference speed.
    presets='medium'       : Fast training time, ideal for initial prototyping.
Beginning AutoGluon training ...
AutoGluon will save models to "/content/AutogluonModels/ag-20241205_012036"
Train Data Rows:    10000
Train Data Columns: 18
Label Column:       signature
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == int, but few unique label-values observed).
    First 10 (of 13) unique label values:  [-2, 0, 2, -8, 4, -4, -6, 8, 6, 10]
    If 'multiclass' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])
Problem Type:       multiclass
Preprocessing data ...
Warning: Some classes in the training set have fewer than 10 examples. AutoGluon will only keep 9 out of 13 classes for training and will not try to predict the rare classes. To keep more classes, increase the number of datapoints from these rare classes in the training data or reduce label_count_threshold.
Fraction of data from classes with at least 10 examples that will be kept for training models: 0.9984
Train Data Class Count: 9
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
    Available Memory:                    11841.83 MB
    Train Data (Original)  Memory Usage: 1.37 MB (0.0% of available memory)
    Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
    Stage 1 Generators:
        Fitting AsTypeFeatureGenerator...
            Note: Converting 5 features to boolean dtype as they only contain 2 unique values.
    Stage 2 Generators:
        Fitting FillNaFeatureGenerator...
    Stage 3 Generators:
        Fitting IdentityFeatureGenerator...
    Stage 4 Generators:
        Fitting DropUniqueFeatureGenerator...
    Stage 5 Generators:
        Fitting DropDuplicatesFeatureGenerator...
    Useless Original Features (Count: 1): ['Symmetry_D8']
        These features carry no predictive signal and should be manually investigated.
        This is typically a feature which has the same value for all rows.
        These features do not need to be present at inference time.
    Types of features in original data (raw dtype, special dtypes):
        ('float', []) : 14 | ['chern_simons', 'cusp_volume', 'injectivity_radius', 'longitudinal_translation', 'meridinal_translation_imag', ...]
        ('int', [])   :  3 | ['Unnamed: 0', 'hyperbolic_adjoint_torsion_degree', 'hyperbolic_torsion_degree']
    Types of features in processed data (raw dtype, special dtypes):
        ('float', [])     : 9 | ['chern_simons', 'cusp_volume', 'injectivity_radius', 'longitudinal_translation', 'meridinal_translation_imag', ...]
        ('int', [])       : 3 | ['Unnamed: 0', 'hyperbolic_adjoint_torsion_degree', 'hyperbolic_torsion_degree']
        ('int', ['bool']) : 5 | ['Symmetry_0', 'Symmetry_D3', 'Symmetry_D4', 'Symmetry_D6', 'Symmetry_Z/2 + Z/2']
    0.3s = Fit runtime
    17 features in original data used to generate 17 features in processed data.
    Train Data (Processed) Memory Usage: 0.96 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.42s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
    To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.1, Train Rows: 8985, Val Rows: 999
User-specified model hyperparameters to be fit:
{
    'NN_TORCH': [{}],
    'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, {'learning_rate': 0.03, 'num_leaves': 128, 'feature_fraction': 0.9, 'min_data_in_leaf': 3, 'ag_args': {'name_suffix': 'Large', 'priority': 0, 'hyperparameter_tune_kwargs': None}}],
    'CAT': [{}],
    'XGB': [{}],
    'FASTAI': [{}],
    'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 13 L1 models, fit_strategy="sequential" ...
Fitting model: KNeighborsUnif ...
    0.2232   = Validation score   (accuracy)
    9.92s    = Training   runtime
    0.02s    = Validation runtime
Fitting model: KNeighborsDist ...
    0.2132   = Validation score   (accuracy)
    0.05s    = Training   runtime
    0.02s    = Validation runtime
Fitting model: NeuralNetFastAI ...
    0.9409   = Validation score   (accuracy)
    16.79s   = Training   runtime
    0.04s    = Validation runtime
Fitting model: LightGBMXT ...
/usr/local/lib/python3.10/dist-packages/dask/dataframe/__init__.py:42: FutureWarning: 
Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.

  warnings.warn(msg, FutureWarning)
    0.9459   = Validation score   (accuracy)
    10.69s   = Training   runtime
    0.24s    = Validation runtime
Fitting model: LightGBM ...
    0.956    = Validation score   (accuracy)
    9.71s    = Training   runtime
    0.33s    = Validation runtime
Fitting model: RandomForestGini ...
    0.9449   = Validation score   (accuracy)
    8.86s    = Training   runtime
    0.12s    = Validation runtime
Fitting model: RandomForestEntr ...
    0.9499   = Validation score   (accuracy)
    10.04s   = Training   runtime
    0.11s    = Validation runtime
Fitting model: CatBoost ...
    0.956    = Validation score   (accuracy)
    73.03s   = Training   runtime
    0.01s    = Validation runtime
Fitting model: ExtraTreesGini ...
    0.9469   = Validation score   (accuracy)
    4.42s    = Training   runtime
    0.13s    = Validation runtime
Fitting model: ExtraTreesEntr ...
    0.9429   = Validation score   (accuracy)
    2.84s    = Training   runtime
    0.13s    = Validation runtime
Fitting model: XGBoost ...
    0.957    = Validation score   (accuracy)
    16.0s    = Training   runtime
    0.35s    = Validation runtime
Fitting model: NeuralNetTorch ...
    0.9419   = Validation score   (accuracy)
    79.07s   = Training   runtime
    0.01s    = Validation runtime
Fitting model: LightGBMLarge ...
    0.9499   = Validation score   (accuracy)
    16.1s    = Training   runtime
    0.42s    = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
    Ensemble Weights: {'RandomForestEntr': 0.25, 'ExtraTreesGini': 0.25, 'KNeighborsUnif': 0.167, 'NeuralNetFastAI': 0.167, 'XGBoost': 0.083, 'NeuralNetTorch': 0.083}
    0.965    = Validation score   (accuracy)
    0.25s    = Training   runtime
    0.0s     = Validation runtime
AutoGluon training complete, total runtime = 264.38s ... Best model: WeightedEnsemble_L2 | Estimated inference throughput: 1512.1 rows/s (999 batch size)
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("/content/AutogluonModels/ag-20241205_012036")

予測

test_data = TabularDataset(f'{data_url}test.csv')
y_pred = predictor.predict(test_data.drop(columns=[label]))

評価

predictor.evaluate(test_data, silent=True)

実行結果

{'accuracy': 0.9478,
 'balanced_accuracy': 0.754478262473782,
 'mcc': 0.9360368834449522}

AutoGluonのTabularPredictorは、leaderboard()関数も提供しており、これを使用して各トレーニング済みモデルのテストデータに対する性能を評価することができます。

predictor.leaderboard(test_data)

model	score_test	score_val	eval_metric	pred_time_test	pred_time_val	fit_time	pred_time_test_marginal	pred_time_val_marginal	fit_time_marginal	stack_level	can_infer	fit_order
WeightedEnsemble_L2	0.9478	0.964965	accuracy	2.777582	0.660655	136.499218	0.025436	0.001888	0.246331	2	True	14
LightGBM	0.9456	0.955956	accuracy	0.704927	0.331303	9.709910	0.704927	0.331303	9.709910	1	True	5
XGBoost	0.9448	0.956957	accuracy	1.877720	0.350646	16.003580	1.877720	0.350646	16.003580	1	True	11
LightGBMLarge	0.9444	0.949950	accuracy	3.199392	0.421252	16.101254	3.199392	0.421252	16.101254	1	True	13
CatBoost	0.9432	0.955956	accuracy	0.065079	0.011186	73.033620	0.065079	0.011186	73.033620	1	True	8
RandomForestEntr	0.9384	0.949950	accuracy	0.284559	0.108530	10.044177	0.284559	0.108530	10.044177	1	True	7
NeuralNetFastAI	0.9364	0.940941	accuracy	0.102912	0.041506	16.789817	0.102912	0.041506	16.789817	1	True	3
ExtraTreesGini	0.9360	0.946947	accuracy	0.413286	0.126837	4.417963	0.413286	0.126837	4.417963	1	True	9
ExtraTreesEntr	0.9358	0.942943	accuracy	0.434792	0.127124	2.836171	0.434792	0.127124	2.836171	1	True	10
RandomForestGini	0.9352	0.944945	accuracy	0.266627	0.117757	8.860353	0.266627	0.117757	8.860353	1	True	6
NeuralNetTorch	0.9320	0.941942	accuracy	0.035788	0.012760	79.072856	0.035788	0.012760	79.072856	1	True	12
LightGBMXT	0.9320	0.945946	accuracy	1.222842	0.243437	10.694745	1.222842	0.243437	10.694745	1	True	4
KNeighborsDist	0.2210	0.213213	accuracy	0.038112	0.016588	0.045746	0.038112	0.016588	0.045746	1	True	2
KNeighborsUnif	0.2180	0.223223	accuracy	0.037879	0.018488	9.924494	0.037879	0.018488	9.924494	1	True	1

結論

このクイックスタートチュートリアルでは、TabularDatasetとTabularPredictorを使用してAutoGluonの基本的なフィットおよび予測機能を見てきました。

AutoGluonは、特徴量エンジニアリングやモデルのハイパーパラメータ調整を必要とせずに、モデルのトレーニングプロセスを簡素化します。

トレーニングや予測ステップのカスタマイズ、カスタム特徴量生成器、モデル、メトリクスの拡張など、AutoGluonの他の機能について詳しく学ぶには詳細なチュートリアルをチェックしてください。

まとめ

いかがだったでしょうか？
最後の表にある通り、多くのモデルを数少ないコードで比較してくれました！！
色々なモデルを試すため学習時間は多くなってしまいますが、どのモデルを使用するのかを検討する際にAutoGluonはとても使えるのではないでしょうか？

データ前処理、ハイパーパラメータ最適化、アーキテクチャー最適化、モデルアンサンブル（スタッキング）までを一度に行ってくれるAutoGluonは今後も目が離せません！

また、今回の表データだけでなく、画像分類・画像セグメンテーション・物体検出・自然言語・マルチモーダル予測など様々な分野に対応しているため、気になった方はぜひそれぞれのクイックスタートを試してみてください！

参考リンク
https://atmarkit.itmedia.co.jp/ait/articles/2203/24/news004.html
https://auto.gluon.ai/stable/index.html
https://pages.awscloud.com/rs/112-TZM-766/images/1.AWS_AutoML_AutoGluon.pdf

※本記事は2025年04月時点の情報です。

著者：マイナビエンジニアブログ編集部

AutoMLで手軽に予測モデルを作ってみたい！！！ By AutoGluon

この記事の目次

目標：AutoGluonを試してみる！

なぜこのチャレンジをするのか

AutoGluonとは

AutoGluon Tabular - クイックスタート

インストール

サンプルデータ

学習

予測

評価

結論

まとめ

人気のあるタグ