Machine Learning Performance on Small Datasets

Dr. Bob Data

Center for Applied ML

Abstract

We compare three popular ML algorithms on datasets with fewer than 1000 samples. Our results suggest that simpler models often outperform complex neural networks in low-data regimes.

1. Methodology

We tested three algorithms: 1. Linear Regression (baseline) 2. Random Forest (ensemble method) 3. Small Neural Network (3 hidden layers)

Datasets ranged from 100 to 1000 samples, with 10 features each.

2. Implementation

All experiments used Python with scikit-learn:

    Code Block
  
    Copy link
  
    Source
  
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPRegressor

# Train models
models = {
    'linear': LinearRegression(),
    'rf': RandomForestRegressor(n_estimators=100),
    'nn': MLPRegressor(hidden_layers=(64, 32, 16))
}

for name, model in models.items():
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    print(f"{name}: R² = {score:.3f}")

3. Results

As shown in Figure 3.1, linear regression achieves the best performance on datasets with fewer than 500 samples.

4. Conclusion

In low-data regimes, simpler is better. Before reaching for deep learning, try linear models—they're interpretable, fast, and often more accurate.

References

1. Breiman, Leo. "Random forests". Machine learning. 2001.