Dr. Bob Data
Center for Applied ML
We compare three popular ML algorithms on datasets with fewer than 1000 samples. Our results suggest that simpler models often outperform complex neural networks in low-data regimes.
We tested three algorithms: 1. Linear Regression (baseline) 2. Random Forest (ensemble method) 3. Small Neural Network (3 hidden layers)
Datasets ranged from 100 to 1000 samples, with 10 features each.
All experiments used Python with scikit-learn:
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPRegressor
# Train models
models = {
'linear': LinearRegression(),
'rf': RandomForestRegressor(n_estimators=100),
'nn': MLPRegressor(hidden_layers=(64, 32, 16))
}
for name, model in models.items():
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print(f"{name}: R² = {score:.3f}")
As shown in Figure 3.1, linear regression achieves the best performance on datasets with fewer than 500 samples.
In low-data regimes, simpler is better. Before reaching for deep learning, try linear models—they're interpretable, fast, and often more accurate.