Nearest neighbors

For small datasets, good as a baseline, easy to explain.

Considers the k number of points near the new data points to assign the label.
from sklearn.neighbors import KNieghborsClassifier
Try looping through a range of neighbors to get a list of training accuracy and test accuracy

for n_neighbors in neighbors_settings:
	# build the model
	clf = KNeighborsClassifier(n_neighbors=n_neighbors)
	clf.fit(X_train, y_train)
	# record training set accuracy
	training_accuracy.append(clf.score(X_train, y_train))
	# record generalization accuracy
	test_accuracy.append(clf.score(X_test, y_test))
plt.plot(neighbors_settings, training_accuracy, label="training accuracy")
plt.plot(neighbors_settings, test_accuracy, label="test accuracy")

A regression variant of the k-nearest neighbors algorithm is also exist
- from sklearn.neighbors import KNeighborsRegressor
- The score method of KNeighborsRegressor returns the R^2 score, known as the coefficient of determination, from 0 to 1.
Upside:
- Ez to understand
- Good performance with little adjustment
Downside:
- Prediction can be slow with large data (means huge number of samples, or features)
- Bad performance with 0 most of the time (sensitive to noisy data)
- Need to pre-process (feature engineering)

Linear models

Go-to as a first algorithm to try, good for very large datasets, good for very high-dimensional data

Linear models for regression

$$ ŷ = w[0] * x[0] + w[1] * x[1] + ... + w[p] * x[p] + b $$

from sklearn.linear_model import LinearRegression
X, y = mglearn.datasets.make_wave(n_samples=60)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
lr = LinearRegression().fit(X_train, y_train)

Linear regression, or ordinary least squares (OLS), is the simplest and most classic method for use.
No need for parameters to control, but means no way to control model complexity.
Very powerful against high-dimensional datasets, but high chance of overfitting.

<aside> 💡 To prevent overfitting, penalizing the magnitude of coefficients of features and minimizing the error between predicted and actual observations is needed. These are called ‘regularization’ techniques.

</aside>