### [Algorithms] - Multi class classification

From Techopedia: Linear multiclass classification is a specific kind of targeted algorithm philosophy in machine

**Ridge Regression**

Ridge regression learns , using the same least-squares criterion but adds a penalty for large variations in parameters.

The addition of a penalty parameter is called regularization. Regularization is an important concept in machine learning. It is a way to prevent overfitting by reducing the model complexity. It improves the likely generalization performance of a model by restricting the model’s possible parameter settings.

The practical effect of using ridge regression is to find feature weights, , that fit the data well and also set many of the feature weights to small values. The accuracy improvement on a regression problem with dozens or hundreds of features is significant.

The practical effect of using ridge regression is to find feature weights, , that fit the data well and also set many of the feature weights to small values. The accuracy improvement on a regression problem with dozens or hundreds of features is significant.

```
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
X_train, X_test, y_train, y_test = train_test_split(X_crime, y_crime,
random_state = 0)
linridge = Ridge(alpha=20.0).fit(X_train, y_train)
print('Crime dataset')
print('ridge regression linear model intercept: {}'
.format(linridge.intercept_))
print('ridge regression linear model coeff:\n{}'
.format(linridge.coef_))
print('R-squared score (training): {:.3f}'
.format(linridge.score(X_train, y_train)))
print('R-squared score (test): {:.3f}'
.format(linridge.score(X_test, y_test)))
print('Number of non-zero features: {}'
.format(np.sum(linridge.coef_ != 0)))
```

```
Crime dataset
ridge regression linear model intercept: -3352.4230358463437
ridge regression linear model coeff:
[ 1.95091438e-03 2.19322667e+01 9.56286607e+00 -3.59178973e+01
6.36465325e+00 -1.96885471e+01 -2.80715856e-03 1.66254486e+00
-6.61426604e-03 -6.95450680e+00 1.71944731e+01 -5.62819154e+00
8.83525114e+00 6.79085746e-01 -7.33614221e+00 6.70389803e-03
9.78505502e-04 5.01202169e-03 -4.89870524e+00 -1.79270062e+01
9.17572382e+00 -1.24454193e+00 1.21845360e+00 1.03233089e+01
-3.78037278e+00 -3.73428973e+00 4.74595305e+00 8.42696855e+00
3.09250005e+01 1.18644167e+01 -2.05183675e+00 -3.82210450e+01
1.85081589e+01 1.52510829e+00 -2.20086608e+01 2.46283912e+00
3.29328703e-01 4.02228467e+00 -1.12903533e+01 -4.69567413e-03
4.27046505e+01 -1.22507167e-03 1.40795790e+00 9.35041855e-01
-3.00464253e+00 1.12390514e+00 -1.82487653e+01 -1.54653407e+01
2.41917002e+01 -1.32497562e+01 -4.20113118e-01 -3.59710660e+01
1.29786751e+01 -2.80765995e+01 4.38513476e+01 3.86590044e+01
-6.46024046e+01 -1.63714023e+01 2.90397330e+01 4.15472907e+00
5.34033563e+01 1.98773191e-02 -5.47413979e-01 1.23883518e+01
1.03526583e+01 -1.57238894e+00 3.15887097e+00 8.77757987e+00
-2.94724962e+01 -2.33454302e-04 3.13528914e-04 -4.13169509e-04
-1.80309962e-04 -5.74054525e-01 -5.17742507e-01 -4.20670933e-01
1.53383596e-01 1.32725423e+00 3.84863158e+00 3.03024594e+00
-3.77692644e+01 1.37933464e-01 3.07676522e-01 1.57128807e+01
3.31418306e-01 3.35994414e+00 1.61265911e-01 -2.67619878e+00]
R-squared score (training): 0.671
R-squared score (test): 0.494
Number of non-zero features: 88
```

The effect of increasing is to shrink the coefficients towards 0 and toward each other. But, if the features have very different scales, then they will also have very different contributions to the penalty. So, transforming the input features so they are all on the same scale means the the ridge penalty is applied more “fairly” to all all features without unduly weighting some more than others just do to a difference in scales.

This was reviewed in our previous post about feature normalization.

Lasso regression is another form of regularized linear regression that uses an **L1 regularization** penalty for training, instead of the **L2 regularization** penalty used by Ridge regression.

This has the effect of setting parameter weights in to zero for the least influential variables, called a “sparse solution.”

When to use ridge versus lasso regression:

Use Ridge if there are only a few variables with many small/medium sized effects.

Use Lasso if there are only a few variables with medium/large effects.

```
from sklearn.linear_model import Lasso
scaler = MinMaxScaler()
X_train, X_test, y_train, y_test = train_test_split(X_crime, y_crime,
random_state = 0)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
linlasso = Lasso(alpha=2.0, max_iter = 10000).fit(X_train_scaled, y_train)
print('Crime dataset')
print('lasso regression linear model intercept: {}'
.format(linlasso.intercept_))
print('lasso regression linear model coeff:\n{}'
.format(linlasso.coef_))
print('Non-zero features: {}'
.format(np.sum(linlasso.coef_ != 0)))
print('R-squared score (training): {:.3f}'
.format(linlasso.score(X_train_scaled, y_train)))
print('R-squared score (test): {:.3f}\n'
.format(linlasso.score(X_test_scaled, y_test)))
print('Top Features with non-zero weight:')
for e in sorted (list(zip(list(X_crime), linlasso.coef_)),
key = lambda e: -abs(e[1])):
if e[1] != 0:
print('\t{}, {:.3f}'.format(e[0], e[1]))
```

```
Crime dataset
lasso regression linear model intercept: 1186.612061998579
lasso regression linear model coeff:
[ 0. 0. -0. -168.18346054
-0. -0. 0. 119.6938194
0. -0. 0. -169.67564456
-0. 0. -0. 0.
0. 0. -0. -0.
0. -0. 0. 0.
-57.52991966 -0. -0. 0.
259.32889226 -0. 0. 0.
0. -0. -1188.7396867 -0.
-0. -0. -231.42347299 0.
1488.36512229 0. -0. -0.
-0. 0. 0. 0.
0. 0. -0. 0.
20.14419415 0. 0. 0.
0. 0. 339.04468804 0.
0. 459.53799903 -0. 0.
122.69221826 -0. 91.41202242 0.
-0. 0. 0. 73.14365856
0. -0. 0. 0.
86.35600042 0. 0. 0.
-104.57143405 264.93206555 0. 23.4488645
-49.39355188 0. 5.19775369 0. ]
Non-zero features: 20
R-squared score (training): 0.631
R-squared score (test): 0.624
Top Features with non-zero weight:
PctKidsBornNeverMar, 1488.365
PctKids2Par, -1188.740
HousVacant, 459.538
PctPersDenseHous, 339.045
NumInShelters, 264.932
MalePctDivorce, 259.329
PctWorkMom, -231.423
pctWInvInc, -169.676
agePct12t29, -168.183
PctVacantBoarded, 122.692
pctUrban, 119.694
MedOwnCostPctIncNoMtg, -104.571
MedYrHousBuilt, 91.412
RentQrange, 86.356
OwnOccHiQuart, 73.144
PctEmplManu, -57.530
PctBornSameState, -49.394
PctForeignBorn, 23.449
PctLargHouseFam, 20.144
PctSameCity85, 5.198
```

So, 20 out of 88 features have non-zero weight in this example. The top five features with strongest relationships between input variables and outcomes for this dataset are:

PctKidsBornNeverMar, the percentage of kids born to people who never married,

PctKids2Par, the percentage of kids in family housing with two parents,

HousVacant, the number of vacant houses,

PctPersDensHous, the percetage of persons in dense housing (1+ person/room), and

NumInShelters, the number of people in homeless shelters.

```
for alpha in [0.5, 1, 2, 3, 5, 10, 20, 50]:
linlasso = Lasso(alpha, max_iter = 10000).fit(X_train_scaled, y_train)
r2_train = linlasso.score(X_train_scaled, y_train)
r2_test = linlasso.score(X_test_scaled, y_test)
print('Alpha = {:.0f}:\tFeatures kept: {}\tr-squared training, test:\t{:.2f}, {:.2f}\r'
.format(alpha, np.sum(linlasso.coef_ != 0), r2_train, r2_test))
```

```
Alpha = 0: Features kept: 35 r-squared training, test: 0.65, 0.58
Alpha = 1: Features kept: 25 r-squared training, test: 0.64, 0.60
Alpha = 2: Features kept: 20 r-squared training, test: 0.63, 0.62
Alpha = 3: Features kept: 17 r-squared training, test: 0.62, 0.63
Alpha = 5: Features kept: 12 r-squared training, test: 0.60, 0.61
Alpha = 10: Features kept: 6 r-squared training, test: 0.57, 0.58
Alpha = 20: Features kept: 2 r-squared training, test: 0.51, 0.50
Alpha = 50: Features kept: 1 r-squared training, test: 0.31, 0.30
```

Same as with Lasso regression, there is an optimal range of values for that will be different for different data sets and different feature preprocessing methods being used.

Suppose we have a set of two-dimensional data points with features x0 and x1

We could transform each data point by adding additional features that were the three unique multiplicative combinations of x0 and x1, yielding the following:

The degree of the polynomial specifies how many variables participate at a time in each new feature (above: 2). Note that this is still a weighted linear combination of features, so its still a linear model. But, this can be thought of intuitively as allowing polynomials to be fit to the training data instead of simply a straight line, but still using the same least-squares criterion.

This approach of adding new features, such as polynomial feaures, is very effective with classification. For example, housing prices may vary as a quadratic function of both the lot size and the amount of taxes paid on the property.

It is important to be careful about polynomial feature expansion with high degree, because this can lead to complex models that overfit. For this reason, polynomial feature expansion is also combined with a regularized learning method like ridge regression.

```
from sklearn.linear_model import LinearRegression
X_train, X_test, y_train, y_test = train_test_split(X_F1, y_F1,
random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)
print('linear model coeff (w): {}'
.format(linreg.coef_))
print('linear model intercept (b): {:.3f}'
.format(linreg.intercept_))
print('R-squared score (training): {:.3f}'
.format(linreg.score(X_train, y_train)))
print('R-squared score (test): {:.3f}'
.format(linreg.score(X_test, y_test)))
```

```
linear model coeff (w): [ 4.42036739 5.99661447 0.52894712 10.23751345 6.5507973 -2.02082636
-0.32378811]
linear model intercept (b): 1.543
R-squared score (training): 0.722
R-squared score (test): 0.722
```

```
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
X_F1_poly = poly.fit_transform(X_F1)
X_train, X_test, y_train, y_test = train_test_split(X_F1_poly, y_F1,
random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)
print('(poly deg 2) linear model coeff (w):\n{}'
.format(linreg.coef_))
print('(poly deg 2) linear model intercept (b): {:.3f}'
.format(linreg.intercept_))
print('(poly deg 2) R-squared score (training): {:.3f}'
.format(linreg.score(X_train, y_train)))
print('(poly deg 2) R-squared score (test): {:.3f}\n'
.format(linreg.score(X_test, y_test)))
```

```
(poly deg 2) linear model coeff (w):
[ 3.40951018e-12 1.66452443e+01 2.67285381e+01 -2.21348316e+01
1.24359227e+01 6.93086826e+00 1.04772675e+00 3.71352773e+00
-1.33785505e+01 -5.73177185e+00 1.61813184e+00 3.66399592e+00
5.04513181e+00 -1.45835979e+00 1.95156872e+00 -1.51297378e+01
4.86762224e+00 -2.97084269e+00 -7.78370522e+00 5.14696078e+00
-4.65479361e+00 1.84147395e+01 -2.22040650e+00 2.16572630e+00
-1.27989481e+00 1.87946559e+00 1.52962716e-01 5.62073813e-01
-8.91697516e-01 -2.18481128e+00 1.37595426e+00 -4.90336041e+00
-2.23535458e+00 1.38268439e+00 -5.51908208e-01 -1.08795007e+00]
(poly deg 2) linear model intercept (b): -3.206
(poly deg 2) R-squared score (training): 0.969
(poly deg 2) R-squared score (test): 0.805
```

The polynomial features version appears to have overfit. Note that the R-squared score is nearly 1 on the training data, and only 0.8 on the test data. The addition of many polynomial features often leads to overfitting, so it is common to use polynomial features in combination with regression that has a regularization penalty, like ridge regression.

```
X_train, X_test, y_train, y_test = train_test_split(X_F1_poly, y_F1,
random_state = 0)
linreg = Ridge().fit(X_train, y_train)
print('(poly deg 2 + ridge) linear model coeff (w):\n{}'
.format(linreg.coef_))
print('(poly deg 2 + ridge) linear model intercept (b): {:.3f}'
.format(linreg.intercept_))
print('(poly deg 2 + ridge) R-squared score (training): {:.3f}'
.format(linreg.score(X_train, y_train)))
print('(poly deg 2 + ridge) R-squared score (test): {:.3f}'
.format(linreg.score(X_test, y_test)))
```

```
(poly deg 2 + ridge) linear model coeff (w):
[ 0. 2.229281 4.73349734 -3.15432089 3.8585194 1.60970912
-0.76967054 -0.14956002 -1.75215371 1.5970487 1.37080607 2.51598244
2.71746523 0.48531538 -1.9356048 -1.62914955 1.51474518 0.88674141
0.26141199 2.04931775 -1.93025705 3.61850966 -0.71788143 0.63173956
-3.16429847 1.29161448 3.545085 1.73422041 0.94347654 -0.51207219
1.70114448 -1.97949067 1.80687548 -0.2173863 2.87585898 -0.89423157]
(poly deg 2 + ridge) linear model intercept (b): 5.418
(poly deg 2 + ridge) R-squared score (training): 0.826
(poly deg 2 + ridge) R-squared score (test): 0.825
```

Note that this model outperforms both the linear model and the version with polynomial features that was trained using non-regularized regression.

These notes were taken from the Coursera course Applied Machine Learning in Python. The information is presented by Kevyn Collins-Thompson, PhD, an associate professor of Information and Computer Science at the University of Michigan.