05. NumPy Arrays
NumPy Arrays
Now that we've loaded the data in Pandas, we need to split the input and output into numpy arrays, in order to apply the classifiers in scikit learn. This is done in the following way: Say we have a pandas dataframe called df
, like the following, with four columns labeled A
, B
, C
, D
:

If we want to extract column A
, we do the following:
>> df['A']
0 1
1 5
2 9
Name: A, dtype: int64
Now, if we want to extract more columns, we just need to specify them, as follows:
>> df[['B', 'D']]
And the result is the following DataFrame:

And finally, we turn these pandas DataFrames into NumPy arrays. The command for turning a DataFrame df
into a NumPy array is very simple:
>> numpy.array(df)
Now, try it yourself! Working with the same dataframe that we loaded in pandas previously, split it into the features X
, and the labels y
, and turn them into NumPy arrays.
Note: The capitalization may look strange, as X
is capitalized whereas y
is lowercase, but this is standard notation, as X
represents a matrix of (maybe) several columns, and y
a single column vector.
Start Quiz:
import pandas as pd
import numpy as np
data = pd.read_csv("data.csv")
# TODO: Separate the features and the labels into arrays called X and y
X = None
y = None
x1,x2,y
0.78051,-0.063669,0
0.28774,0.29139,0
0.40714,0.17878,0
0.2923,0.4217,0
0.50922,0.35256,0
0.27785,0.10802,0
0.27527,0.33223,0
0.43999,0.31245,0
0.33557,0.42984,0
0.23448,0.24986,0
0.0084492,0.13658,0
0.12419,0.33595,0
0.25644,0.42624,0
0.4591,0.40426,0
0.44547,0.45117,0
0.42218,0.20118,0
0.49563,0.21445,0
0.30848,0.24306,0
0.39707,0.44438,0
0.32945,0.39217,0
0.40739,0.40271,0
0.3106,0.50702,0
0.49638,0.45384,0
0.10073,0.32053,0
0.69907,0.37307,0
0.29767,0.69648,0
0.15099,0.57341,0
0.16427,0.27759,0
0.33259,0.055964,0
0.53741,0.28637,0
0.19503,0.36879,0
0.40278,0.035148,0
0.21296,0.55169,0
0.48447,0.56991,0
0.25476,0.34596,0
0.21726,0.28641,0
0.67078,0.46538,0
0.3815,0.4622,0
0.53838,0.32774,0
0.4849,0.26071,0
0.37095,0.38809,0
0.54527,0.63911,0
0.32149,0.12007,0
0.42216,0.61666,0
0.10194,0.060408,0
0.15254,0.2168,1
0.45558,0.43769,1
0.28488,0.52142,1
0.27633,0.21264,1
0.39748,0.31902,1
0.5533,1,0
0.44274,0.59205,0
0.85176,0.6612,0
0.60436,0.86605,0
0.68243,0.48301,0
1,0.76815,1
0.72989,0.8107,1
0.67377,0.77975,1
0.78761,0.58177,1
0.71442,0.7668,1
0.49379,0.54226,1
0.78974,0.74233,1
0.67905,0.60921,1
0.6642,0.72519,1
0.79396,0.56789,1
0.70758,0.76022,1
0.59421,0.61857,1
0.49364,0.56224,1
0.77707,0.35025,1
0.79785,0.76921,1
0.70876,0.96764,1
0.69176,0.60865,1
0.66408,0.92075,1
0.65973,0.66666,1
0.64574,0.56845,1
0.89639,0.7085,1
0.85476,0.63167,1
0.62091,0.80424,1
0.79057,0.56108,1
0.58935,0.71582,1
0.56846,0.7406,1
0.65912,0.71548,1
0.70938,0.74041,1
0.59154,0.62927,1
0.45829,0.4641,1
0.79982,0.74847,1
0.60974,0.54757,1
0.68127,0.86985,1
0.76694,0.64736,1
0.69048,0.83058,1
0.68122,0.96541,1
0.73229,0.64245,1
0.76145,0.60138,1
0.58985,0.86955,1
0.73145,0.74516,1
0.77029,0.7014,1
0.73156,0.71782,1
0.44556,0.57991,1
0.85275,0.85987,1
0.51912,0.62359,1
import pandas as pd
import numpy as np
data = pd.read_csv("data.csv")
# TODO: Separate the features and the labels into arrays called X and y
X = np.array(data[['x1', 'x2']])
y = np.array(data['y'])