NumPy Arrays

Now that we've loaded the data in Pandas, we need to split the input and output into numpy arrays, in order to apply the classifiers in scikit learn. This is done in the following way: Say we have a pandas dataframe called df, like the following, with four columns labeled A, B, C, D:

If we want to extract column A, we do the following:

>> df['A']
    0    1
    1    5
    2    9
    Name: A, dtype: int64

Now, if we want to extract more columns, we just need to specify them, as follows:

>> df[['B', 'D']]

And the result is the following DataFrame:

And finally, we turn these pandas DataFrames into NumPy arrays. The command for turning a DataFrame df into a NumPy array is very simple:

>> numpy.array(df)

Now, try it yourself! Working with the same dataframe that we loaded in pandas previously, split it into the features X, and the labels y, and turn them into NumPy arrays.

Note: The capitalization may look strange, as X is capitalized whereas y is lowercase, but this is standard notation, as X represents a matrix of (maybe) several columns, and y a single column vector.

Start Quiz:

quiz.py data.csv solution.py

import pandas as pd
import numpy as np

data = pd.read_csv("data.csv")

# TODO: Separate the features and the labels into arrays called X and y

X = None
y = None

x1,x2,y
0.78051,-0.063669,0
0.28774,0.29139,0
0.40714,0.17878,0
0.2923,0.4217,0
0.50922,0.35256,0
0.27785,0.10802,0
0.27527,0.33223,0
0.43999,0.31245,0
0.33557,0.42984,0
0.23448,0.24986,0
0.0084492,0.13658,0
0.12419,0.33595,0
0.25644,0.42624,0
0.4591,0.40426,0
0.44547,0.45117,0
0.42218,0.20118,0
0.49563,0.21445,0
0.30848,0.24306,0
0.39707,0.44438,0
0.32945,0.39217,0
0.40739,0.40271,0
0.3106,0.50702,0
0.49638,0.45384,0
0.10073,0.32053,0
0.69907,0.37307,0
0.29767,0.69648,0
0.15099,0.57341,0
0.16427,0.27759,0
0.33259,0.055964,0
0.53741,0.28637,0
0.19503,0.36879,0
0.40278,0.035148,0
0.21296,0.55169,0
0.48447,0.56991,0
0.25476,0.34596,0
0.21726,0.28641,0
0.67078,0.46538,0
0.3815,0.4622,0
0.53838,0.32774,0
0.4849,0.26071,0
0.37095,0.38809,0
0.54527,0.63911,0
0.32149,0.12007,0
0.42216,0.61666,0
0.10194,0.060408,0
0.15254,0.2168,1
0.45558,0.43769,1
0.28488,0.52142,1
0.27633,0.21264,1
0.39748,0.31902,1
0.5533,1,0
0.44274,0.59205,0
0.85176,0.6612,0
0.60436,0.86605,0
0.68243,0.48301,0
1,0.76815,1
0.72989,0.8107,1
0.67377,0.77975,1
0.78761,0.58177,1
0.71442,0.7668,1
0.49379,0.54226,1
0.78974,0.74233,1
0.67905,0.60921,1
0.6642,0.72519,1
0.79396,0.56789,1
0.70758,0.76022,1
0.59421,0.61857,1
0.49364,0.56224,1
0.77707,0.35025,1
0.79785,0.76921,1
0.70876,0.96764,1
0.69176,0.60865,1
0.66408,0.92075,1
0.65973,0.66666,1
0.64574,0.56845,1
0.89639,0.7085,1
0.85476,0.63167,1
0.62091,0.80424,1
0.79057,0.56108,1
0.58935,0.71582,1
0.56846,0.7406,1
0.65912,0.71548,1
0.70938,0.74041,1
0.59154,0.62927,1
0.45829,0.4641,1
0.79982,0.74847,1
0.60974,0.54757,1
0.68127,0.86985,1
0.76694,0.64736,1
0.69048,0.83058,1
0.68122,0.96541,1
0.73229,0.64245,1
0.76145,0.60138,1
0.58985,0.86955,1
0.73145,0.74516,1
0.77029,0.7014,1
0.73156,0.71782,1
0.44556,0.57991,1
0.85275,0.85987,1
0.51912,0.62359,1

import pandas as pd
import numpy as np

data = pd.read_csv("data.csv")

# TODO: Separate the features and the labels into arrays called X and y

X = np.array(data[['x1', 'x2']])
y = np.array(data['y'])

Next Concept