10. Quiz: Testing in sklearn
Splitting a dataset into training and testing data is very easy with sklearn. All we need is the command train_test_split
. The function takes as inputs X
and y
, and returns four things:
X_train
: The training inputX_test
: The testing inputy_train
: The training labelsy_test
: The testing labels
The call to the function looks as follows:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
The last parameter, test_size
is the percentage of the points that we want to use as testing. In the above call, we are using 25% of our points for testing, and 75% for training.
Let's practice! We'll again use the dataset from the previous section:

In the following quiz, use the train_test_split
function to split the dataset into training and testing sets. The size of the testing set must be 20% of the total size of the data. Call your training sets X_train
and y_train
, and your testing sets X_test
and y_test
.
Click on Test Run
to see a visualization of the results, where the training set will be drawn as circles, and the testing set as squares. Then when you're done, click on Submit
to check your code!
Start Quiz:
# Reading the csv file
import pandas as pd
data = pd.read_csv("data.csv")
# Splitting the data into X and y
import numpy as np
X = np.array(data[['x1', 'x2']])
y = np.array(data['y'])
# Import statement for train_test_split
from sklearn.cross_validation import train_test_split
# TODO: Use the train_test_split function to split the data into
# training and testing sets.
# The size of the testing set should be 20% of the total size of the data.
# Your output should contain 4 objects.
x1,x2,y
0.78051,-0.063669,0
0.28774,0.29139,0
0.40714,0.17878,0
0.2923,0.4217,0
0.50922,0.35256,0
0.27785,0.10802,0
0.27527,0.33223,0
0.43999,0.31245,0
0.33557,0.42984,0
0.23448,0.24986,0
0.0084492,0.13658,0
0.12419,0.33595,0
0.25644,0.42624,0
0.4591,0.40426,0
0.44547,0.45117,0
0.42218,0.20118,0
0.49563,0.21445,0
0.30848,0.24306,0
0.39707,0.44438,0
0.32945,0.39217,0
0.40739,0.40271,0
0.3106,0.50702,0
0.49638,0.45384,0
0.10073,0.32053,0
0.69907,0.37307,0
0.29767,0.69648,0
0.15099,0.57341,0
0.16427,0.27759,0
0.33259,0.055964,0
0.53741,0.28637,0
0.19503,0.36879,0
0.40278,0.035148,0
0.21296,0.55169,0
0.48447,0.56991,0
0.25476,0.34596,0
0.21726,0.28641,0
0.67078,0.46538,0
0.3815,0.4622,0
0.53838,0.32774,0
0.4849,0.26071,0
0.37095,0.38809,0
0.54527,0.63911,0
0.32149,0.12007,0
0.42216,0.61666,0
0.10194,0.060408,0
0.15254,0.2168,1
0.45558,0.43769,1
0.28488,0.52142,1
0.27633,0.21264,1
0.39748,0.31902,1
0.5533,1,0
0.44274,0.59205,0
0.85176,0.6612,0
0.60436,0.86605,0
0.68243,0.48301,0
1,0.76815,1
0.72989,0.8107,1
0.67377,0.77975,1
0.78761,0.58177,1
0.71442,0.7668,1
0.49379,0.54226,1
0.78974,0.74233,1
0.67905,0.60921,1
0.6642,0.72519,1
0.79396,0.56789,1
0.70758,0.76022,1
0.59421,0.61857,1
0.49364,0.56224,1
0.77707,0.35025,1
0.79785,0.76921,1
0.70876,0.96764,1
0.69176,0.60865,1
0.66408,0.92075,1
0.65973,0.66666,1
0.64574,0.56845,1
0.89639,0.7085,1
0.85476,0.63167,1
0.62091,0.80424,1
0.79057,0.56108,1
0.58935,0.71582,1
0.56846,0.7406,1
0.65912,0.71548,1
0.70938,0.74041,1
0.59154,0.62927,1
0.45829,0.4641,1
0.79982,0.74847,1
0.60974,0.54757,1
0.68127,0.86985,1
0.76694,0.64736,1
0.69048,0.83058,1
0.68122,0.96541,1
0.73229,0.64245,1
0.76145,0.60138,1
0.58985,0.86955,1
0.73145,0.74516,1
0.77029,0.7014,1
0.73156,0.71782,1
0.44556,0.57991,1
0.85275,0.85987,1
0.51912,0.62359,1
# Reading the csv file
import pandas as pd
data = pd.read_csv("data.csv")
# Splitting the data into X and y
import numpy as np
X = np.array(data[['x1', 'x2']])
y = np.array(data['y'])
# Import statement for train_test_split
from sklearn.cross_validation import train_test_split
# TODO: Use the train_test_split function to split the data into
# training and testing sets.
# The size of the testing set should be 20% of the total size of the data.
# Your output should contain 4 objects.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)