I have this pipeline:
diamonds = sns.load_dataset("diamonds")
# Build feature/target arrays
X, y = diamonds.drop("cut", axis=1), diamonds["cut"]
# Set up the colnames
to_scale = ["depth", "table", "x", "y", "z"]
to_log = ["price", "carat"]
categorical = X.select_dtypes(include="category").columns
scale_pipe = make_pipeline(StandardScaler())
log_pipe = make_pipeline(PowerTransformer())
categorical_pipe = make_pipeline(OneHotEncoder(sparse=False))
transformer = ColumnTransformer(
transformers=[
("scale", scale_pipe, to_scale),
("log_transform", log_pipe, to_log),
("oh_encode", categorical_pipe, categorical),
]
)
knn_pipe = Pipeline([("prep", transformer), ("knn", KNeighborsClassifier())])
# Fit/predict/score
_ = knn_pipe.fit(X_train, y_train)
preds = knn.predict(X_test)
When I run it, it is fitting to the data perfectly fine but I can’t score or make predictions because I am getting this error:
ValueError: could not convert string to float: 'G'
The above exception was the direct cause of the following exception:
ValueError: Unable to convert array of bytes/strings into decimal numbers with dtype='numeric'
It is a classification problem, so I thought the reason for the error was because I didn’t encode the target. But even after using LabelEncode on the target, I am still getting the same error.
What might be the reason? I tried the pipeline with other models too. The error is the same. BTW, I am using the built-in Diamonds dataset of Seaborn.
Solution
It looks like you did not predict the values for X_test
with your knn_pipe
. The variable knn
that you use in your last line is actually undefined in the example you provide. I guess you have defined it somewhere in the original and thus see this error message.
Anyway, just change
preds = knn.predict(X_test)
to
preds = knn_pipe.predict(X_test)
and it will work.