Linear Classifier – Training dataset is missing some categorical values which appear in Evaluation dataset. How to handle?

Hi,

I have a training dataset which is 67% of all my data. Then an evaluation dataset which is 33%.

They’ve been randomly shuffled. Somehow, there are some values in the evaluation dataset which didn’t appear in training. This is causing the following bug:

tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[4] = 54 is not in [0, 53)

Which, after some googling, is because not all the vocabulary values were found in the training dataset. I want to just extend the vocab size but I’m unsure how to do it.

The relevant lines of code would be these ones I think:

for feature_name in CATEGORICAL_COLUMNS:
vocabulary = dftrain[feature_name].unique() # gets a list of all unique values from given feature column
feature_columns.append(tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocabulary))

Which is a list, and not a scalar length. So I can’t simply add to it.

Any ideas or more information required?

submitted by /u/Cwlrs
[visit reddit] [comments]

Leave a Reply Cancel reply