transformations¶
SKLearn-compliant transformers, for use as part of pipelines
Classes
CategoricalImputer ([missing_values, …]) |
Impute missing values from a categorical/string np.ndarray or pd.Series with the most frequent value on the training data. |
EmbeddingVectorizer ([max_sequence_length]) |
Converts text into padded sequences. |
LabelEncoder |
Encode labels with value between 0 and n_classes-1. |
TimeSeriesVectorizer ([max_sequence_length]) |
|
TypeConversionEncoder (conversion_type) |
-
class
transformations.
CategoricalImputer
(missing_values='NaN', strategy='most_frequent', fill_value='?', fill_unknown_labels=False, copy=True)¶ Impute missing values from a categorical/string np.ndarray or pd.Series with the most frequent value on the training data. Parameters ———- missing_values : string or “NaN”, optional (default=”NaN”)
The placeholder for the missing values. All occurrences of missing_values will be imputed. None and np.nan are treated as being the same, use the string value “NaN” for them.- copy : boolean, optional (default=True)
- If True, a copy of X will be created.
- strategy : string, optional (default = ‘most_frequent’)
The imputation strategy. - If “most_frequent”, then replace missing using the most frequent
value along each column. Can be used with strings or numeric data.- If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.
- fill_value : string, optional (default=’?’)
- The value that all instances of missing_values are replaced with if strategy is set to constant. This is useful if you don’t want to impute with the mode, or if there are multiple modes in your data and you want to choose a particular one. If strategy is not set to constant, this parameter is ignored.
- fill_ : str
- The imputation fill value
-
static
_get_null_mask
(X, value)¶ Compute the boolean mask X == missing_values.
-
_get_unknown_label_mask
(X)¶ Compute the boolean mask X == missing_values.
-
fit
(X, y=None)¶ Get the most frequent value. Parameters ———-
- X : np.ndarray or pd.Series
- Training data.
y : Passthrough for
Pipeline
compatibility.self: CategoricalImputer
-
transform
(X)¶ Replaces missing values in the input data with the most frequent value of the training data. Parameters ———-
- X : np.ndarray or pd.Series
- Data with values to be imputed.
- np.ndarray
- Data with imputed values.
-
class
transformations.
EmbeddingVectorizer
(max_sequence_length=None)¶ Converts text into padded sequences. The output of this transformation is consistent with the required format for Keras embedding layers
For example ‘the fat man’ might be transformed into [2, 0, 27, 1, 1, 1], if the embedding_sequence_length is 6.
There are a few sentinel values used by this layer:
- 0 is used for the UNK token (tokens which were not seen during training)
- 1 is used for the padding token (to fill out sequences that shorter than embedding_sequence_length)
-
fit
(X, y=None)¶
-
static
generate_embedding_sequence_length
(observation_series)¶
-
static
pad
(input_sequence, length, pad_char)¶ Pad the given iterable, so that it is the correct length.
Parameters: - input_sequence – Any iterable object
- length (int) – The desired length of the output.
- pad_char (str or int) – The character or int to be added to short sequences
Returns: A sequence, of len length
Return type: []
-
static
prepare_input
(X)¶
-
process_string
(input_string)¶ Turn a string into padded sequences, consistent with Keras’s Embedding layer
- Simple preprocess & tokenize
- Convert tokens to indices
- Pad sequence to be the correct length
Parameters: input_string (str) – A string, to be converted into a padded sequence of token indices Returns: A padded, fixed-length array of token indices Return type: [int]
-
transform
(X)¶
-
class
transformations.
LabelEncoder
¶ Encode labels with value between 0 and n_classes-1.
Read more in the User Guide.
- classes_ : array of shape (n_class,)
- Holds the label for each class.
LabelEncoder can be used to normalize labels.
>>> from sklearn import preprocessing >>> le = preprocessing.LabelEncoder() >>> le.fit([1, 2, 2, 6]) LabelEncoder() >>> le.classes_ array([1, 2, 6]) >>> le.transform([1, 1, 2, 6]) #doctest: +ELLIPSIS array([0, 0, 1, 2]...) >>> le.inverse_transform([0, 0, 1, 2]) array([1, 1, 2, 6])
It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels.
>>> le = preprocessing.LabelEncoder() >>> le.fit(["paris", "paris", "tokyo", "amsterdam"]) LabelEncoder() >>> list(le.classes_) ['amsterdam', 'paris', 'tokyo'] >>> le.transform(["tokyo", "tokyo", "paris"]) #doctest: +ELLIPSIS array([2, 2, 1]...) >>> list(le.inverse_transform([2, 2, 1])) ['tokyo', 'tokyo', 'paris']
- sklearn.preprocessing.OneHotEncoder : encode categorical integer features
- using a one-hot aka one-of-K scheme.
-
fit
(y)¶ Fit label encoder
- y : array-like of shape (n_samples,)
- Target values.
self : returns an instance of self.
-
fit_transform
(y, **kwargs)¶ Fit label encoder and return encoded labels
- y : array-like of shape [n_samples]
- Target values.
y : array-like of shape [n_samples] :param **kwargs:
-
inverse_transform
(y)¶ Transform labels back to original encoding.
- y : numpy array of shape [n_samples]
- Target values.
y : numpy array of shape [n_samples]
-
transform
(y)¶ Transform labels to normalized encoding.
- y : array-like of shape [n_samples]
- Target values.
y : array-like of shape [n_samples]