transformations

SKLearn-compliant transformers, for use as part of pipelines

Classes

CategoricalImputer([missing_values, …]) Impute missing values from a categorical/string np.ndarray or pd.Series with the most frequent value on the training data.
EmbeddingVectorizer([max_sequence_length]) Converts text into padded sequences.
LabelEncoder Encode labels with value between 0 and n_classes-1.
TimeSeriesVectorizer([max_sequence_length])
TypeConversionEncoder(conversion_type)
class transformations.CategoricalImputer(missing_values='NaN', strategy='most_frequent', fill_value='?', fill_unknown_labels=False, copy=True)

Impute missing values from a categorical/string np.ndarray or pd.Series with the most frequent value on the training data. Parameters ———- missing_values : string or “NaN”, optional (default=”NaN”)

The placeholder for the missing values. All occurrences of missing_values will be imputed. None and np.nan are treated as being the same, use the string value “NaN” for them.
copy : boolean, optional (default=True)
If True, a copy of X will be created.
strategy : string, optional (default = ‘most_frequent’)

The imputation strategy. - If “most_frequent”, then replace missing using the most frequent

value along each column. Can be used with strings or numeric data.
  • If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.
fill_value : string, optional (default=’?’)
The value that all instances of missing_values are replaced with if strategy is set to constant. This is useful if you don’t want to impute with the mode, or if there are multiple modes in your data and you want to choose a particular one. If strategy is not set to constant, this parameter is ignored.
fill_ : str
The imputation fill value
static _get_null_mask(X, value)

Compute the boolean mask X == missing_values.

_get_unknown_label_mask(X)

Compute the boolean mask X == missing_values.

fit(X, y=None)

Get the most frequent value. Parameters ———-

X : np.ndarray or pd.Series
Training data.

y : Passthrough for Pipeline compatibility.

self: CategoricalImputer
transform(X)

Replaces missing values in the input data with the most frequent value of the training data. Parameters ———-

X : np.ndarray or pd.Series
Data with values to be imputed.
np.ndarray
Data with imputed values.
class transformations.EmbeddingVectorizer(max_sequence_length=None)

Converts text into padded sequences. The output of this transformation is consistent with the required format for Keras embedding layers

For example ‘the fat man’ might be transformed into [2, 0, 27, 1, 1, 1], if the embedding_sequence_length is 6.

There are a few sentinel values used by this layer:

  • 0 is used for the UNK token (tokens which were not seen during training)
  • 1 is used for the padding token (to fill out sequences that shorter than embedding_sequence_length)
fit(X, y=None)
static generate_embedding_sequence_length(observation_series)
static pad(input_sequence, length, pad_char)

Pad the given iterable, so that it is the correct length.

Parameters:
  • input_sequence – Any iterable object
  • length (int) – The desired length of the output.
  • pad_char (str or int) – The character or int to be added to short sequences
Returns:

A sequence, of len length

Return type:

[]

static prepare_input(X)
process_string(input_string)

Turn a string into padded sequences, consistent with Keras’s Embedding layer

  • Simple preprocess & tokenize
  • Convert tokens to indices
  • Pad sequence to be the correct length
Parameters:input_string (str) – A string, to be converted into a padded sequence of token indices
Returns:A padded, fixed-length array of token indices
Return type:[int]
transform(X)
class transformations.LabelEncoder

Encode labels with value between 0 and n_classes-1.

Read more in the User Guide.

classes_ : array of shape (n_class,)
Holds the label for each class.

LabelEncoder can be used to normalize labels.

>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6]) #doctest: +ELLIPSIS
array([0, 0, 1, 2]...)
>>> le.inverse_transform([0, 0, 1, 2])
array([1, 1, 2, 6])

It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels.

>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"]) #doctest: +ELLIPSIS
array([2, 2, 1]...)
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']
sklearn.preprocessing.OneHotEncoder : encode categorical integer features
using a one-hot aka one-of-K scheme.
fit(y)

Fit label encoder

y : array-like of shape (n_samples,)
Target values.

self : returns an instance of self.

fit_transform(y, **kwargs)

Fit label encoder and return encoded labels

y : array-like of shape [n_samples]
Target values.

y : array-like of shape [n_samples] :param **kwargs:

inverse_transform(y)

Transform labels back to original encoding.

y : numpy array of shape [n_samples]
Target values.

y : numpy array of shape [n_samples]

transform(y)

Transform labels to normalized encoding.

y : array-like of shape [n_samples]
Target values.

y : array-like of shape [n_samples]

class transformations.TimeSeriesVectorizer(max_sequence_length=None)
fit(X, y=None)
transform(X)
class transformations.TypeConversionEncoder(conversion_type)
fit(X, y=None)
transform(X)