Preprocessing

Used to preprocess the data.

src.data.preprocessing.get_synonyms(phrase, sample_size)[source]

Create synonyms using wordnets sysnsets method

Parameters
  • phrase (str) –

  • sample_size (int) –

Returns

List containing sysnonyms for given phrase. Only returned if sample_size > set of sysnonyms for phrase synonym_set_sampled (list): List containing sampled sysnonyms for given phrase

Return type

synonym_set (list)

src.data.preprocessing.lemmatization(tokens: pandas.core.series.Series)[source]

Lemmatize tokens using nltk WordNetLemmatizer method

Parameters

tokens (pd.Series) – Series of tokens

Returns

Series containing lemmatized tokens

Return type

tokens (pd.Series)

src.data.preprocessing.pca(features: pandas.core.frame.DataFrame, components: int = 5)[source]
Parameters
  • features (pd.Series) –

  • components (int) –

Return type

(pd.DataFrame)

src.data.preprocessing.preprocess(data: pandas.core.series.Series, expansion: bool = False)[source]

Preprocess Text using tokenization, removing punctuation and stopwords, text expansion, stemming

Parameters
  • data (pd.Series) – Series of strings

  • expansion (bool) – Decide whether to use word expansion on data or not

Returns

Series of np.arrays containing preprocessed text

Return type

data (pd.Series)

src.data.preprocessing.query_expansion(tokens: pandas.core.series.Series, sample_size=2)[source]

Expand series of tokens with synonyms

Parameters
  • tokens (pd.Series) – Series of tokens

  • sample_size (int) –

Return type

new_tokenlist (pd.Series)

src.data.preprocessing.removal(tokens: pandas.core.series.Series)[source]

Remove punctuation, stopwords and NA values

Parameters

tokens (pd.Series) – Series of tokens

Returns

Series containing tokens with punctuation, stopwords and NA values removed

Return type

tokens (pd.Series)

src.data.preprocessing.split_and_scale(X_y_train, X_test, X_val=None, components_pca: int = 0)[source]
Parameters
  • () (components_pca) –

  • ()

  • X_val – ():

  • ()

Returns

y (): X_test (): test_pair ():

Return type

X ()

src.data.preprocessing.stemming(tokens: pandas.core.series.Series)[source]

Stem tokens using nltk PorterStemmer method

Parameters

tokens (pd.Series) – Series of tokens

Returns

Series containing stemmed tokens

Return type

tokens (pd.Series)

src.data.preprocessing.tokenization(text: str)[source]

Tokenize using nltk.word_tokenize method and lower string

Parameters

text (str) – String of text

Returns

Series containing lowered tokens

Return type

(pd.Series)