Preprocessing
Used to preprocess the data.
- src.data.preprocessing.get_synonyms(phrase, sample_size)[source]
Create synonyms using wordnets sysnsets method
- Parameters
phrase (str) –
sample_size (int) –
- Returns
List containing sysnonyms for given phrase. Only returned if sample_size > set of sysnonyms for phrase synonym_set_sampled (list): List containing sampled sysnonyms for given phrase
- Return type
synonym_set (list)
- src.data.preprocessing.lemmatization(tokens: pandas.core.series.Series)[source]
Lemmatize tokens using nltk WordNetLemmatizer method
- Parameters
tokens (pd.Series) – Series of tokens
- Returns
Series containing lemmatized tokens
- Return type
tokens (pd.Series)
- src.data.preprocessing.pca(features: pandas.core.frame.DataFrame, components: int = 5)[source]
- Parameters
features (pd.Series) –
components (int) –
- Return type
(pd.DataFrame)
- src.data.preprocessing.preprocess(data: pandas.core.series.Series, expansion: bool = False)[source]
Preprocess Text using tokenization, removing punctuation and stopwords, text expansion, stemming
- Parameters
data (pd.Series) – Series of strings
expansion (bool) – Decide whether to use word expansion on data or not
- Returns
Series of np.arrays containing preprocessed text
- Return type
data (pd.Series)
- src.data.preprocessing.query_expansion(tokens: pandas.core.series.Series, sample_size=2)[source]
Expand series of tokens with synonyms
- Parameters
tokens (pd.Series) – Series of tokens
sample_size (int) –
- Return type
new_tokenlist (pd.Series)
- src.data.preprocessing.removal(tokens: pandas.core.series.Series)[source]
Remove punctuation, stopwords and NA values
- Parameters
tokens (pd.Series) – Series of tokens
- Returns
Series containing tokens with punctuation, stopwords and NA values removed
- Return type
tokens (pd.Series)
- src.data.preprocessing.split_and_scale(X_y_train, X_test, X_val=None, components_pca: int = 0)[source]
- Parameters
() (components_pca) –
() –
X_val – ():
() –
- Returns
y (): X_test (): test_pair ():
- Return type
X ()