Pipeline

Allows to control all preprocessing, feature creation, selection and modelling steps.

class src.pipeline.Pipeline(collection: Optional[str] = None, queries: Optional[str] = None, queries_val: Optional[str] = None, queries_test: Optional[str] = None, features: Optional[str] = None, qrels_val: Optional[str] = None, qrels_test: Optional[str] = None, features_test: Optional[str] = None, features_val: Optional[str] = None)[source]

Bases: object

Class to combine the different download, preprocessing, modeling and evaluation steps.

collection

Imports collection data from .pkl file if not None

Type

str

queries

Imports queries data from .pkl file if not None

Type

str

queries_val

Imports queries_val data from .pkl file if not None

Type

str

queries_test

Imports queries_test data from .pkl file if not None

Type

str

features

Imports features data from .pkl file if not None

Type

pd.DataFrame

qrels_val

Imports qrels_val data from .pkl file if not None

Type

str

qrels_test

Imports qrels_test data from .pkl file if not None

Type

str

features_test

Imports features_test data from .pkl file if not None

Type

pd.DataFrame

features_val

Imports features_val data from .pkl file if not None

Type

pd.DataFrame

collection = None
create_BM25_features()[source]

Creates BM25 features.

create_POS_features()[source]

Creates POS features.

create_bert_embeddings()[source]

Creates bert embeddings.

create_bert_feature(path_collection: str = 'data/embeddings/bert_collection_embeddings.pkl', path_query: str = 'data/embeddings/bert_query_embeddings.pkl')[source]

Creates bert features.

Parameters
  • path_collection (str) – Path to “bert_collection_embeddings.pkl”

  • path_query (str) – Path to “bert_query_embeddings.pkl”

Returns

none

create_glove_embeddings()[source]

Creates glove embeddings.

create_glove_embeddings_tfidf_weighted()[source]

Creates glove embeddings tfidf-weighted.

create_glove_feature(path_collection: str = 'data/embeddings/glove_collection_embeddings.pkl', path_query: str = 'data/embeddings/glove_query_embeddings.pkl')[source]

Creates glove features.

Parameters
  • path_collection (str) – Path to “glove_collection_embeddings.pkl”

  • path_query (str) – Path to “glove_query_embeddings.pkl”

Returns

none

create_interpretation_features()[source]

Creates interpretation features.

create_jaccard_feature()[source]

Creates Jaccard feature.

create_sentence_features()[source]

Creates sentence features.

create_test_features()[source]

Creates features for test data.

create_tfidf_embeddings()[source]

Creates tfidf embeddings.

create_tfidf_feature(path_collection: str = 'data/embeddings/tfidf_collection_embeddings.pkl', path_query: str = 'data/embeddings/tfidf_query_embeddings.pkl')[source]

Creates tfidf-features.

Parameters
  • path_collection (str) – Path to “tfidf_collection_embeddings.pkl”

  • path_query (str) – Path to “tfidf_query_embeddings.pkl”

Returns

none

create_train_features()[source]

Creates features for the training data.

create_val_features()[source]

Creates features for validation data.

create_w2v_embeddings()[source]

Creates word2vec embeddings.

create_w2v_embeddings_tfidf_weighted()[source]

Creates word2vec embeddings tfidf-weighted.

create_w2v_feature(path_collection: str = 'data/embeddings/w2v_collection_embeddings.pkl', path_query: str = 'data/embeddings/w2v_query_embeddings.pkl')[source]

Creates word2vec features.

Parameters
  • path_collection (str) – Path to “w2v_collection_embeddings.pkl”

  • path_query (str) – Path to “w2v_collection_embeddings.pkl”

Returns

none

create_w2v_tfidf_feature(path_collection: str = 'data/embeddings/w2v_tfidf_collection_embeddings.pkl', path_query: str = 'data/embeddings/w2v_tfidf_query_embeddings.pkl')[source]

Creates word2vec tfidf-weighted features.

Parameters
  • path_collection (str) – Path to “w2v_tfidf_collection_embeddings.pkl”

  • path_query (str) – Path to “w2v_tfidf_query_embeddings.pkl”

Returns

none

evaluate(name: Optional[str] = None, model: str = 'nb', pca: int = 0, pairwise_model: Optional[str] = None, pairwise_top_k: int = 50, search_space: Optional[list] = None, trials: int = 20, models_path: Optional[str] = None, store_model_path: Optional[str] = None)[source]

Evaluates the performance of the model.

Parameters
  • namne (str) – Give the experiment a name

  • model (str) – Specify model to test performance on

  • pca (int) –

  • pairwise_model (str) –

  • pairwise_top_k (int) –

  • search_space (list) –

  • models_path (str) –

  • store_model_path (str) – Path to store model to

Returns

none

features = Empty DataFrame Columns: [] Index: []
features_test = Empty DataFrame Columns: [] Index: []
features_val = Empty DataFrame Columns: [] Index: []
forward_selection(model: str = 'nb', pca: int = 0, name=None)[source]

Performs forward feature selection to determine best features.

Parameters
  • model (str) – Specify model to test performance on

  • pca (int) – PCA components to use (None if 0)

  • name (str) – Name of the experiment

Returns

none

preprocess(expansion=False)[source]

Preprocesses the data.

Parameters

expansion (bool) – Whether query expansion should be used

Returns

none

qrels_test = None
qrels_val = None
queries = None
queries_test = None
queries_val = None
save(name: str, path: str = 'data/processed')[source]

Saves created DataFrames as .pkl files.

Parameters
  • name (str) – Specify name of dataset

  • path (str) – Path to store dataset to

Returns

none

setup(qrel_sampling: int = 20, training_sampling: int = 200, irrelevant_sampling: int = 0, datasets: Optional[list] = None, path: str = 'data/TREC_Passage')[source]

Samples from the different datasets and initializes pipeline.

Parameters
  • qrel_sampling (int) – Specifies number samples from “2019qrels-pass.txt”

  • training_sampling (int) – Specifies number samples from “qidpidtriples.train.full.2.tsv”

  • irrelevant_sampling (int) –

  • datasets (list) – List of datasets to consider

  • path (str) – Path to datasets

Returns

none