Pipeline
Allows to control all preprocessing, feature creation, selection and modelling steps.
- class src.pipeline.Pipeline(collection: Optional[str] = None, queries: Optional[str] = None, queries_val: Optional[str] = None, queries_test: Optional[str] = None, features: Optional[str] = None, qrels_val: Optional[str] = None, qrels_test: Optional[str] = None, features_test: Optional[str] = None, features_val: Optional[str] = None)[source]
Bases:
objectClass to combine the different download, preprocessing, modeling and evaluation steps.
- collection
Imports collection data from .pkl file if not None
- Type
str
- queries
Imports queries data from .pkl file if not None
- Type
str
- queries_val
Imports queries_val data from .pkl file if not None
- Type
str
- queries_test
Imports queries_test data from .pkl file if not None
- Type
str
- features
Imports features data from .pkl file if not None
- Type
pd.DataFrame
- qrels_val
Imports qrels_val data from .pkl file if not None
- Type
str
- qrels_test
Imports qrels_test data from .pkl file if not None
- Type
str
- features_test
Imports features_test data from .pkl file if not None
- Type
pd.DataFrame
- features_val
Imports features_val data from .pkl file if not None
- Type
pd.DataFrame
- collection = None
- create_bert_feature(path_collection: str = 'data/embeddings/bert_collection_embeddings.pkl', path_query: str = 'data/embeddings/bert_query_embeddings.pkl')[source]
Creates bert features.
- Parameters
path_collection (str) – Path to “bert_collection_embeddings.pkl”
path_query (str) – Path to “bert_query_embeddings.pkl”
- Returns
none
- create_glove_feature(path_collection: str = 'data/embeddings/glove_collection_embeddings.pkl', path_query: str = 'data/embeddings/glove_query_embeddings.pkl')[source]
Creates glove features.
- Parameters
path_collection (str) – Path to “glove_collection_embeddings.pkl”
path_query (str) – Path to “glove_query_embeddings.pkl”
- Returns
none
- create_tfidf_feature(path_collection: str = 'data/embeddings/tfidf_collection_embeddings.pkl', path_query: str = 'data/embeddings/tfidf_query_embeddings.pkl')[source]
Creates tfidf-features.
- Parameters
path_collection (str) – Path to “tfidf_collection_embeddings.pkl”
path_query (str) – Path to “tfidf_query_embeddings.pkl”
- Returns
none
- create_w2v_feature(path_collection: str = 'data/embeddings/w2v_collection_embeddings.pkl', path_query: str = 'data/embeddings/w2v_query_embeddings.pkl')[source]
Creates word2vec features.
- Parameters
path_collection (str) – Path to “w2v_collection_embeddings.pkl”
path_query (str) – Path to “w2v_collection_embeddings.pkl”
- Returns
none
- create_w2v_tfidf_feature(path_collection: str = 'data/embeddings/w2v_tfidf_collection_embeddings.pkl', path_query: str = 'data/embeddings/w2v_tfidf_query_embeddings.pkl')[source]
Creates word2vec tfidf-weighted features.
- Parameters
path_collection (str) – Path to “w2v_tfidf_collection_embeddings.pkl”
path_query (str) – Path to “w2v_tfidf_query_embeddings.pkl”
- Returns
none
- evaluate(name: Optional[str] = None, model: str = 'nb', pca: int = 0, pairwise_model: Optional[str] = None, pairwise_top_k: int = 50, search_space: Optional[list] = None, trials: int = 20, models_path: Optional[str] = None, store_model_path: Optional[str] = None)[source]
Evaluates the performance of the model.
- Parameters
namne (str) – Give the experiment a name
model (str) – Specify model to test performance on
pca (int) –
pairwise_model (str) –
pairwise_top_k (int) –
search_space (list) –
models_path (str) –
store_model_path (str) – Path to store model to
- Returns
none
- features = Empty DataFrame Columns: [] Index: []
- features_test = Empty DataFrame Columns: [] Index: []
- features_val = Empty DataFrame Columns: [] Index: []
- forward_selection(model: str = 'nb', pca: int = 0, name=None)[source]
Performs forward feature selection to determine best features.
- Parameters
model (str) – Specify model to test performance on
pca (int) – PCA components to use (None if 0)
name (str) – Name of the experiment
- Returns
none
- preprocess(expansion=False)[source]
Preprocesses the data.
- Parameters
expansion (bool) – Whether query expansion should be used
- Returns
none
- qrels_test = None
- qrels_val = None
- queries = None
- queries_test = None
- queries_val = None
- save(name: str, path: str = 'data/processed')[source]
Saves created DataFrames as .pkl files.
- Parameters
name (str) – Specify name of dataset
path (str) – Path to store dataset to
- Returns
none
- setup(qrel_sampling: int = 20, training_sampling: int = 200, irrelevant_sampling: int = 0, datasets: Optional[list] = None, path: str = 'data/TREC_Passage')[source]
Samples from the different datasets and initializes pipeline.
- Parameters
qrel_sampling (int) – Specifies number samples from “2019qrels-pass.txt”
training_sampling (int) – Specifies number samples from “qidpidtriples.train.full.2.tsv”
irrelevant_sampling (int) –
datasets (list) – List of datasets to consider
path (str) – Path to datasets
- Returns
none