Datasets

Used to download, import and samples the datasets.

src.data.dataset.download(remote_url: Optional[str] = None, path: Optional[str] = None)[source]

Downloads files

Parameters
  • remote_url (str) – URL to dataset

  • path (str) – Location to store downloaded data.

Returns

none

src.data.dataset.download_dataset(datasets: Optional[list] = None, path: str = 'data/TREC_Passage')[source]

Combines and executes download and unzip methods

Parameters
  • datasets (list) – List of required files

  • path (str) – Location to store downloaded data.

Returns

none

src.data.dataset.import_collection(path: str = 'data/TREC_Passage', qrels_val: Optional[list] = None, qrels_test: Optional[list] = None, triples: Optional[list] = None, samples: int = 0)[source]

Imports data from collection.tsv file

Parameters
  • path (str) – Location of dataset

  • qrels_val (list) –

  • triples (list) –

  • samples (int) – Specify number of rows to be imported from dataset

Returns

Data frame containing IDs and Passages from collection dataset

Return type

df (pd.DataFrame)

src.data.dataset.import_qrels(path: str = 'data/TREC_Passage', samples: int = 5)[source]

Imports data from 2019qrels-pass.txt as validation set and from 2020qrels-pass.txt as test set

Parameters
  • path (str) – Location of dataset

  • samples (int) – Specify number of rows to be imported from dataset

Returns

Data frame containing validation set df_test (pd.DataFrame): Data frame containing test set

Return type

df_val (pd.DataFrame)

src.data.dataset.import_queries(path: str = 'data/TREC_Passage', collection: Optional[list] = None)[source]

Imports train queries

Parameters
  • path (str) – Location of dataset

  • collection (list) –

Returns

Query train IDs and content

Return type

df (pd.DataFrame)

src.data.dataset.import_training_set(path: str = 'data/TREC_Passage', samples: int = 200)[source]

Imports data from qidpidtriples.train.full.2.tsv as training set

Parameters
  • path (str) – Location of dataset

  • samples (int) – Specify number of rows to be imported from dataset

Returns

Data frame containing training set

Return type

df (pd.DataFrame)

src.data.dataset.import_val_test_queries(path: str = 'data/TREC_Passage', qrels_val: Optional[list] = None, qrels_test: Optional[list] = None)[source]

Imports validation and test queries

Parameters
  • path (str) – Location of dataset

  • qrels_val (list) –

  • qrels_test (list) –

Returns

Query validation IDs and content test_df (pd.DataFrame): Query test IDs and content

Return type

val_df (pd.DataFrame)

src.data.dataset.unzip(file: Optional[str] = None)[source]

Unzips files

Parameters

file (str) – Specify file to unzip

Returns

none