Main class DataReader
src.data_readers.data_reader.DataReader
DataReader class for reading pre-determined dataset and data transformation for the UI.
Source code in src/data_readers/data_reader.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 |
|
__init__(configs)
Data reader init class. Attributes
configs : dict configuration dict of the dataset name : str name of the dataset set in run_apps args query_col : str name of the query column sensitive_col : str name of the sensitive column (e.g. gender) used for applying fairness interventions (optional) data_path : str path to the dataset file output_file_path : str path to the output file where the transformed dataset will be saved
Source code in src/data_readers/data_reader.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
|
read(split)
Read dataset file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
split |
str
|
The split of the dataset to read ('test' or 'train'). |
required |
Returns:
Type | Description |
---|---|
If split is 'test': tuple: A tuple containing the dataframes of document, query, and experiment lists. |
|
If split is 'train': tuple: A tuple containing the dataframes of document and query. |
Raises:
Type | Description |
---|---|
FileNotFoundError
|
If the dataset file or query file is not found. |
Source code in src/data_readers/data_reader.py
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
|
save_data()
Save the transformed data in splits.
This method creates the necessary directories and saves the transformed data to CSV files.
The data is saved in the following structure:
- The main output directory is created at self.output_file_path
.
- Inside the main output directory, two subdirectories are created: 'test' and 'train'.
- The transformed test data is saved as 'data.csv' inside the 'test' subdirectory.
This will be displayed in the UI.
- If there is transformed train data available, it is saved as 'data.csv' inside the 'train' subdirectory.
This will be used for training the ranker or fairness intervention.
- The dataset queries are saved as 'query.csv' inside the main output directory.
Note: The method assumes that the necessary data has already been transformed and is available.
Returns:
Type | Description |
---|---|
None |
Source code in src/data_readers/data_reader.py
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 |
|
Extending DataReader Class
Here are few examples of how to extend DataReader Class.
Data Reader for Amazon dataset
src.data_readers.data_reader_amazon.DataReaderAmazon
Bases: DataReader
Source code in src/data_readers/data_reader_amazon.py
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
|
transform_data()
Transform data into pandas.DataFrame and apply cleaning steps.
Reads the 'amazon.csv' file from the specified data path, drops rows with missing values, and performs data transformations on the columns. Returns the transformed data.
Returns:
Name | Type | Description |
---|---|---|
dataframe_query |
DataFrame
|
A DataFrame containing the transformed query data. |
data_train |
DataFrame
|
A DataFrame containing the transformed training data. |
data_test |
DataFrame
|
A DataFrame containing the transformed testing data. |
Source code in src/data_readers/data_reader_amazon.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
|
Data Reader for CVS dataset
src.data_readers.data_reader_cvs.DataReaderCvs
Bases: DataReader
Source code in src/data_readers/data_reader_cvs.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
|
transform_data()
Transform data into pandas.DataFrame and apply cleaning steps.
This method reads and preprocesses data from multiple files and directories.
It iterates over each occupation directory, reads the query description from a JSON file,
formats the query as plain text, and appends it to the dataframes_occupations
list.
It then lists all JSON files in each occupation directory, reads the candidate data from each file,
preprocesses the candidate data, and appends it to the dataframes_candidates
list.
Finally, it concatenates all the DataFrames into a single DataFrame and returns the result.
Returns:
Name | Type | Description |
---|---|---|
dataframe_occupations |
DataFrame
|
A DataFrame containing the preprocessed query data. |
data_train |
DataFrame
|
A DataFrame containing the preprocessed candidate data for training. |
data_test |
DataFrame
|
A DataFrame containing the preprocessed candidate data for testing. |
Source code in src/data_readers/data_reader_cvs.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
|
Data Reader for Flickr dataset
src.data_readers.data_reader_flickr.DataReaderFlickr
Bases: DataReader
Source code in src/data_readers/data_reader_flickr.py
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
|
transform_data()
Transform data into pandas.DataFrame and apply cleaning steps.
Reads the flickr data from a CSV file, performs data preprocessing, and returns the transformed data.
Returns:
Name | Type | Description |
---|---|---|
tuple |
A tuple containing the transformed data. - dataframe_query (pandas.DataFrame): A DataFrame containing the transformed query data. - data_train (pandas.DataFrame): A DataFrame containing the transformed training data. - data_test (pandas.DataFrame): A DataFrame containing the transformed testing data. |
Source code in src/data_readers/data_reader_flickr.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
|
Data Reader for Xing dataset
src.data_readers.data_reader_xing.Candidate
Bases: object
represents a candidate in a set that is passed to a search algorithm a candidate composes of a qualification and a list of protected attributes (strings) if the list of protected attributes is empty/null this is a candidate from a non-protected group natural ordering established by the qualification
Source code in src/data_readers/data_reader_xing.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
|
isProtected
property
true if the list of ProtectedAttribute elements actually contains anything false otherwise
__init__(work_experience, edu_experience, hits, qualification, protectedAttributes, member_since, degree)
@param qualification : describes how qualified the candidate is to match the search query @param protectedAttributes: list of strings that represent the protected attributes this candidate has (e.g. gender, race, etc) if the list is empty/null this is a candidate from a non-protected group
Source code in src/data_readers/data_reader_xing.py
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
|
src.data_readers.data_reader_xing.DataReaderXing
Bases: DataReader
reads profiles collected from Xing on certain job description queries profiles are available in JSON format they are read into a data frame indexed by the search queries we used to obtain candidate profiles
the columns consists of arrays of Candidates, the protected ones, the non-protected ones and one that contains all candidates in the same order as was collected from Xing website.
| PROTECTED | NON-PROTECTED | ORIGINAL ORDERING
Administrative Assistant | [protected1, protected2, ...] | [nonProtected1, nonProtected2, ...] | [nonProtected1, protected1, ...] Auditor | [protected3, protected4, ...] | [nonProtected3, nonProtected3, ...] | [protected4, nonProtected3, ...] ... | ... | ... | ...
the protected attribute of a candidate is their sex a candidate's sex was manually determined from the profile name depending on the dominating sex of a search query result, the other one was set as the protected attribute (e.g. for administrative assistant the protected attribute is male, for auditor it's female)
Source code in src/data_readers/data_reader_xing.py
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 |
|
__determineEduMonths(r)
takes a person's profile as JSON node and computes the total amount of work months this person has
Parameters:
r : JSON node
Source code in src/data_readers/data_reader_xing.py
324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 |
|
__determineIfProtected(r, protAttr)
takes a JSON profile and finds if the person belongs to the protected group
Parameter:
r : JSON node a person description in JSON, everything below node "profile"
Source code in src/data_readers/data_reader_xing.py
252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 |
|
__determineWorkMonths(r)
takes a person's profile as JSON node and computes the total amount of work months this person has
Parameters:
r : JSON node
Source code in src/data_readers/data_reader_xing.py
272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 |
|
__readFileOfQuery(filename)
takes one .json file and reads all information, creates candidate objects from these information and sorts them into 3 arrays. One contains all protected candidates, one contains all non-protected candidates, one contains all candidates in the same order as they appear in the json-file
@param filename: the json's filename
@return: key: the search query string protected: array that contains all protected candidates nonProtected: array that contains all nonProtected candidates
Source code in src/data_readers/data_reader_xing.py
181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 |
|
transform_data()
Transform data into pandas.DataFrame and apply cleaning steps.
Reads the XING data from a JSON file, performs data preprocessing, and returns the transformed data.
Returns:
Name | Type | Description |
---|---|---|
tuple |
A tuple containing the transformed data. - dataframe_query (pandas.DataFrame): A DataFrame containing the transformed query data. - data_train (pandas.DataFrame): A DataFrame containing the transformed training data. - data_test (pandas.DataFrame): A DataFrame containing the transformed testing data. |
Source code in src/data_readers/data_reader_xing.py
105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
|