EECS 675: Multicore and GPU Programming

Spring 2020

Project 2

Due: April 6, 2020 (emailed by 11:59:59 p.m.)

Consider the following two files that contain various types of data for 500 U.S. cities:

The two files contain the same basic data in two different forms. The csv file will probably be easier for you to parse, but the json file contains additional descriptive information that will tell you what the columns of the spreadsheet mean. You can just search through the json file, looking for "description". Skip the first instance which is the description for the data file as a whole. The second and subsequent instances of "description" identify the data stored in columns A, B, C, ….

Project Requirements

  1. The rank 0 process must read the csv file referenced above and store all the data. You can treat columns with range data (stored, for example, as "(13.3, 14.0)") in any way you want. You will not need to use data from those columns in any part of this project.
  2. Use the parameters in argv to determine how the data is to be communicated to other processes as well as how results are to be sent back to the rank 0 process. In general, you will be computing some quantity or quantities, and then reporting the results to the standard output. All I/O (both reading the file as well as writing results to standard output) must be done strictly by the rank 0 process.
  3. All queries will be handled either by a "scatter – reduce" or a "broadcast – gather" strategy. The value of argv[1] will be either "sr" or "bg", respectively.
  4. The choices for "sr" are:
    Example mpirun commandanswer reported by rank 0 process for this example command
    mpirun -np 20 proj2 sr max DNew York, NY, Population2010 = 8175133.00
    mpirun -np 20 proj2 sr min DBurlington, VT, Population2010 = 42417.00
    mpirun -np 20 proj2 sr avg COAverage OBESITY_CrudePrev = 23.40
    mpirun -np 20 proj2 sr number AS gt 55Number cities with COLON_SCREEN_CrudePrev gt 55 = 430
    mpirun -np 20 proj2 sr number CO lt 20Number of cities with OBESITY_CrudePrev lt 20 = 109

    The string following "max", "min", "avg", and "number" refers to a column in the original spreadsheet (e.g., column D, column CO, and column AS as in the table above). No relationals other than "lt" or "gt" ("less than" and "greater than", respectively) are needed.

    Note that a city is only reported for "max" and "min".

    For the example commands here in #4, divide the work evenly among the number of processes specified in the "-np" directive. There are 500 cities in the data file, hence immediately exit with an error message (reported only by rank 0) if the program is launched with an "-np" value that does not evenly divide 500.

  5. The choices for "bg" are:
    Example mpirun commandanswer reported by rank 0 process for this example command
    mpirun -np 4 proj2 bg max D E I CO
    max Population2010 = 8175133.00; New York, NY
    max ACCESS2_CrudePrev = 51.50; Pharr, TX
    max ARTHRITIS_CrudePrev = 36.80; Charleston, WV
    max OBESITY_CrudePrev = 38.80; Dayton, OH
    mpirun -np 4 proj2 bg min D E I CO
    min Population2010 = 42417.00; Burlington, VT
    min ACCESS2_CrudePrev = 4.20; Newton, MA
    min ARTHRITIS_CrudePrev = 9.40; College Station, TX
    min OBESITY_CrudePrev = 12.20; Milpitas, CA
    mpirun -np 4 proj2 bg avg D E I CO
    avg Population2010 = 206041.62
    avg ACCESS2_CrudePrev = 19.41
    avg ARTHRITIS_CrudePrev = 22.43
    avg OBESITY_CrudePrev = 23.40

    Again, note that a city is only reported for "max" and "min".

    For the example commands here in #5, each process must do the work associated with one of the requested columns. Therefore the number of processes specified by "-np" must be the same as the number of columns to be examined. Notice in the examples shown, we wanted to examine four columns (D, E, I, and CO), hence we specified "-np 4". Immediately terminate the program with an error message if the number of processes is not the same as the number of columns to scan.

By the way...

Do you know why I specified "scatter" for the queries in #4, but "broadcast" for the queries in #5? Obviously I want you to get experience with both, but there is more to it than that. (I could ask a similar question of why "reduce" in #4 and "gather" in #5, but that should be a bit more obvious.)

Project Submission

Remove any object files (i.e., *.o) and your linked executable program. Then create and send a tar file of the project2 directory to me at jrmiller@ku.edu.