This page demonstrates Python tips and tricks that I use in my everyday programming as an atmospheric science graduate student.
-Brian Blaylock

Friday, August 4, 2017

Lessons Learned using the Open Science Grid

I have been playing with the Open Science Grid, a high throughput computing system. I am using the system to compute statistics of HRRR data over the last 2.5 years stored in the CHPC Pando archive system. The archive is currently 35TB, and growing every day.

OSG Limits: 102,400 MB and 500,000 Files

The OSG appears to run faster in the early morning. Maybe less people are on it.

Downside is the file transfer time if you need to transfer lots of files to your computer. I processed three years of HRRR data, generating almost 700 GB of statistical data for a single variable in 4 hours, but to transfer all those 8,700+ files to my home institution took 1 day and 7 hours. Still, that is faster than running the job on our local nodes, which took 7.5 days to complete the same job on wx4 which has 8 cores.

My third run using the Open Science Grid I calculated 30-day running percentiles for every hour of the year of composite reflectivity, REFC:entire atmosphere, from the HRRR data. It took 2 hours and 10 minutes, running 8,784 jobs (one job for every hour of the year including leap year). Some of the jobs failed (about 10-15) for an unknown reason. I simply dealt with it by re-running the job locally and that didn't take too long. I created 331 GB of data (~43 MB per file), a little smaller than the temperature or wind grids.

I've seen over 455 jobs run simultaneously on the OSG.

File Size

LatLon 18MB
TMP: 2 m  | 684 GB (~82 MB/file) | All Map Images: 8.87 GB | Transfer: 6.5 hours (8 cores) |
WIND:10 m | 704 GB (~87 MB/file) | All Map Images: 8.98 GB | Transfer: 2.5 hours (32 cores) |
DPT:2 m | _ GB (~84 MB/file) | All Map Images: _GB | Transfer: _ hours (32 cores) |
REFC:entire | _ GB (~_ MB/file) | All Map Images: _GB | Transfer: _ hours (32 cores) |

Should have 8,784 jobs

Since I still get errors when running on OSG, I have to do a little babysitting in my workflow:
  1. Create condor_submit script on OSG
    • a python script helps me with this
  2. Submit the job on the OSG:
    • condor_submit job.submit
  3. Copy the files to local CHPC storage using scp
    • Again, another python script loops through all available files
  4. Rerun the statistics calculations locally for files that are missing.
    • For a reason I haven't investigated yet, some files are not created on the OSG. I rerun these locally because running 10-15 files doesn't take as much time as the 8,000+ that are run on OSG.
  5. Rerun bad files
    • Some of the files returned from OSG are incomplete, so I check the file sizes of all the files and when I see one that is smaller than the rest I rerun the job locally.
  6. Remove files from OSG
    • These jobs produce a lot of data, over 500% of my allocation on the OSG, so I quickly remove those files.

A note about saving HDF5 files:
It is best to save each array on "top level" of the HDF5 file. My first iteration of creating these files I stored all my percentile calculations as a 3D array, requiring to get a slab of data to index the individual levels, i.e. file['percentile values'][0][10, 10] to get a point of data. It is much better and more efficient to grab data stored in 2D arrays, on the "top level" like my mean array is stored, i.e. file['mean'][10, 10]. With multiprocessing, I created a time series of the mean data at a point in 30 seconds, but it took 33 minutes to create a time series at a point for the percentile data!!! I'm pretty sure this is due to the added dimension of the percentile variable.

Each HRRR file has a grid size of 1059x1799 (1.9 million grid points), and 136 different meteorological variables. That is a lot of data that needs to be sifted through.

The deal with using the OSG is that your jobs need to be embarrassingly parallel. Thus, I have had framed the research questions I want answered in an embarrassingly parallel framework.
In my specific case, I have chosen to compute hourly statistics for each month--a job that can be done in 288 parts (12 months * 24 hours). This method gives me about 60-90 samples for each hour (60 samples if two years are available, or 90 samples if three years are available).

To give you an idea of the speed of computing on OSG, I have a job that would take about 2 hours to run on the meso4 server, which has 32 processors, and created 288 ~65MB files. In comparison, I submitted the same job to the OSG, which ran in about 15 minutes. The difference is that I ran the 288 jobs on meso4 in serial, whereas I ran the 288 jobs simultaneously on the OSG, each with between 8-48 processors, as soon as the computing resources become available. If you don't account for the time in the queue waiting for resources to become available, it takes less than one minute (typically between 10-60 seconds with an average time of 30 seconds) to complete the computations on the 8,000+ cores. The trade off is in the transfer time...transferring the files from the compute node (farmed out somewhere in the United States) to my OSG home directory, and then from the OSG home directory to meso4 via scp. The transfer of files from OSG to meso4 can take up to 24 minutes if done in serial (about 4-5 seconds per file, and need to transfer 288 files).



I am sifting through TONS of data. Each variable for each hour is about 1MB in GRIB2 format, that is downloaded from the Pando archive. Converting the field to NetCDF when we save the final results, bloats the file size. On the OSG I have had to request computers that have a minimum of 8GB of memory in order to complete the computations. In the future, when there are 2 or 3 years of additional HRRR data, this method of calculating statistics will require a computer with a minimum of 16GB of available memory. (Computing the max, min, and mean of the data set is not memory intensive. Computing percentiles is memory intensive, because all the values need to be stored and then sorted.)

Examples of data computed:


Note: the graphs shouldn't be interpreted as a continuum. Each month is independent from one another. This explains the steep steps in a value from month to month. These steps from month to month are caused because the maximums are the month maximum. For example, the maximum on the 23rd hour in May and the 0th hour in June are very different. This is because the maximum in May may be on on May 30th, while the maximum in June may be on June 30th. Thus, the line representing the maximum of each hour in each month cannot be considered a continuous.

Because I only have 2.5 years of data, a loop showing a map of the maximum or minimum values appears to show features of two underlying weather features move around the map. These are caused by the extreme events for the day


Most useful commands:
condor_submit name_of_job_file.submit
condor_q
condor_q -nobatch
condor_q blaylockbk
condor_rm 1234567890

The queue looks different whether you are in the normal command prompt, or if you have loaded a singularity image. I find that the queue within the singularity image is filled with someone else submitting jobs. Jobs I've submitted get through the queue really quick in the normal command prompt (when I'm not in the singularity image).
  • Data Transfer of large files is somewhat cumbersome. I use the scp command to move files form OSG to the CHPC environment.
  • I use the netCDF python module for writing the data, and use basic compression (comlevel=1) which turns a 180MB file to a 65MB file. In those files contains 9 statistical fields (max, min, mean, p1, p5, p10, p90, p95, p99) a field for latitude and longitude, and details on the number of cores used to compute the statistics, total time, and beginning and ending date.
  • I couldn't set up SSH keys on Putty, so I gave up.

Python multiprocessing is easy to utilize, but doesn't really gain me any ground for what I'm doing. Primarily, multiprocessing is used to speed up the downloads, but it doesn't appear it makes much difference in the compute time. Below shows a scatter plot for 288 files of the number of cores used in the computation in relation to the number of seconds it took to complete the computations. Something to keep in mind is that some of these are calculating statistics for a high sample size, while some are calculating for a fewer sample size.

Below is for the 2m Temperature file:

What do we learn about the number of samples in a computation? They can either run fast or slow for a given number of samples:



Running the same script on the OSG for another variable, this time 10 m wind, you get a different result. Here you see that more cores does increase the speed of the computations, to a point.

 Number of samples for each hour: April 18, 2015 is when the archive started.



I have plotted some of the data from my experiment:


Then, I can compare these statistics to the observed temperature and the HRRR analysis data:

Wind, with the 95th percentile