Diving deeper into the analysis and API

This notebook is meant to give a quick introduction into the main features and workflows of pyaerocom.

This includes brief introductions into the following features:

What?

Relevant classes and/or methods

Finding model and observation data

`pya.browse_database <https://pyaerocom.met.no/api.html#pyaerocom.io.utils.browse_database>`__

Reading of gridded model data

`pya.io.ReadGridded <https://pyaerocom.met.no/api.html#pyaerocom.io.readgridded.ReadGridded>`__

Working with gridded data

`pya.GriddedData <https://pyaerocom.met.no/api.html#pyaerocom.griddeddata.GriddedData>`__

Reading of ungridded observation data

`pya.io.ReadUngridded <https://pyaerocom.met.no/api.html#reading-of-ungridded-data>`__

Working with ungridded data

`pya.UngriddedData <https://pyaerocom.met.no/api.html#pyaerocom.ungriddeddata.UngriddedData>`__

Working with data from individual site locations

`pya.StationData <https://pyaerocom.met.no/api.html#pyaerocom.stationdata.StationData>`__

Colocation of model and observational data

High-level: `pya.colocation_auto <https://pyaerocom.met.no/api.html#module-pyaerocom.colocation_auto>`__

Low-level: `pya.colocation <https://pyaerocom.met.no/api.html#module-pyaerocom.colocation>`__

Working with colocated data

`pya.ColocatedData <https://pyaerocom.met.no/api.html#pyaerocom.colocateddata.ColocatedData>`__

In a graphical way it introduces the main data object and processing routines for model and observation comparisons with pyaerocom, illustrated in the following flowchart:

Basic processing flowchart of pyaerocom

Only that in this example “Data server” is the local computer that has the minimal testdataset as an example dataset.

[1]:
import pyaerocom as pya
pya.__version__
[1]:
'0.10.0rc2'

Should be at least 0.10.X

Check access to testdata

NOTE: details regarding testdata access and intialization are covered in tutorial notebook getting_started_setup.ipynb.

[2]:
from pyaerocom.testdata_access import initialise
initialise()
Adding data search directory /home/jonasg/MyPyaerocom/testdata-minimal/modeldata.
Adding data search directory /home/jonasg/MyPyaerocom/testdata-minimal/obsdata.
Adding data search directory /home/jonasg/MyPyaerocom/testdata-minimal/config.
Adding ungridded dataset AeronetSunV3L2Subset.daily located at /home/jonasg/MyPyaerocom/testdata-minimal/obsdata/AeronetSunV3Lev2.daily/renamed.Reader: <class 'pyaerocom.io.read_aeronet_sunv3.ReadAeronetSunV3'>
Adding ungridded dataset AeronetSDAV3L2Subset.daily located at /home/jonasg/MyPyaerocom/testdata-minimal/obsdata/AeronetSDAV3Lev2.daily/renamed.Reader: <class 'pyaerocom.io.read_aeronet_sdav3.ReadAeronetSdaV3'>
Adding ungridded dataset AeronetInvV3L2Subset.daily located at /home/jonasg/MyPyaerocom/testdata-minimal/obsdata/AeronetInvV3Lev2.daily/renamed.Reader: <class 'pyaerocom.io.read_aeronet_invv3.ReadAeronetInvV3'>
Adding ungridded dataset EBASSubset located at /home/jonasg/MyPyaerocom/testdata-minimal/obsdata/EBASMultiColumn.Reader: <class 'pyaerocom.io.read_ebas.ReadEbas'>
pyaerocom-testdata is ready to be used. The data is available at /home/jonasg/MyPyaerocom/testdata-minimal

Model data: Reading of and working with gridded data

This section provides an introduction into the following pyaerocom classes and architectures:

*you may click the links to see the online documentation of these classes.

Pre-remark on the ReadGridded class

As you could see in tutorial getting_started_setup.ipynb the ReadGridded class makes extensive use of the AeroCom file naming conventions. So if you have model data that is stored using different conventions (e.g. CMIP6), this class will not be of much help (yet) for filtering the correct files to read. In that case you may locate a model NetCDF file yourself and pass it directly into a GriddedData object on initialisation.

The testdataset contains data from the TM5 model, which is used in the following. You can use the browse_database function of pyaerocom to find model ID’s (which can be quite cryptic sometimes) using wildcard pattern search.

[3]:
pya.browse_database('*TM5*')

Pyaerocom ReadGridded
---------------------
Data ID: TM5-met2010_CTRL-TEST
Data directory: /home/jonasg/MyPyaerocom/testdata-minimal/modeldata/TM5-met2010_CTRL-TEST/renamed
Available experiments: ['AP3']
Available years: [2010]
Available frequencies ['daily' 'monthly']
Available variables: ['abs550aer', 'od550aer']
[3]:
['TM5-met2010_CTRL-TEST']
[4]:
model_id = 'TM5-met2010_CTRL-TEST'
reader = pya.io.ReadGridded(model_id)

You can have a look at the individual files and corresponding metadata using the file_info attribute:

[5]:
reader.file_info
[5]:
var_name year ts_type vert_code data_id name meteo experiment perturbation is_at_stations 3D filename
0 abs550aer 2010 daily Column TM5-met2010_CTRL-TEST TM5 met2010 AP3 CTRL2019 False False aerocom3_TM5-met2010_AP3-CTRL2019_abs550aer_Co...
3 abs550aer 2010 monthly Column TM5-met2010_CTRL-TEST TM5 met2010 AP3 CTRL2019 False False aerocom3_TM5-met2010_AP3-CTRL2019_abs550aer_Co...
1 od550aer 2010 daily Column TM5-met2010_CTRL-TEST TM5 met2010 AP3 CTRL2019 False False aerocom3_TM5-met2010_AP3-CTRL2019_od550aer_Col...
2 od550aer 2010 monthly Column TM5-met2010_CTRL-TEST TM5 AP3 CTRL2016 False False aerocom3_TM5_AP3-CTRL2016_od550aer_Column_2010...

You can also filter this attribute based on what you are interested in. E.g.:

[6]:
reader.filter_files(var_name='od550aer')
[6]:
var_name year ts_type vert_code data_id name meteo experiment perturbation is_at_stations 3D filename
1 od550aer 2010 daily Column TM5-met2010_CTRL-TEST TM5 met2010 AP3 CTRL2019 False False aerocom3_TM5-met2010_AP3-CTRL2019_od550aer_Col...
2 od550aer 2010 monthly Column TM5-met2010_CTRL-TEST TM5 AP3 CTRL2016 False False aerocom3_TM5_AP3-CTRL2016_od550aer_Column_2010...
[7]:
od550aer = reader.read_var('od550aer')
[8]:
od550aer.quickplot_map();
../_images/pyaerocom-tutorials_getting_started_analysis_22_0.png

Ups, this looks rather incomplete. The reason is that pyaerocom picked the available daily dataset, which is cropped in the minimal testdataset for storage purpose. Let’s try monthly.

[9]:
od550aer_tm5 = reader.read_var('od550aer', ts_type='monthly')
od550aer_tm5.quickplot_map();
../_images/pyaerocom-tutorials_getting_started_analysis_24_0.png

Looking better. You may wonder why only January is displayed here. This is because quickplot_map picks the first available timestamp in the dataset, you may specify that explicitly.

Under the hood pyaerocom.GriddedData is based on the iris.Cube object class (iris library) and features very similar functionality (and more).

The loaded Cube instance can be accessed via:

[10]:
od550aer_tm5.cube
[10]:
Atmosphere Optical Thickness Due To Ambient Aerosol (1) time latitude longitude
Shape 12 90 120
Dimension coordinates
time x - -
latitude - x -
longitude - - x
Attributes
Conventions CF-1.6
computed False
concatenated False
contact Twan van Noije (noije@knmi.nl)
data_id TM5-met2010_CTRL-TEST
experiment AP3
experiment_id AP3-CTRL2016
from_files ['/home/jonasg/MyPyaerocom/testdata-minimal/modeldata/TM5-met2010_CTRL...
institute_id KNMI
institution Royal Netherlands Meteorological Institute, De Bilt, The Netherlands
meteo
model_id TM5
outliers_removed False
perturbation CTRL2016
project_id AeroCom Phase 3
reader None
references Van Noije, T.P.C., et al. (Geosci. Model Dev., 7, 2435-2475, 2014); Van...
region None
regridded False
source TM5-mp: CTM ERA-Interim 3x2 34L
title TM5 model output prepared for AeroCom Phase 3
ts_type monthly
var_name_read n/d
Cell methods
point longitude, latitude
mean time

If you have not heard of xarray, you should check it out. If you have heard of it (or maybe even used it already) you may convert a GriddedData object to an xarray.DataArray via:

[11]:
xarr = od550aer_tm5.to_xarray()
xarr
[11]:
Show/Hide data repr Show/Hide attributes
xarray.DataArray
'od550aer'
  • time: 12
  • lat: 90
  • lon: 120
  • dask.array<chunksize=(12, 90, 61), meta=np.ndarray>
    Array Chunk
    Bytes 518.40 kB 263.52 kB
    Shape (12, 90, 120) (12, 90, 61)
    Count 10 Tasks 2 Chunks
    Type float32 numpy.ndarray
    120 90 12
    • time
      (time)
      object
      2010-01-15 12:00:00 ... 2010-12-15 12:00:00
      standard_name :
      time
      long_name :
      Time
      array([cftime.DatetimeJulian(2010, 1, 15, 12, 0, 0, 0),
             cftime.DatetimeJulian(2010, 2, 14, 0, 0, 0, 0),
             cftime.DatetimeJulian(2010, 3, 15, 12, 0, 0, 0),
             cftime.DatetimeJulian(2010, 4, 15, 0, 0, 0, 0),
             cftime.DatetimeJulian(2010, 5, 15, 12, 0, 0, 0),
             cftime.DatetimeJulian(2010, 6, 15, 0, 0, 0, 0),
             cftime.DatetimeJulian(2010, 7, 15, 12, 0, 0, 0),
             cftime.DatetimeJulian(2010, 8, 15, 12, 0, 0, 0),
             cftime.DatetimeJulian(2010, 9, 15, 0, 0, 0, 0),
             cftime.DatetimeJulian(2010, 10, 15, 12, 0, 0, 0),
             cftime.DatetimeJulian(2010, 11, 15, 0, 0, 0, 0),
             cftime.DatetimeJulian(2010, 12, 15, 12, 0, 0, 0)], dtype=object)
    • lat
      (lat)
      float64
      -89.0 -87.0 -85.0 ... 87.0 89.0
      standard_name :
      latitude
      long_name :
      Center coordinates for latitudes
      units :
      degrees
      array([-89., -87., -85., -83., -81., -79., -77., -75., -73., -71., -69., -67.,
             -65., -63., -61., -59., -57., -55., -53., -51., -49., -47., -45., -43.,
             -41., -39., -37., -35., -33., -31., -29., -27., -25., -23., -21., -19.,
             -17., -15., -13., -11.,  -9.,  -7.,  -5.,  -3.,  -1.,   1.,   3.,   5.,
               7.,   9.,  11.,  13.,  15.,  17.,  19.,  21.,  23.,  25.,  27.,  29.,
              31.,  33.,  35.,  37.,  39.,  41.,  43.,  45.,  47.,  49.,  51.,  53.,
              55.,  57.,  59.,  61.,  63.,  65.,  67.,  69.,  71.,  73.,  75.,  77.,
              79.,  81.,  83.,  85.,  87.,  89.])
    • lon
      (lon)
      float64
      -181.5 -178.5 ... 172.5 175.5
      standard_name :
      longitude
      long_name :
      Center coordinates for longitudes
      units :
      degrees
      array([-181.5, -178.5, -175.5, -172.5, -169.5, -166.5, -163.5, -160.5, -157.5,
             -154.5, -151.5, -148.5, -145.5, -142.5, -139.5, -136.5, -133.5, -130.5,
             -127.5, -124.5, -121.5, -118.5, -115.5, -112.5, -109.5, -106.5, -103.5,
             -100.5,  -97.5,  -94.5,  -91.5,  -88.5,  -85.5,  -82.5,  -79.5,  -76.5,
              -73.5,  -70.5,  -67.5,  -64.5,  -61.5,  -58.5,  -55.5,  -52.5,  -49.5,
              -46.5,  -43.5,  -40.5,  -37.5,  -34.5,  -31.5,  -28.5,  -25.5,  -22.5,
              -19.5,  -16.5,  -13.5,  -10.5,   -7.5,   -4.5,   -1.5,    1.5,    4.5,
                7.5,   10.5,   13.5,   16.5,   19.5,   22.5,   25.5,   28.5,   31.5,
               34.5,   37.5,   40.5,   43.5,   46.5,   49.5,   52.5,   55.5,   58.5,
               61.5,   64.5,   67.5,   70.5,   73.5,   76.5,   79.5,   82.5,   85.5,
               88.5,   91.5,   94.5,   97.5,  100.5,  103.5,  106.5,  109.5,  112.5,
              115.5,  118.5,  121.5,  124.5,  127.5,  130.5,  133.5,  136.5,  139.5,
              142.5,  145.5,  148.5,  151.5,  154.5,  157.5,  160.5,  163.5,  166.5,
              169.5,  172.5,  175.5])
  • standard_name :
    atmosphere_optical_thickness_due_to_ambient_aerosol
    long_name :
    Ambient Aerosol Optical Thickness at 550 nm
    institution :
    Royal Netherlands Meteorological Institute, De Bilt, The Netherlands
    institute_id :
    KNMI
    source :
    TM5-mp: CTM ERA-Interim 3x2 34L
    model_id :
    TM5
    references :
    Van Noije, T.P.C., et al. (Geosci. Model Dev., 7, 2435-2475, 2014); Van Noije, T.P.C., et al. (manuscript in preparation)
    experiment_id :
    AP3-CTRL2016
    project_id :
    AeroCom Phase 3
    title :
    TM5 model output prepared for AeroCom Phase 3
    Conventions :
    CF-1.6
    contact :
    Twan van Noije (noije@knmi.nl)
    from_files :
    ['/home/jonasg/MyPyaerocom/testdata-minimal/modeldata/TM5-met2010_CTRL-TEST/renamed/aerocom3_TM5_AP3-CTRL2016_od550aer_Column_2010_monthly.nc']
    data_id :
    TM5-met2010_CTRL-TEST
    var_name_read :
    n/d
    ts_type :
    monthly
    regridded :
    False
    outliers_removed :
    False
    computed :
    False
    concatenated :
    False
    meteo :
    experiment :
    AP3
    perturbation :
    CTRL2016
    cell_methods :
    longitude: latitude: point time: mean

Simply print the object.

[12]:
print(od550aer)
pyaerocom.GriddedData: TM5-met2010_CTRL-TEST
Grid data: atmosphere_optical_thickness_due_to_ambient_aerosol / (1) (time: 365; latitude: 11; longitude: 11)
     Dimension coordinates:
          time                                                 x              -              -
          latitude                                             -              x              -
          longitude                                            -              -              x
     Attributes:
          Conventions: CF-1.6
          NCO: 4.7.2
          computed: False
          concatenated: False
          contact: Twan van Noije (noije@knmi.nl)
          data_id: TM5-met2010_CTRL-TEST
          experiment: AP3
          experiment_id: AP3-CTRL2019
          from_files: ['/home/jonasg/MyPyaerocom/testdata-minimal/modeldata/TM5-met2010_CTRL...
          history: Wed Jul  8 10:31:53 2020: ncks -d lat,20,30 -d lon,20,30 raw/aerocom3_TM5-met2010_AP3-CTRL2019_od550aer_Column_2010_daily.nc...
          institute_id: KNMI
          institution: Royal Netherlands Meteorological Institute, De Bilt, The Netherlands
          meteo: met2010
          model_id: TM5
          outliers_removed: False
          perturbation: CTRL2019
          project_id: AeroCom Phase 3
          reader: None
          references: Van Noije, T.P.C., et al. (Geosci. Model Dev., 7, 2435-2475, 2014); Bergman,...
          region: None
          regridded: False
          source: TM5-mp, r1058: CTM ERA-Interim 3x2 34L
          title: TM5 model output prepared for AeroCom Phase 3
          ts_type: daily
          var_name_read: n/d
     Cell methods:
          point: longitude, latitude
          mean: time

Dimension coordinates can be simply accessed either using [] or . operator, e.g.

[13]:
od550aer['latitude']
[13]:
DimCoord(array([-49., -47., -45., -43., -41., -39., -37., -35., -33., -31., -29.]), bounds=array([[-50., -48.],
       [-48., -46.],
       [-46., -44.],
       [-44., -42.],
       [-42., -40.],
       [-40., -38.],
       [-38., -36.],
       [-36., -34.],
       [-34., -32.],
       [-32., -30.],
       [-30., -28.]]), standard_name='latitude', units=Unit('degrees'), long_name='Center coordinates for latitudes', var_name='lat')
[14]:
od550aer.longitude
[14]:
DimCoord(array([61.5, 64.5, 67.5, 70.5, 73.5, 76.5, 79.5, 82.5, 85.5, 88.5, 91.5]), bounds=array([[60., 63.],
       [63., 66.],
       [66., 69.],
       [69., 72.],
       [72., 75.],
       [75., 78.],
       [78., 81.],
       [81., 84.],
       [84., 87.],
       [87., 90.],
       [90., 93.]]), standard_name='longitude', units=Unit('degrees'), long_name='Center coordinates for longitudes', var_name='lon')

They are instances of iris.coords.DimCoords, as defined in the underlying Cube instance used in the GriddedData object.

Time stamps

Time stamps are represented as numerical values with respect to a reference date and frequency, according to the CF conventions. They can be accessed via the time attribute of the data class (if the data contains a time dimension).

[15]:
od550aer_tm5.time
[15]:
DimCoord(array([58454.5, 58484. , 58513.5, 58544. , 58574.5, 58605. , 58635.5,
       58666.5, 58697. , 58727.5, 58758. , 58788.5]), standard_name='time', units=Unit('days since 1850-01-01 00:00', calendar='julian'), long_name='Time', var_name='time')

You may also want the time-stamps in the form of actual datetime-like objects. These can be computed using the time_stamps() method:

[16]:
od550aer.time_stamps()[0:3]
[16]:
array(['2010-01-01T00:00:00.000000', '2010-01-02T00:00:00.000000',
       '2010-01-03T00:00:00.000000'], dtype='datetime64[us]')

As introduced above, maps of individual time stamps can be plotted using the quickplot_map method. Above we used the default call, which chooses the first available timestamp. You may also specify which date you are interested in:

[17]:
od550aer_tm5.quickplot_map('2010-10');
../_images/pyaerocom-tutorials_getting_started_analysis_41_0.png

If you want more control on the input parameters of the map plotting function (e.g. color-binning, lower, upper limit, colorbar, etc.), you may use the underlying plot method (that is also used in GriddedData.quickplot_map, which is available at `pya.plot.mapping.plot_griddeddata_on_map <https://pyaerocom.met.no/api.html#pyaerocom.plot.mapping.plot_griddeddata_on_map>`__, e.g.:

[18]:
pya.plot.mapping.plot_griddeddata_on_map(od550aer_tm5[1], xlim=(-60, 60), ylim=(-30, 30), vmin=0, vmax=0.4, log_scale=False);
../_images/pyaerocom-tutorials_getting_started_analysis_43_0.png

Regional filtering can be performed using the Filter class.

Rectangular regions

An overview of rectangular AeroCom default regions can be accessed via:

[19]:
print(pya.const.OLD_AEROCOM_REGIONS)
['WORLD', 'ASIA', 'AUSTRALIA', 'CHINA', 'EUROPE', 'INDIA', 'NAFRICA', 'SAFRICA', 'SAMERICA', 'NAMERICA']

Let’s choose north Africa as an example. Create instance of Filter class:

[20]:
f = pya.Filter('NAFRICA')

You can print its region attribute to see the edges:

[21]:
print(f.region)
pyaeorocom Region
Name: NAFRICA
Longitude range: [-17, 50]
Latitude range: [0, 40]
Longitude range (plots): [-17, 50]
Latitude range (plots): [0, 40]

Now apply to the model data object:

[22]:
od550aer_nafrica = f(od550aer_tm5)

Compare shapes:

[23]:
od550aer_nafrica
[23]:
pyaerocom.GriddedData
Grid data: <iris 'Cube' of atmosphere_optical_thickness_due_to_ambient_aerosol / (1) (time: 12; latitude: 22; longitude: 23)>
[24]:
od550aer_tm5
[24]:
pyaerocom.GriddedData
Grid data: <iris 'Cube' of atmosphere_optical_thickness_due_to_ambient_aerosol / (1) (time: 12; latitude: 90; longitude: 120)>

As you can see, the filtered object is reduced in the longitude and latitude dimension. Let’s have a look:

[25]:
od550aer_nafrica.quickplot_map('March 2010');
../_images/pyaerocom-tutorials_getting_started_analysis_56_0.png

Binary region masks

Available HTAP binary filter masks can be accessed via:

[26]:
print(pya.const.HTAP_REGIONS)
['PAN', 'EAS', 'NAF', 'MDE', 'LAND', 'SAS', 'SPO', 'OCN', 'SEA', 'RBU', 'EEUROPE', 'NAM', 'WEUROPE', 'SAF', 'USA', 'SAM', 'EUR', 'NPO', 'MCA']

And they are handled in the same way as the rectangular regions:

[27]:
pya.Filter('OCN')(od550aer_tm5).quickplot_map();
Failed to compute / add area weighted mean. Reason: ValueError('Format specifier missing precision')
../_images/pyaerocom-tutorials_getting_started_analysis_60_1.png

As you can see the provided HTAP region masks are only valid within 60\(^\circ\)S to 60\(^\circ\)N.

Filtering of time

Filtering of time is not included in the Filter class (which only allows for regional filtering) but can be easily performed from the GriddedData object directly. If you know the indices of the time stamps you want to crop, you can simply use numpy indexing syntax (remember that we have a 3D array containing time, latitude and lonfgitude).

Let’s say we are interested in the (northern hemispheric) summer months of June to September.

Since the time dimension corresponds the first index in the 3D data (time, lat, lon), and since we know, that we have monthly 2010 data (see above), we may use:

[28]:
od550aer_summer = od550aer_tm5[5:8]
od550aer_summer.time_stamps()
[28]:
array(['2010-06-15T00:00:00.000000', '2010-07-15T12:00:00.000000',
       '2010-08-15T12:00:00.000000'], dtype='datetime64[us]')

However, this methodology might not always be handy (imagine you have a 10 year dataset of 3hourly sampled data and want to extract three months in the 6th year …). In that case, you can perform the cropping using the actual timestamps:

[29]:
od550aer_tm5.crop(time_range=('6-2010', '9-2010')).time_stamps()
[29]:
array(['2010-06-15T00:00:00.000000', '2010-07-15T12:00:00.000000',
       '2010-08-15T12:00:00.000000'], dtype='datetime64[us]')

Data selection over multiple dimensions

Inspired by the xarray.DataArray.sel method, a similar method was implemented in GriddedData:

[30]:
od550aer_tm5.sel(time='April 2010', longitude=(90, 179), latitude=(-50, 20)).quickplot_map();
../_images/pyaerocom-tutorials_getting_started_analysis_68_0.png

NOTE: Before release of version 0.10.0, there was a bug that led to a crash if a time range (i.e. time=(start, stop)) was passed into the sel method.

You may regrid GriddedData using the regrid method (for regional regridding) or the resample_time method (for temporal resmpling). Like already done above, the calls may be combined, e.g.:

[31]:
lowres = od550aer_tm5.regrid(lat_res_deg=10, lon_res_deg=20).resample_time('yearly')
lowres
[31]:
pyaerocom.GriddedData
Grid data: <iris 'Cube' of od550aer / (1) (time: 1; latitude: 18; longitude: 18)>

As you can see, the time dimension only has one entry, as expected, as the data only contains 2010 timestamps and we computed a yearly average, lat and lon dimensions are also reduced, accordingly.

[32]:
lowres.quickplot_map();
../_images/pyaerocom-tutorials_getting_started_analysis_73_0.png

Regional averaging

The actual cell sizes of latitude and longitude coordinates vary, dependent on where you are, that is, they are largest close to the equator, and smallest near the poles. When computing a regional average, this needs to be considered (i.e. values need to be weighted by their actual cell size). This is area weighted regridding is implemented in the iris library and is used by default in GriddedData, for instance, when calling:

[33]:
od550aer_tm5.mean()
[33]:
0.11864813532841474

You may specify if you do not want to use area weighting:

[34]:
od550aer_tm5.mean(areaweighted=False)
[34]:
0.09825691

Makes quite a difference, doesn’t it?

Time-series at individual coordinates can be extracted from a GriddedData object via:

[35]:
ts_data = od550aer_tm5.to_time_series(latitude=60, longitude=11)
ts_data
[35]:
[StationData([('dtime', []),
              ('var_info', BrowseDict([('od550aer', {'units': Unit('1')})])),
              ('station_coords',
               {'latitude': None, 'longitude': None, 'altitude': None}),
              ('data_err', BrowseDict()),
              ('overlap', BrowseDict()),
              ('numobs', BrowseDict()),
              ('data_flagged', BrowseDict()),
              ('filename', None),
              ('station_id', None),
              ('station_name', None),
              ('instrument_name', None),
              ('PI', None),
              ('country', None),
              ('country_code', None),
              ('ts_type', 'monthly'),
              ('latitude', 61.0),
              ('longitude', 10.5),
              ('altitude', nan),
              ('data_id', 'TM5-met2010_CTRL-TEST'),
              ('dataset_name', None),
              ('data_product', None),
              ('data_version', None),
              ('data_level', None),
              ('revision_date', None),
              ('website', None),
              ('ts_type_src', None),
              ('stat_merge_pref_attr', None),
              ('od550aer',
               2010-01-15 12:00:00    0.049607
               2010-02-14 00:00:00    0.061162
               2010-03-15 12:00:00    0.069986
               2010-04-15 00:00:00    0.097556
               2010-05-15 12:00:00    0.103770
               2010-06-15 00:00:00    0.107482
               2010-07-15 12:00:00    0.146354
               2010-08-15 12:00:00    0.145518
               2010-09-15 00:00:00    0.078066
               2010-10-15 12:00:00    0.077722
               2010-11-15 00:00:00    0.037447
               2010-12-15 12:00:00    0.039024
               dtype: float32)])]

As you can see from the output, the return value of this method is a list, that contains one pyaerocom.StationData object. The reason why this method returns a list is because it is usually called with many input coordinates (e.g. all site locations of an observation network), and thus, returns a list of StationData objects, one for each input coordinate.

The StationData object is basically a dictionary-like object with some extra functionality.

[36]:
station = ts_data[0]

The actual time-series is a pandas.Series object and can be accessed through the variable name (remember, GriddedData instances are single variable).

[37]:
ts = station['od550aer']
ts
[37]:
2010-01-15 12:00:00    0.049607
2010-02-14 00:00:00    0.061162
2010-03-15 12:00:00    0.069986
2010-04-15 00:00:00    0.097556
2010-05-15 12:00:00    0.103770
2010-06-15 00:00:00    0.107482
2010-07-15 12:00:00    0.146354
2010-08-15 12:00:00    0.145518
2010-09-15 00:00:00    0.078066
2010-10-15 12:00:00    0.077722
2010-11-15 00:00:00    0.037447
2010-12-15 12:00:00    0.039024
dtype: float32
[38]:
ax = ts.plot()
ax.set_title('TM5 AOD Oslo')
ax.set_ylabel('AOD [550 nm]');
../_images/pyaerocom-tutorials_getting_started_analysis_86_0.png

This is as much as there is to say for now about GriddedData and StationData for now. Let’s have a little closer look at observations. After all, the main purpose of the AeroCom initiative is to compare models with observations. As we shall see below, the just introduced StationData object will play a key role when bringing gridded model data (GriddedData) together with ungridded observational data, such as measurements of a certain variable at a given site location.

In the following section the reading of ungridded data is illustrated based on the example of AERONET version 3 (level 2) data.

Observational data: Reading of and working with ungridded data

This section provides brief introductions into the following pyaerocom classes and architectures:

Primer on observational data

Other than model data, which can be provided as a gridded object over a certain domain (e.g. latitude, longitude, time) and in that, can be considered fully sampled, observational data is usually sparsely sampled in space and time.

That is, consider a network of observations of a certain variable (e.g. od550aer, or AOD), with many different site locations around the globe. Each of these sites is measuring the variable at that exact location, and the whole network of sites makes a point cloud of site locations in the latitude, longitude domain. In addition, since these are real world measurements, the temporal sampling itself between the different sites is not synchronised, that is, each site is measuring independently of any other site.

For instance, the AERONET network is a global network of sun photometer measurements, that can measure the AOD at several wavelengths based on measurements of the solar irradiance. Thus, at the least, these measurements require 2 things:

  1. Daylight

  2. A clear sky

Thus, it is needless to say, that a site in Antarctica cannot measure at the same time as a site in Ny-Ålesund (actually, that is also not strictly true, as AERONET now also provides AOD measurements based on the lunar irradiance, but I hope you got the point anyways).

This should illustrate, that it is more difficult to define a harmonised and yet, flexible data format for such observational databases. In pyaerocom, the UngriddedData object is designed for such point cloud data and typically holds the data belonging to a whole observation network, that is:

The ``UngriddedData`` object can be considered a point-cloud-like dataobject that holds individual time-series from many locations around the globe and the associated metadata for each site and measurement

Moreover, since observational data typically comes from many different observation networks, the formats in which these data are stored typically vary from network to network, which makes it harder to read the data, compared to model data which typically comes as NetCDF file and these days, most often follow some metadata conventions such as the CF conventions.

Data from the AERONET network (that is introduced in the following), for instance, is provided in the form of column seperated text files per measurement station, where columns correspond to different variables and data rows to individual time stamps.

As a result, custom reading routines for individual observation networks need to be implemented, and pyaerocom provides reading support for many commonly used observational databases such as AERONET, or the EBAS or EARLINET data.

The basic workflow for reading of ungridded data, such as Aeronet data, is very similar to the reading of gridded data (comprising a reading class that handles a query and returns a data class, here UngriddedData. However, under the hood, the implementation is a little more complicated, as there are reading classes for each supported network, as illustrated in the following flowchart:

Flowchart ungridded reading

The actual classes handling the reading of data (for a given dataset) are indicated in blue. The orange ReadUngridded class is a factory class, that knows about the blue reading classes via a unique ID (similar to the gridded reading). Thus, as indicated, as a user, you do not need to know which exact reading class you need, you just need the ID and ReadUngridded will know which (blue) reader to use. To summarise, what you need for reading an ungridded dataset is:

  1. A path where the actual datafiles are located

  2. An unique ID, that links that path with a name

  3. A reader that can read the class

The first 2 points are available via:

[39]:
pya.const.OBSLOCS_UNGRIDDED
[39]:
OrderedDict([('AeronetSunV3L2Subset.daily',
              '/home/jonasg/MyPyaerocom/testdata-minimal/obsdata/AeronetSunV3Lev2.daily/renamed'),
             ('AeronetSDAV3L2Subset.daily',
              '/home/jonasg/MyPyaerocom/testdata-minimal/obsdata/AeronetSDAV3Lev2.daily/renamed'),
             ('AeronetInvV3L2Subset.daily',
              '/home/jonasg/MyPyaerocom/testdata-minimal/obsdata/AeronetInvV3Lev2.daily/renamed'),
             ('EBASSubset',
              '/home/jonasg/MyPyaerocom/testdata-minimal/obsdata/EBASMultiColumn')])

And the reader classes that are supposed to be used for each of these IDs is provided in the ReadUngridded class header:

[40]:
pya.io.ReadUngridded.SUPPORTED_READERS
[40]:
[pyaerocom.io.read_aeronet_invv3.ReadAeronetInvV3,
 pyaerocom.io.read_aeronet_invv2.ReadAeronetInvV2,
 pyaerocom.io.read_aeronet_sdav2.ReadAeronetSdaV2,
 pyaerocom.io.read_aeronet_sdav3.ReadAeronetSdaV3,
 pyaerocom.io.read_aeronet_sunv2.ReadAeronetSunV2,
 pyaerocom.io.read_aeronet_sunv3.ReadAeronetSunV3,
 pyaerocom.io.read_earlinet.ReadEarlinet,
 pyaerocom.io.read_ebas.ReadEbas,
 pyaerocom.io.read_gaw.ReadGAW,
 pyaerocom.io.read_aasetal.ReadAasEtal,
 pyaerocom.io.read_ghost.ReadGhost]

The link between ID (keys of const.OBSLOCS_UNGRIDDED) and reader is available in the actual readers themselves, e.g.:

[41]:
pya.io.read_aeronet_sunv3.ReadAeronetSunV3.SUPPORTED_DATASETS
[41]:
['AeronetSunV3Lev1.5.daily',
 'AeronetSunV3Lev1.5.AP',
 'AeronetSunV3Lev2.daily',
 'AeronetSunV3Lev2.AP',
 'AeronetSunV3L2Subset.daily']

But these are details that you usually do not need to worry about. If you want to register a new observation dataset, you need the 3 points specified above and can add it via:

[42]:
aeronet_sun_datadir = f'{pya.const.OUTPUTDIR}/testdata-minimal/obsdata/AeronetSunV3Lev2.daily/renamed'
pya.const.add_ungridded_obs(obs_id='Bla',
                            data_dir=aeronet_sun_datadir,
                            reader=pya.io.read_aeronet_sunv3.ReadAeronetSunV3)

Now, we basically have 2 names for the same dataset:

[43]:
pya.io.read_aeronet_sunv3.ReadAeronetSunV3.SUPPORTED_DATASETS
[43]:
['AeronetSunV3Lev1.5.daily',
 'AeronetSunV3Lev1.5.AP',
 'AeronetSunV3Lev2.daily',
 'AeronetSunV3Lev2.AP',
 'AeronetSunV3L2Subset.daily',
 'Bla']

That is, the data under the above directory is now accessible via 2 IDs: Bla and AeronetSunV3L2Subset.daily.

Before continuing with the reading of observational data, some things need to be said related to the caching of UngriddedData objects.

Caching of UngriddedData

Reading of ungridded data is often rather time-consuming. Therefore, pyaerocom uses a caching strategy that stores loaded instances of the UngriddedData class as pickle files in a cache directory (illustrated in the flowchart shown above). The loaction of the cache directory can be accessed via:

[44]:
pya.const.CACHEDIR
[44]:
'/home/jonasg/MyPyaerocom/_cache/jonasg'

You may change this directory if required.

[45]:
f'Caching is active? {pya.const.CACHING}'
[45]:
'Caching is active? True'
Deactivate / Activate caching
[46]:
pya.const.CACHING = False
[47]:
pya.const.CACHING = True

Note: if caching is active, make sure you have enough disk quota or change location where the cache files are stored.

Read Aeronet Sun v3 level 2 data

As illustrated in the flowchart above, ungridded observation data can be imported using the ReadUngridded class. Like for the model data, observation datasets can be searched as follows:

[48]:
pya.browse_database('Aeronet*');

Dataset name: AeronetSunV3L2Subset.daily
Data directory: /home/jonasg/MyPyaerocom/testdata-minimal/obsdata/AeronetSunV3Lev2.daily/renamed
Supported variables: ['od340aer', 'od440aer', 'od500aer', 'od870aer', 'ang4487aer', 'ang44&87aer', 'od550aer']
Last revision: n/d

Dataset name: AeronetSDAV3L2Subset.daily
Data directory: /home/jonasg/MyPyaerocom/testdata-minimal/obsdata/AeronetSDAV3Lev2.daily/renamed
Supported variables: ['od500gt1aer', 'od500lt1aer', 'od500aer', 'ang4487aer', 'od550aer', 'od550gt1aer', 'od550lt1aer']
Last revision: n/d

Dataset name: AeronetInvV3L2Subset.daily
Data directory: /home/jonasg/MyPyaerocom/testdata-minimal/obsdata/AeronetInvV3Lev2.daily/renamed
Supported variables: ['abs440aer', 'angabs4487aer', 'od440aer', 'ang4487aer', 'abs550aer', 'od550aer']
Last revision: n/d

The search routine found 3 matches for the 3 different AERONET data products: Sun, SDA, and Inv (inversion). You may read more about the different products at the AERONET website.

Let’s continue with the “Sun” product (AERONET Direct Sun algorithm). As you can see from the output above, this dataset contains daily averages, which is convenient to use for model evaluation.

[49]:
obs_id = 'AeronetSunV3L2Subset.daily'
[50]:
obs_reader = pya.io.ReadUngridded(obs_id)
print(obs_reader)

Dataset name: AeronetSunV3L2Subset.daily
Data directory: /home/jonasg/MyPyaerocom/testdata-minimal/obsdata/AeronetSunV3Lev2.daily/renamed
Supported variables: ['od340aer', 'od440aer', 'od500aer', 'od870aer', 'ang4487aer', 'ang44&87aer', 'od550aer']
Last revision: n/d

Let’s read the data (you can read a single or multiple variables at the same time). For now, we only read the AOD at 550 nm:

[51]:
od550aer_aeronet = obs_reader.read(vars_to_retrieve='od550aer')
od550aer_aeronet
[51]:
UngriddedData <networks: ['AeronetSunV3L2Subset.daily']; vars: ['od550aer']; instruments: ['sun_photometer'];No. of metadata units: 22

As you can see, the data object is of type UngriddedData. Other than GriddedData, UngriddedData can hold an arbitrary number of variables, and even networks. The number of metadata units indicates the number of data files that have been read.

Plot all station coordinates

To get an overview, you can plot all site coordinates contained in the dataset. You can also plot multiple times into the same map with different input criteria. For instance, below we first plot all site locations available in the data (in red), and then, on top of it, in green, we plot sites that contain data in 2010.

[52]:
ax = od550aer_aeronet.plot_station_coordinates(markersize=80)
od550aer_aeronet.plot_station_coordinates(color='lime', var_name='od550aer', start=2010, stop=2011, markersize=20, ax=ax)
[52]:
<cartopy.mpl.geoaxes.GeoAxes at 0x7f0963f0fa90>
../_images/pyaerocom-tutorials_getting_started_analysis_123_1.png

Access of individual stations

For intercomparison with model data, we are interested in time-series from individual sites. You can check out all existing site-location names via:

[53]:
od550aer_aeronet.unique_station_names
[53]:
['AAOT',
 'ARIAKE_TOWER',
 'Agoufou',
 'Alta_Floresta',
 'American_Samoa',
 'Amsterdam_Island',
 'Anmyon',
 'Avignon',
 'Azores',
 'BORDEAUX',
 'Barbados',
 'Blyth_NOAH',
 'La_Paz',
 'Mauna_Loa',
 'Tahiti',
 'Taihu',
 'Taipei_CWB',
 'Tamanrasset_INM',
 'The_Hague',
 'Thessaloniki',
 'Thornton_C-power',
 'Trelew']

To access individual site location data as StationData you can simply do:

[54]:
station_data = od550aer_aeronet['La_Paz'] # this is fully equivalent with aeronet_data.to_station_data('Leipzig')
station_data
[54]:
StationData([('dtime',
              array(['2006-02-18T00:00:00.000000000', '2006-02-19T00:00:00.000000000',
                     '2006-02-20T00:00:00.000000000', ...,
                     '2019-05-24T00:00:00.000000000', '2019-05-25T00:00:00.000000000',
                     '2019-05-26T00:00:00.000000000'], dtype='datetime64[ns]')),
             ('var_info',
              BrowseDict([('od550aer',
                           OrderedDict([('units', '1'),
                                        ('overlap', False),
                                        ('ts_type', 'daily'),
                                        ('apply_constraints', False),
                                        ('min_num_obs', None),
                                        ('how', 'mean')]))])),
             ('station_coords',
              {'latitude': -16.538999999999998,
               'longitude': -68.066467,
               'altitude': 3439.0}),
             ('data_err', BrowseDict()),
             ('overlap', BrowseDict()),
             ('numobs', BrowseDict()),
             ('data_flagged', BrowseDict()),
             ('filename',
              '/home/jonasg/MyPyaerocom/testdata-minimal/obsdata/AeronetSunV3Lev2.daily/renamed/La_Paz.lev30'),
             ('station_id', None),
             ('station_name', 'La_Paz'),
             ('instrument_name', 'sun_photometer'),
             ('PI', 'Brent_Holben'),
             ('country', None),
             ('country_code', None),
             ('ts_type', 'daily'),
             ('latitude', -16.538999999999998),
             ('longitude', -68.066467),
             ('altitude', 3439.0),
             ('data_id', 'AeronetSunV3L2Subset.daily'),
             ('dataset_name', None),
             ('data_product', None),
             ('data_version', None),
             ('data_level', None),
             ('revision_date', None),
             ('website', None),
             ('ts_type_src', 'daily'),
             ('stat_merge_pref_attr', None),
             ('data_revision', 'n/d'),
             ('od550aer',
              2006-02-18    0.087568
              2006-02-19         NaN
              2006-02-20         NaN
              2006-02-21         NaN
              2006-02-22    0.101807
                              ...
              2019-05-22         NaN
              2019-05-23    0.059489
              2019-05-24    0.049292
              2019-05-25    0.074172
              2019-05-26    0.060137
              Freq: D, Length: 4846, dtype: float64)])

As you can see, the returned object is of type StationData which has been introduced above (remember, we extracted a time series from the TM5 model for the location of Oslo).

As mentioned above, it can be used like a dictionary, and the variable time-series can be accessed via:

[55]:
station_data['od550aer'].plot()
[55]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f0969e50510>
../_images/pyaerocom-tutorials_getting_started_analysis_129_1.png

You may also plot directly from the StationData object (and do some more other hopefully self-explanatory things):

[56]:
ax = station_data.plot_timeseries('od550aer', marker='x', ls='none')
station_data.plot_timeseries('od550aer', freq='monthly', marker=' ', ls='-', lw=3, ax=ax)
[56]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f09681bf250>
../_images/pyaerocom-tutorials_getting_started_analysis_131_1.png

Back to UngriddedData: You may also retrieve the StationData with specifying more constraints using to_station_data (e.g. in monthly resolution and only for the year 2010). And you can overlay different curves, by passing the axes instance returned by the plotting method:

[57]:
ax=od550aer_aeronet.to_station_data('La_Paz',
                                    start=2010, stop=2011,
                                    freq='daily').plot_timeseries('od550aer')

ax=od550aer_aeronet.to_station_data('La_Paz',
                                    start=2010,
                                    freq='monthly').plot_timeseries('od550aer', ax=ax)
ax.legend()
ax.set_title('La Paz AODs 2010');
../_images/pyaerocom-tutorials_getting_started_analysis_133_0.png

You can also plot the time-series directly from UngriddedData

[58]:
od550aer_aeronet.plot_station_timeseries('La_Paz', 'od550aer', ts_type='monthly',
                                     start=2018).set_title('Monthly AOD in La Paz, 2018');
../_images/pyaerocom-tutorials_getting_started_analysis_135_0.png

Low-level colocation routine(s)

Let’s colocate the TM5 model data with the AERONET AOD subset for the year 2010 and in monthly resolution. Since we already have both data objects loaded, we can go straight to the low-level colocation routine:

[61]:
coldata = pya.colocation.colocate_gridded_ungridded(od550aer_tm5,
                                                    od550aer_aeronet,
                                                    ts_type='monthly',
                                                    start=2010,
                                                    filter_name='WORLD-noMOUNTAINS')
Input filters {'latitude': [-89.0, 89.0], 'longitude': [-181.5, 175.5]} result in unchanged data object

The filter-name WORLD-noMOUNTAINS denotes that all available AERONET sites are supposed to be used but high altitude sites (located above 1000m a.s.l). A more detailed introduction into available regions and region filters is provided in the getting_started_setup.ipynb tutorial.

You may create a scatter plot from these colocated monthly means, which includes relevant statistical parameters that help to assess model performance:

[62]:
coldata.plot_scatter(loglog=True);
../_images/pyaerocom-tutorials_getting_started_analysis_144_0.png

Does not look too bad, you can see that this result is from 8 sites and 62 data points (monthly averages). The normalised-mean-bias (NMB) is -12.9%, which means that the model slightly underestimates AOD at these locations.

A more illustrative view of the model biases can be retrieved by plotting a bias map:

[63]:
pya.plot.mapping.plot_nmb_map_colocateddata(coldata);
../_images/pyaerocom-tutorials_getting_started_analysis_146_0.png

The fact that you can barely see most of the sites is a good sign, since 0% bias is mapped to white color which is the same as the background color here. The largest bias is found in Amsterdam Island, in the southern Indian Ocean, which could be an indication that the model is simulating too little sea-salt aerosol in this very remote and clean region.

Under the hood …

… the ColocatedData object is an xarray.DataArray:

[64]:
coldata.data
[64]:
Show/Hide data repr Show/Hide attributes
xarray.DataArray
'od550aer'
  • data_source: 2
  • time: 12
  • station_name: 8
  • nan 0.1077 0.0365 0.113 nan ... 0.07274 0.5046 0.3815 0.1642 0.04502
    array([[[       nan, 0.10766512, 0.03649583, 0.11300932,        nan,
                    nan, 0.13828996, 0.03593528],
            [       nan, 0.1281071 , 0.05131321, 0.1490295 ,        nan,
                    nan, 0.09390731, 0.03669082],
            [       nan, 0.06650846,        nan, 0.15400053, 0.77145611,
             0.81960458, 0.15375983, 0.03303893],
            [1.04184348, 0.09195735,        nan, 0.17945707, 0.56584754,
             0.66320834, 0.26381093, 0.03862473],
            [1.05366664, 0.0922461 , 0.18108093, 0.13238964, 0.62159417,
                    nan, 0.18389999,        nan],
            [1.00704245,        nan, 0.23711979, 0.12935421, 1.01970277,
                    nan, 0.23285789,        nan],
            [0.98468077,        nan, 0.28315152, 0.15569135,        nan,
                    nan, 0.22223999,        nan],
            [0.34698175, 1.1291196 , 0.22721504, 0.1078829 , 0.6659021 ,
                    nan, 0.28657127, 0.03157442],
            [0.34638257, 1.36759564, 0.22911738,        nan, 0.48163878,
                    nan, 0.17283706, 0.05573316],
            [       nan, 0.51974499, 0.19571329, 0.19495548,        nan,
                    nan, 0.14301278, 0.06043777],
            [       nan, 0.20634955,        nan, 0.09995729,        nan,
                    nan, 0.18381835, 0.04511684],
            [       nan, 0.17557466,        nan,        nan,        nan,
                    nan,        nan, 0.04412376]],
    
           [[0.15579131, 0.074198  , 0.07480989, 0.09742171, 0.51915187,
             0.32987517, 0.15443519, 0.04990007],
            [0.18340141, 0.0763083 , 0.07439816, 0.10789451, 0.49576023,
             0.36209598, 0.21525712, 0.05284398],
            [0.89635843, 0.07341532, 0.06992208, 0.14702028, 1.00326443,
             0.63314253, 0.22872289, 0.05543199],
            [0.77087325, 0.1048945 , 0.06650965, 0.1865073 , 0.7776224 ,
             0.54631037, 0.28632328, 0.03336016],
            [0.80734402, 0.12378095, 0.05653326, 0.17420147, 0.65254241,
             0.45009974, 0.30526772, 0.03229909],
            [0.50654256, 0.14234523, 0.06274261, 0.17343882, 0.58004749,
             0.33348686, 0.30070606, 0.03174512],
            [0.46850282, 0.1863786 , 0.0675569 , 0.17624108, 0.47164536,
             0.19874078, 0.24717812, 0.04090422],
            [0.34062195, 1.90581691, 0.07018556, 0.15123938, 0.36946237,
             0.18504179, 0.26474383, 0.03088068],
            [0.24573354, 1.21507001, 0.11556577, 0.13191637, 0.4020009 ,
             0.26275966, 0.23841298, 0.04008143],
            [0.2798166 , 0.15733013, 0.08971586, 0.13991331, 0.38702768,
             0.34291422, 0.21531013, 0.03859388],
            [0.35316458, 0.08011972, 0.0626139 , 0.069165  , 0.55920124,
             0.38426778, 0.19802441, 0.03579372],
            [0.28721237, 0.0827378 , 0.07033022, 0.07274448, 0.50464362,
             0.3815313 , 0.1641538 , 0.04502364]]])
    • data_source
      (data_source)
      <U26
      'AeronetSunV3L2Subset.daily' 'TM5-met2010_CTRL-TEST'
      array(['AeronetSunV3L2Subset.daily', 'TM5-met2010_CTRL-TEST'], dtype='<U26')
    • var_name
      (data_source)
      <U8
      'od550aer' 'od550aer'
      array(['od550aer', 'od550aer'], dtype='<U8')
    • var_units
      (data_source)
      <U1
      '1' '1'
      array(['1', '1'], dtype='<U1')
    • ts_type_src
      (data_source)
      <U7
      'daily' 'monthly'
      array(['daily', 'monthly'], dtype='<U7')
    • time
      (time)
      datetime64[ns]
      2010-01-15 ... 2010-12-15
      array(['2010-01-15T00:00:00.000000000', '2010-02-15T00:00:00.000000000',
             '2010-03-15T00:00:00.000000000', '2010-04-15T00:00:00.000000000',
             '2010-05-15T00:00:00.000000000', '2010-06-15T00:00:00.000000000',
             '2010-07-15T00:00:00.000000000', '2010-08-15T00:00:00.000000000',
             '2010-09-15T00:00:00.000000000', '2010-10-15T00:00:00.000000000',
             '2010-11-15T00:00:00.000000000', '2010-12-15T00:00:00.000000000'],
            dtype='datetime64[ns]')
    • station_name
      (station_name)
      <U16
      'Agoufou' ... 'Trelew'
      array(['Agoufou', 'Alta_Floresta', 'Amsterdam_Island', 'Avignon', 'Taihu',
             'Taipei_CWB', 'Thessaloniki', 'Trelew'], dtype='<U16')
    • latitude
      (station_name)
      float64
      15.35 -9.871 -37.8 ... 40.63 -43.25
      standard_name :
      latitude
      units :
      degrees
      array([ 15.3454  ,  -9.871339, -37.799544,  43.93275 ,  31.421   ,
              25.014683,  40.63    , -43.249806])
    • longitude
      (station_name)
      float64
      -1.479 -56.1 77.57 ... 22.96 -65.31
      standard_name :
      longitude
      units :
      degrees
      array([ -1.479117, -56.104453,  77.571946,   4.878067, 120.215333,
             121.53837 ,  22.96    , -65.308611])
    • altitude
      (station_name)
      float64
      305.0 277.0 49.0 ... 26.0 60.0 15.0
      array([305., 277.,  49.,  32.,  20.,  26.,  60.,  15.])
  • data_source :
    ['AeronetSunV3L2Subset.daily', 'TM5-met2010_CTRL-TEST']
    var_name :
    ['od550aer', 'od550aer']
    ts_type :
    monthly
    filter_name :
    WORLD-noMOUNTAINS
    ts_type_src :
    ['daily', 'monthly']
    start_str :
    20100101
    stop_str :
    20101231
    var_units :
    ['1', '1']
    vert_scheme :
    None
    data_level :
    3
    revision_ref :
    n/d
    from_files :
    ['aerocom3_TM5_AP3-CTRL2016_od550aer_Column_2010_monthly.nc']
    from_files_ref :
    None
    stations_ignored :
    None
    colocate_time :
    False
    obs_is_clim :
    False
    pyaerocom :
    0.10.0rc2
    apply_constraints :
    None
    min_num_obs :
    None
    outliers_removed :
    True
    region :
    WORLD
    lon_range :
    [-180, 180]
    lat_range :
    [-90, 90]
    alt_range :
    [-1000000.0, 1000.0]
    land_sea :
    None

As you can see, model and obs (stored in data_source dimension) now share the same coordinates (dimension station_name) and time stamps (dimension time). The data_source dimension always contains the observation data at the first index and the model data at the second:

[65]:
obsdata = coldata.data[0]
obsdata
[65]:
Show/Hide data repr Show/Hide attributes
xarray.DataArray
'od550aer'
  • time: 12
  • station_name: 8
  • nan 0.1077 0.0365 0.113 nan nan 0.1383 ... nan nan nan nan nan 0.04412
    array([[       nan, 0.10766512, 0.03649583, 0.11300932,        nan,
                   nan, 0.13828996, 0.03593528],
           [       nan, 0.1281071 , 0.05131321, 0.1490295 ,        nan,
                   nan, 0.09390731, 0.03669082],
           [       nan, 0.06650846,        nan, 0.15400053, 0.77145611,
            0.81960458, 0.15375983, 0.03303893],
           [1.04184348, 0.09195735,        nan, 0.17945707, 0.56584754,
            0.66320834, 0.26381093, 0.03862473],
           [1.05366664, 0.0922461 , 0.18108093, 0.13238964, 0.62159417,
                   nan, 0.18389999,        nan],
           [1.00704245,        nan, 0.23711979, 0.12935421, 1.01970277,
                   nan, 0.23285789,        nan],
           [0.98468077,        nan, 0.28315152, 0.15569135,        nan,
                   nan, 0.22223999,        nan],
           [0.34698175, 1.1291196 , 0.22721504, 0.1078829 , 0.6659021 ,
                   nan, 0.28657127, 0.03157442],
           [0.34638257, 1.36759564, 0.22911738,        nan, 0.48163878,
                   nan, 0.17283706, 0.05573316],
           [       nan, 0.51974499, 0.19571329, 0.19495548,        nan,
                   nan, 0.14301278, 0.06043777],
           [       nan, 0.20634955,        nan, 0.09995729,        nan,
                   nan, 0.18381835, 0.04511684],
           [       nan, 0.17557466,        nan,        nan,        nan,
                   nan,        nan, 0.04412376]])
    • data_source
      ()
      <U26
      'AeronetSunV3L2Subset.daily'
      array('AeronetSunV3L2Subset.daily', dtype='<U26')
    • var_name
      ()
      <U8
      'od550aer'
      array('od550aer', dtype='<U8')
    • var_units
      ()
      <U1
      '1'
      array('1', dtype='<U1')
    • ts_type_src
      ()
      <U7
      'daily'
      array('daily', dtype='<U7')
    • time
      (time)
      datetime64[ns]
      2010-01-15 ... 2010-12-15
      array(['2010-01-15T00:00:00.000000000', '2010-02-15T00:00:00.000000000',
             '2010-03-15T00:00:00.000000000', '2010-04-15T00:00:00.000000000',
             '2010-05-15T00:00:00.000000000', '2010-06-15T00:00:00.000000000',
             '2010-07-15T00:00:00.000000000', '2010-08-15T00:00:00.000000000',
             '2010-09-15T00:00:00.000000000', '2010-10-15T00:00:00.000000000',
             '2010-11-15T00:00:00.000000000', '2010-12-15T00:00:00.000000000'],
            dtype='datetime64[ns]')
    • station_name
      (station_name)
      <U16
      'Agoufou' ... 'Trelew'
      array(['Agoufou', 'Alta_Floresta', 'Amsterdam_Island', 'Avignon', 'Taihu',
             'Taipei_CWB', 'Thessaloniki', 'Trelew'], dtype='<U16')
    • latitude
      (station_name)
      float64
      15.35 -9.871 -37.8 ... 40.63 -43.25
      standard_name :
      latitude
      units :
      degrees
      array([ 15.3454  ,  -9.871339, -37.799544,  43.93275 ,  31.421   ,
              25.014683,  40.63    , -43.249806])
    • longitude
      (station_name)
      float64
      -1.479 -56.1 77.57 ... 22.96 -65.31
      standard_name :
      longitude
      units :
      degrees
      array([ -1.479117, -56.104453,  77.571946,   4.878067, 120.215333,
             121.53837 ,  22.96    , -65.308611])
    • altitude
      (station_name)
      float64
      305.0 277.0 49.0 ... 26.0 60.0 15.0
      array([305., 277.,  49.,  32.,  20.,  26.,  60.,  15.])
  • data_source :
    ['AeronetSunV3L2Subset.daily', 'TM5-met2010_CTRL-TEST']
    var_name :
    ['od550aer', 'od550aer']
    ts_type :
    monthly
    filter_name :
    WORLD-noMOUNTAINS
    ts_type_src :
    ['daily', 'monthly']
    start_str :
    20100101
    stop_str :
    20101231
    var_units :
    ['1', '1']
    vert_scheme :
    None
    data_level :
    3
    revision_ref :
    n/d
    from_files :
    ['aerocom3_TM5_AP3-CTRL2016_od550aer_Column_2010_monthly.nc']
    from_files_ref :
    None
    stations_ignored :
    None
    colocate_time :
    False
    obs_is_clim :
    False
    pyaerocom :
    0.10.0rc2
    apply_constraints :
    None
    min_num_obs :
    None
    outliers_removed :
    True
    region :
    WORLD
    lon_range :
    [-180, 180]
    lat_range :
    [-90, 90]
    alt_range :
    [-1000000.0, 1000.0]
    land_sea :
    None
[66]:
modeldata = coldata.data[1]
modeldata
[66]:
Show/Hide data repr Show/Hide attributes
xarray.DataArray
'od550aer'
  • time: 12
  • station_name: 8
  • 0.1558 0.0742 0.07481 0.09742 0.5192 ... 0.5046 0.3815 0.1642 0.04502
    array([[0.15579131, 0.074198  , 0.07480989, 0.09742171, 0.51915187,
            0.32987517, 0.15443519, 0.04990007],
           [0.18340141, 0.0763083 , 0.07439816, 0.10789451, 0.49576023,
            0.36209598, 0.21525712, 0.05284398],
           [0.89635843, 0.07341532, 0.06992208, 0.14702028, 1.00326443,
            0.63314253, 0.22872289, 0.05543199],
           [0.77087325, 0.1048945 , 0.06650965, 0.1865073 , 0.7776224 ,
            0.54631037, 0.28632328, 0.03336016],
           [0.80734402, 0.12378095, 0.05653326, 0.17420147, 0.65254241,
            0.45009974, 0.30526772, 0.03229909],
           [0.50654256, 0.14234523, 0.06274261, 0.17343882, 0.58004749,
            0.33348686, 0.30070606, 0.03174512],
           [0.46850282, 0.1863786 , 0.0675569 , 0.17624108, 0.47164536,
            0.19874078, 0.24717812, 0.04090422],
           [0.34062195, 1.90581691, 0.07018556, 0.15123938, 0.36946237,
            0.18504179, 0.26474383, 0.03088068],
           [0.24573354, 1.21507001, 0.11556577, 0.13191637, 0.4020009 ,
            0.26275966, 0.23841298, 0.04008143],
           [0.2798166 , 0.15733013, 0.08971586, 0.13991331, 0.38702768,
            0.34291422, 0.21531013, 0.03859388],
           [0.35316458, 0.08011972, 0.0626139 , 0.069165  , 0.55920124,
            0.38426778, 0.19802441, 0.03579372],
           [0.28721237, 0.0827378 , 0.07033022, 0.07274448, 0.50464362,
            0.3815313 , 0.1641538 , 0.04502364]])
    • data_source
      ()
      <U26
      'TM5-met2010_CTRL-TEST'
      array('TM5-met2010_CTRL-TEST', dtype='<U26')
    • var_name
      ()
      <U8
      'od550aer'
      array('od550aer', dtype='<U8')
    • var_units
      ()
      <U1
      '1'
      array('1', dtype='<U1')
    • ts_type_src
      ()
      <U7
      'monthly'
      array('monthly', dtype='<U7')
    • time
      (time)
      datetime64[ns]
      2010-01-15 ... 2010-12-15
      array(['2010-01-15T00:00:00.000000000', '2010-02-15T00:00:00.000000000',
             '2010-03-15T00:00:00.000000000', '2010-04-15T00:00:00.000000000',
             '2010-05-15T00:00:00.000000000', '2010-06-15T00:00:00.000000000',
             '2010-07-15T00:00:00.000000000', '2010-08-15T00:00:00.000000000',
             '2010-09-15T00:00:00.000000000', '2010-10-15T00:00:00.000000000',
             '2010-11-15T00:00:00.000000000', '2010-12-15T00:00:00.000000000'],
            dtype='datetime64[ns]')
    • station_name
      (station_name)
      <U16
      'Agoufou' ... 'Trelew'
      array(['Agoufou', 'Alta_Floresta', 'Amsterdam_Island', 'Avignon', 'Taihu',
             'Taipei_CWB', 'Thessaloniki', 'Trelew'], dtype='<U16')
    • latitude
      (station_name)
      float64
      15.35 -9.871 -37.8 ... 40.63 -43.25
      standard_name :
      latitude
      units :
      degrees
      array([ 15.3454  ,  -9.871339, -37.799544,  43.93275 ,  31.421   ,
              25.014683,  40.63    , -43.249806])
    • longitude
      (station_name)
      float64
      -1.479 -56.1 77.57 ... 22.96 -65.31
      standard_name :
      longitude
      units :
      degrees
      array([ -1.479117, -56.104453,  77.571946,   4.878067, 120.215333,
             121.53837 ,  22.96    , -65.308611])
    • altitude
      (station_name)
      float64
      305.0 277.0 49.0 ... 26.0 60.0 15.0
      array([305., 277.,  49.,  32.,  20.,  26.,  60.,  15.])
  • data_source :
    ['AeronetSunV3L2Subset.daily', 'TM5-met2010_CTRL-TEST']
    var_name :
    ['od550aer', 'od550aer']
    ts_type :
    monthly
    filter_name :
    WORLD-noMOUNTAINS
    ts_type_src :
    ['daily', 'monthly']
    start_str :
    20100101
    stop_str :
    20101231
    var_units :
    ['1', '1']
    vert_scheme :
    None
    data_level :
    3
    revision_ref :
    n/d
    from_files :
    ['aerocom3_TM5_AP3-CTRL2016_od550aer_Column_2010_monthly.nc']
    from_files_ref :
    None
    stations_ignored :
    None
    colocate_time :
    False
    obs_is_clim :
    False
    pyaerocom :
    0.10.0rc2
    apply_constraints :
    None
    min_num_obs :
    None
    outliers_removed :
    True
    region :
    WORLD
    lon_range :
    [-180, 180]
    lat_range :
    [-90, 90]
    alt_range :
    [-1000000.0, 1000.0]
    land_sea :
    None

High-level colocation routine

If it wasn’t for the purpose of this notebook, normally, we don’t want to go through the hassle of reading the data individually before colocating. Thus, pyaerocom has a high-level interface that can do colocation straight with the observation and model IDs (under the hood, of course, it uses the same routines that have been used here). By default, this high-level interface also stores all produced ColocatedData objects as NetCDF files, for later analysis:

[67]:
colocator = pya.Colocator(
    model_id=model_id, obs_id=obs_id, obs_vars='od550aer',
    ts_type='monthly',
    model_ts_type_read='monthly',
    filter_name='OCN', #  let's try to better isolate Amsterdam Island
    reanalyse_existing=True,
    save_coldata=True)

colocator
[67]:
Colocator([('save_coldata', True),
           ('_obs_cache_only', False),
           ('obs_vars', ['od550aer']),
           ('obs_vert_type', None),
           ('model_vert_type_alt', None),
           ('read_opts_ungridded', None),
           ('obs_ts_type_read', None),
           ('model_use_vars', None),
           ('model_add_vars', None),
           ('model_keep_outliers', True),
           ('model_to_stp', False),
           ('model_id', 'TM5-met2010_CTRL-TEST'),
           ('model_name', None),
           ('model_data_dir', None),
           ('obs_id', 'AeronetSunV3L2Subset.daily'),
           ('obs_name', None),
           ('obs_data_dir', None),
           ('obs_keep_outliers', False),
           ('obs_use_climatology', False),
           ('obs_add_meta', []),
           ('gridded_reader_id',
            {'model': 'ReadGridded', 'obs': 'ReadGridded'}),
           ('start', None),
           ('stop', None),
           ('ts_type', 'monthly'),
           ('filter_name', 'OCN'),
           ('remove_outliers', True),
           ('apply_time_resampling_constraints', None),
           ('min_num_obs', None),
           ('resample_how', None),
           ('var_outlier_ranges', None),
           ('var_ref_outlier_ranges', None),
           ('harmonise_units', False),
           ('vert_scheme', None),
           ('regrid_res_deg', None),
           ('ignore_station_names', None),
           ('basedir_coldata', '/home/jonasg/MyPyaerocom/colocated_data'),
           ('model_ts_type_read', 'monthly'),
           ('model_read_aux', None),
           ('model_use_climatology', False),
           ('colocate_time', False),
           ('flex_ts_type_gridded', True),
           ('reanalyse_existing', True),
           ('raise_exceptions', False),
           ('_log', None),
           ('logging', True),
           ('data', {}),
           ('file_status', {})])

Quite a few options, a lot of them are for the even higher-level automatic web-processing tools that feed the Aerocom Evaluation websites, so let’s not get lost in these details here.

The colocation can be run as follows:

[68]:
colocator.run()
PREPARING colocation of TM5-met2010_CTRL-TEST vs. AeronetSunV3L2Subset.daily
The following variable combinations will be colocated
MODEL-VAR       OBS-VAR
od550aer        od550aer
Running TM5-met2010_CTRL-TEST / AeronetSunV3L2Subset.daily (od550aer, od550aer)
Deleting and recomputing existing colocated data file od550aer_REF-AeronetSunV3L2Subset.daily_MOD-TM5-met2010_CTRL-TEST_20100101_20101231_monthly_OCN.nc
REMOVE: od550aer_REF-AeronetSunV3L2Subset.daily_MOD-TM5-met2010_CTRL-TEST_20100101_20101231_monthly_OCN.nc

Input filters {'latitude': [-89.0, 89.0], 'longitude': [-181.5, 175.5]} result in unchanged data object
WRITE: od550aer_REF-AeronetSunV3L2Subset.daily_MOD-TM5-met2010_CTRL-TEST_20100101_20101231_monthly_OCN.nc

As you can see in the last line of the output, the colocated data object was stored as NetCDF file. The default direcory for these files can be accessed (and modified) in the const class:

[69]:
pya.const.COLOCATEDDATADIR
[69]:
'/home/jonasg/MyPyaerocom/colocated_data'
[70]:
import os
os.listdir(pya.const.COLOCATEDDATADIR)
[70]:
['logfiles',
 'CAM5.3-Oslo_AP3-CTRL2016-PD',
 'TM5-met2010_CTRL-TEST',
 'NorESM2-met2010_AP3-CTRL']

And you can see that there is a subdirecory which contains all colocated data objects that have been created for the TM5 model. The loaded colocated data object can also be accessed via:

[71]:
coldata2 = colocator.data[model_id]['od550aer']
[72]:
coldata2.plot_scatter();
../_images/pyaerocom-tutorials_getting_started_analysis_162_0.png

That looks like we managed to get only Amsterdam Island here (1 site) and 8 months of data. The corresponding model bias suggests that TM5 is underestimating AOD at 550nm at Amsterdam Island by ca 60%.

As a last step for this tutorial, let’s make sure it is Amsterdam Island that we got here:

[73]:
pya.plot.mapping.plot_nmb_map_colocateddata(coldata2);
../_images/pyaerocom-tutorials_getting_started_analysis_164_0.png

Looks like it! Ciao!