# Introduction into the UngriddedData class and the StationData class¶

This notebook introduces 2 of the most relevant data objects that exist in pyaerocom:

UngriddedData

• Designed to hold a whole database of observations, that is, timeseries data for multiple variables from multiple stations around the globe.

• Supports also 3D variables (e.g. timeseries of lidar profiles).

• Usually, one instance of this data object contains a single network, but it can also contain more than one network.

StationData

• Data object that contains data from a single station.

• Includes metadata and variable timeseries data.

• Arbitrary number of variables supported.

## NOTE¶

This notebook is currently under development and gives only a brief and incomplete introduction into the two data objects.

## The UngriddedData object¶

The first part of the tutorial shows some features of the UngriddedData object.

### Import the UngriddedData object that was created in the previous tutorial¶

[1]:

import pyaerocom as pya
# read the data from the storage
%store -r data

data

Initating pyaerocom configuration
Checking database access...
Init data paths for lustre
Expired time: 0.016 s

[1]:

UngriddedData <networks: ['AeronetSunV3Lev2.daily']; vars: ['od550aer']; instruments: ['sun_photometer'];No. of stations: 1230


### Create an overview map of all stations¶

Before digging a little deeper into the UngriddedData object, let’s get an overview of the bigger picture:

[2]:

# plots all stations as red dots
ax = data.plot_station_coordinates(markersize=12, color='r')

# add all stations that contain AOD data in 2010 in green
ax = data.plot_station_coordinates(var_name='od550aer',
start=2010,
stop=2011, color='lime', ax=ax)


As you can see, you can specify additional input parameters, e.g. to display only stations that contain variable data or to specify a time interval.

In any case, it is always good to know about the help function:

[3]:

help(data.plot_station_coordinates)

Help on method plot_station_coordinates in module pyaerocom.ungriddeddata:

plot_station_coordinates(var_name=None, filter_name=None, start=None, stop=None, ts_type=None, color='r', marker='o', markersize=8, fontsize_base=10, **kwargs) method of pyaerocom.ungriddeddata.UngriddedData instance
Plot station coordinates on a map

All input parameters are optional and may be used to add constraints
related to which stations are plotted. Default is all stations of all
times.

Parameters
----------

var_name : :obj:str, optional
name of variable to be retrieved
filter_name : :obj:str, optional
name of filter (e.g. EUROPE-noMOUNTAINS)
start
start time (optional)
stop
stop time (optional). If start time is provided and stop time not,
then only the corresponding year inferred from start time will be
considered
ts_type : :obj:str, optional
temporal resolution
color : str
color of stations on map
marker : str
marker type of stations
markersize : int
size of station markers
fontsize_base : int
basic fontsize
**kwargs
Addifional keyword args passed to
:func:pyaerocom.plot.plot_coordinates

Returns
-------
axes
matplotlib axes instance



### Quicklook plotting of station timeseries¶

Time series of individual stations can be plotted as follows:

[4]:

data.plot_station_timeseries(station_name='Granada', var_name='od550aer');


### Access metadata of the data files that were read¶

Look into the metadata of the different files. Metadata can be accessed via the metadata attribute, and there is one metadatadictionary for each file that was read:

[5]:

len(data.metadata)

[5]:

1230


Access metadata of first file (index 0):

[6]:

data.metadata[0]

[6]:

{'var_info': OrderedDict([('od550aer', OrderedDict([('units', '1')]))]),
'latitude': 45.3139,
'longitude': 12.508299999999998,
'altitude': 10.0,
'station_name': 'AAOT',
'PI': 'Brent_Holben',
'ts_type': 'daily',
'data_id': 'AeronetSunV3Lev2.daily',
'variables': ['od550aer'],
'instrument_name': 'sun_photometer',
'data_revision': '20190920'}


### Filtering of the data¶

So far, you can filter UngriddedData objects by common metadata attributes. For instance:

[7]:

subset = data.filter_by_meta(latitude=(30, 60), longitude=(0, 45), altitude=(0, 1000))
print(subset)


Pyaerocom UngriddedData
-----------------------
Contains networks: ['AeronetSunV3Lev2.daily']
Contains variables: ['od550aer']
Contains instruments: ['sun_photometer']
Total no. of meta-blocks: 164
Filters that were applied:
Filter time log: 20191002122003
Created od550aer single var object from multivar UngriddedData instance
Filter time log: 20191003180107
latitude: (30, 60)
longitude: (0, 45)
altitude: (0, 1000)

[8]:

subset.plot_station_coordinates();


### Other attributes that may be useful¶

Access all station names and print the first 4:

[9]:

stat_names = data.station_name
print(stat_names[:4])

['AAOT', 'AOE_Baotou', 'ARM_Ascension_Is', 'ARM_Barnstable_MA']


Essentially, what data.station_name does is, it iterates over all metadata-dictionaries (that are stored in data.metadata, and are organised per file that was read) and extracts the station_name attribute and appends it to a list which is then returned by the method.

Hence, the list of station names corresponds to the list of metadata-blocks / files that are stored in the data object:

[10]:

len(stat_names)

[10]:

1230


In a similar manner, you can access coordinates latitude, longitude and altitude arrays for all files.

[11]:

lons = data.longitude
lons[:4]

[11]:

[12.508299999999998,
109.62879999999998,
-14.349805999999996,
-70.29006100000001]

[12]:

lats = data.latitude
lats[:4]

[12]:

[45.3139, 40.851699999999994, -7.966963999999998, 41.669588999999995]

[13]:

alts = data.altitude
alts[:4]

[13]:

[10.0, 1314.0, 341.0, 15.0]


### List of unique station names¶

As mentioned earlier, some databases provide more than one data file per station. Since the ungridded reading (see previous) tutorial is done per data file, this means that their can be more than one metadata-block per station (not the case here, though). In any case, you can get a list of unique station names using:

[14]:

unique_names = data.unique_station_names
unique_names[:4]

[14]:

['AAOT', 'AOE_Baotou', 'ARM_Ascension_Is', 'ARM_Barnstable_MA']


## StationData: Access the data from individual stations¶

As you could see above the metadata dictionaries in the UngriddedData class for each file do only contain the associated metadata. For the sake of performance the actual data arrays are all stored in one big 2D numpy array (which does not need to bother you too much) which is accessible in the _data attribute of the UngriddedData object (if you like to dive into it).

In most cases that concern model evaluation, the observation data is analysed station-by-station. For this purpose the StationData class was designed, which is introduced below.

Starting from an instance of the UngriddedData object, the individual station data (i.e. time series of one or more variables + metadata) can be accessed using the method:

UngriddedData.to_station_data

or using the square brackets [] which is equivalent to the former as it is only a wrapper for to_station_data. This meants, calling

UngriddedData[0]


will give you the same output as

UngriddedData.to_station_data[0]


that is, the data associated with the first file that was read (i.e. the first metadata-block in the object) into the UngriddedData object (see previous tutorial for details regarding the reading of ungridded observation networks).

To specify the station, you can either use the metadata index of the corresponding data file (meta_idx=9, for 10th file) or you can specify the station name or a wildcard specifying the station name.

The method returns a pyaerocom.StationData object, which is a dictionary-like object which contains data vectors and time-stamps as well as metadata.

Below we will illustrate the several options to access station data (and show that they contain the same data):

### Option 1. Get station data using the corresponding metadata indices that match the station name¶

Find index (or indices) that match the station name:

[15]:

index = data.find_station_meta_indices('Granada')
index

[15]:

[497]


The result shows that there is one file that matches this station name (as we would expect for AERONET data) and the corresponding metadata index is 488.

To access the data, you can use the method to_station_data. It helps to have a look into the options of this method:

[16]:

help(data.to_station_data)

Help on method to_station_data in module pyaerocom.ungriddeddata:

to_station_data(meta_idx, vars_to_convert=None, start=None, stop=None, freq=None, merge_if_multi=True, merge_pref_attr=None, merge_sort_by_largest=True, insert_nans=False, **kwargs) method of pyaerocom.ungriddeddata.UngriddedData instance
Convert data from one station to :class:StationData

Todo
----
- Review for retrieval of profile data (e.g. Lidar data)

Parameters
----------
meta_idx : float
index of station or name of station.
vars_to_convert : :obj:list or :obj:str, optional
variables that are supposed to be converted. If None, use all
variables that are available for this station
start
start time, optional (if not None, input must be convertible into
pandas.Timestamp)
stop
stop time, optional (if not None, input must be convertible into
pandas.Timestamp)
freq : str
pandas frequency string (e.g. 'D' for daily, 'M' for month end) or
valid pyaerocom ts_type
merge_if_multi : bool
if True and if data request results in multiple instances of
StationData objects, then these are attempted to be merged into one
:class:StationData object using :func:merge_station_data
merge_pref_attr
only relevant for merging of multiple matches: preferred attribute
that is used to sort the individual StationData objects by relevance.
Needs to be available in each of the individual StationData objects.
For details cf. :attr:pref_attr in docstring of
:func:merge_station_data. Example could be revision_date. If
None, then the stations will be sorted based on the number of
available data points (if :attr:merge_sort_by_largest is True,
which is default).
merge_sort_by_largest : bool
only relevant for merging of multiple matches: cf. prev. attr. and
docstring of :func:merge_station_data method.
insert_nans : bool
if True, then the retrieved :class:StationData objects are filled
with NaNs

Returns
-------
StationData or list
StationData object(s) containing results. list is only returned if
input for meta_idx is station name and multiple matches are
detected for that station (e.g. data from different instruments),
else single instance of StationData. All variable time series are
inserted as pandas Series



So the first input argument takes either the metadata index, or the name of the station. Here we use the metadata index option using the index that we just retrieved:

[17]:

granada_opt1 = data.to_station_data(meta_idx=index[0], insert_nans=True)

[17]:

pyaerocom.stationdata.StationData


The returned data type is an instance of the pyaerocom.StationData class.

NOTE: if there is more than one index match for one station (i.e. data.find_station_meta_indices('Granada') returns more than one match), then, using Option 1, you would need to call to_station_data for each of the index matches. Alternatively you could use either of the following methods, which automatically merge the individual StationData objects into one, in case of multiple matches for that station name.

Let’s have a quick look at the StationData object (it is a dictionary-like object and simple to use):

### Option 2: Retrieve station data using the station name directly¶

[18]:

granada_opt2 = data.to_station_data('Granada', insert_nans=True)


Other than option 1, in case of multiple meta-index matches, this method automatically merges the individual data objects.

### Option 3: Use […] notation¶

This is a wrapper for the method to_station_data so you may use meta-index or station name for access.

[19]:

granada_opt3_1 = data[index[0]]


### Let’s have a look if the data objects are really the same (by plotting the AOD timeseries for the 4 different options):¶

[20]:

ax = granada_opt1.plot_timeseries('od550aer', lw=3, label='Option 1', tit='Method: UngriddedData.to_station_data()')
granada_opt2.plot_timeseries('od550aer', ls='--', lw=1, label='Option 2',ax=ax)

# plot the results from the [] access option into a new figure (don't pass ax)
ax = granada_opt3_1.plot_timeseries('od550aer', lw=3, label='Option 3.1 (using meta-index)',
tit='Method: UngriddedData[]')
granada_opt3_2.plot_timeseries('od550aer', ls='--', lw=1, label='Option 3.2 (using station name)',ax=ax)

[20]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f3a66706a20>


Looks good. Let’s explore the StationData object a little more (you can print it and it will get you a nice overview):

[21]:

print(granada_opt1)


Pyaerocom StationData
---------------------
var_info (BrowseDict):
od550aer (OrderedDict):
units: 1
overlap: False
ts_type: daily
apply_constraints: False
min_num_obs: None
station_coords (dict):
latitude: 37.163999999999994
longitude: -3.6049999999999995
altitude: 680.0
data_err (BrowseDict): <empty_dict>
overlap (BrowseDict): <empty_dict>
data_flagged (BrowseDict): <empty_dict>
filename: None
station_id: None
instrument_name: sun_photometer
PI: Brent_Holben
country: None
ts_type: daily
latitude: 37.163999999999994
longitude: -3.6049999999999995
altitude: 680.0
data_id: AeronetSunV3Lev2.daily
dataset_name: None
data_product: None
data_version: None
data_level: None
revision_date: None
website: None
ts_type_src: daily
stat_merge_pref_attr: None
data_revision: 20190920

Data arrays
.................
dtime (ndarray, 5087 items): [2004-12-30T00:00:00.000000000, 2004-12-31T00:00:00.000000000, ..., 2018-12-02T00:00:00.000000000, 2018-12-03T00:00:00.000000000]
Pandas Series
.................
od550aer (Series, 5087 items)


You can see that the StationData object contains both metadata (e.g. PI) and data vectors which can be either 1D numpy arrays or python lists (e.g. dtime) or pandas.Series objects (e.g. variable od550aer). All attributes can be accessed and manipulated either using dictionary style access (i.e. [] notation), or using the . operator.

Here some examples:

[22]:

# get longitude using "[]" notation

[22]:

-3.6049999999999995

[23]:

# get longitude using "." notation

[23]:

-3.6049999999999995

[24]:

# assign longitude using "." notation and display new value (again using "[]" notation)

[24]:

42

[25]:

granada_opt1['station_name']

[25]:

'Granada'


Get station name:

[26]:

granada_opt1.station_name

[26]:

'Granada'


### Small detour through the pandas world¶

As you can see in the output above, the time-series data in the StationData object is an instance of the pandas.Series class.

[27]:

aod_data = granada_opt1.od550aer
type(aod_data)

[27]:

pandas.core.series.Series


NOTE: pyaerocom relies on pandas, so if you are not familiar with the pandas library, it is a good advice to make yourself familiar with it (especially if you are interested in timeseries analysis). See here for a short introduction into pandas.

Anyways, you should know about the 2 basic datatypes of pandas which are:

Both objects are very similar in their handling and the Series class can be imagined as a single column of the Dataframe object which is a table-like object that has columns (e.g. variables) and rows (e.g. time-stamps). It is hence, easy to go back and forth between the two objects.

Anyways, here is some examples what you can do with an instance of the pandas.Series object that we just accessed from the pyaerocom.StationData object from Granada (and which we named aod_data).

First, you can get a basic clue about the data by using the describe method:

[28]:

aod_data.describe()

[28]:

count    3010.000000
mean        0.137759
std         0.103330
min         0.015534
25%         0.070969
50%         0.108200
75%         0.171291
max         1.278507
dtype: float64


Second, you may plot it:

[29]:

aod_data.plot();


Third, you may extract subsets using fancy indexing:

[30]:

aod_data_march2010 = aod_data['2010-3-1':'2010-4-1']
aod_data_march2010.plot();


Or fourth, resample to another frequency:

[31]:

aod_data_monthly = aod_data.resample('M', 'mean')
aod_data_monthly.plot();


Or fifth, resample to lower frequency, but require a minimum number of observations per period:

[32]:

monthly_with_count = aod_data.resample('M').agg(['mean', 'count'])

[32]:

mean count
2004-12-31 0.056462 1
2005-01-31 0.081300 28
2005-02-28 0.111567 24
2005-03-31 0.156312 17
2005-04-30 0.154527 26

Now, here you see an example, where pandas automatically converted our Series (which is single variable) to a DataFrame (which is a table), since we told the resampler above, to aggregate monthly mean and monthly count.

Now let’s say we require at least 15 observations (here, days, since our original dataset is in daily resolution) per month:

[33]:

invalid_mask = monthly_with_count['count'] < 15
aod_monthly_min15d = monthly_with_count['mean']

[33]:

2004-12-31         NaN
2005-01-31    0.081300
2005-02-28    0.111567
2005-03-31    0.156312
2005-04-30    0.154527
Freq: M, Name: mean, dtype: float64


Now plot both the monthly timeseries from above without constraint and the one with constraint:

[34]:

ax = aod_data_monthly.plot(label='Monthly without constraint', lw=3, figsize=(14,6))
aod_monthly_min15d.plot(ax=ax, style='x-', label='Monthly (at least 15 days per month)')
ax.legend();


As you can see, there is quite some months missing when applying the filter.

You may also be intersted in a climatology:

[35]:

aod_monthly_climatology = aod_data_monthly.groupby(aod_data_monthly.index.month).mean()
aod_monthly_climatology

[35]:

1     0.091885
2     0.125760
3     0.114174
4     0.132566
5     0.146915
6     0.164997
7     0.174380
8     0.192138
9     0.146022
10    0.115006
11    0.091344
12    0.082346
dtype: float64


And do the same for the monthly data with minimum 15 days per month that we created above:

[36]:

aod_monthly_climatology_min15d = aod_monthly_min15d.groupby(aod_monthly_min15d.index.month).mean()
aod_monthly_climatology_min15d

[36]:

1     0.093066
2     0.130359
3     0.119908
4     0.136720
5     0.145820
6     0.164751
7     0.184243
8     0.183453
9     0.141699
10    0.111532
11    0.072294
12    0.089056
Name: mean, dtype: float64

[37]:

ax = aod_monthly_climatology.plot(label='Monthly climatology (no constraint)')
aod_monthly_climatology_min15d.plot(label='Monthly climatology (min 15 days/month)')

[37]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f3a6f89a6a0>


That was enough of a detour into the pandas world. As you shall see below, some of these pandas features are also provided in the pyaerocom data objects (e.g. resampling) and more will follow soon!

### Plotting of timeseries data directly from StationData class¶

Let’s come back to the StationData object. Below are some more examples that show how you can plot the timeseries directly from the StationData object. This includes to do a resampling out of the box when plotting:

[38]:

ax = granada_opt1.plot_timeseries('od550aer')
ax = granada_opt1.plot_timeseries('od550aer', freq='monthly', lw=3, ax=ax)
granada_opt1.plot_timeseries('od550aer', freq='yearly', ls='none', marker='o', ms=14, ax=ax);


The resampling of the timeseries in the plotting method is done automatically (if input ts_type is provided).

Currently, this does not apply additional constraints such as a minimum number of available observations when downsampling (like we showed above). From the yearly data (green dots) you can see clearly that this can be an issue, especially for the first and the last year.

In order to account for it, you may to the following:

[39]:

# convert to monthly with at least 5 days per month
od550aer_monthly = granada_opt1.resample_timeseries('od550aer', 'monthly', 'mean', min_num_obs=5)

# assign to our StationData

[40]:

ax = granada_opt1.plot_timeseries('od550aer')