# Choosing stations in Peru pt. 1, Meteorological Stations

## Introduction

Last summer I spent a few months in San Martín, Peru, toward the western border of the Amazon. My trip was dedicated to identifying areas of the region that are affected by seasonal flooding. I was directed to the district of San Rafael, where I secured the support of the local government. The authorities directed me to two smaller communities within their district: Libertad and Panamá. I will write more on the interviews and my direct findings in a future post. For now, I want to get into the nitty-gritty of identifying stations to inform a local climate study in the San Rafael District. This will be the first in a series of posts related to my dissertation research. In today’s post, I will strive to select stations and perform some cursory data exploration. Let’s start by loading the necessary libraries.

library(dplyr)
library(plotly)
library(purrr)
library(senamhiR)
library(tibble)
library(timevis)
library(zoo)

## Data audit

The completeness of these data sets is important to check. I will apply the quick_audit() function to each station. I will filter these tables so that any year that is 100% missing for a given variable will be omitted, hence the year column will be discontinuous.

dat_audit <- lapply(dat, quick_audit)

### Station 1: BELLAVISTA (000382)

dat_audit[[1]] %>% filter_at(2:ncol(.), any_vars(. < 100)) %>% print(n = 50)
## # A tibble: 42 x 13
##     Year Tmax (C) pct N… Tmin (C) pct N… TBS07 (C) pct …
##    <int>            <dbl>            <dbl>            <dbl>
##  1  1963           91.5             91.5             91.5
##  2  1964            0                0                0
##  3  1965            0.274            0.274            0.274
##  4  1966            0                0                0
##  5  1967            0                0                0
##  6  1968            0                0                0
##  7  1969            0                0                0
##  8  1970            0                0                0
##  9  1971            0.274            0.274            0.274
## 10  1972            0.820            0.820            0.820
## 11  1973           16.2             16.2             16.2
## 12  1974            0.274            0                0
## 13  1975            0                0                0
## 14  1976            0                0                0
## 15  1977            0                0                0
## 16  1978            1.10             1.10             1.10
## 17  1979            0.822            0.822            0.822
## 18  1980            3.28             3.28             3.28
## 19  1981           41.9             42.7             41.9
## 20  1995           16.2             16.2             16.2
## 21  1996            0                0                0
## 22  1997            0                0                0
## 23  1998            0                0                0
## 24  1999            0                0                0
## 25  2000            0                0                0
## 26  2001            0                0                0
## 27  2002            0                0                0
## 28  2003            0                0                0
## 29  2004            0                0                0
## 30  2005            0                0                0
## 31  2006            0               18.6              0
## 32  2007            0                0                0
## 33  2008            0                0                0
## 34  2009            0                0                0
## 35  2010            0                0                0
## 36  2011            0                0                0
## 37  2012            0                0                0
## 38  2013            0                0                0
## 39  2014            0                0                0
## 40  2015            0                0                0
## 41  2016           92.1             92.1             92.1
## 42  2017           32.9             35.9             32.9
## # ... with 9 more variables: TBS13 (C) pct NA <dbl>, TBS19 (C) pct
## #   NA <dbl>, TBH07 (C) pct NA <dbl>, TBH13 (C) pct NA <dbl>, TBH19
## #   (C) pct NA <dbl>, Prec07 (mm) pct NA <dbl>, Prec19 (mm) pct
## #   NA <dbl>, Direccion del Viento pct NA <dbl>, Velocidad del Viento
## #   (m/s) pct NA <dbl>

We can see that Bellavista seems to have a full array of data for all variables. This is very good! There is a large stretch of missing data from the early 1980s to the mid ’90s. I have heard (colloquially) that a lot of stations were abandoned during the period of internal conflict in Peru. Sendero Luminoso was active in the Bellavista region, so that might explain this large gap. While I don’t have 30 years of continuous data from Bellavista, I am pretty content with the 25 years preceding the present.

### Station 2: LA UNION (000384)

dat_audit[[2]] %>% filter_at(2:ncol(.), any_vars(. < 100)) %>% print(n = 50)
## # A tibble: 38 x 13
##     Year Tmax (C) pct N… Tmin (C) pct N… TBS07 (C) pct …
##    <int>            <dbl>            <dbl>            <dbl>
##  1  1970           58.1             58.1             58.1
##  2  1971            1.10             0.274            0.274
##  3  1972            9.02             8.74             8.74
##  4  1973           42.7             42.7             42.7
##  5  1974           46.6             21.9             21.9
##  6  1975            0.274            0.822            0.274
##  7  1976            0.546            0.546            0.546
##  8  1977           27.1             27.1             27.1
##  9  1978            0.274            0.274            0.274
## 10  1979            0                0                0
## 11  1980            0                0                0
## 12  1981           83.8             83.8             83.8
## 13  1982           58.6             58.6             58.6
## 14  1993           18.4             18.6             18.6
## 15  1994            0                0                0
## 16  1995            0                2.74             0
## 17  1996            0                0                0
## 18  1997            0                0                0
## 19  1998            0                0                0
## 20  1999            0                0                0
## 21  2000            0.273            0                0
## 22  2001            0                0                0
## 23  2002            0                0                0
## 24  2003            0                0                0
## 25  2004            0                0                0
## 26  2005            0                0                0
## 27  2006            0                0                0
## 28  2007            0                0                0
## 29  2008            0                0                0
## 30  2009            0                0                0
## 31  2010            0                0                0
## 32  2011            8.49             8.49             8.49
## 33  2012            0                0                0
## 34  2013            0                0                0
## 35  2014            0                0                0
## 36  2015            0                0                0
## 37  2016           83.6             83.6             83.6
## 38  2017           32.9             32.9             32.9
## # ... with 9 more variables: TBS13 (C) pct NA <dbl>, TBS19 (C) pct
## #   NA <dbl>, TBH07 (C) pct NA <dbl>, TBH13 (C) pct NA <dbl>, TBH19
## #   (C) pct NA <dbl>, Prec07 (mm) pct NA <dbl>, Prec19 (mm) pct
## #   NA <dbl>, Direccion del Viento pct NA <dbl>, Velocidad del Viento
## #   (m/s) pct NA <dbl>

Union shows a similar gap in data continuity as Bellavista. However, it is still useful, as it has fairly complete data since 1993 or 1994

### Station 3: PICOTA (153313)

I have worked with the Picota data before, so I know that there is no temperature data at this station. Given my focus on flooding, I am more interested in the precipitation record. Let’s see what we have available to us.

dat_audit[[3]] %>% select_at(.var = vars(Year, starts_with("Prec"))) %>% filter_at(2:ncol(.), any_vars(. < 100)) %>% print(n = 99)
## # A tibble: 56 x 3
##     Year Prec07 (mm) pct NA Prec19 (mm) pct NA
##    <int>                <dbl>                <dbl>
##  1  1963               66.6                 66.6
##  2  1964                0                    0
##  3  1965               33.4                 33.4
##  4  1966                0                    0
##  5  1967                0                    0
##  6  1968                0                    0
##  7  1969                0                    0
##  8  1970                3.29                 2.74
##  9  1971                0.274                0.548
## 10  1972                0                    0
## 11  1973                0                    0
## 12  1974                0                    0
## 13  1975                0                    0
## 14  1976                0                    0
## 15  1977                0                    0
## 16  1978                0                    0
## 17  1979                0                    0
## 18  1980                0                    0
## 19  1981                0                    0
## 20  1982                8.49                 8.49
## 21  1983                0                    0
## 22  1984                0                    0
## 23  1985                0                    0
## 24  1986                0                    0
## 25  1987                0                    0
## 26  1988                0                    0
## 27  1989                0                    0
## 28  1990                0                    0
## 29  1991                0                    0
## 30  1992                0                    0
## 31  1993                0                    0
## 32  1994                0                    0
## 33  1995                0                    0
## 34  1996                0                    0
## 35  1997                0                    0
## 36  1998                0                    0
## 37  1999                0                    0
## 38  2000                0                    0
## 39  2001                0                    0
## 40  2002                0.274                0
## 41  2003                0                    0
## 42  2004                0                    0
## 43  2005                0                    0
## 44  2006                0                    0
## 45  2007                0                    0
## 46  2008                0                    0
## 47  2009                0                    0
## 48  2010                0                    0
## 49  2011                0                    0
## 50  2012                0                    0
## 51  2013                0.274                0
## 52  2014                0                    0
## 53  2015                0                    0
## 54  2016               42.1                 41.8
## 55  2017                0.274                0
## 56  2018               50.4                 50.4

Indeed, just precipitation, but nearly a complete record since the 1960s! Picota is downstream of my target site, but I think that the precipitation info here may still prove to be useful.

### Station 4: NUEVO LIMA (153312)

Finally, we have Nuevo Lima, which appears to be another rain gauge station, though there is some very sparse temperature data since around 2012 (not shown here).

dat_audit[[4]] %>% select_at(.var = vars(Year, starts_with("Prec"))) %>% filter_at(2:ncol(.), any_vars(. < 100)) %>% print(n = 99)
## # A tibble: 55 x 3
##     Year Prec07 (mm) pct NA Prec19 (mm) pct NA
##    <int>                <dbl>                <dbl>
##  1  1963               87.4                 87.4
##  2  1964               25.1                 25.1
##  3  1965                0.274                0
##  4  1966                1.64                 4.11
##  5  1967               18.9                 22.5
##  6  1968               22.1                 23.5
##  7  1969               16.4                 17.3
##  8  1970               27.9                 23.8
##  9  1971               21.1                 15.6
## 10  1972               23.2                 19.4
## 11  1973                9.32                14.2
## 12  1974               11.5                 11.2
## 13  1975               12.6                 12.3
## 14  1976                8.47                 5.74
## 15  1977                5.75                 4.66
## 16  1978                4.93                 6.58
## 17  1979                2.47                 6.03
## 18  1980                2.73                 5.46
## 19  1981                3.29                 8.77
## 20  1982                3.56                 5.48
## 21  1983                4.11                 4.38
## 22  1984                6.01                 6.56
## 23  1985                6.58                 3.84
## 24  1986               11.2                  3.84
## 25  1987                8.77                 2.47
## 26  1988               17.5                  0
## 27  1989                8.77                 0
## 28  1990                7.12                 3.29
## 29  1991                7.12                 4.66
## 30  1992                6.83                 5.19
## 31  1993                7.12                 9.86
## 32  1994                4.11                 6.85
## 33  1995                1.92                 0.548
## 34  1996                0                    0
## 35  1997                0.274                0
## 36  1998                0                    0
## 37  1999                0                    0
## 38  2000                0                    0
## 39  2001                0                    0
## 40  2002                0.274                0.274
## 41  2003                0                    0
## 42  2004                0                    0
## 43  2005                0                    0
## 44  2006                0.548                0
## 45  2007                0.274                0.822
## 46  2008                1.91                 3.83
## 47  2009                1.92                 3.84
## 48  2010                2.47                 3.01
## 49  2011                0.548                1.92
## 50  2012                8.74                10.7
## 51  2013                1.10                 1.37
## 52  2014                4.11                 5.75
## 53  2015                0.274                0
## 54  2016               83.6                 83.6
## 55  2017               33.2                 32.9

Another station with a little bit of patchiness, but overall a fairly complete and long-term precipitation record up to 2015.

## Visualizing these data

### Monthly average precipitation

OK, let’s take a look at these data. First, I’ll plot the monthly average precipitation. I will include a record of the completeness of the data using the aforementioned quick_audit(), which I wrote for the senamhiR package. As written, quick_audit() only works on one station at a time, so I’ll use the map() functions in the purrr package to get the desired data. I am going to limit my visualization to the past 25-years (1993 to 2017).

# This has to be nested because of the attribute stored in each table, which is lost when passed to quick_audit. In a future re-write of quick_audit, I will handle this better.
dat_audit <- dat %>%
quick_audit(
filter(., Fecha >= "1993-01-01" & Fecha <= "2017-12-31"),
variables = c("Prec07", "Prec19"), by = "month"),
StationID = attr(., "StationID"), .before = 1)
)

# Binding rows causes yearmon objects to lose attributes; I'll rename it while I'm at it
dat_audit <- dat_audit %>% mutate(Yearmon = as.yearmon(Year-month)) %>%
select(StationID, Yearmon, Prec07 (mm) pct NA, Prec19 (mm) pct NA)

dat_audit
## # A tibble: 1,200 x 4
##    StationID Yearmon       Prec07 (mm) pct NA Prec19 (mm) pct NA
##    <chr>     <S3: yearmon>                <dbl>                <dbl>
##  1 000382    Jan 1993                       100                  100
##  2 000382    Feb 1993                       100                  100
##  3 000382    Mar 1993                       100                  100
##  4 000382    Apr 1993                       100                  100
##  5 000382    May 1993                       100                  100
##  6 000382    Jun 1993                       100                  100
##  7 000382    Jul 1993                       100                  100
##  8 000382    Aug 1993                       100                  100
##  9 000382    Sep 1993                       100                  100
## 10 000382    Oct 1993                       100                  100
## # ... with 1,190 more rows

Now I will get the aggregate monthly total precipitation, and join the information about the percent of missing values. As a first-time visualization, I’ll only filter fully missing months (100% NA values). I just want to get a quick visual, but for a more robust visualization, I would certainly consider a lower threshold.

First, I will do some data cleaning of the daily data to make my life easier.

dat <- dat %>%
collapse_pcd %>% # Uses the collapse_pcd function from senamhiR
filter(Fecha >= "1993-01-01" & Fecha <= "2017-12-31") %>% # filter the data
rename(p07 = Prec07 (mm), p19 = Prec19 (mm)) %>% # rename some vars for easier typing
mutate(Prec = case_when(is.na(p07) & is.na(p19) ~ as.numeric(NA), # avoid false zeroes
TRUE ~ rowSums(.[11:12],  na.rm = TRUE)), # keep one p even if other is NA
Day = factor(case_when(is.na(Prec) ~ as.character(NA),
Prec < 1 ~ "dry",
TRUE ~ "wet"))) # add wet and dry day variables

Now I will create the aggregate monthly series and plot this using plotly.

dat_monthly <- dat %>% group_by(StationID, Yearmon = as.yearmon(Fecha)) %>%
summarize(Tot_Prec = sum(Prec, na.rm = TRUE),
dry_days = sum(Day == "dry", na.rm = TRUE),
wet_days = sum(Day == "wet", na.rm = TRUE))

dat_monthly <- left_join(dat_monthly, dat_audit) %>%
filter(Prec07 (mm) pct NA != 100 & Prec19 (mm) pct NA != 100) %>%
group_by(StationID, Month = format(Yearmon, format = "%m")) %>%
summarize(Tot_Prec = mean(Tot_Prec, na.rm = TRUE),
dry_days = mean(dry_days, na.rm = TRUE),
wet_days = mean(wet_days, na.rm = TRUE))

plot_ly(data = dat_monthly, x = ~Month, y = ~Tot_Prec, color = ~StationID) %>%
layout(title = "Average monthly total precipitation (mm)",
yaxis = list(title = "Total monthly precipitation (mm)"))

By the looks of it, all four stations exhibit fairly similar precipitation patterns. The region shows a double-peak precipitation pattern, with highest rainfall occurring in March, and a second, smaller peak in November. These peaks are consistent with the displacement of the intertropical convergence zone, which shifts north and south based on the solar equinoxes. San Rafael experiences the lowest rainfall in June, July and August, when the ITCZ is to the north. This is a super neat pattern that will require a deeper analysis in another post or paper.

### Wet- and dry-day frequencies and spells

How about wet days and dry days? First, let’s take a look at the average number of wet and dry days in a month. Note that the code for these numbers was in the previous code boxes.

subplot(
plot_ly(data = dat_monthly, x = ~Month, color = ~StationID, y = ~wet_days) %>%
plot_ly(data = dat_monthly, x = ~Month, color = ~StationID, y = ~dry_days) %>%
layout(title = "Average monthly (top) wet (≥ 1 mm p) and (bottom) dry days (< 1 mm p)",
yaxis = list(title = "Number of monthly wet/dry days"),
showlegend = FALSE),
nrows = 2
)

The top panel in the above shows the average number of wet days (days with ≥ 1 mm precipitation) for each month at each station. By the looks of it, none of the stations exceed an average maximum of 11.5 dry days in March. Throughout the year, it seems to rain more than 1 mm on somewhere between 5 and 8 days each month, on average. The bottom panel shows the number of dry-days which, as you might imagine, is the exact opposite of the number of wet days, as I have defined dry days here as the number of days with less than 1 mm of precipitation).

Wet and dry spells may be more interesting than the total number of wet or dry days as a possible indicator of flood events. Let’s take a look at extended (3 days or more) periods of wet days, similar to what I did in a previous post about cold weather events in Toronto.

fx <- function(x, y = NULL, day = NULL, comm = NULL, ...) {
event <- rle(x == day)
event$values[event$lengths < 3] <- FALSE

if (!inherits(comm, "function")) {
# This will return the lengths of the events
lens <- NULL
for (i in 1:length(event[[1]])) {
if (!is.na(event$values[i]) & event$values[i]) {
lens <- c(lens, rep(event$lengths[i], event$lengths[i]))
} else {
lens <- c(lens, rep(NA, event$lengths[i])) } } lens } else { # This will return summary variables of the events vals <- NULL for (i in 1:length(event[[1]])) { if (!is.na(event$values[i]) & event$values[i]) { pull <- c(rep(FALSE, sum(event$lengths[1:(i - 1)])),
rep(TRUE, event$lengths[i])) pull <- c(pull, rep(FALSE, length(y) - length(pull))) vals <- c(vals, rep(comm(y[pull], ...), event$lengths[i]))
} else {
vals <- c(vals, rep(NA, event$lengths[i])) } } vals } } dat_days <- dat %>% group_by(StationID) %>% mutate(dry_start = as.Date(fx(Day, Fecha, "dry", min)), dry_end = as.Date(fx(Day, Fecha, "dry", max)), dryspell = fx(Day, day = "dry"), wet_start = as.Date(fx(Day, Fecha, "wet", min)), wet_end = as.Date(fx(Day, Fecha, "wet", max)), wetspell = fx(Day, day = "wet")) The figure below shows a timeline of wet spells detected at each station from 1993 to 2017. The number of events is pretty dense, so some zooming and panning is required, but the timeline allows us to get a quick idea of where longer-duration events have occurred. timedat <- dat_days %>% filter(!is.na(wetspell)) %>% group_by(StationID, wet_start, wet_end, wetspell) %>% summarize(Tot_Prec = sum(Prec), most_prec = max(Prec, na.rm = TRUE), least_prec = min(Prec, na.rm = TRUE), avg_dly_prec = mean(Prec, na.rm = TRUE)) %>% ungroup() %>% mutate(content = paste(wetspell, "days"), start = wet_start, end = wet_end, group = StationID, title = paste("Total:", Tot_Prec, "mm; Max:", most_prec, "mm/day; Min:", least_prec, "mm/day; Avg:", round(avg_dly_prec, 1), "mm/day")) %>% select(content, start, end, group, title) groups <- tibble(id = unique(timedat$group), content = id)

timevis::timevis(timedat, groups, options = list(stack = FALSE), width = 720, height = 240)

## Preliminary summary

In this post, I have examined the data availability at four meteorological stations near my study site in San Rafael, in San Martín, Peru. I have identified a two-peak precipitation regime, and have classified some clustering of precipitation. This information will be useful to examine alongside some river flow data to begin to unravel the relationship between precipitation and river discharge and level. As my analysis progresses, I may need to consider more rain data from areas upstream of my study site, as well as river data from tributaries to the Huallaga River. In my next post, I will look at some river stations that might inform my current study.