--- title: 'Estimating microcredit impact with low take-up, contamination and inconsistent data. A review of Crépon, Devoto, Duflo and Pariente (American Economic Journal: Applied Economics, 2015)' author: Anonymised bibliography: bibliography.bib output: pdf_document: keep_tex: yes latex_engine: xelatex word_document: default header-includes: \usepackage{caption} editor_options: chunk_output_type: console abstract: We replicate a flagship randomised control trial carried out in rural Morocco that showed substantial and significant impacts of microcredit on the assets, the outputs, the expenses and the profits of self-employment activities. The original results rely primarily on trimming, that is the exclusion of observation with the highest values on some variables. However, the applied trimming procedures are inconsistent between the baseline and the endline. Using identical specifications as the original paper reveals large and significant imbalances at the baseline and, at the endline impacts on implausible outcomes, like household head gender, language or education. This calls into question the reliability of the data and the integrity of the experiment protocol. We find a series of coding, measurement and sampling errors. Correcting the identified errors lead to different results. After rectifying identified errors, we still find substantial imbalances at baseline and implausible impacts at the endline. Our re-analysis focused on the lack of internal validity of this experiment, but several of the identified issues also raise concerns about its external validity. --- \newpage ```{r load_packages, message=F, warning=F, echo=F, error=F} # Note 1 for code rerunning: our algorithm that checks household composition # consistency is computationaly very demanding and can take ~1h to run. # The rest of the codes takes ~5 minutes. It is possible to run the # aforementionned algorithm separately, before the rest of the code. See # instructions in the corresponding chunk (named 'algo_hh_chk') # Note 2 for code rerunning: the present rmd file produces a LaTeX-pdf by # default. To do so, you need to have a LaTeX engine installed on your # computer and the link indicated in RStudio options: # Tools/Global options/Sweave. Otherwise simpliy click on Knit and chose # another format than pdf (html or word) to reproduce the article. # IREE requires to automatize package and dependency installation # We do that install_load <- function(mypkg) { if (!is.element(mypkg, installed.packages()[,1])) { install.packages(mypkg, repos='http://cran.us.r-project.org') } library(mypkg, character.only=TRUE) } # Loading packages for analysis install_load("haven") # to import .dta (Stata) data files install_load("tidyverse") # package collection to tidy and plot data install_load("combinat") # to generate all hh mb possible combination to check possible matches install_load("labelled") # to extract stata variable labels install_load("kableExtra") # to produce complex tables usepackage_latex("threeparttable") # from kableExtra For long footnotes install_load("lmtest") # for clusterd standard errors install_load("sandwich") # idem install_load("broom") # to tidy regression outputs install_load("grid") # For combined violin plots install_load("gridExtra") # Same ## Some details on system package versions in case of incompatiblities ## with previous of subserquent versions: sessionInfo() output ## System: # R version 3.5.0 (2018-04-23) # Platform: i386-w64-mingw32/i386 (32-bit) # Running under: Windows 7 (build 7601) Service Pack 1 # # Attached packages # [1] bindrcpp_0.2.2 gridExtra_2.3 # [3] broom_0.4.4 sandwich_2.4-0 # [5] lmtest_0.9-36 zoo_1.8-1 # [7] kableExtra_0.8.0 labelled_1.0.1 # [9] combinat_0.0-8 forcats_0.3.0 # [11] stringr_1.3.0 dplyr_0.7.4 # [13] purrr_0.2.4 readr_1.1.1 # [15] tidyr_0.8.0 tibble_1.4.2 # [17] ggplot2_2.2.1 tidyverse_1.2.1 # [19] haven_1.1.1 # Downloading the data if (!file.exists("Data&Code_AEJApp_MicrocreditMorocco/Input/Microcredit_BL_mini_anonym.dta")) { download.file("https://www.aeaweb.org/aej/app/data/0701/2013-0535_data.zip", "data.zip") unzip("data.zip") } # The latter creates a message on latex, that a I cannot get rid of. # so I keep the long name and corresponding path in data import # file.rename(from = "Data&Code_AEJApp_MicrocreditMorocco", to = "data") # Download bibliography if (!file.exists("bibliography.bib")) { download.file("https://www.dropbox.com/s/p73wgj0ce3vt7yx/bibliography.bib?raw=1", "bibliography.bib") } ``` # 1. Introduction Randomised control trials (RCTs) are increasingly considered as the gold standard for producing evidence on what works and what does not, and this trend is particularly strong in development economics [@bedecarrats_all_2017]. In this field, microfinance is the sector most frequently evaluated by RCTs. J-PAL (a global research centre promoting this method for poverty reduction) posts 262 “finance” RCTs out of its 902 completed and ongoing RCTs.[^1] A highlight of this undertaking was the 2015 publication of a special issue in the *American Economic Journal: Applied Economics* (AEJ:AE) featuring six RCTs on microcredit [@banerjee2015six]. This special issue is seen by leading RCT movement figures as the decisive contribution to settle a long-standing debate on the subject [@ogden_experimental_2017]. It quickly attracted massive coverage: 2,557 citations in other scientific studies[^2] and J-PAL’s publication of a policy briefcase based on the six papers and drawing general conclusions for finance access strategies worldwide [@loiseau_where_2015]. [^1]: Source: The Abdul Lateef Jameel Poverty Action Lab website: [www.povertyactionlab.org/evaluations](https://www.povertyactionlab.org/evaluations), visited on 23/04/2018. [^2]: Source: Google Scholar citation indexes for the articles featured in this special issue, [see corresponding webpage](https://scholar.google.fr/scholar?hl=fr&as_sdt=0%2C5&as_ylo=2015&as_yhi=2015&q=microcredit+source%3A%22American+economic+journal+applied+economics%22&btnG=), visited on 23/04/2018. To strengthen the robustness of empirical research, the scientific community increasingly recommends systematic replication. A replication is a "*study whose main purpose is to determine the validity of one or more empirical results from a previously published study*" [@duvendack_what_2017: 47]. Clemens [-@clemens_meaning_2017] defines two categories and four subcategories of tests that can be used to this effect. The first *replication test* category uses the same specifications as the original paper, focuses on the same population of interest and is expected to produce the same results. Replication tests can be divided into two subcategories. The *replication-verification* subcategory retains the same sample as the original, to ensure that the reported statistical analysis does indeed produce the same results. Its purpose is mainly to identify flawed measurements, codes, datasets, etc. The *replication-reproduction* subcategory resamples, but from the same population and with the same distribution as the original paper. This is designed to turn up sampling errors, statistical power issues and other errors found by verification. The second *robustness test* category uses different specifications to the original paper. They are not expected to produce the same results, but the results should remain consistent with the conclusions of the original paper to hold. Robustness tests can also be divided into two subcategories. The *robustness-reanalysis* subcategory alters the statistical procedures to include new recoded variables or run different types of regressions for instance. It may or may not entail resampling, but it refers to the same population of interest. The *robustness-extension* subcategory uses different data from a different population or from the same population at a different point in time, but applies the same data analysis procedure. Replications are still seldom performed, and most of them belong to the *robustness-reanalysis* category. Sukhantar [-@sukhtankar_replications_2017] systematically reviews development economics articles published in ten top-ranking journals[^3] since 2000. He finds that 71 (6.2%) of the 1,138 empirical articles studied have been the subject of replication or robustness tests in a published or working paper. This rate rises to 12.5% when considering solely the 120 RCTs covered in this systematic review. Yet when the scope is narrowed to reviews conducting *replication tests* (verification or reproduction), the ratio falls to just 0.20% for all empirical papers and 0.16% (only two cases) for RCTs. These rates suggest that economists generally take for granted the reliability of the data, sampling and codes of the work produced by their peers and that, when they do take an interest in challenging a publication, they focus the discussion on modelling techniques. [^3]: *American Economic Review*, *Quarterly Journal of Economics*, *Journal of Political Economy*, *Econometrica*, *Review of Economic Studies*, *American Economic Journal: Applied Economics*, *American Economic Journal: Economic Policy*, *Economic Journal*, *Journal of the European Economic Association*, and *Review of Economics and Statistics*. ```{r eval = FALSE, echo = FALSE} # # Reanalysis from Sukhtankar data to filter for verifications and reproductions # # To process, just uncomment and run the following lines. # download.file("https://assets.aeaweb.org/assets/production/files/4421.zip", # "data_replic.zip") # unzip("data_replic.zip") # biblio_replic <- read_dta("final/replication_data_final.dta") %>% # mutate(verif_reprod = ReplicationVerification + ReplicationReproduction, # verif_reprod = ifelse(verif_reprod > 0 , # "Includes replication", "Robustness only")) # biblio_replic %>% # filter(EmpiricalPaper == "Empirical") %>% # group_by(Replicated, verif_reprod) %>% # count() # # biblio_replic %>% # filter(EmpiricalPaper == "Empirical", RCT == "Yes") %>% # group_by(Replicated, verif_reprod) %>% # count() ``` Replication tests can only be performed if the raw microdata is available. So in order to encourage these tests, a growing number of journals now systematically publish articles jointly with the data and analysis procedure on which they are based. The AEJ:EJ data availability policy[^4] states that the raw data should be made available, in particular in the case of experiments. However, in the above-mentioned special issue on microcredit, the raw data is available for just three of the six RCTs: [@crepon_estimating_2015; @attanasio_impacts_2015; @augsburg_impacts_2015]. A subset of pre-processed aggregated variables is provided in two cases [@banerjee_miracle_2015; @angelucci_microcredit_2015], and no data is made available at all in one case [@tarozzi_impacts_2015]. We chose to replicate the Moroccan study by Crépon, Devoto, Duflo and Pariente (hereafter referred to as CDDP). This is the most cited paper of this reproducible half of the AEJ:EJ special issue on microcredit. It is also co-authored by two researchers who play a central role as standard setters at J-PAL: Crépon and Duflo [@jatteau_faire_2016: 313]. It could therefore be indicative of common RCT practices in the development field. [^4]: All journals from the American Economic Association, including AEJ:EJ, are subject to the same data availability policy, available online ([www.aeaweb.org/journals/policies/data-availability-policy](https://www.aeaweb.org/journals/policies/data-availability-policy)). This data availability policy has remained the same since at least 2012. This is the same clause as found in this review of journal data policy ([www.edawax.de/wp-content/uploads/2012/07/Data_Policies_WP2.pd](http://www.edawax.de/wp-content/uploads/2012/07/Data_Policies_WP2.pdf)). CDDP conducted this RCT impact evaluation with Morocco’s largest microcredit institution (Association Al Amana, hereafter AAA), which was launching microcredit in rural areas not yet covered. The team took advantage of this expansion to new places to perform a RCT at area level. 162 villages were chosen around a central zone where the MFI had decided to start up new operations. The villages were then divided into 81 pairs of similar villages based on observable characteristics such as the number of households, accessibility to the centre of the community, existing infrastructure, type of activities carried out by the households and type of agricultural activities. AAA started up operations in randomly assigned villages offering joint-liability loans to local men and women living there. The loans granted were similar to urban area loans: group loans with amounts ranging from MAD (Moroccan dirhams) 1,000 to 15,000 (USD 124 to 1,855) per group member. In March 2008, AAA launched individual loans in rural areas: housing and non-agricultural businesses were eligible for larger amounts, but with additional conditions. Most of the loans taken in these areas, however, were group loans. Loan periods ranged from 3 to 18 months and repayments were made weekly, fortnightly or monthly excepting stockbreeding loans, which benefited from a two-month grace period. Annual interest rates ranged between 12.5% and 14.5% at the time of the study. The authors argue that there was enough distance between pairs of villages to prevent any contamination between treatment and control villages. The RCT was performed from 2006 to 2010 over four expansion periods. The baseline was conducted in four phases between 2006 and 2007. The sample as a whole was broken down into three household categories: 1) households in the top quartile identified along the line of the propensity score (25% of households with the highest probability of taking out a microloan); 2) five randomly selected households in the three other quartiles added to this sample in each village (treatment and control); and 3) a last (third) group of 1,433 households added only at the endline by re-estimating take-up scores across the entire sample and matching with administrative data provided by the MFI. The total sample contained 4,465 households at the baseline, 92% of which (4,118) could be re-interviewed at the endline, plus the 1,433 new households added at the endline. The total sample came to 5,551 households. The authors state that these three categories of potential borrowers capture the heterogeneity across households (borrowers versus non-borrowers) and thus enable them to assess the spillover effect on non-borrowers and “measure the impact of microcredit expansion on the community as a whole” [@loiseau_where_2015: 3]. The main findings of the RCT on the entire population of a village are reported in @crepon_estimating_2015, and @loiseau_where_2015. The first finding is that demand (take-up) for microcredit was low and lower than the researchers and the partner MFI expected. While this pattern is similar to other countries such as Ethiopia, India and Mexico, the uptake rate was particularly low in Morocco (13% of eligible borrowers), despite active promotion of microcredit by AAA loan officers during the RCT. The authors find that the programme had no impact on business start-up, but positive effects were found on a number of business-related outcome variables such as income, assets, investment and profits. Overall positive results were highly heterogeneous, meaning that some households benefited (larger business owners) while others did not (negative impact). Heterogeneity aside, positive impacts on business earnings were offset by significant decrease in labour supplied outside the home and in salary income. Consumption across an entire village population also decreased, albeit not significantly. Lastly, in terms of empowerment, microcredit impact on two major outcome variables (education and women’s empowerment) is unlikely to change women’s bargaining power in rural Morocco. The main conclusion the authors derive from their study is that the aggregate impact of microcredit should not be overestimated, as their study finds an overall fairly limited effect on the population at large, at least over a short period of time (two years). This replication paper is structured as follows. In Section 2, we describe the data and our replication method. Section 3 discusses the trimming procedures used by CDDP and assesses their results' sensitivity to the trimming threshold. Section 4 highlights several significant imbalances at baseline and disconcerting impacts on other outcomes produced with the same specifications as CDDP. Section 5 focuses on coding and measurement errors, while Section 6 addresses sampling errors. Section 7 discusses shortcomings related to external validity and our concluding comments are found in Section 8. # 2. Data and method The data and code used by CDDP can be found on the American Economic Association website’s subsection on AEJ:EJ, as links on the page on this article. The download contains three datasets, in Stata (.dta) format: the short preparatory survey (15,145 observations and 25 variables), the baseline survey (4,465 observations and 3,733 variables) and the endline survey (5,551 observations and 4,790 variables). It also includes the endline survey questionnaire, in French and English. Neither the simple preparatory survey questionnaire nor the baseline survey questionnaire is provided. Lastly, the download includes five data processing scripts, also in Stata format (.do): "Outcome construction at baseline", "Outcome construction at endline", "Analysis", "Graphs" and "Master". In the following replication, we refer to specific code sections, giving the Stata files these code sections come from (abbreviated respectively as BL, EL, AN, GR and MA), followed by the line number. For example, “BL:43” refers to line 43 of the file “Outcome construction at baseline”. We also refer to specific survey questions and microdata variables, giving their code in single quotation marks. Modalities are placed in italics. For example, '*Al Amana*' and '*Zakoura*' are two possible answers to survey questionnaire question ‘i3’ on whom the household has borrowed from during the past year. To ensure that our procedures are fully transparent and reproducible, we computed them using [R](https://cran.r-project.org/) [@r_core_team_r_2018] in [RMarkdown](http://rmarkdown.rstudio.com/) [@rstudio_team_rstudio_2018] format. We published, jointly with this paper, its source file with a .rmd extension, which contains all the scripts to access, download, import, prepare and compute the data. No data or figure was added outside of the script and the results, tables and figures displayed in the document are produced solely by this code. Taking Clemens’ typology [-@clemens_meaning_2017], our analysis includes *replication-verification*, *replication-reproduction* and *robustness-reanalysis* tests. These tests are interdependent. Our verification turns up not only measurement errors, but also sampling errors, calling for resampling analysis. Our verification also raises concerns as to the robustness of the paper’s conclusions. This was assessed by using the same specifications as CDDP, but by completing the independent variables they included in their regression to control for imbalanced variables at baseline, with other variables on which we also found major imbalances at baseline. The primary focus of this re-analysis is assessing the internal validity of CDDP published results and, if not stated otherwise, the shortcomings discussed below all refer to internal validity. Some of the issues we identified to assess internal validity also have implications for external validity, so we also discuss this in the last section of this replication paper. Verification tests are often restricted to “push button replications”, as the International Initiative for Impact Evaluation (3IE) describes them[^5]: rerunning the script code provided by the authors with the same data and checking that it produces the same outputs. Here, we conducted a more exacting process, consisting of translating the analysis procedure into a different statistical language (R) to the one used by the authors (Stata). Translating the code into another programming language requires the replicators to understand the original authors’ intention, design a script that executes this intention (instead of simply copy-pasting), and analyse any discrepancies between replicated and original results at all stages of the data analysis process until the cause of each and every difference can be understood. We ended up refining a code where each step of data analysis is a function. Every time a coding error was identified in the original paper, this coding error was included as an optional parameter in the corresponding function. If the option is activated, the function reproduces the error made by CDDP. If it is deactivated, it produces a corrected output. [^5]: See the "Push Button Replication Project" page from the 3IE website ([www.3ieimpact.org](www.3ieimpact.org)). We verified data quality and sampling integrity using basic good practices for survey analysis [@division_household_2005], in particular to check the consistency of household composition with respect to simple criteria such as gender and age. We also verified the variation in respondents’ answers to identical survey questions repeated across the questionnaires. The original code and paper run regressions on 110 constructed dependent variables, each one built upon information contained in a number (sometime dozens) of variables from the raw dataset. These variables can be clustered into five groups: credit, self-employment activities, work, consumption and socio-economic variables. We focused here on a subset of three of these groups, namely credit, which corresponds to the treatment being evaluated, self-employment activities, which is where the main impacts have been found, and consumption, as it is used for trimming (see Section 3.1). We first reproduced with R the analysis of the original paper to show that we did have the same data and that we had understood every detail of the analysis procedures applied by CDDP. Table 1 below reproduces some of the balance test presented in CDDP Table 1. Table 2 below reproduces the average impact estimates of the experiment on access to credit, as in CDDP Table 2. Table 3 below presents the average treatment effect on variables related to self-employment activities, which include the most significant results of this RCT, as in CDDP Table 3. ```{r import_data, message=F, warning=F, echo=F} # loading the data into R bl <- read_stata(file = "Data&Code_AEJApp_MicrocreditMorocco/Input/Microcredit_BL_mini_anonym.dta", encoding = "L1") %>% # specify encoding to run on linux systems mutate(ident = as.character(ident)) # Convert hh number to character to facilitate joins el <- read_stata(file = "Data&Code_AEJApp_MicrocreditMorocco/Input/Microcredit_EL_mini_anonym.dta", encoding = "L1") %>% # specify encoding to run on linux systems mutate(ident = as.character(ident)) # Convert hh number to character to facilitate joins ms <- read_stata(file = "Data&Code_AEJApp_MicrocreditMorocco/Input/Microcredit_MS_anonym.dta", encoding = "L1") # specify encoding to run on linux systems # Convert "custom" NA values used by Crepon et al bl[bl == -99 | bl == -98 | bl == -97] <- NA el[el == -99 | el == -98 | el == -97] <- NA # A function to extract table content codebook = function(x, fctr = FALSE){ nom = names(x) variable = var_label(x) modalites = val_labels(x, prefixed = TRUE) modalites = lapply(modalites, FUN = labels) modalites = lapply(modalites, FUN = paste, collapse = "; ") cb = cbind(nom, variable, modalites) cb = as.data.frame(cb, row.names = FALSE) return(cb) } # To be deleted bl_cb <- codebook(bl) el_cb <- codebook(el) ``` ```{r load_functions, message=F, warning=F, echo=F} # This chunk loads all the functions that we prepared to reproduce original # results. Each time an error was indentified, a conditionnal section was # included in function script, and a corresponding parameter was added in # function call to reproduce the error or not. # borrowed_[source] ----------------------------------------------------------- # Function to extract, reshape and label credit information extract_cred <- function (x, other_as_util = FALSE) { x %>% # Catch credit variables select(ident, matches("^i[1-9]_")) %>% gather(variable, value, -ident) %>% mutate(rank = sub("(.+)([0-9]+)$", "\\2", variable), variable = gsub("(.+)([0-9]+)$", "\\1", variable), variable = recode_factor(variable, "i3_" = "Source", "i3_aut" = "src_oth", "i4_" = "maturity", "i4_unit" = "matur_tu", "i5_" = "decision", "i6_" = "name_loan", "i7_" = "guarantee", "i8_" = "guar_type", "i9_" = "amount")) %>% spread(key = variable, value = value) %>% mutate(`Source` = recode_factor(`Source`, "1" = "Crédit agricole", "2" = "Other bank", "3" = "Al Amana", "4" = "Zakoura", "5" = "CA Foundation", "6" = "Other MFI", "7" = "Usurier", "8" = "Jewler", "9" = "Family", "10" = "Neighbour", "11" = "Friend", "12" = "Shop", "13" = "Client", "14" = "Supplier", "15" = "Cooperative", "16" = "Other", "17" = "Utility", "18" = "NR/NS"), source_type = recode_factor(`Source`, "Al Amana" = "Al Amana", "Zakoura" = "Other MFI", "CA Foundation" = "Other MFI", "Other MFI" = "Other MFI", "Crédit agricole" = "Other formal", "Other bank" = "Other formal", "Cooperative" = "Other formal", "Usurier" = "Informal", "Jewler" = "Informal", "Family" = "Informal", "Neighbour" = "Informal", "Friend" = "Informal", "Shop" = "Informal", "Client" = "Informal", "Supplier" = "Informal", "Other" = ifelse(other_as_util, "Utility", "Other"), "Utility" = "Utility"), source_type2 = recode_factor(source_type, "Al Amana" = "Al Amana", "Other MFI" = "Other formal", "Other formal" = "Other formal", "Informal" = "Informal", "Utility" = "Utility", "Other bank" = "Other formal", "Cooperative" = "Other formal", "Usurier" = "Informal", "Jewler" = "Informal", "Family" = "Informal", "Voisin" = "Informal", "Friend" = "Informal", "Shop" = "Informal", "Client" = "Informal", "Supplier" = "Informal", "Other" = ifelse(other_as_util, "Utility", "Other"), "Utility" = "Utility")) } # Function to assess % of household with credit from source count_cred <- function(x, y = source_type, active_only = FALSE) { # This to recreate the error by Crepon et al. who only used active credits # at baseline if (active_only) { x <- filter(x, rank < 4) } # To assign a variable to a parameter in dplyr, # we need to quote (quo) and unquote it (!) y <- enquo(y) x %>% group_by(ident, !!y) %>% summarise(n = n()) %>% spread(key = !!y, value = n, fill = 0) %>% select(-``) } # Dummy if hh has one or more credit from given source : 0(no)/1(yes) borrowed <- function (x) { x[,-1] <- ifelse(x[,-1] > 0, 1, 0) return(x) } # # loansamt_[source] variables ----------------------------------------------- amt <- function(x) { x %>% group_by(ident, source_type) %>% mutate(amount = as.numeric(amount)) %>% summarise(sumcred = sum(amount, na.rm = TRUE)) %>% spread(source_type, sumcred, fill = 0) %>% select(-``) } # cons_repay variable --------------------------------------------------------- recode_cons_repay <- function(x) { x %>% mutate(cons_repay = ifelse(h3_10 >= 0 & !is.na(h3_10), 1, 0)) %>% select(ident, cons_repay) } # asset_agri variable---------------------------------------------------------- # A function to extract and format agricultural assets extract_ast_agr <- function (x, excl_imp_share = TRUE) { out <- x %>% select(ident, e1_1:e8_16) %>% gather(variable, value, -ident) %>% mutate(`Asset` = gsub("(.+)_(.+)", "\\2", variable), variable = gsub("(.+)_(.+)", "\\1", variable), # labels = var_label(el)[variable], `Asset` = recode_factor(`Asset`, "1" = "Tractor", "2" = "Reaper", "3" = "Traditional laborer", "4" = "Cart", "5" = "Rake", "6" = "Shovel", "7" = "Ax", "8" = "Wheelbarow", "9" = "Sickle", "10" = "Car or truck", "11" = "Oil Mill", "12" = "Other 1", "13" = "Other 2", "14" = "Other 3", "15" = "Other 4", "16" = "Other 5")) %>% mutate(value = as.numeric(value)) %>% spread(key = variable, value = value) %>% rename(has = e1, number = e2, owns = e3, rent = e4, share = e5, new = e6, new_cost = e7, fin_src = e8) %>% arrange(ident, `Asset`) if (excl_imp_share) { out <- out %>% mutate(share_origin = share, share = ifelse(share > 100, NA, share)) } return(out) } # A function to appraise agro assets appraise_agro_ast <- function(x, ast_ap = ast_ap, exclude = "") { if (ast_ap == "") { # otherwise uses specified prices (for both EL and BL) # Compute median prices by asset type ast_ap <- x %>% group_by(`Asset`) %>% summarise(median_value = median(new_cost, na.rm = TRUE)) } # apply median price to hh assets ast_val <- x %>% left_join(ast_ap, by = "Asset") %>% mutate(appraisal = ifelse(owns == 2 & !is.na(share), number * median_value * (share/100), ifelse(owns == 1 | owns == 2, number * median_value, 0))) %>% filter(`Asset` != exclude) # summarise per hh ast_val %>% group_by(ident) %>% summarise(asset_agri = sum(appraisal, na.rm = TRUE)) } # expense_agrirent ------------------------------------------------------------ sum_agrirent <- function(x) { x %>% group_by(ident) %>% summarise(expense_agrirent = sum(rent, na.rm = TRUE)) } # inv_agri -------------------------------------------------------------------- appraise_inv_agri <- function(x) { x %>% mutate(inv_owns = ifelse(!is.na(new_cost) & new_cost >= 0 & owns == 1, new_cost, 0), inv_shares = ifelse(!is.na(new_cost) & new_cost >= 0 & !is.na(share) & share >0 & owns == 2, new_cost*share/100, 0), inv_tot = inv_owns + inv_shares) %>% group_by(ident) %>% summarise(inv_agri = sum(inv_tot, na.rm = TRUE)) } # expense_agriinv ------------------------------------------------------------- # Very puzzling operation from Crepon et al.!!! op_expense_agriinv <- function(x, correct = FALSE) { if(correct) { x %>% mutate(expense_agriinv = inv_agri) %>% select(ident, expense_agriinv) } else { x %>% mutate(expense_agriinv = ifelse(inv_agri > 10000, inv_agri / 10, inv_agri)) %>% select(ident, expense_agriinv) } } # savings_livestock variable ------------ # "Savings" in livestock extract_lsk <- function (x) { x %>% select(ident, f8_0_1:f14_8) %>% gather(variable, value, -ident) %>% mutate(`Animal` = gsub("(.+)_(.+)$", "\\2", variable), variable = gsub("(.+)_(.+)$", "\\1", variable), `Animal` = recode_factor(`Animal`, "1" = "Local cattle breed", "2" = "Improved cattle breed", "3" = "Sheep", "4" = "Goat", "5" = "Draft animal", "6" = "Poultry", "7" = "Rabbit", "8" = "Other"), variable = recode_factor(variable, "f8_0" = "had", "f8" = "sold", "f9" = "totval_sold", # total val of sale "f10" = "consumed", "f11" = "lost", "f12" = "purchased", "f13" = "totval_purchased", "f14" = "remain"), value = as.numeric(value)) %>% spread(key = variable, value = value) } # Compute median prices by animal type appraise_lsk <- function (x) { # Compute median prices lsk_price <- x %>% # A discrepancy in prices appears at endline between R and Stata, because there are # 6 obs where purchased =0 but totval_purchased >0; Differences in ways # Stata and R handle missing do that Stat includes those 6 in median calculation # and R not. So for "as is"" reproduction's sake, we replace price_purchased to 0 # in those 6 observations, # BTW, this is a Stata problem and R is right to do so. mutate(totval_purchased = ifelse(purchased == 0, 0, totval_purchased)) %>% group_by(`Animal`) %>% summarise(median_price_sold = median(totval_sold/sold, na.rm = TRUE), median_price_purchased = median(totval_purchased/purchased, na.rm = TRUE)) # apply median price to hh remaining animals lsk_val <- x %>% left_join(lsk_price, by = "Animal") %>% mutate(val_remain = ifelse(remain >=0 & !is.na(median_price_purchased), remain * median_price_purchased, ifelse(remain >= 0 & !is.na(median_price_sold), remain * median_price_sold, 0))) # summarise by household lsk_val <- lsk_val %>% group_by(ident) %>% summarise(savings_livestock = sum(val_remain, na.rm = TRUE)) } # sale_livestockanim ---------------------------------------------------------- # Appraise sales or auto consumption/gifts appraise_anim_sale <- function (x) { # Compute median prices lsk_price <- x %>% group_by(`Animal`) %>% summarise(median_price_sold = median(totval_sold/sold, na.rm = TRUE), median_price_purchased = median(totval_purchased/purchased, na.rm = TRUE)) # apply median price to hh consumed or gifted animals lsk_val <- x %>% left_join(lsk_price, by = "Animal") %>% mutate(val_sold = ifelse(totval_sold > 0 & !is.na(totval_sold), totval_sold, 0), val_cons_gift = ifelse(consumed >=0 & !is.na(consumed), consumed * median_price_sold, 0), val_sale = val_sold + val_cons_gift) # summarise by household lsk_val <- lsk_val %>% group_by(ident) %>% summarise(sale_livestockanim = sum(val_sale, na.rm = TRUE)) } # asset_livestock-------------------------------------------------------------- extract_lsk_ast <- function (x) { x %>% select(ident, f1_1:f7_5) %>% gather(variable, value, -ident) %>% mutate(`Asset` = gsub("(.+)_(.+)$", "\\2", variable), variable = gsub("(.+)_(.+)$", "\\1", variable), `Asset` = recode_factor(`Asset`, "1" = "Milk jars", "2" = "Cage", "3" = "Honey wooden hives", "4" = "Other 1", "5" = "Other 2"), variable = recode_factor(variable, "f1" = "own", "f2" = "number", "f3" = "purchase_inyear", "f4" = "purchase_price", "f5" = "financing", "f6" = "rent", "f7" = "rent_amount"), value = as.numeric(value)) %>% spread(key = variable, value = value) } appraise_lsk_ast <- function (x, correct = FALSE) { # Compute median prices for assets 1 to 3 median_prices <- x %>% group_by(`Asset`) %>% mutate(unit_price = ifelse(!is.na(purchase_price) & purchase_price > 0 & !is.na(number) & number > 0, purchase_price/number, 0)) %>% filter(unit_price > 0) %>% summarise(median_price = median(unit_price, na.rm = TRUE)) %>% filter(`Asset` != "Other 1" & `Asset` != "Other 2") # apply median price to hh assets val <- x %>% left_join(median_prices, by = "Asset") %>% mutate(val_ast1 = ifelse(is.na(median_price) & purchase_price > 0, purchase_price, number * median_price), val_ast1 = ifelse(!is.na(val_ast1), val_ast1, 0), # unit price if purchase val_ast2 = ifelse(!is.na(purchase_price) & purchase_price >0 & !is.na(number) & number > 0, purchase_price / number, 0), # to reproduce error : Crepon et al added unit price to asset total val_ast3 = ifelse(`Asset` == "Other 1" | `Asset` == "Other 2", val_ast1, val_ast1 + val_ast2)) if (correct) { # summarise by household val <- val %>% group_by(ident) %>% summarise(asset_livestock = sum(val_ast1, na.rm = TRUE)) } else { val <- val %>% group_by(ident) %>% summarise(asset_livestock = sum(val_ast3, na.rm = TRUE)) } } # savings_business------------------------------------------------------------- # Extract business activities extract_bac <- function (x) { x %>% # Catch activity variables select(ident, matches("g1[0-5].+|g3[7-9].+|g4[0-3]_.+")) %>% gather(variable, value, -ident) %>% mutate(`Activity` = gsub("(.+)([0-9]+)$", "\\2", variable), variable = gsub("(.+)([0-9]+)$", "\\1", variable), variable = gsub("_", "", variable), variable = recode_factor(variable, "g10act" = "Activity code2", "g10m" = "Month started", "g10a" = "Year started", "g11" = "Location", "g12" = "Owns location", "g13" = "Rent location", "g14" = "Inputs week", "g15" = "Inputs month", "g37" = "Sales week", "g38" = "Sales month", "g39" = "Stock", "g40" = "Best month", "g41" = "Sales best month", "g42" = "Worst month", "g43" = "Sales worst month"), value = as.numeric(value)) %>% spread(key = variable, value = value) } # Stock = savings_business appraise_bac_s <- function (x) { x %>% group_by(ident) %>% summarise(savings_business = sum(`Stock`, na.rm = TRUE)) } # asset_business -------------------------------------------------------------- extract_bas <- function (x, manual_recode = TRUE) { out <- x %>% select(ident, g1_1:g9_13) %>% gather(variable, value, -ident) %>% mutate(`Asset` = gsub("(.+)_(.+)$", "\\2", variable), variable = gsub("(.+)_(.+)$", "\\1", variable), `Asset` = recode_factor(`Asset`, "1" = "Sewing machine", "2" = "Weaving machine", "3" = "Clay turner", "4" = "Car or truck", "5" = "Soldering iron", "6" = "Cart", "7" = "Saw", "8" = "Plane Planer", "9" = "Balance", "10" = "Other 1", "11" = "Other 2", "12" = "Other 3", "13" = "Other 4"), variable = recode_factor(variable, "g1" = "has", "g2" = "number", "g3" = "activity", "g4" = "own", "g5" = "percent_own", "g6" = "recent", "g7" = "purchase_price", "g8" = "financing", "g9" = "rent"), value = as.numeric(value)) %>% spread(key = variable, value = value) if (manual_recode) { out <- out %>% mutate(number = ifelse(`Asset` == "Car or truck" & number == 25, NA, number), percent_own = ifelse(percent_own == 0.25, 25, percent_own)) } return(out) } appraise_bas <- function (x) { # Compute median prices for assets 1 to 9 median_prices <- x %>% group_by(`Asset`) %>% summarise(median_price = median(purchase_price, na.rm = TRUE)) %>% filter(`Asset` != "Other 1" & `Asset` != "Other 2" & `Asset` != "Other 3" & `Asset` != "Other 4") # apply median price to hh assets val <- x %>% left_join(median_prices, by = "Asset") %>% mutate(val_ast = ifelse(is.na(median_price) & purchase_price > 0, purchase_price, ifelse(own == 2, number * median_price * percent_own/100, ifelse(own == 1, number * median_price, 0)))) # summarise by household val <- val %>% group_by(ident) %>% summarise(asset_business = sum(val_ast, na.rm = TRUE)) } # expense_busrent ------------------------------------------------------------- sum_busrent <- function(x) { x %>% group_by(ident) %>% summarise(expense_busrent = sum(rent, na.rm = TRUE)) } # expense_businv -------------------------------------------------------------- # NB : gen expense_businv = inv_business_assets; sum_businv <- function(x) { x %>% mutate(inv_own = ifelse(purchase_price >= 0 & !is.na(purchase_price) & own == 1 & recent == 1, purchase_price, 0), inv_share = ifelse(purchase_price >= 0 & !is.na(purchase_price) & percent_own >= 0 & !is.na(percent_own) & own == 2 & recent == 1, purchase_price * percent_own / 100, 0), inv_tot = inv_own + inv_share) %>% group_by(ident) %>% summarise(expense_businv = sum(inv_tot)) } # expense_businputs ----------------------------------------------------------- extract_bus_exp1 <- function (x) { x %>% select(ident, matches("g1[6-9].+|g2[0-2].+")) %>% gather(variable, value, -ident) %>% mutate(`Item` = gsub("(.+)_(.+)$", "\\2", variable), `Activity` = gsub("(.+)_(.+)_(.+)$", "\\2", variable), variable = gsub("(.+)_(.+)_(.+)$", "\\1", variable), variable = recode_factor(variable, "g16" = "purchased", "g17" = "unit", "g18" = "price_unit", "g19" = "number_unit", "g20" = "times", "g21" = "place", "g22" = "seller"), value = as.numeric(value)) %>% spread(key = variable, value = value) } val_expbusinp1 <- function(x) { x %>% mutate(tot = ifelse(price_unit >= 0 & !is.na(price_unit) & number_unit >= 0 & !is.na(number_unit), price_unit * number_unit * 12, 0)) %>% group_by(ident) %>% summarise(expbusinp1 = sum(tot, na.rm = TRUE)) } extract_bus_exp2 <- function (x, correct = FALSE) { x <- x %>% select(ident, matches("g2[3-5].+")) %>% gather(variable, value, -ident) %>% mutate(`Item` = gsub("(.+)_(.+)$", "\\2", variable), `Activity` = gsub("(.+)_(.+)_(.+)$", "\\2", variable), variable = gsub("(.+)_(.+)_(.+)$", "\\1", variable), variable = recode_factor(variable, "g23" = "label", "g24" = "expense", "g25" = "times"), value = as.numeric(value)) %>% spread(key = variable, value = value) if (!correct) { # Mistake in Crepon et al.: Their indiex on items only goes to 5 # But on their data, we have up to 6 items per activity x <- x %>% filter(`Item` < 6) } return(x) } val_expbusinp2 <- function(x) { x %>% group_by(ident) %>% summarise(expbusinp2 = sum(expense, na.rm = TRUE)*12) } # One difference here for HH 184105, that I cannot explain # expense_buslabor ------------------------------------------------------------ # a function to extract section on business labor extract_buslabor <- function (x) { x %>% select(ident, matches("g4[7-9].+|g5[0-1].+")) %>% gather(variable, value, -ident) %>% mutate(`Activity` = gsub("(.+)_(.+)$", "\\2", variable), variable = gsub("(.+)_(.+)$", "\\1", variable), variable = recode_factor(variable, "g47" = "nb_permanent", "g48" = "wage_permanent", "g49" = "nb_seasonal", "g50" = "day_seasonal", "g51" = "day_wage_seasonal"), value = as.numeric(value)) %>% spread(key = variable, value = value) } # a function to appraise business labor expenses val_buslabor <- function(x) { x %>% mutate(expense_permanent = ifelse(!is.na(nb_permanent) & nb_permanent > 0 & !is.na(wage_permanent) & wage_permanent > 0, nb_permanent * wage_permanent * 12, 0), expense_seasonal = ifelse(!is.na(nb_seasonal) & nb_seasonal >0 & !is.na(day_seasonal) & day_seasonal > 0 & !is.na(day_wage_seasonal) & day_wage_seasonal > 0, nb_seasonal * day_seasonal * day_wage_seasonal, 0), expense_total = expense_permanent + expense_seasonal) %>% group_by(ident) %>% summarise(expense_buslabor = sum(expense_total, na.rm = TRUE)) } # sale_business --------------------------------------------------------------- extract_bsale <- function (x, exclude_var = "", exclude_item = "") { x %>% # Catch business sales variables select(ident, matches("g2[8-9].+|g35.+")) %>% gather(variable, value, -ident) %>% mutate(`Item` = gsub("(.+)([0-9])$", "\\2", variable), `Activity` = gsub("(.+)_([0-9])_(.+)", "\\2", variable), variable = gsub("(.+)_([0-9])_(.+)", "\\1", variable), variable = recode_factor(variable, "g28" = "product_price", "g29" = "product_units", "g35" = "sale_services"), value = as.numeric(value)) %>% filter(!(variable == exclude_var & `Item` %in% exclude_item)) %>% spread(key = variable, value = value) } appraise_bsale <- function (x) { x %>% mutate(sale_service_y = ifelse(!is.na(sale_services), sale_services * 12, 0), sale_item_y = ifelse(!is.na(product_price) & !is.na(product_units), product_units * product_price * 12, 0), sale_activity = sale_service_y + sale_item_y) %>% group_by(ident) %>% summarise(sale_business = sum(sale_activity, na.rm = TRUE)) } # sale_cereal ----------------------------------------------------------------- extract_cereal <- function (x) { x %>% select(ident, e15_1:e24_20) %>% gather(variable, value, -ident) %>% mutate(`Cereal` = gsub("(.+)_(.+)", "\\2", variable), variable = gsub("(.+)_(.+)", "\\1", variable), `Cereal` = recode_factor(`Cereal`, "1" = "Durum", "2" = "Wheat", "3" = "Corn", "4" = "Barley", "5" = "Sunflower", "6" = "Fava", "7" = "Lentills", "8" = "Peas", "9" = "Chickpeas", "10" = "Dry beans", "11" = "Forage", "12" = "Other 1", "13" = "Other 2", "14" = "Other 3", "15" = "Other 4", "16" = "Other 5", "17" = "Other 6", "18" = "Other 7", "19" = "Other 8", "20" = "Other 9"), variable = recode_factor(variable, "e15" = "grew", "e16" = "area", "e17" = "month_harvest", "e18" = "production", "e19" = "consumed", "e20" = "sold_before_h", "e21" = "price_before_h", "e22" = "sold_after_h", "e23" = "price_after_h", "e24" = "stock"), value = as.numeric(value)) %>% spread(key = variable, value = value) } # A function to appraise cereal sales appraise_cereal <- function(x, correct = FALSE) { # Compute median prices by cereal if (correct) { # Crepon and al. should not have computed different price to different # "others" only because their index is different. x <- x %>% mutate(`Cereal` = recode_factor(`Cereal`, "Other 1" = "Other", "Other 2" = "Other", "Other 3" = "Other", "Other 4" = "Other", "Other 5" = "Other", "Other 6" = "Other", "Other 7" = "Other", "Other 8" = "Other", "Other 9" = "Other")) } cer_ap <- x %>% group_by(`Cereal`) %>% summarise(median_price_beforeh = median(price_before_h, na.rm = TRUE), median_price_afterh = median(price_after_h, na.rm = TRUE)) # By default reproduces error from original paper # Here authors applied price after harvest to all sales, including those before # We should have price before harvest for sales before harvest (standing crop sales) if (!correct) { cer_ap$median_price_beforeh <- cer_ap$median_price_afterh } # apply median price to cereal sales cer_val <- x %>% left_join(cer_ap, by = "Cereal") %>% mutate(appraisal_before = ifelse(is.na(sold_before_h), 0, ifelse(!is.na(price_before_h) & price_before_h > 0, sold_before_h * price_before_h, sold_before_h * median_price_beforeh)), appraisal_after = ifelse(is.na(sold_after_h), 0, ifelse(!is.na(price_after_h) & price_after_h > 0, sold_after_h * price_after_h, sold_after_h * median_price_afterh)), appraisal_consum = ifelse(is.na(consumed), 0, consumed * median_price_afterh), appraisal_total = (appraisal_before+ appraisal_after+ appraisal_consum), test1_before = !is.na(price_before_h), test2_before = price_before_h > 0) # # summarise per hh cer_val %>% group_by(ident) %>% summarise(sale_cereal = sum(appraisal_total, na.rm = TRUE)) } # savings_cereal -------------------------------------------------------------- # A function to appraise cereal savings appraise_cereal_savings <- function(x, correct = FALSE) { # Compute median prices by cereal cer_ap <- x %>% group_by(`Cereal`) %>% summarise(median_price_beforeh = median(price_before_h, na.rm = TRUE), median_price_afterh = median(price_after_h, na.rm = TRUE)) # By default reproduces error from original paper # Here we should apply price after harvest as stock is after harvest if (correct) { cer_ap$median_price_beforeh <- cer_ap$median_price_afterh } # apply median price to cereal sales cer_val <- x %>% left_join(cer_ap, by = "Cereal") %>% mutate(appraisal_savings = stock * median_price_beforeh) # summarise per hh cer_val %>% group_by(ident) %>% summarise(savings_cereal = sum(appraisal_savings, na.rm = TRUE)) } # sale_tree ------------------------------------------------------------------- extract_tree <- function (x) { x %>% select(ident, e25_1:e35_13) %>% gather(variable, value, -ident) %>% mutate(`Tree` = gsub("(.+)_(.+)", "\\2", variable), variable = gsub("(.+)_(.+)", "\\1", variable), variable = recode_factor(variable, "e25" = "number_tree", "e26" = "harest_month", "e27" = "production", "e28" = "consumed", "e29" = "sold_before_h", "e30" = "price_before_h", "e31" = "sold_during_h", "e32" = "price_during_h", "e33" = "sold_after_h", "e34" = "price_after_h", "e35" = "stock"), value = as.numeric(value)) %>% spread(key = variable, value = value) } # A function to appraise tree sales appraise_tree <- function(x, correct = FALSE) { # replace to NA if no sale x <- x %>% mutate(price_before_h = ifelse(sold_before_h == 0 & price_before_h == 0, NA, price_before_h), price_during_h = ifelse(sold_during_h == 0 & price_during_h == 0, NA, price_during_h), price_after_h = ifelse(sold_after_h == 0 & price_after_h == 0, NA, price_after_h)) # Compute median prices by tree tree_ap <- x %>% group_by(`Tree`) %>% summarise(median_price_beforeh = median(price_before_h, na.rm = TRUE), median_price_duringh = median(price_during_h, na.rm = TRUE), median_price_afterh = median(price_after_h, na.rm = TRUE), median_price_beforeh = ifelse(is.na(median_price_beforeh), median_price_duringh, median_price_beforeh)) # By default reproduces error from original paper if (!correct) { tree_ap <- tree_ap %>% mutate(median_price_duringh = median_price_beforeh, median_price_afterh = median_price_beforeh) } # apply median price to hh assets tree_val <- x %>% left_join(tree_ap, by = "Tree") %>% mutate(appraisal_before = ifelse(is.na(sold_before_h), 0, ifelse(!is.na(price_before_h) & price_before_h > 0, sold_before_h * price_before_h, sold_before_h * median_price_beforeh)), appraisal_during = ifelse(is.na(sold_during_h), 0, ifelse(!is.na(price_during_h) & price_during_h > 0, sold_during_h * price_during_h, sold_during_h * median_price_duringh)), appraisal_after = ifelse(is.na(sold_after_h) | sold_after_h < 0, 0, ifelse(!is.na(price_after_h) & price_after_h > 0, sold_after_h * price_after_h, sold_after_h * median_price_afterh)), appraisal_consum = ifelse(is.na(consumed), 0, consumed * median_price_afterh), appraisal_total = (appraisal_before+ appraisal_during + appraisal_after + appraisal_consum)) # # summarise per hh tree_val %>% group_by(ident) %>% summarise(sale_tree = sum(appraisal_total, na.rm = TRUE)) } # savings_tree ---------------------------------------------------------------- # A function to appraise tree savings appraise_tree_savings <- function(x) { # replace to NA if no sale x <- x %>% mutate(price_before_h = ifelse(sold_before_h == 0 & price_before_h == 0, NA, price_before_h), price_during_h = ifelse(sold_during_h == 0 & price_during_h == 0, NA, price_during_h), price_after_h = ifelse(sold_after_h == 0 & price_after_h == 0, NA, price_after_h)) # Compute median prices by tree tree_ap <- x %>% group_by(`Tree`) %>% summarise(median_price_beforeh = median(price_before_h, na.rm = TRUE), median_price_duringh = median(price_during_h, na.rm = TRUE), median_price_afterh = median(price_after_h, na.rm = TRUE), median_price_beforeh = ifelse(is.na(median_price_beforeh), median_price_duringh, median_price_beforeh), median_price = ifelse(!is.na(median_price_afterh), median_price_afterh, ifelse(!is.na(median_price_duringh), median_price_duringh, median_price_beforeh))) # apply median price to hh assets tree_val <- x %>% left_join(tree_ap, by = "Tree") %>% mutate(appraisal = ifelse(is.na(stock) | stock < 0, 0, stock * median_price)) # # summarise per hh tree_val %>% group_by(ident) %>% summarise(savings_tree = sum(appraisal, na.rm = TRUE)) } # sale_veg -------------------------------------------------------------------- extract_veg <- function (x) { x %>% select(ident, e36_1:e40b8) %>% gather(variable, value, -ident) %>% mutate(`Crop` = gsub("(.+)([1-8])", "\\2", variable), variable = gsub("(.+)([1-8])", "\\1", variable), `Crop` = recode_factor(`Crop`, "1" = "Carrots", "2" = "Potatoes", "3" = "Tomatoes", "4" = "Onions", "5" = "Mint", "6" = "Melon", "7" = "Other 1", "8" = "Other 2"), variable = recode_factor(variable, "e36_" = "grew", "e37_" = "production", "e38_" = "consumed", "e39_" = "sold", "e40_" = "price_sold", "e40b" = "stock"), value = as.numeric(value)) %>% spread(key = variable, value = value) } # A function to appraise vegetable sales appraise_veg <- function(x, correct = FALSE) { # Compute median prices by crop veg_ap <- x %>% group_by(`Crop`) %>% summarise(median_price = median(price_sold, na.rm = TRUE)) # By default reproduces error from original paper if (correct) { veg_ap <- veg_ap %>% filter(`Crop` != "Other 1" & `Crop` != "Other 2") } # apply median price to hh assets veg_val <- x %>% left_join(veg_ap, by = "Crop") %>% mutate(appraisal_sold = ifelse(is.na(sold), 0, ifelse(!is.na(price_sold) & price_sold > 0, sold * price_sold, sold * median_price)), appraisal_consum = ifelse(is.na(consumed), 0, consumed * median_price), appraisal_total = (appraisal_sold + appraisal_consum)) # # summarise per hh veg_val %>% group_by(ident) %>% summarise(sale_veg = sum(appraisal_total, na.rm = TRUE)) } # savings_veg ----------------------------------------------------------------- # A function to appraise vegetable savings appraise_veg_savings <- function(x) { # Compute median prices by crop veg_ap <- x %>% group_by(`Crop`) %>% summarise(median_price = median(price_sold, na.rm = TRUE)) # apply median price to hh assets veg_val <- x %>% left_join(veg_ap, by = "Crop") %>% mutate(appraisal = ifelse(is.na(stock) | stock < 0, 0, stock * median_price)) # summarise per hh veg_val %>% group_by(ident) %>% summarise(savings_veg = sum(appraisal, na.rm = TRUE)) } # sale_livestockprod ---------------------------------------------------------- extract_lskprod <- function (x) { x %>% select(ident, f15_0_1:f18b4) %>% gather(variable, value, -ident) %>% mutate(`Prod` = gsub("(.+)([1-8])", "\\2", variable), variable = gsub("(.+)([1-8])", "\\1", variable), `Prod` = recode_factor(`Prod`, "1" = "Milk", "2" = "Butter", "3" = "Eggs", "4" = "Cheese", "5" = "Honey", "6" = "Wool"), variable = recode_factor(variable, "f15_0_" = "produced", "f15_" = "quantity", "f16_" = "consumed", "f17_" = "sold", "f18_" = "totval_sold", "f18b" = "stock"), value = as.numeric(value)) %>% spread(key = variable, value = value) } # A function to appraise livestock production sales appraise_lsk_sale <- function (x) { # Compute median prices lskprod_price <- x %>% group_by(`Prod`) %>% summarise(median_price_sold = median(totval_sold/sold, na.rm = TRUE)) # apply median price to consumed or gifted animals lskprod_val <- x %>% left_join(lskprod_price, by = "Prod") %>% mutate(val_sold = ifelse(totval_sold > 0 & !is.na(totval_sold), totval_sold, 0), val_cons_gift = ifelse(consumed >=0 & !is.na(consumed), consumed * median_price_sold, 0), val_sale = val_sold + val_cons_gift) # summarise by household lskprod_val %>% group_by(ident) %>% summarise(sale_livestockprod = sum(val_sale, na.rm = TRUE)) } # expense_agriinputs ---------------------------------------------------------- sum_agriinputs <- function (x) { x %>% select(ident, starts_with("e49_")) %>% mutate(expense_agriinputs = rowSums(select(., starts_with("e49_")), na.rm = TRUE), ident = as.character(ident)) %>% select(ident, expense_agriinputs) } # expense_agrilabor ----------------------------------------------------------- sum_agrilabor <- function(x) { x%>% select(ident, e44:e48_hr) %>% mutate(permanent = ifelse(!is.na(e44) & e45 >= 0 & !is.na(e45), e44*e45*12, 0), seasonal_harvest = ifelse(!is.na(e46) & e46 >= 0 & !is.na(e47_r) & e47_r >= 0 & !is.na(e48_r) & e48_r >= 0, e46*(e47_r*e48_r), 0), seasonal_nonharvest = ifelse(!is.na(e46) & e46 >= 0 & !is.na(e47_hr) & e47_hr >= 0 & !is.na(e48_hr) & e48_hr >= 0, e46*(e47_hr*e48_hr), 0), expense_agrilabor = permanent + seasonal_harvest + seasonal_nonharvest, ident = as.character(ident)) %>% select(ident, expense_agrilabor) } # expense_livestockmatrent ------------------------------------------------ # A function to assess livestock material rental expenses val_livestockmatrent <- function(x) { x %>% group_by(ident) %>% summarise(expense_livestockmatrent = sum(rent_amount, na.rm = TRUE)) } # expense_livestockmatinv ----------------------------------------------------- # A function to assess livestock material rental investments val_livestockmatinv <- function(x) { x %>% group_by(ident) %>% summarise(expense_livestockmatinv = sum(purchase_price, na.rm = TRUE)) } # expense_livestockaniminv ---------------------------------------------------- # A function to assess investment in livestock animals val_livestockaniminv <- function(x) { x %>% group_by(ident) %>% summarise(expense_livestockaniminv = sum(totval_purchased, na.rm = TRUE)) } # expense_livestocklabor ------------------------------------------------------ # a function to appraise business labor expenses appraise_livestocklabor <- function(x) { x %>% select(ident, nb_permanent = f22, wage_permanent = f23, nb_seasonal = f24, day_seasonal = f25, day_wage_seasonal =f26) %>% mutate(expense_permanent = ifelse(!is.na(nb_permanent) & nb_permanent > 0 & !is.na(wage_permanent) & wage_permanent > 0, nb_permanent * wage_permanent * 12, 0), expense_seasonal = ifelse(!is.na(nb_seasonal) & nb_seasonal >0 & !is.na(day_seasonal) & day_seasonal > 0 & !is.na(day_wage_seasonal) & day_wage_seasonal > 0, nb_seasonal * day_seasonal * day_wage_seasonal, 0), expense_total = expense_permanent + expense_seasonal) %>% group_by(ident) %>% summarise(expense_livestocklabor = sum(expense_total, na.rm = TRUE)) } # expense_livestockinputs ----------------------------------------------------- appraise_livestockinputs <- function(x) { x %>% select(ident, matches("f2[7-8]_[1-3]|f2[7-8]_[4-6]m")) %>% gather(variable, value, -ident) %>% mutate(prod = gsub("(f2[7-8])_([1-6])(m?)", "\\2", variable), variable = gsub("(f2[7-8])_([1-6])(m?)", "\\1", variable), year_cost = ifelse(variable == "f27", value * 12, value)) %>% group_by(ident) %>% summarise(expense_livestockinputs = sum(year_cost, na.rm = TRUE)) } # income_dep ------------------------------------------------------------------ # A function to extract data on incomes from dependant activities extract_income_dep <- function(x) { x %>% select(ident, matches("h1[0-3](_|[b-c])[0-9]+")) %>% gather(variable, value, -ident) %>% mutate(hh_mb = gsub("(h1[0-3](_|[b-c]))([0-9]+)", "\\3", variable), variable = gsub("(h1[0-3](_|[b-c]))([0-9]+)", "\\1", variable), variable = recode_factor(variable, "h10_" = "salary_month", "h11_" = "salary_year", "h11b" = "wage_month", "h11c" = "wage_year", "h12_" = "pension_month", "h13_" = "pension_year"), value = as.numeric(value)) %>% spread(key = variable, value = value) } # A function to sum up incomes sum_income_dep <- function(x) { x %>% mutate(sum_salary = ifelse(!is.na(salary_year) & salary_year >= 0, salary_year, 0), sum_wage = ifelse(!is.na(wage_year) & wage_year >= 0, wage_year, 0), sum_income = sum_salary + sum_wage) %>% group_by(ident) %>% summarise(income_dep = sum(sum_income, na.rm = TRUE), income_pension = sum(pension_year, na.rm = TRUE)) } # consumption ----------------------------------------------------------------- # A function to extract week consumption (mosly food) extract_week <- function (x) { x %>% select(ident, matches("^h[1-2]_[0-9]+$")) %>% gather(variable, value, -ident) %>% mutate(item = gsub("^(h[1-2])_([0-9]+)$", "\\2", variable), variable = gsub("^(h[1-2])_([0-9]+)$", "\\1", variable), item = recode_factor(item, "1" = "Flour", "2" = "Vegetables", "3" = "Fruits and nuts", "4" = "Milk or dairy", "5" = "Bread", "6" = "Oil", "7" = "Tea", "8" = "Coffee", "9" = "Sugar", "10" = "Chicken", "11" = "Meet", "12" = "Eggs", "13" = "Fish", "14" = "Spices", "15" = "Transport", "16" = "Cigarettes", "17" = "Meals from outside", "18" = "Drinks from outside", "19" = "Other1", "20" = "Other2", "21" = "Other3"), variable = recode_factor(variable, "h1" = "bought", "h2" = "autocons"), value = as.numeric(value)) %>% spread(key = variable, value = value) %>% mutate(consummed = ifelse(!is.na(bought), ifelse(!is.na(autocons), bought + autocons, bought), ifelse(!is.na(autocons), autocons, 0))) } # A function to extract month consumption extract_month <- function (x) { x %>% select(ident, matches("^h[3]_[0-9]+$")) %>% gather(variable, value, -ident) %>% mutate(item = gsub("^(h[3])_([0-9]+)$", "\\2", variable), variable = gsub("^(h3)_([0-9]+)$", "\\1", variable), item = recode_factor(item, "1" = "Water", "2" = "Electricity", "3" = "Telephone", "4" = "Lighting", "5" = "Gas", "6" = "Soap", "7" = "Cleaning", "8" = "Healthcare", "9" = "Rent", "10" = "Credit repayment", "11" = "Donations to Imam", "12" = "Journals", "13" = "Rent goods", "14" = "Leisure", "15" = "Other1", "16" = "Other2", "17" = "Other3"), variable = recode_factor(variable, "h3" = "consummed"), value = as.numeric(value)) %>% spread(key = variable, value = value) } # A function to extract year non-durable consumption (foreseen) extract_year <- function (x) { x %>% select(ident, matches("^h[4-5]_[0-9]+$")) %>% gather(variable, value, -ident) %>% mutate(item = gsub("^(h[4-5])_([0-9]+)$", "\\2", variable), variable = gsub("^(h[4-5])_([0-9]+)$", "\\1", variable), item = recode_factor(item, "1" = "School fees", "2" = "Clothes", "3" = "Shoes", "4" = "House", "5" = "Ramadan food", "6" = "Aid el Kebir", "7" = "Travelling", "8" = "Gifts", "9" = "Other1", "10" = "Other2"), variable = recode_factor(variable, "h4" = "consummed", "h5" = "financing_source"), value = as.numeric(value)) %>% spread(key = variable, value = value) } # A function to extract year non-durable consumption of unforseen extract_year_occas <- function (x) { x %>% select(ident, matches("^h[6-7]_[0-9]+$")) %>% gather(variable, value, -ident) %>% mutate(item = gsub("^(h[6-7])_([0-9]+)$", "\\2", variable), variable = gsub("^(h[6-7])_([0-9]+)$", "\\1", variable), item = recode_factor(item, "1" = "Pilgrimage", "2" = "Birth", "3" = "Circumcision", "4" = "Marriage", "5" = "Healthcare", "6" = "Death", "7" = "Accident", "8" = "Other1", "9" = "Other2"), variable = recode_factor(variable, "h6" = "consummed", "h7" = "financing_source"), value = as.numeric(value)) %>% spread(key = variable, value = value) } # A function to extract year non-durable consumption of unforseen extract_durable <- function (x) { x %>% select(ident, matches("^c[1-4]_[0-9]+$")) %>% gather(variable, value, -ident) %>% mutate(item = gsub("^(c[1-4])_([0-9]+)$", "\\2", variable), variable = gsub("^(c[1-4])_([0-9]+)$", "\\1", variable), variable = recode_factor(variable, "c1" = "owns", "c2" = "number", "c3" = "recent", "c4" = "consummed"), value = as.numeric(value)) %>% spread(key = variable, value = value) } # A function to sum up consmptions # Specify period of recollection for consuption: "week", "month" or "year" # Same factors than Crepon and al will be applied: *4.345 if week; /12 if year # Exclude goods that should not be accounted for sum_cons <- function(x, exclude = c(""), period = "month", varname = "cons") { quotient <- ifelse(period == "week", 1/4.345, ifelse(period == "year", 12, 1)) x <- x %>% filter(!(item %in% exclude)) x <- x %>% group_by(ident) %>% summarise(total = sum(consummed, na.rm = TRUE) / quotient) colnames(x) <- c("ident", varname) return(x) } # reshape individuals --------------------------------------------------------- # A function to extract, reshape and label individual information extract_individuals <- function (x) { x %>% select(ident, matches("^a[1-9]_[0-9]+$|^a1[0-9]_[0-9]+$")) %>% gather(variable, value, -ident) %>% mutate(order = gsub("(a[1-9][0-9]?)_([0-9]+$)", "\\2", variable), variable = gsub("(a[1-9][0-9]?)_([0-9]+$)", "\\1", variable), value = as.numeric(value), order = as.numeric(order)) %>% spread(key = variable, value = value) } # members_resid & nadults_resid------------------------------------------------ count_resid <- function(x, correct = FALSE) { # if correct, HH with no hh members registered should be NA if (correct) { x <- x %>% filter(a5 == 3 | a5 == 5 | a3 == 1) } x %>% group_by(ident) %>% summarise(members_resid = sum(a5 == 3 | a5 == 5 | a3 == 1, na.rm = TRUE), nadults_resid = sum((a5 == 3 | a5 == 5 | a3 == 1) & a7 >= 16, na.rm = TRUE)) } # head_age -------------------------------------------------------------------- age_head <- function(x, correct = FALSE) { age <- x %>% filter(a3 == 1) %>% select(ident, head_age = a7) %>% group_by(ident) %>% # include first last because several members registered as head induce some # inconsistencies summarise(head_age_l = last(head_age), head_age_f = first(head_age)) %>% mutate(head_age = ifelse(is.na(head_age_l), head_age_f, head_age_l)) %>% select(ident, head_age) if (!correct) { x <- x %>% select(ident) %>% unique() %>% left_join(age, by = "ident") %>% mutate(head_age = ifelse(is.na(head_age), 0, head_age)) } return(x) } # act_livestock & act_business ------------------------------------------------ # A function to extract activity list extract_activities <- function (x) { x %>% select(ident, matches("^d[1-6]_(.+)?[1-6]$")) %>% gather(variable, value, -ident) %>% mutate(activity = gsub("(^d[1-6]_(.+)?)([1-6])$", "\\3", variable), variable = gsub("(^d[1-6]_(.+)?)([1-6])$", "\\1", variable), variable = recode_factor(variable, "d1_nom" = "nom", "d1_code" = "code", "d2_" = "sector", "d3" = "member_responsible", "d4" = "place", "d5" = "need")) %>% spread(key = variable, value = value) %>% mutate(code = as.numeric(code), sector = as.numeric(sector)) } # There is a small mistake in the way variables are recoded in Crepon et al. # booleans not consistent, should keep sector == | code == code_activity <- function(x) { x %>% mutate(code = code / 100, act_livestock = ifelse(sector == 2 | (is.na(sector) & !is.na(nom) & nom != "" & code >= 2 & code < 3), 1, 0), act_business = ifelse((!is.na(nom) & nom != "" & sector >= 3 & sector <= 6) | (is.na(sector) & !is.na(nom) & nom != "" & code >= 3 & code < 7), 1, 0)) %>% group_by(ident) %>% summarise(act_livestock = sum(act_livestock, na.rm = TRUE), act_business = sum(act_business, na.rm = TRUE)) %>% mutate(act_livestock = ifelse(act_livestock > 0, 1, 0), act_business = ifelse(act_business > 0, 1, 0)) } # resp_activ ------------------------------------------------------------------ # A function to identify respondent to section D of questionnaire # takes as argument individual reshape + full table respondent_id <- function(x, y) { if ("id_rep_d" %in% colnames(y)) { x_t <- x %>% select(ident, order, a1, a3) %>% mutate(ident = as.character(ident), a1 = as.character(a1), order = as.character(order)) out <- y %>% select(ident, id_rep_d) %>% mutate(ident = as.character(ident), id_rep_d = as.character(id_rep_d)) %>% left_join(x_t, by = c("ident" = "ident", "id_rep_d" = "order")) %>% mutate(ccm_resp_activ = ifelse(a3 == 2, 1, 0), other_resp_activ = ifelse(a3 > 2, 1, 0)) } else { # return empty variables if no resp var (ie. if baseline) out <- y %>% select(ident) %>% mutate(ident = as.character(ident), id_rep_d = NA, ccm_resp_activ = NA, other_resp_activ = NA) } return(out) } # rename credit sources ------------------------------------------------------- # Our functions to generate credit sources variables have "nice" names # (with capital letters and spaces) to make nicer tables and graphs # We rename them exactly as Crepon et al. to facilitate code comparison rename_cr_src <- function(x) { x <- x %>% rename(borrowed_alamana = `Al Amana_borrowed`, borrowed_oamc = `Other MFI_borrowed`, borrowed_oformal = `Other formal_borrowed`, borrowed_informal = `Informal_borrowed`, loansamt_alamana = `Al Amana_loansamt`, loansamt_oamc = `Other MFI_loansamt`, loansamt_oformal = `Other formal_loansamt`, loansamt_informal = `Informal_loansamt`) if (!is.null(x$`Utility_loansamt`)) { x <- x %>% rename(loansamt_branching = `Utility_loansamt`, borrowed_branching = `Utility_borrowed`) } if (!is.null(x$`Other_loansamt`)) { x <- x %>% rename(loansamt_other = `Other_loansamt`, borrowed_other = `Other_borrowed`) } return(x) } # Compute aggregates ---------------------------------------------------------- sum_outcomes <- function (x, correct = FALSE) { x <- x %>% ungroup() %>% # we heritate unnecessary grouping from borrowed_ table mutate( # then it's the same as in Crépon et al. dofiles sale_agri = sale_cereal + sale_tree + sale_veg, expense_businputs = expbusinp1 + expbusinp2, savings_agri = savings_cereal + savings_tree + savings_veg, prod_agri= sale_agri + savings_agri, astock_livestock = savings_livestock + asset_livestock, stock_agri=savings_agri + asset_agri, astock_livestock = savings_livestock + asset_livestock, astock_business = savings_business + asset_business, sale_livestock = sale_livestockanim + sale_livestockprod, output_total= prod_agri + sale_livestock + sale_business, assets_total = asset_agri + astock_livestock + asset_business, # investment variable are duplicated with expense variables # inv_livestock = inv_livestockmat + inv_livestockanim, inv_livestock = expense_livestockaniminv + expense_livestockmatinv, # idem: inv_total = inv_business_assets + inv_livestock + inv_agri inv_total = expense_businv + inv_livestock + inv_agri, expense_agri = expense_agriinputs + expense_agrilabor + expense_agriinv + expense_agrirent, expense_business = expense_busrent + expense_businputs + expense_businv + expense_buslabor, expense_livestock = expense_livestockmatrent + expense_livestockmatinv + expense_livestockaniminv + expense_livestocklabor + expense_livestockinputs, expense_total = expense_agri + expense_business + expense_livestock, profit_agri = prod_agri - expense_agri, profit_livestock = sale_livestock - expense_livestock, profit_business = sale_business - expense_business, profit_total=profit_agri + profit_livestock + profit_business) # Crepon et al. made an error and did not include credit from oamc if (correct) { x <- x %>% mutate(borrowed_test = select(., starts_with("borrowed")) %>% rowSums(na.rm = TRUE), # borrowed_total = ifelse(borrowed_test > 0, 1, 0), loansamt_total = select(., starts_with("loansamt")) %>% rowSums(na.rm = TRUE)) } else { # reproduce Crepon et al. error not to include other MFI in borrowings at BL x <- x %>% mutate(borrowed_test = select(., starts_with("borrowed"), -borrowed_oamc) %>% rowSums(na.rm = TRUE), # borrowed_total = ifelse(borrowed_test > 0, 1, 0), loansamt_total = select(., starts_with("loansamt"), -borrowed_oamc) %>% rowSums(na.rm = TRUE)) } return(x) } prepare <- function(x, cr_other_as_util = TRUE, #TRUE for as is endline cr_active_only = FALSE, #TRUE for BL in original paper include_cr_oamc = TRUE, #FALSE for BL in original study = control var ast_ap = ast_ap_blel, # By default takes appraisal from both rounds exclude_agro_ast = c("Tractor", "Reaper"), #normally should be "" excl_imp_share_agro_ast = TRUE,#excl hh046012: shares tractor+reaper at EL exp_agriinv_ok = FALSE,#expenses /10 for tractors and reapers lsk_ast_ok = FALSE, #one unit price added to total asset lsk for each cat manual_recode_bas = TRUE, #authors manually recode 2 bus assets cases at EL exclude_var_bsale = "sale_services", #sales from services not included at EL exclude_item_bsale = c(5,6), #only 4/6 were taken into acount at EL cereal_sales_ok = FALSE,#sales before harvest appraised at after prices cereal_sav_ok = FALSE, #savings are appraised at price before harvest tree_price_ok = FALSE, #sales during and after harvest at before price veg_price_ok = FALSE, #all "others" are appraised at the sime price mb_resid_ok = FALSE, #Household without residing members are 0 instead of NA head_age_ok = FALSE) {#When several heads takes first valid age # PREPARING VARIABLES --------------------------------------------------------- # Select and reshape credit data cr <- x %>% extract_cred(other_as_util = cr_other_as_util) # Borrowed from diffent source type borr <- cr %>% count_cred(y = source_type, active_only = cr_active_only) %>% borrowed() # Amounts borrowed from source types loansamt <- amt(cr) # Consent to repay cons_repay <- recode_cons_repay(x) # Extract agri assets ast_agr <- extract_ast_agr(x, excl_imp_share = excl_imp_share_agro_ast) # Appraise agricultural assets asset_agri = appraise_agro_ast(ast_agr, ast_ap = ast_ap, exclude = exclude_agro_ast) # Adds up agriculture material renting expenses expense_agrirent <- sum_agrirent(ast_agr) # Appraise agricultural investments inv_agri <- appraise_inv_agri(ast_agr) # Appraise agricultural expenses, error: /10 for tractors and cars expense_agriinv <- op_expense_agriinv(inv_agri, correct = exp_agriinv_ok) # Extract livestock animals anim <- extract_lsk(x) # Value of remaining ("saving") livestock savings_livestock <- appraise_lsk(anim) # Value sales of animals sale_livestockanim <- appraise_anim_sale(anim) # Extract livestock assets lsk_ast <- extract_lsk_ast(x) # Appraise livestock one unit price added to total asset_livestock <- appraise_lsk_ast(lsk_ast, correct = lsk_ast_ok) # Extract and appraise business savings savings_business <- x %>% extract_bac() %>% appraise_bac_s() # Extract business assets bas <- extract_bas(x, manual_recode = manual_recode_bas) # Appraise those assets asset_business <- appraise_bas(bas) # Sum up renting expenses for business assets expense_busrent <- sum_busrent(bas) # Sum up invesments for business assets expense_businv <- sum_businv(bas) # Extract business expenses first questionnaire section bus_exp1 <- extract_bus_exp1(x) # Appraise expenses first section expbusinp1 <- val_expbusinp1(bus_exp1) # Extract business expenses second questionnaire section bus_exp2 <- extract_bus_exp2(x) # Appraise expenses second section expbusinp2 <- val_expbusinp2(bus_exp2) # Extract and appriase business laboral expenses expense_buslabor <- x %>% extract_buslabor() %>% val_buslabor() # Extract business sales bsale <- x %>% extract_bsale(exclude_var = exclude_var_bsale, exclude_item = exclude_item_bsale) %>% appraise_bsale() # Extract cereals cereal <- extract_cereal(x) # Appraise sales of cereals sale_cereal <- appraise_cereal(cereal, correct = cereal_sales_ok) # Appraise savings of cereals savings_cereal <- appraise_cereal_savings(cereal, correct = cereal_sav_ok) # Extract variables related to tree production tree <- extract_tree(x) # Appraise sales from tree production sale_tree <- appraise_tree(tree, correct = tree_price_ok) # Appraise savings from tree production savings_tree <- appraise_tree_savings(tree) # Extract vegetable production veg <- extract_veg(x) # Appraise vegetable sales sale_veg <- appraise_veg(veg, correct = veg_price_ok) # Appraise vegetable savings savings_veg <- appraise_veg_savings(veg) # Extract and appraise livestock production sale_livestockprod <- x %>% extract_lskprod() %>% appraise_lsk_sale() # Sum up expenses for agricultural inuputs expense_agriinputs <- sum_agriinputs(x) # Sum up expenses for agricultural labor expense_agrilabor <- sum_agrilabor(x) # Sum up expenses for rent of agricultural material expense_livestockmatrent <- val_livestockmatrent(lsk_ast) # Sum up investments in agricultural material expense_livestockmatinv <- val_livestockmatinv(lsk_ast) # Sum up investments in livestock material expense_livestockaniminv <- val_livestockaniminv(anim) # Appraise expenses in labor for livestock expense_livestocklabor <- appraise_livestocklabor(x) # Appraise expenses for livestock inputs expense_livestockinputs = appraise_livestockinputs(x) # Extract and sum up incomes from dependent activities income_dep <- x %>% extract_income_dep() %>% sum_income_dep() # Extract and appraise consumption of non durables by recall category cons_week <- x %>% extract_week() %>% sum_cons(period = "week", varname = "cons_week") cons_month <- x %>% extract_month() %>% sum_cons(exclude = "Credit repayment", period = "month", varname = "cons_month") cons_year <- x %>% extract_year() %>% sum_cons(period = "year", varname = "cons_year") # occasional expenses cons_year_occas <- x %>% extract_year_occas() %>% sum_cons(period = "year", varname = "cons_year_occas") # Extract consumption of durables cons_durable <- x %>% extract_durable() %>% sum_cons(period = "year", varname = "cons_durable") # Consolidate consumption consumption <- cons_week %>% left_join(cons_month, by = "ident") %>% left_join(cons_year, by = "ident") %>% left_join(cons_year_occas, by = "ident") %>% left_join(cons_durable, by = "ident") %>% mutate(consumption = cons_week + cons_month + cons_year + cons_year_occas + cons_durable) # Extract individuals ind <- extract_individuals(x) # HH with no member should be NA members_resid <- count_resid(ind, correct = mb_resid_ok) # idem head_age <- age_head(ind, correct = head_age_ok) # Extract answer to activities activities <- extract_activities(x) # Dummy if livestock and other activities act_livestock <- x %>% extract_activities() %>% code_activity() # Identify respondent relation do hh head resp_activ <- respondent_id(ind, x) additionnal <- x %>% select(ident, treatment, paire, demi_paire, score1, score2, score3, one_of("random5", "random5_end", "newhh", "client", "wave"), m1:m14) # CONSOLIDATING VARIABLES ----------------------------------------------------- consolidated <- borr %>% left_join(loansamt, by = "ident", suffix = c("_borrowed", "_loansamt")) %>% left_join(cons_repay, by = "ident") %>% left_join(asset_agri, by = "ident") %>% left_join(expense_agrirent, by = "ident") %>% left_join(inv_agri, by = "ident") %>% left_join(expense_agriinv, by = "ident") %>% left_join(savings_livestock, by = "ident") %>% left_join(sale_livestockanim, by = "ident") %>% left_join(asset_livestock, by = "ident") %>% left_join(savings_business, by = "ident") %>% left_join(asset_business, by = "ident") %>% left_join(expense_busrent, by = "ident") %>% left_join(expense_businv , by = "ident") %>% left_join(expbusinp1, by = "ident") %>% left_join(expbusinp2, by = "ident") %>% left_join(expense_buslabor, by = "ident") %>% left_join(bsale, by = "ident") %>% left_join(sale_cereal, by = "ident") %>% left_join(savings_cereal, by = "ident") %>% left_join(sale_tree, by = "ident") %>% left_join(savings_tree, by = "ident") %>% left_join(sale_veg, by = "ident") %>% left_join(savings_veg, by = "ident") %>% left_join(sale_livestockprod, by = "ident") %>% left_join(expense_agriinputs, by = "ident") %>% left_join(expense_agrilabor, by = "ident") %>% left_join(expense_livestockmatrent, by = "ident") %>% left_join(expense_livestockmatinv, by = "ident") %>% left_join(expense_livestockaniminv, by = "ident") %>% left_join(expense_livestocklabor, by = "ident") %>% left_join(expense_livestockinputs, by = "ident") %>% left_join(income_dep, by = "ident") %>% left_join(consumption, by = "ident") %>% left_join(members_resid, by = "ident") %>% left_join(head_age, by = "ident") %>% left_join(act_livestock, by = "ident") %>% left_join(resp_activ, by = "ident") %>% left_join(additionnal, by = "ident") consolidated %>% rename_cr_src() %>% sum_outcomes(correct = include_cr_oamc) } # List of variables trimmed in Crepon et al. trim_vars_el <- c("loansamt_total", "loansamt_alamana", "loansamt_oamc", "loansamt_oformal", "loansamt_branching", "loansamt_other", "loansamt_informal", "assets_total", "stock_agri", "astock_livestock", "astock_business", "output_total", "sale_agri", "savings_agri", "sale_livestockanim", "sale_livestockprod", "sale_business", "expense_total", "expense_agri", "expense_livestock", "expense_business", "income_dep", "consumption") # Note that quantiles behave natively differently in R and Stata # See: http://data.princeton.edu/stata/markdown/quantiles.htm flag_trimobs <- function(x, trim_vars, threshold = 0.995) { decile9 <- x %>% select(one_of(trim_vars)) %>% mutate_all(funs(replace(.,. == 0, NA))) %>% # discard 0 values summarise_all(funs(quantile(., probs = 0.9, na.rm = TRUE, type =2))) # ratio value / variable nintieth percentile y <- select(x, ident, one_of(trim_vars)) for (i in seq_along(decile9)) { y[,i+1] = y[,i+1]/decile9[[i]] } # max ratio for each household x$maxratio <- apply(y[,-1], 1, max) # threshold for 9th quintile ratiocut <- quantile(x$maxratio, threshold, na.rm = TRUE, type = 2) x$trimobs <- ifelse(x$maxratio > ratiocut, 1, 0) return(x) } # dummies are created for missing values set_missing_controlvars <- function(x) { x %>% mutate(ccm_resp_activ_d = ifelse(is.na(ccm_resp_activ_el), 1, 0), ccm_resp_activ = ifelse(ccm_resp_activ_d == 1, 0, ccm_resp_activ_el), other_resp_activ_d = ifelse(is.na(other_resp_activ_el), 1, 0), other_resp_activ = ifelse(other_resp_activ_d == 1, 0, other_resp_activ_el), members_resid_bl_d = ifelse(is.na(members_resid_bl), 1, 0), members_resid_bl = ifelse(members_resid_bl_d == 1, 0, members_resid_bl), nadults_resid_bl_d = ifelse(is.na(nadults_resid_bl), 1, 0), nadults_resid_bl = ifelse(nadults_resid_bl_d == 1, 0, nadults_resid_bl), head_age_bl_d = ifelse(is.na(head_age_bl), 1, 0), head_age_bl = ifelse(head_age_bl_d == 1, 0, head_age_bl), act_livestock_bl_d = ifelse(is.na(act_livestock_bl), 1, 0), act_livestock_bl = ifelse(act_livestock_bl_d == 1, 0, act_livestock_bl), act_business_bl_d = ifelse(is.na(act_business_bl), 1, 0), act_business_bl = ifelse(act_business_bl_d == 1, 0, act_business_bl), borrowed_total_bl_d = ifelse(is.na(borrowed_total_bl), 1, 0), borrowed_total_bl = ifelse(borrowed_total_bl_d == 1, 0, borrowed_total_bl), score1 = ifelse(is.na(score1_bl), score1_el, score1_bl), score2 = ifelse(is.na(score2_bl), score2_el, score2_bl), score3 = ifelse(is.na(score3_bl), score3_el, score3_bl), random5 = ifelse(is.na(random5), 0, random5), random5 = ifelse(is.na(random5_end) | is.na(newhh), random5, ifelse(random5_end == 1 & newhh == 0, random5_end, random5)), group = ifelse(!(ident %in% bl$ident) & ident %in% el$ident, "Added", ifelse(ident %in% bl$ident & !(ident %in% el$ident), "Dropout", ifelse(ident %in% bl$ident & ident %in% el$ident, "Panel", "Error: check formula")))) } # Compute criteria used to select observation to be regressed subsample <- function(x) { x %>% filter(newhh == 1) %>% group_by(demi_paire_el) %>% summarise(min_score2_newhh = min(score2, na.rm = TRUE), min_score3_newhh = min(score3, na.rm = TRUE)) %>% right_join(x, by = "demi_paire_el") %>% mutate(random5_final = ifelse(is.na(wave_el), random5, ifelse(score2 > min_score2_newhh & random5 == 1 & wave_el == 2, 0, ifelse(score3 > min_score3_newhh & random5 == 1 & wave_el == 34, 0, random5))), samplemodel = ifelse(random5_final == 0, 1, 0)) } # A function to add stars with usual thresholds make_stars <- function(pval) { stars = "" if(pval <= 0.001) stars = "***" if(pval > 0.001 & pval <= 0.01) stars = "**" if(pval > 0.01 & pval <= 0.05) stars = "*" if(pval > 0.05 & pval <= 0.1) stars = "." stars } # A function to add stars with thresholds as in paper make_stars2 <- function(pval) { stars = "" if(pval <= 0.01) stars = "***" if(pval > 0.01 & pval <= 0.05) stars = "**" if(pval > 0.05 & pval <= 0.1) stars = "*" stars } # A function to add stars with thresholds as in paper make_stars3 <- function(pval) { stars = "" if(pval <= 0.001) stars = "****" stars } # A function to change number to thousands c_th <- function (x) { if (as.numeric(str_extract(x, "(\\d+)")) > 999999) { print("error : number is too long, my formula only works until 999,999") } else if (as.numeric(str_extract(x, "(\\d+)")) < 1000) { x } else { str_replace(x, "(\\d{1,3}?)(\\d{3})*(.\\d+)?$", "\\1,\\2\\3") } } # regressing as in Crepon et al for a vector of outcome vars reg <- function(x, dep_vars, fullreg = FALSE, var_out = "treatment_el", separator = "\n", rest_form = "~ treatment_el + members_resid_bl + nadults_resid_bl + head_age_bl + act_livestock_bl + act_business_bl + borrowed_total_bl + members_resid_bl_d + nadults_resid_bl_d + head_age_bl_d + act_livestock_bl_d + act_business_bl_d + borrowed_total_bl_d + ccm_resp_activ + other_resp_activ + ccm_resp_activ_d + other_resp_activ_d + factor(paire_el)") { output <- vector("character", length(dep_vars)) separator = paste(separator, "(", sep = "") names(output) <- dep_vars rest_form <- rest_form for (i in seq_along(dep_vars)) { form <- paste(dep_vars[i], rest_form, sep = "") regress <- lm(form, data = x) result <- coeftest(regress, vcov = vcovCL(regress, x$demi_paire_el)) coef_out <- custom_round(result[var_out,1]) se_out <- custom_round(result[var_out,2]) symbol_out <- make_stars2(result[var_out,4]) out <- paste(separator, c_th(se_out), ")", sep = "") out <- paste(c_th(coef_out), symbol_out, out, sep = "") output[i] <- out } # return either full results or only one variable outputs if (fullreg) { return(result) } else { return(output) } } reg_balance <- function(treatment, dep_vars, controls, weights = NULL, cluster, data) { treatment_v <- enquo(treatment) treatment_t <- as.character(treatment_v[2]) # Creating a table to store the results output <- tibble(`Dependent variable` = character(), `Obs.` = character(), `Obs. controls` = character(), `Mean controls` = character(), `SD controls` = character(), `Coeff.` = character(), `p-value` = character()) # Concatenating controls controls <- ifelse(controls != "", paste(" + ", controls, sep = ""), "") for (i in seq_along(dep_vars)) { # print(dep_vars[i]) form <- paste(dep_vars[i], " ~ ", treatment_t, controls, sep = " ") if (!is.null(weights)) { data$weights <- data[,weights] regress <- lm(form, weights = weights, data = data) } else { regress <- lm(form, data = data) } data_controls <- data %>% filter(!!treatment_v == 0) %>% select(dep_vars[i]) %>% unlist data$cluster <- cluster result <- coeftest(regress, vcov = vcovCL(regress, data[,cluster])) coef_out <- custom_round(result[treatment_t,1]) p_out <- round(result[treatment_t,4], 3) symbol_out <- make_stars2(result[treatment_t,4]) coef_out <- paste(coef_out, symbol_out, sep = "") output[i, 1] <- dep_vars[i] output[i, 2] <- nrow(data) output[i, 3] <- length(data_controls) output[i, 4] <- custom_round(mean(data_controls, na.rm = TRUE)) output[i, 5] <- custom_round(sd(data_controls, na.rm = TRUE)) output[i, 6] <- coef_out output[i, 7] <- p_out } return(output) } # A function to round custom_round <- function(x) { y = ifelse(abs(x) >= 100, 0, ifelse(abs(x) >= 10, 1, ifelse(abs(x) >= 1, 2, 3))) round(x, y) } ``` ```{r regress_asis, message=F, warning=F, echo=F} # Prepare baseline, endline and merged datasets ------------------------------- consolidated_bl_avtrim <- bl %>% prepare(include_cr_oamc = FALSE, ast_ap = "", cr_active_only = TRUE) consolidated_bl <- consolidated_bl_avtrim %>% flag_trimobs(trim_vars = trim_vars_el) consolidated_el_avtrim <- el %>% prepare(ast_ap = "") consolidated_el <- consolidated_el_avtrim %>% flag_trimobs(trim_vars = trim_vars_el) # merge baseline and endline consolidated <- consolidated_bl %>% full_join(consolidated_el, by = "ident", suffix = c("_bl", "_el")) %>% set_missing_controlvars() # Balance at baseline as in Crepon et al. ------------------------------------ consolidated_bl <- consolidated_bl %>% mutate(borrowed_oformal2 = ifelse(borrowed_oamc > 0 | borrowed_oformal > 0, 1,0)) bal_bl <- reg_balance(treatment = treatment, dep_vars = c("members_resid", "nadults_resid", "head_age", "act_livestock", "act_business", "borrowed_alamana", "borrowed_oformal2", "borrowed_informal", "borrowed_branching"), controls = "factor(paire)", cluster = "demi_paire", data = consolidated_bl) bal_bl <- bal_bl %>% mutate(`Dependent variable` = recode(`Dependent variable`, "members_resid" = "Number of household members", "nadults_resid" = "Number of adults", "head_age" = "Head age", "act_livestock" = "Does animal husbandry", "act_business" = "Runs a non-farm business", "borrowed_alamana" = "Loan from Al Amana", "borrowed_oformal2" = "Loan from other formal institution", "borrowed_informal" = "Informal loan", "borrowed_branching" = "Electricity or water connection loan")) colnames(bal_bl) <- c(" ", "Obs. ", "Obs.", "Mean", "SD", "Coeff.", "p-value") bal_bl %>% mutate(`Obs. ` = as.numeric(`Obs. `), `Obs.` = as.numeric(`Obs.`)) %>% # mutate_all(linebreak) %>% kable(format = "latex", booktabs = T, escape = F, format.args = list(big.mark = ","), align = "l", caption = "Summary statistics: reproduction of CDDP balance tests at baseline") %>% kable_styling(latex_options = "HOLD_position") %>% add_header_above(c(" " = 2, "Control group" = 3, "Treatment - Control" = 2)) %>% # column_spec (1, width = "5cm") %>% footnote(general = c("Source: Our reproduction of CDDP Table 1 with R, using the same raw data and specifications and producing the same results. Coefficients and p-values from an OLS regression of the variable on a treated village dummy, controlling for strata dummies (paired villages). Standard errors are clustered at the village level.", "*** Significant at the 1 percent level", "** Significant at the 5 percent level", "* Significant at the 10 percent level"), general_title = "", threeparttable = T) # Regress like in Crepon et al. ----------------------------------------------- sample <- consolidated %>% subsample() %>% filter(group == "Panel" | group == "Added") %>% filter(trimobs_el != 1 & samplemodel == 1) ## Credit results ----------------------- reg_cred_asis <- reg(sample, dep_vars = c("client", "borrowed_alamana_el", "borrowed_oamc_el", "borrowed_oformal_el", "borrowed_informal_el", "borrowed_branching_el", "borrowed_total_el"), fullreg = FALSE, var_out = "treatment_el") tb_cr_asis <- reg_cred_asis %>% as_tibble() %>% rownames_to_column(var = "Credit source") tb_cr_asis <- tb_cr_asis %>% column_to_rownames(var = "Credit source") %>% select(`Treated\nvillages` = value) %>% t() %>% as.data.frame() %>% as_tibble() %>% rownames_to_column(var = " ") # Generate table selfact tb_cr_asis %>% mutate_all(as.character) %>% mutate_all(linebreak) %>% kable(format = "latex", booktabs = T, escape = F, format.args = list(big.mark = ","), col.names = linebreak(c("","AAA\nadmin data", "AAA\nsurvey data", "Other\nMFI", "Other\nformal", "Utility\ncompany", "Informal", "Total")), align = "c", caption = "Credit: reproduction of CDDP regression results") %>% kable_styling(latex_options = "HOLD_position") %>% # column_spec (1, width = "5cm") %>% footnote(general = c("Source: Our reproduction of CDDP Table 2 with R, using the same raw data and specifications and producing the same results. Sample includes 4,934 households classified as high probability-to-borrow and surveyed at endline, after trimming 0.5 percent of observations. Coefficients and standard errors (in parentheses) from an OLS regression of the variable on a treated village dummy, controlling for strata dummies (paired villages), number of household members, number of adults, head age, does animal husbandry, does other non-agricultural activity, had an outstanding loan over the past 12 months, HH spouse responded to the survey, and other HH member (excluding the HH head) responded to the survey and variables specified below. Standard errors are clustered at the village level.", "*** Significant at the 1 percent level", "** Significant at the 5 percent level", "* Significant at the 10 percent level"), general_title = "", threeparttable = T) ### Selfact results --------------------- selfact <- c("assets_total_el", "output_total_el", "expense_total_el", "inv_total_el", "profit_total_el") reg_selfact_asis <- reg(sample, dep_vars = selfact, fullreg = FALSE, var_out = "treatment_el") # Insert the results as published in the paper (ie. the same) tb_asis <- reg_selfact_asis %>% as_tibble() %>% rownames_to_column(var = "Outcome") # rename and reformat columns tb_asis <- tb_asis %>% column_to_rownames(var = "Outcome") %>% select(`Treated\nvillages` = value) %>% t() %>% as.data.frame() %>% as_tibble() %>% rownames_to_column(var = " ") # Generate table selfact tb_asis %>% mutate_all(as.character) %>% mutate_all(linebreak) %>% kable(format = "latex", booktabs = T, escape = F, format.args = list(big.mark = ","), col.names = linebreak(c(" ","Assets", "Sales and\nhome\nconsumption", "Expenses", "Of which:\nInvestment","Profit")), align = "c", caption = "Self-Employment Activities: reproduction of CDDP results") %>% kable_styling(latex_options = "HOLD_position") %>% column_spec (1, width = "5cm") %>% footnote(general = c("Source: Our reproduction of CDDP Table 3 with R using the same raw data and producing the same results. Sample includes 4,934 households classified as high probability-to-borrow and surveyed at endline, after trimming 0.5 percent of observations. Coefficients and standard errors (in parentheses) from an OLS regression of the variable on a treated village dummy, controlling for strata dummies (paired villages), number of household members, number of adults, head age, does animal husbandry, does other non-agricultural activity, had an outstanding loan over the past 12 months, HH spouse responded to the survey, and other HH member (excluding the HH head) responded to the survey and variables specified below. Standard errors are clustered at the village level.", "*** Significant at the 1 percent level", "** Significant at the 5 percent level", "* Significant at the 10 percent level"), general_title = "", threeparttable = T) ``` Table 1 shows that CDDP identified some small but significant imbalances at baseline: households in treatment villages have older heads, carry out more frequently animal husbandry and non-farm businesses and borrow more frequently from formal and informal credit sources. The baseline values of these imbalanced variables have been used by CDDP as controls for the regressions estimating the average treatment effects at endline, for instance Table 2 and Table 3. Table 2 suggests that the experiment worked, that is that the households in the village assigned to the treatment group received significantly more loans from Al Amana and not from other sources. Table 3 shows substantial and significant impacts of the treatment on assets, outputs, expenses and profits of self-employment activities. While reproducing CDDP results, however, we identified issues with the trimming procedure, other imbalances at baseline and significant impacts on unlikely outcomes. In-depth verification revealed sampling errors, measurement errors and coding errors. These errors are not acknowledged by CDDP. After correcting the errors that could be corrected, we found different results, whose validity nevertheless remains uncertain. # 3. Results rely primarily on the trimming procedure and threshold @deaton_limitations_2016 issue the following warning regarding trimming in RCTs: "*When there are outlying individual treatment effects, the estimate depends on whether the outliers are assigned to treatments or controls, causing massive reductions in the effective sample size. Trimming of outliers would fix the statistical problem, but only at the price of destroying the economic problem; for example, in healthcare, it is precisely the few outliers that make or break a programme.*" Examining the trimming procedure applied by CDDP reveals that different procedures were applied at baseline and endline and that the final results are heavily dependent on the trimming threshold. ## 3.1 Different trimming procedures were applied at baseline and at endline CDDP present the procedure they used for trimming as follows: "*Out of the 5,551, to remove obvious outliers without risking cherry-picking, we trimmed 0.5 percent of observations using the following mechanical rule: for each of the main continuous variables of our analysis (total loan amount, Al Amana loan amount, other MFI loan amount, other formal loan amount, utility company loan amount, informal loan amounts, total assets, productive assets of each of the three self-employment activities, total production, production of each of the three self-employment activities, total expenses, expenses of each of the three self-employment activities, income from employment activities, and monthly household consumption) , we computed the ratio of the value of the variable and the ninetieth percentile of the variable distribution. We then computed the maximum ratio over all the variables for each household and we trimmed 0.5 percent of households with the highest ratios. Analysis is thus conducted over 5,424 observations instead of the original 5,551, and no further trimming is done in the data*" [@crepon_estimating_2015: 130]. However, this account is inaccurate: it should have read 5,524 instead of 5,424, which corresponds to the number of remaining observations once 0.5% of 5,551 has been removed. Secondly, most of the analysis’s continuous variables were included in the trimming exercise, but not all of them: the number of worked hours was not included, for instance. In addition, this systematic trimming was applied only to endline data. The baseline data was the subject of far more erratic and extended trimming. Table 4 compares the variables and thresholds applied at baseline and endline. As can be seen from Table 4, a number of trimmings were performed on different variables using different thresholds and at least two different procedures. The above-quoted complex procedure described by CDDP was used at endline. A simpler procedure was used for 24 variables at baseline, consisting of removing a variable value where this value was above a given variable distribution threshold. The thresholds determined for this “simple” trimming varied from one variable to another, from 0.1% to 0.4%. A total of 459 observations have been trimmed this way, out of a total of 4,465 observations in the baseline sample, that is a percentage of 10.3% of observations on which some variables have been trimmed at baseline. This raises three concerns. First, it is not true that “*no further trimming is done in the data*” [@crepon_estimating_2015: 130]. Second, setting fixed cut-offs for trimming lacks objectivity and is a source of bias, as it does not take into account the structure of the data distribution. Good practice for trimming experimental data consists of using a factor of standard deviation and, ideally, defining this factor based on sample size [@selst_solution_1994]. Third, the impact estimations are highly sensitive to the selected trimming threshold, as illustrated in the next section. ```{r trim_thresh, message=F, warning=F, echo=F} `Variable` <- c("Amounts of active loans from AAA, informal & utilities", "Amounts of active loans from other formal sources", "Amounts of matured loans", "Agriculture, livestocks and business assets", "Livestock & business investments", "Agricultural investments", "Agricultural sales", "Livestock sales", "Business sales", "Agriculture, livestock and business expenses", "Agricultural savings", "Livestock & business savings", "Consumption", "Income from dependent activities", "Loan repayments", "Income from self-employment activities", "Employment in agriculture and livestock", "Work from family members in agriculture and livestock", "Distance to markets") `Trimming threshold at baseline`<- c("0.1% (BL:89-92)", "0.3% (BL:94-5)", "", "0.3% (BL:366-9)", "0.3% (BL:366-9)", "0.4% (BL:371-2)", "0.4% (BL:514-5)", "0.3% (BL:564-5)", "0.4% (BL:593-4)", "0.3% (BL:631-2, 675-6, 701-2)", "0.3% (BL:756-7)", "0.3% (BL:785-6, 823-4)", "0.1% (BL:923-4)", "", "0.1% (BL:930-1)", "0.3% (BL:1016-7)", "0.3% (BL:1073-6)", "0.3% (BL:1101-4)", "0.1% (1256-71)") `Trimming threshold at endline`<- c("", "", "0.5% (AN:247-72)*", "0.5% (AN:247-72)*", "", "", "0.5% (AN:247-72)*", "0.5% (AN:247-72)*", "0.5% (AN:247-72)*", "0.5% (AN:247-72)*", "0.5% (AN:247-72)*", "", "0.5% (AN:247-72)*", "0.5% (AN:247-72)*", "", "", "0.3% (EL:1299-302)", "", "") trim_comp <- tibble(`Variable`, `Trimming threshold at baseline`, `Trimming threshold at endline`) #kable(trim_comp) trim_comp %>% kable(format = "latex", booktabs = T, escape = T, caption = "Inconsistent trimming procedures and threshold between baseline and endline by CDDP") %>% kable_styling(latex_options = "HOLD_position") %>% column_spec (2, width = "3.1cm") %>% column_spec (3, width = "3cm") %>% footnote(general = c("Source: Examination of CDDP scripts for data preparation at baseline (BL) and at endline (EL).", "* Those cases are trimmed using the procedure described in Crépon et al. (2015) and presented above (under 4.5): the whole observation is removed for each trimmed observation. For the other cases, only the outlying values of the trimmed variables were truncated as missing."), general_title = "", threeparttable = T) ``` ## 3.2 Variation in impact estimates depending on trimming threshold In Table 5, we use the exact same data preparation and regression specifications as CDDP, and test other thresholds. Table 5 shows that the results published by CDDP are highly sensitive to the threshold results and other thresholds than 0.5% point towards different interpretations. Thresholds below 0.5% produce results with no statistically significant impacts on self-employment activity outputs (sales and home consumption) or profits. The logical interpretation would then be that microcredit has no clear impact on self-employment activities. Thresholds above 0.5% generate a statistically significant impact in terms of an increase in expenses and decrease in investment, but no statistically significant impact on profits. It would be harder to produce a coherent interpretation of such results as, in particular, a decrease in investment is contradictory with an increase in assets. Initial conclusions on microcredit effects are also minimised if the provision of liquidity only results in an increase in turnover (sales and expenses), with no effect on investment or profits. ```{r sensitif_regress, message=F, warning=F, echo=F} # Prepare baseline, endline and merged datasets ------------------------------- # Trim at different thresholds at baseline # a function to compute a series of regression as in crépon et al, with # varying thresholds test_sens_trim <- function(x, var_out = "treatment_el", dep_vars = selfact, added = TRUE, cons_p = TRUE) { reg_selfact_all <- tibble("Outcome variables" = c("obs", dep_vars)) for (i in 1:length(x)) { consolidated_bl <- flag_trimobs(consolidated_bl_avtrim, trim_vars = trim_vars_el, threshold = x[i]) consolidated_el <- flag_trimobs(consolidated_el_avtrim, trim_vars = trim_vars_el, threshold = x[i]) consolidated <- consolidated_bl %>% full_join(consolidated_el, by = "ident", suffix = c("_bl", "_el")) %>% set_missing_controlvars() # Samples wether includes added at EL and consistent panel if (added) { sample <- consolidated %>% subsample() %>% filter(group == "Panel" | group == "Added") %>% filter(trimobs_el != 1 & samplemodel == 1) } else if (added == FALSE & cons_p == TRUE) { sample <- consolidated %>% subsample() %>% filter(group == "Panel", keep_hh == 1) %>% filter(trimobs_el != 1 & samplemodel == 1) } else { sample <- consolidated %>% subsample() %>% filter(group == "Panel") %>% filter(trimobs_el != 1 & samplemodel == 1) } # Number of obs obs <- c(nrow(sample)) names(obs) <- "obs" # Regress reg_selfact_i <- reg(sample, dep_vars = selfact, fullreg = FALSE, var_out = "treatment_el") reg_selfact_i <- c(obs, reg_selfact_i) reg_selfact_i <- tibble(`Outcome variables` = names(reg_selfact_i), `Results` = reg_selfact_i) result_lab <- paste("Trimming at ", round((1-x[i])*100,1), "\\%", sep = "") colnames(reg_selfact_i) <- c("Outcome variables", result_lab) # Appends to result list reg_selfact_all <- reg_selfact_all %>% left_join(reg_selfact_i, by = "Outcome variables") } return(reg_selfact_all) } sens_trim <- test_sens_trim(c(1, 0.997, 0.995, 0.993, 0.99, 0.985, 0.98, 0.97, 0.95)) sens_trim %>% as_tibble() %>% t() %>% as.data.frame() %>% as_tibble() %>% rownames_to_column(var = "Source") %>% filter(`Source` != "Outcome variables") %>% mutate_all(as.character) %>% mutate_all(linebreak) %>% kable(format = "latex", booktabs = T, escape = F, # problem: no footnote possible col.names = linebreak(c("Treshold", "Obs.","Assets", "Sales and\nhome\nconsumption", "Expenses", "Of which:\nInvestment","Profit")), caption = "Identical analysis to CDDP, but with varying trimming thresholds") %>% kable_styling(latex_options = "HOLD_position") %>% footnote(general = c("Source: Our reproduction of CCDDP Table 3 with R, using the same data and same trimming procedure at endline, but with varying trimming thresholds. The sample includes the households surveyed at endline, minus the households considered as low probability-to-borrow and minus the trimmed observations. The other specifications are the same as CDDP Table 3: Coefficients and standard errors (in parentheses) from an OLS regression of the variable on a treated village dummy, controlling for strata dummies (paired villages), number of household members, number of adults, head age, does animal husbandry, does other non-agricultural activity, had an outstanding loan over the past 12 months, HH spouse responded to the survey, and other HH member (excluding the HH head) responded to the survey and variables specified below. Standard errors are clustered at the village level.", "*** Significant at the 1 percent level", "** Significant at the 5 percent level", "* Significant at the 10 percent level"), general_title = "", threeparttable = T) ``` In sum, CDDP trimmed 459 observations (10.3%) at baseline, removing only the most extreme values on those observations, while at endline they trimmed 27 observations (0.5%) differently by removing them entirely. The fact that the final results vary substantially depending on the number of removed observations could mean that there are data quality issues. # 4. Imbalances at baseline and impacts on implausible outcomes CDDP started their analysis by testing the balance between treatment and control groups on a limited number of variables. They found some small, but significant differences for some of them: households in treatment villages have more access to credit, more livestock activities and livestock assets, less non-farm business, and household heads are slightly older (see Table 1). The baseline values for these variables were therefore included as controls in the regressions to estimate impacts (Table 2 and Table 3 among others). However, CDDP did not report the balance for the most important variables in their analysis, namely the outcomes they used to estimate the experiment’s impact. They also did not report the balance on the characteristics that have been highlighted as essential in a qualitative research aiming at providing contextual insights for this RCT [@morvant-roux_adding_2014]: socio-economic status, belonging to a particular language or ethnic group, attitude towards female empowerment. It seems also important to check the balance on access to water and electricity services, as we will see in Section 5.1.4 that loans to finance connexions to these utilities are the main source of credit in the area, with a significant variation between baseline and endline. In Table 6, we use the same specification as in Table 1, to assess the balance between control and treatment groups at baseline, but with regression on these additional variables. We also estimate in Table 6 the average treatment effect on those additional variables, first with the exact same specifications as CDDP Table 3, second adding as controls the additional variables that appeared as imbalanced at baseline. Table 6 reveals that, at baseline, households in the treatment group had significantly less sales and profits from self-employment activities than households in the control group. They were also making higher investments. There are also imbalances at baseline on several important variables, such as the area of owned land, access to basic services or women empowerment. When using the same specifications as CDDP, we also find significant treatment effects on outcomes for which microcredit impact is hardly plausible: household head gender, absence of education and spoken language. Controlling for all the variables identified as imbalanced at baseline increases the magnitude and the significance of the estimated impacts on assets, sales and expenses. However, the impact on profits no longer appears significant. Some impacts on unlikely outcomes are no longer significant, but others remain or appear, like household head gender, education and household members leaving the household. The variables regarding access to electricity, water and sanitation deserve a specific attention. They show significant imbalances at baseline, but also a strong average treatment effects at endline. This is notable as we will see that branching credit and expansion campaigns from those utilities appear as a possible co-intervention that might have contaminated the experiment (see 5.1.5). These imbalances at baseline and unlikely average treatment effects call for a closer examination of data quality and experiment integrity. We start with reviewing measurement and coding errors. ```{r balance_bl, message=F, warning=F, echo=F} extract_migrants <- function (x) { x %>% select(ident, matches("^a2[0-9]_[0-9]+$|^a2[0-9]_[0-9]+$")) %>% gather(variable, value, -ident) %>% mutate(order = gsub("(a[0-9][0-9]?)_([0-9]+$)", "\\2", variable), variable = gsub("(a[0-9][0-9]?)_([0-9]+$)", "\\1", variable), value = as.numeric(value), order = as.numeric(order)) %>% spread(key = variable, value = value) } check_misc <- function(svy, obs) { ind <- svy %>% extract_individuals() divers <- ind %>% filter(a5 == 3 | a5 == 5 | a3 == 1 | order == 1) %>% group_by(ident) %>% summarise(nb_cm = sum(a3 == 1, na.rm = TRUE), no_cm = nb_cm == 0, svl_cm = nb_cm > 1, nb_mb = sum(a5 == 3 | a5 == 5 | a3 == 1, na.rm = TRUE), no_mb = nb_mb == 0, # Language head cm_darija = sum(a3 == 1 & a9 == 1, na.rm = TRUE), cm_darija_arabe = sum(a3 == 1 & a9 == 2, na.rm = TRUE), cm_darija_arabe_francais = sum(a3 == 1 & a9 == 3, na.rm = TRUE), cm_darija_francais = sum(a3 == 1 & a9 == 4, na.rm = TRUE), cm_berbere = sum(a3 == 1 & a9 == 5, na.rm = TRUE), cm_berbere_darija = sum(a3 == 1 & a9 == 6, na.rm = TRUE), cm_berbere_darija_arabe = sum(a3 == 1 & a9 == 7, na.rm = TRUE), cm_berbere_darija_arabe_francais = sum(a3 == 1 & a9 == 8, na.rm = TRUE), cm_berbere_darija_francais = sum(a3 == 1 & a9 == 9, na.rm = TRUE), cm_darija_tot = cm_darija + cm_darija_arabe + cm_darija_arabe_francais + cm_darija_francais + cm_berbere_darija + cm_berbere_darija_arabe + cm_berbere_darija_arabe_francais + cm_berbere_darija_francais, cm_berbere_tot = cm_berbere + cm_berbere_darija + cm_berbere_darija_arabe + cm_berbere_darija_arabe_francais + cm_berbere_darija_francais, cm_arabe_class_tot = cm_darija_arabe + cm_darija_arabe_francais + cm_berbere_darija_arabe + cm_berbere_darija_arabe_francais, cm_francais_tot = cm_darija_arabe_francais + cm_darija_francais + cm_berbere_darija_arabe_francais + cm_berbere_darija_francais, darija = sum(a9 == 1, na.rm = TRUE)/sum(!is.na(a9), na.rm = TRUE), darija_arabe = sum(a9 == 2, na.rm = TRUE)/sum(!is.na(a9), na.rm = TRUE), darija_arabe_francais = sum(a9 == 3, na.rm = TRUE)/sum(!is.na(a9), na.rm = TRUE), darija_francais = sum(a9 == 4, na.rm = TRUE)/sum(!is.na(a9), na.rm = TRUE), berbere = sum(a9 == 5, na.rm = TRUE)/sum(!is.na(a9), na.rm = TRUE), berbere_darija = sum(a9 == 6, na.rm = TRUE)/sum(!is.na(a9), na.rm = TRUE), berbere_darija_arabe = sum(a9 == 7, na.rm = TRUE)/sum(!is.na(a9), na.rm = TRUE), berbere_darija_arabe_francais = sum(a9 == 8, na.rm = TRUE)/sum(!is.na(a9), na.rm = TRUE), berbere_darija_francais = sum(a9 == 9, na.rm = TRUE)/sum(!is.na(a9), na.rm = TRUE), darija_tot = darija + darija_arabe + darija_arabe_francais + darija_francais + berbere_darija + berbere_darija_arabe + berbere_darija_arabe_francais + berbere_darija_francais, berbere_tot = berbere + berbere_darija + berbere_darija_arabe + berbere_darija_arabe_francais + berbere_darija_francais, arabe_class_tot = darija_arabe + darija_arabe_francais + berbere_darija_arabe + berbere_darija_arabe_francais, francais_tot = darija_arabe_francais + darija_francais + berbere_darija_arabe_francais +berbere_darija_francais, cm_hom = sum(a3 == 1 & a4 == 1, na.rm = TRUE), cm_hom = ifelse(cm_hom >= 1, 1, 0), part_fem = mean(a4 == 2, na.rm = TRUE), nb_fem = sum(a4 == 2, na.rm = TRUE), chef_educ_sup = sum(a3 == 1 & a8 > 7 & a8 < 16, na.rm = TRUE), tot_educ_sup = sum(a7 > 15 & a8 > 7 & a8 < 16, na.rm = TRUE), chef_ss_educ = sum(a3 == 1 & a8 == 1, na.rm = TRUE), chef_educ_coran = sum(a3 == 1 & a8 == 2, na.rm = TRUE), educ_coran = sum(a8 == 2, na.rm = TRUE), cm_fonctionnaire = sum(a3 == 1, a11 ==3, na.rm = TRUE), fonctionnaire = sum(a11 == 3, na.rm = TRUE), cm_retraite = sum(a3 == 1, a11 == 14, na.rm = TRUE), retraite = sum(a11 == 14, na.rm = TRUE), cm_ne_ds_douar = sum(a3 == 1 & a6 == 1, na.rm = TRUE), cm_ne_ds_prov = sum(a3 == 1 & a6 < 5, na.rm = TRUE), tot_ne_ds_douar = sum(a6 == 1, na.rm = TRUE), tot_ne_ds_prov = sum(a6 < 5, na.rm = TRUE), femmes_ss_educ = sum(a4 == 2 & a8 == 1, na.rm = TRUE), # check si erreur femmes_ss_educ_per = sum(a4 == 2 & a8 == 1, na.rm = TRUE)/sum(a4 == 2, na.rm = TRUE), cm_act = sum(a3 == 1 & a10 == 1, na.rm = TRUE), mb_act = sum(a10 == 1, na.rm = TRUE), cm_agri = sum(a3 == 1 & a11 == 5, na.rm = TRUE), mb_agri = sum(a11 == 5, na.rm = TRUE), cm_commerce = sum(a3 == 1 & a11 == 2, na.rm = TRUE), mb_commerce = sum(a11 == 2, na.rm = TRUE), cm_journalier = sum(a3 == 1 & a11 == 1, na.rm = TRUE), mb_journalier = sum(a11 == 1, na.rm = TRUE)) divers2 <- ind %>% group_by(ident) %>% summarise(cm_out = sum(a3 == 1 & (a5 == 1 | a5 == 2 | a5 == 4), na.rm = TRUE), ccm_out = sum(a3 == 2 & (a5 == 1 | a5 == 2 | a5 == 4), na.rm = TRUE), all_out = sum(a5 == 1 | a5 == 2 | a5 == 4, na.rm = TRUE)) olivier <- svy %>% extract_tree() %>% filter(`Tree` == 1) %>% group_by(ident) %>% summarise(nb_oliviers = sum(number_tree, na.rm = TRUE)) biens_conso <- svy %>% extract_durable() %>% group_by(ident) %>% summarise( nb_refrig = sum(ifelse(item == 5, number, 0), na.rm = TRUE), nb_TV_couleur = sum(ifelse(item == 10, number, 0), na.rm = TRUE), nb_parabole = sum(ifelse(item == 11, number, 0), na.rm = TRUE), nb_gsm = sum(ifelse(item == 14, number, 0), na.rm = TRUE)) income_dep <- svy %>% extract_income_dep() %>% group_by(ident) %>% summarise(pension = sum(pension_year, na.rm = TRUE), salary = sum(salary_year, na.rm = TRUE), wage = sum(wage_year, na.rm = TRUE), sal_wage = salary + wage, all_income = sal_wage + pension) general <- svy %>% mutate(elec = b15_1 == 1, elec_2 = b15_2 == 1, assain_1 = b14_1 == 1, assain_2 = b14_2 == 1, eau_1 = b13_1 == 1, eau_2 = b13_2 == 1, proprio = b11 == 1, wc = b10_1 == 1, douche = b10_2 == 1, taille = ifelse(is.na(b9_1), 0, b9_1), owns_land = c5 == 1, area_land = ifelse(is.na(c6), 0, c6), area_land_rent = ifelse(is.na(c10), 0, c10), area_irig = ifelse(is.na(c7_1), 0, c7_1), area_melk = ifelse(is.na(c8_1), 0, c8_1), area_jamoua = ifelse(is.na(c8_2), 0, c8_2), area_habous = ifelse(is.na(c8_3), 0, c8_3), wom_no_souk = j10 == 1 | j10 == 2, wom_no_bus = j11 == 1 | j11 == 2, wom_no_visit = j12 == 1 | j12 == 2, wom_no_friends = j13 == 1 | j13 == 2) %>% select(ident, elec, elec_2, assain_1, assain_2, eau_1, eau_2, proprio, wc, douche, taille, owns_land, area_land, wom_no_souk, wom_no_bus, wom_no_visit, wom_no_friends, area_land_rent, area_irig, area_melk, area_jamoua, area_habous) migr <- svy %>% extract_migrants() migr <- migr %>% filter(order == 1 | !is.na(a20)) %>% group_by(ident) %>% summarise( migr_1y = sum(a21 == 1, na.rm = TRUE), migr_2y = sum(a21 == 2, na.rm = TRUE), migr_3y = sum(a21 == 3, na.rm = TRUE), migr_4yplus = sum(a21 == 4, na.rm = TRUE), migr_1to2y = migr_1y + migr_2y, migr_tot = migr_1to2y + migr_3y + migr_4yplus) obs_to_analyse <- obs %>% left_join(divers, by = "ident", suffix = c("_el", "_bl")) %>% left_join(divers2, by = "ident", suffix = c("_el", "_bl")) %>% left_join(olivier, by = "ident", suffix = c("_el", "_bl")) %>% left_join(biens_conso, by = "ident", suffix = c("_el", "_bl")) %>% left_join(income_dep, by = "ident", suffix = c("_el", "_bl")) %>% left_join(general, by = "ident", suffix = c("_el", "_bl")) %>% left_join(migr, by = "ident", suffix = c("_el", "_bl")) return(obs_to_analyse) } bl4bal <- check_misc(bl, filter(consolidated_bl, trimobs != 1)) sample_misc <- check_misc(el, sample) sample_misc <- check_misc(bl, sample_misc) vars_analyse <- c("assets_total", "output_total", "expense_total", "inv_total", "profit_total", "cm_darija_tot", "cm_berbere_tot", "cm_arabe_class_tot", "cm_francais_tot", "chef_ss_educ", "cm_fonctionnaire", "cm_ne_ds_douar", "cm_hom", "nb_refrig", "nb_TV_couleur", "nb_gsm", "elec", "elec_2", "assain_1", "assain_2", "eau_1", "eau_2", "owns_land", "area_land", "migr_tot", "wom_no_souk", "wom_no_bus", "wom_no_visit", "wom_no_friends") balance_misc <- reg_balance(treatment = treatment, dep_vars = vars_analyse, controls = "factor(paire)", cluster = "demi_paire", data = bl4bal) vars_analyse_el <- paste0(vars_analyse, "_el") impact_misc <- reg(sample_misc, dep_vars = vars_analyse_el, fullreg = FALSE, separator = " ", var_out = "treatment_el")%>% as.data.frame() %>% rownames_to_column(var = "code") %>% mutate(code = str_remove(code, "_el$"), n = nrow(sample_misc)) %>% select(1, 3, 2) impact_misc2 <- reg(sample_misc, dep_vars = vars_analyse_el, fullreg = FALSE, var_out = "treatment_el", separator = " ", rest_form = "~ treatment_el + members_resid_bl + nadults_resid_bl + head_age_bl + act_livestock_bl + act_business_bl + borrowed_total_bl + members_resid_bl_d + nadults_resid_bl_d + head_age_bl_d + act_livestock_bl_d + act_business_bl_d + borrowed_total_bl_d + ccm_resp_activ + other_resp_activ + ccm_resp_activ_d + other_resp_activ_d + output_total_bl + profit_total_bl + inv_total_bl + cm_fonctionnaire_bl + cm_ne_ds_douar_bl + cm_darija_tot_bl + area_land_bl + elec_bl + assain_1_bl + assain_2_bl + eau_2_bl + wom_no_souk_bl + wom_no_bus_bl + factor(paire_el)")%>% as.data.frame() %>% rownames_to_column(var = "code") %>% mutate(code = str_remove(code, "_el$")) var_names <- tibble(c("cm_darija_tot", "Darija", "Household head spoken language", 3), c("cm_berbere_tot", "Berber", "Household head spoken language", 3), c("cm_arabe_class_tot", "Classical Arabic", "Household head spoken language", 3), c("cm_francais_tot", "French", "Household head spoken language", 3), c("cm_hom", "Male head", "Household characteristics", 1), c("cm_fonctionnaire", "Head is a public servant", "Household characteristics", 1), c("cm_ne_ds_douar", "Head born in the same village", "Household characteristics", 1), c("chef_ss_educ", "Head without education", "Household characteristics", 1), c("nb_TV_couleur", "Number of color TVs", "Household assets", 6), c("elec", "Electricity from grid", "Access to basic utilities", 8), c("assain_1", "Sewage network", "Access to basic utilities", 8), c("assain_2", "Septic tank", "Access to basic utilities", 8), c("eau_1", "Private connection to piped water", "Access to basic utilities", 8), c("eau_2", "Shared connection to public tap", "Access to basic utilities", 8), c("owns_land", "Owns land", "Household assets", 6), c("area_land", "Area of owned land", "Household assets", 6), c("migr_tot", "Members left in the last 5 years", "Household characteristics", 1), c("wom_no_souk", "Go to the souk alone", "Respondent considers that women should not", 9), c("wom_no_bus", "Take the bus alone", "Perception of woman condition", 9), c("assets_total","Assets", "Outcomes on self-employment activities", 0), c("output_total","Sales and home consumption", "Outcomes on self-employment activities", 0), c("expense_total","Expenses", "Outcomes on self-employment activities", 0), c("inv_total","Of which: Investment", "Outcomes on self-employment activities", 0), c("profit_total","Profit", "Outcomes on self-employment activities", 0)) colnames(var_names) <- 1:24 var_names <- t(var_names) colnames(var_names) <- c("code", "label", "category", "rank") results <- var_names %>% as.tibble() %>% arrange(rank) %>% left_join(balance_misc, by = c("code" = "Dependent variable")) %>% left_join(impact_misc, by = "code") %>% left_join(impact_misc2, by = "code") results2 <- results %>% select(-code, -category, -rank) colnames(results2) <- c("Variable", "Obs. ", "Obs.", "Mean", "SD", "Coeff.\\textsuperscript{1}", "p-value", "Obs. ","As in CDDP\\textsuperscript{2}", "Adding controls\\textsuperscript{3}") results2 %>% mutate(`Obs. ` = as.numeric(`Obs. `), `Obs.` = as.numeric(`Obs.`)) %>% # mutate_all(linebreak) %>% kable(format = "latex", booktabs = T, escape = F, format.args = list(big.mark = ","), # align = c(rep("l", 10)), caption = "Balance tests at baseline and impact estimates at endline, without correcting coding, measurement and sampling errors") %>% kable_styling() %>% add_header_above(c(" " = 1, "N" = 1, "Control group" = 3, "Treatment - Control" = 2, " " = 1,"ATE estimates" = 2)) %>% add_header_above(c(" " = 1, "Balance at baseline" = 6, "Impact at endline" = 3)) %>% group_rows("Outcomes on self-employment activities", 1, 5) %>% group_rows("Household characteristics", 6, 10) %>% group_rows("Household head spoken language", 11, 14) %>% group_rows("Household assets", 15, 17) %>% group_rows("Access to basic utilities", 18, 22) %>% group_rows("Respondent considers that women should not:", 23, 24) %>% # column_spec (1, width = "5cm") %>% footnote(general = c("*** Significant at the 1 percent level; ** Significant at the 5 percent level; * Significant at the 10 percent level.", "1. Same specifications as in Table 1; 2. Same specifications as in Table 3; 3. Same specifications as in Table 3, adding as controls the baseline values of sales, investments, profits, head is a public servant, head was born in the same village, speaks Darija, area of owned land, household has a connexion to electricity, to the sewage network, to a septic tank, access to a public tap, respondent considers that women should not go to souk alone and that women should not take the bus alone."), general_title = "", threeparttable = T) %>% landscape() ``` # 5. Measurement and coding errors Measurement errors can be observed in all sections of the dataset. We focus here on the variables used in the regression, which therefore have a direct incidence on identification and impact estimates. We also present the coding errors that have an incidence on the results. Other coding errors are listed in Appendix 3. ## 5.1 Inconsistent treatment (credit) measures Credit measures are essentials to characterise the treatment and confirm that no contamination or co-interventions pose a threat to the experiment integrity. The analysis of coding and measurement errors on access to credit shows that the administrative data appended to the survey data is not reliable and indicates a lower take-up, as well as possible contamination and co-interventions. ### 5.1.1 Discrepancies between administrative and survey data Household access to AAA credit was captured by two different questions, present in both the baseline and endline questionnaires: * Question 'i3': Did you or a member of the household have a loan from '*[NAME OF SOURCE]*'? Is it outstanding or mature? (previous question specifies that recall period for matured loan is 12 months); * Question 'i62': Do you or any household member have an outstanding loan or a loan that matured during the last 12 months from Al Amana? Besides variables 'i3' and 'i62' that derive from the survey, CDDP built a third variable named 'client' out of data gathered from the AAA client registry. The variable 'i3' indicates a low average level of borrowing from AAA at endline: 10.5% (289 households) in the treatment group and 2% (57 households) in the control group. The variable 'client' indicates a higher average level of borrowing from AAA at endline: 15.9% in the treatment villages (453 households) and 0% in the control villages. CDDP argue that more than a third of the households that took a loan from AAA did not report it in the survey and propose two interpretations: the household might not admit to borrowing because it is frowned upon by Islam; or they might confuse credit from AAA with credit from other formal sources. They conclude that administrative data must be regarded as more reliable than survey data to capture take-up [@crepon_estimating_2015: 133-134]. Qualitative research in the settlements targeted by this RCT confirms that religious norms strongly influence practices and discourses related to credit [@morvant-roux_adding_2014]. Islam frowns upon two aspects. First, interest rates are explicitly illegal according to the sharia, which mostly applies to formal credit. Second, being in debt is regarded as a disgrace, which applies to all forms of credit. There is no question in the survey questionnaire that assesses religious practices or observance. If it were, we would probably notice some correlation between religious indicator and credit. It would, however, be difficult to assess what arises from a lower credit taking and from a lower credit reporting, as religious norms might lead believers not to borrow rather than to borrow and refrain from reporting it to interviewers. Table 7 presents cross-tabulation of the three variables that report household borrowings from AAA. It reveals that inconsistencies are much broader than the differences in averages of reported borrowings. Such inconsistencies contradict the assertion that the administrative data can be regarded as more reliable than the survey data. Table 7 yields two insights. First, there are limited inconsistencies across different questions of the same survey: 20 households reported credit from AAA in Question ‘i3’, but not in Question ‘i62’. Conversely, 26 households did not report credit from AAA in Question ‘i3’, but did so in Question ‘i62’. Second, there are major inconsistencies between the survey data and the ‘client’ variable extracted from the AAA administrative data: 152 households declare having contracted a loan from AAA in Question ‘i3’ but do not appear in the ‘client’ variable retrieved from AAA administrative registries. 241 households identified in the latter as AAA borrowers declare not having an outstanding or matured loan from this microfinance institution (MFI) in Question ‘i3’. ```{r chk_cr_var, message=F, warning=F, message=F, echo=F} cr_el <- el %>% extract_cred(other_as_util = FALSE) %>% left_join(select(el, ident, treatment), by = "ident") %>% mutate(`Wave` = "Endline", `Group` = recode_factor(treatment, "0" = "Control", "1" = "Treatment")) borrowed_el <- cr_el %>% count_cred(y = `Source`, active_only = FALSE) %>% borrowed() chk_cr_var <- el %>% select(ident, i62, client, client_admin) %>% mutate(ident = as.character(ident)) %>% full_join(select(borrowed_el, ident, `Al Amana`), by = "ident") %>% mutate(i62 = recode_factor(i62, "1" = "Yes", "2" = "No", .missing = "No", .default = "No"), `Al Amana` = recode_factor(`Al Amana`, "1" = "Credit from AAA in `i3'", "0" = "No credit from AAA in `i3'"), client = recode_factor(client, "1" = "Yes", "0" = "No")) kable(t(rbind(table(chk_cr_var$i62, chk_cr_var$`Al Amana`), table(chk_cr_var$client, chk_cr_var$`Al Amana`))), caption = "Number of households borrowing from Al Amana at endline: contradictions between survey information and administrative data", format = "latex", booktabs = T, align = "c", longtable = T, format.args = list(big.mark = ",")) %>% kable_styling(full_width = T, latex_options = "HOLD_position") %>% add_header_above(c(" ", "Credit from AAA in `i62'" = 2, "Credit from AAA in `client'" = 2)) %>% # column_spec (1, width = "5cm") %>% footnote(general = c("Source: Our analysis using CDDP microdata retrieved from endline survey (`i3' and `i62') and AAA administrative data ('client')."), general_title = "", threeparttable = T) ``` Of the 241 households identified as clients at endline based on the AAA administrative data and who declared not having an outstanding or matured loan from this microfinance institution (MFI) in Question ‘i3’: * 27 reported at least one other formal credit[^6] at endline; * 25 reported at least one other formal credit at baseline (and 18 of those did not do so at endline); * 2 reported filing a credit application that was refused (one of these two was not already reported in the above cases). [^6]: CDDP classify as formal credit: '*Al Amana*'; '*Zakoura*'; '*Crédit Agricole Foundation*'; '*Other MFI*'; '*Crédit agricole*'; and '*Other bank*'. See more details on credit sources in 3.1.4. To sum up, the religion-driven shame argument clearly does not apply to 46 (27+18+1, i.e. 19%) of these 241 households, as they declare borrowing from formal sources elsewhere and, as explained above, the religion-driven shame argument applies equally to AAA microcredit and to other formal sources of credit. On the other hand, an argument of “credit shame” for these 241 households would call for an explanation of “credit pride” for the 152 households who reported having a loan from AAA even though they did not appear in the AAA registries. Turning to the second argument regarding confusing AAA with other sources of formal credit, we show in Section 5.1.5 that access to formal credit did not increase in the treatment group, but remained stable with other formal sources replaced by AAA. In the control group, access to formal credit fell between the baseline and endline. The fact that the other formal sources of credit fell significantly in both groups between the baseline and endline does not leave much room for substantial confusion between AAA and other formal sources at endline. Another plausible hypothesis to explain these discrepancies between survey data and administrative data is that the administrative data is inaccurate, or that it was not properly matched with the survey data. As we will see in Section 6.3, the sampling strategy failed to identify the households with a high propensity to borrow. It is therefore likely that a large part of the households that did borrow from AAA in the treated villages were not included in the survey sample. Besides, the microfinance sector in Morocco suffered a serious crisis from 2008 to 2012 (the endline surveys were conducted from May 2008 to January 2010) due to uncontrolled growth, over-indebtedness and widespread fraud by credit officers who used nominees to embezzle loans [@chen_growth_2010; @rozas_ending_2014; @despallier_crises_2015]. A Master’s student who did an internship in AAA’s internal audit division in 2009 substantiated the existence of such fraud in the MSc thesis he published on this subject [@hejjaji_analyse_2010]. AAA had to write off 23%[^7] of its portfolio in the following years as many loans were deemed uncollectable. To this should be added the rather frequent practice of borrowers themselves using nominees to bypass restrictive eligibility rules. These observations show that the reliability of the MFI administrative data should be viewed with caution, and that administrative data cannot be automatically considered to be more reliable than survey data. As the dataset is anonymised, we are unable to review the quality of the matching between survey and administrative variables. [^7]: Data from Mix Market database: Write-off ratios from 2006 to 2016 are for each subsequent year: 0.5%, 1.3%, 3.7%, 6.4%, 3.5%, 8.7%, NA, 4.5%, 3.7%, 3.7%, 5.1%. The figure for 2012 is not known. In sum, the identification of the households borrowing from Al Amana matches across sources in 194 cases, versus 587 cases (241 + 152 +194) where households appear as borrowing from AAA in either the administrative data or the survey data. That is a concordance rate of 33%, which is small considering that credit from AAA corresponds to the "treatment" which effectiveness is being tested. A large portion of CDDP's demonstration relies on these credit-taking variables. CDDP use the baseline values of variable 'i3' to produce their Table 1 and as control variables for their Tables 2 to 8. CDDP did not use variable ‘i62’ in their statistical analysis. The ‘client’ variable created from administrative data was used by CDDP to recompute a new borrowing propensity score, used to test externalities [@crepon_estimating_2015: Table 8], in order to argue that there is no externality of microcredit and to justify the Local Average Treatment Effect (LATE) estimation. This ‘client’ variable was also used to instrument the regression presented in CDDP Table 9. Therefore, the inaccuracy in borrowers' identification highlighted in this section has an incidence on the tests applied to check sample balance at baseline, on the estimation of the average treatment effect and on the estimation of the local average treatment effect. We cannot rectify these inaccuracies with the available data, nor measure their incidence on the impact estimates. However, this imprecision regarding which households, and how many of them, benefited from the evaluated intervention undermines the internal validity of the RCT results, and in particular of the local average treatment effect estimations. ### 5.1.2 Credit from other MFIs was omitted at baseline CDDP did not take into account loans from other MFIs when reporting access to credit and assessing the balance between treatment and control groups at baseline, as explained in more detail in Appendix A.2.1. In their Table 1, CDDP used the number of loans and the dummy (having a loan or not) variables to assess the balance between the treatment and control groups. The ‘total access to credit at baseline’ variable was also one of the control variables used for all regressions presented by CDDP (Tables 2 to 7). ```{r, message=F, warning=F, message=F, echo=F} cr_bl1 <- bl %>% extract_cred() %>% count_cred(y = source_type, active_only = TRUE) %>% mutate(tot_wo_oamc = `Al Amana` + `Other formal` + `Informal` + `Other`, tot_wo_oamc = ifelse(tot_wo_oamc > 0, 1, 0), tot_all = `Other MFI` + tot_wo_oamc, tot_all = ifelse(tot_all > 0, 1, 0)) %>% ungroup() %>% summarise(tot_wo_oamc = mean(tot_wo_oamc, na.rm = TRUE), tot_all = mean(tot_all, na.rm = TRUE)) pct_oamc <- round((cr_bl1[[2]] - cr_bl1[[1]]) / cr_bl1[[1]] * 100) ``` Correcting this error increases by `r pct_oamc`% the level of total access to credit in treatment and control group at baseline. This error combines with the one presented in Section 5.1.3, which has a larger incidence on measured credit access at baseline. This result has an incidence on the impact evaluation results, as illustrated in the following section. ### 5.1.3 Only outstanding loans were taken into account at baseline When assessing access to credit at endline, CDDP included the loans outstanding at the time of the survey, plus the loans that were not outstanding any more at the time of the survey, but that had been outstanding in the past 12 months. When assessing access to credit at baseline, CDDP only included the loans outstanding at the time of the survey. They did not include the loans outstanding in the past 12 months that ended before the survey. Appendix A.2.2 details the coding error that led to this difference. ```{r, echo = FALSE, warning = FALSE, message = FALSE} cr_bl <- bl %>% extract_cred(other_as_util = FALSE) %>% left_join(select(bl, ident, treatment), by = "ident") %>% mutate(`Wave` = "Baseline", `Group` = recode_factor(treatment, "0" = "Control", "1" = "Treatment")) cr_bl2_hh <- cr_bl %>% count_cred(y = source_type, active_only = FALSE) %>% mutate(tot_wo_oamc = `Al Amana` + `Other formal` + `Informal` + `Other`, tot_wo_oamc = ifelse(tot_wo_oamc > 0, 1, 0), tot_all = `Other MFI` + tot_wo_oamc, tot_all = ifelse(tot_all > 0, 1, 0)) %>% ungroup() cr_bl2 <- cr_bl2_hh %>% summarise(tot_wo_oamc = mean(tot_wo_oamc, na.rm = TRUE), tot_all = mean(tot_all, na.rm = TRUE)) pct_recall <- round((cr_bl2[[2]] - cr_bl1[[2]]) / cr_bl1[[2]] * 100) # Good total of credit sample <- cr_bl2_hh %>% select(ident, tot_all) %>% right_join(sample, by = "ident") reg_tot_asis <- reg(sample, dep_vars = c("borrowed_alamana_el", "profit_total_el"), fullreg = FALSE, var_out = "treatment_el", separator = " ") reg_tot_correc <- reg(sample, dep_vars = c("borrowed_alamana_el", "profit_total_el"), fullreg = FALSE, var_out = "treatment_el", separator = " ", rest_form = "~ treatment_el + members_resid_bl + nadults_resid_bl + head_age_bl + act_livestock_bl + act_business_bl + tot_all + members_resid_bl_d + nadults_resid_bl_d + head_age_bl_d + act_livestock_bl_d + act_business_bl_d + borrowed_total_bl_d + ccm_resp_activ + other_resp_activ + ccm_resp_activ_d + other_resp_activ_d + factor(paire_el)") ``` This inconsistency between borrowing recall periods at baseline and at endline is problematic when it comes to evaluating the impact of growth in access to credit. The identical naming and commenting on the code files suggests that the difference was not made on purpose. Besides, CDDP reiterate on three different occasions in their paper that this variable at baseline indicates whether a household “*had an outstanding formal loan over the past 12 months*” (pages 129, 132 and 133). Correcting this error increases by `r pct_recall`% the measured level of total access to credit at baseline in treatment and control group. The revised levels of access by source and treatment or control group are detailed in Section 5.1.5, Table 11. Total access to credit is used by CDDP as a control variable, the increase in their values after correcting the errors pointed out in 5.1.3 and 5.1.4 therefore modifies the measured impact results. For instance, the average treatment effect on access to AAA credit was estimated in CDDP Table 2 at `r reg_tot_asis[[1]]`, while it gets to `r reg_tot_correc[[1]]` when correcting this error, which indicates an impact lower by 30% of the experiment on credit take-up. The average treatment effect on self-employment profits was also estimated in CDDP Table 1 as `r reg_tot_asis[[2]]`, which is substantial and significant at the 10 percent level. Once corrected for the errors in total access to credit at baseline, the estimated treatment effect on profits becomes `r reg_tot_correc[[2]]`, which is smaller and insignificant. ### 5.1.4 All "other" credits were incorrectly recoded as "utilities" credit In the baseline and endline surveys, credit sources were collected by the above-mentioned question ‘i3’. Each loan was registered on a specific line of the questionnaire depending on its credit source. Sixteen possible sources were proposed to respondents at baseline (we reproduce here the English translations by CDDP): ‘*Crédit agricole*’; ‘*Other bank*’; ‘*Al Amana*’; ‘*Zakoura*’; ‘*Crédit Agricole Foundation*’; ‘*Other MFI (Microfinance Institution)*’; ‘*Usurer/Rhnane*'; '*Jeweler*’; ‘*Family*’; ‘*Neighbor*’; ‘*Friend*’; ‘*Shop*’; ‘*A client*’; ‘*A supplier*’; ‘*Cooperative*’; ‘*Other, specify:*’. A 17th option was added at endline: ‘*Utilities credit*’. Table 3 presents the number of respondents reporting one or more loan in the ‘*Other, specify:*’ and ‘*Utilities credit*’ categories: ```{r crosstab_crutil, message=F, warning=F, message=F, echo=F} borrowed_bl <- cr_bl%>% count_cred(y = `Source`, active_only = FALSE) %>% borrowed() %>% mutate(`Wave` = "Baseline") util_tb <- borrowed_el %>% mutate(`Wave` = "Endline") %>% bind_rows(borrowed_bl) %>% group_by(`Wave`) %>% summarise_if(is.numeric, sum, na.rm = TRUE) %>% select(`Wave`, `Other`, `Utilities` = `Utility`) %>% mutate(`Surveyed households` = c(nrow(borrowed_bl), nrow(borrowed_el)), `% other` = round((`Other` / `Surveyed households`)*100, 1), `% utilities` = round((`Utilities` / `Surveyed households`)*100, 1)) %>% select(`Wave`, `Surveyed households`, `Other`, `% other`, `Utilities`, `% utilities`) # Wave should be renamed "survey rounds". Better without colnames(util_tb) <- c(" ", colnames(util_tb)[-1]) util_tb %>% kable(caption = "Number of households reporting one or more loan in the ‘Other’ and ‘Utility’ categories", format = "latex", booktabs = T, align = "c", longtable = T, format.args = list(big.mark = ",")) %>% kable_styling(full_width = T, latex_options = "HOLD_position") %>% column_spec (2, width = "4cm") %>% footnote(general = c("Source: Our analysis using CDDP microdata retrieved from baseline and endline surveys."), general_title = "", threeparttable = T) ``` However, when recoding these variables, all sources registered as ‘*Other, specify:*’ were reclassified as ‘*Utilities credit*’ (see code in Appendix A.2.3). In other words, CDDP considered that all credit from sources other than those listed in the questionnaire was credit from water or electricity companies, even at endline where loans from water or electricity companies were specific options listed in the questionnaire. To check for consistency, we first correlated the ‘*Other, specify:*’ answers to Question ‘i3’ with the variable indicating whether households had water or electricity supply, both at baseline and endline (Table 9). ```{r chk_othr, message=F, warning=F, echo =F} # As classified, do they have electricity or water? extr_util <- function(x) { x %>% select(ident, `Water` = b13_1, `Electricity` = b15_1) %>% mutate(`Water` = recode_factor(`Water`, "1" = "Yes", "2" = "No", "-99" = "NA"), `Electricity` = recode_factor(`Electricity`, "1" = "Yes", "2" = "No", "-99" = "NA"), `Utility service` = ifelse(`Electricity` == "Yes" | `Water` == "Yes", "Yes", "No")) } util_bl <- extr_util(bl) util_el <- extr_util(el) # Analyse other at BL borrowed_bl %>% ungroup() %>% left_join(util_bl, by = "ident") %>% mutate(`Other` = recode_factor(`Other`, "0" = "No \"Other\" credit at baseline", "1" = "\"Other\" credit at baseline")) %>% count(`Utility service`, `Other`) %>% spread(`Other`, n) %>% rename(`Has electricity or water` = `Utility service`) -> tb_util_bl # Analyse other at EL borrowed_el %>% ungroup() %>% left_join(util_el, by = "ident") %>% mutate(`Other` = recode_factor(`Other`, "0" = "No \"Other\" credit at endline", "1" = "\"Other\" credit at endline")) %>% count(`Utility service`, `Other`) %>% spread(`Other`, n) %>% rename(`Has electricity or water` = `Utility service`) %>% select(1, 3) -> tb_util_el # Analyse other at BL util_tb2 <- borrowed_bl %>% ungroup() %>% left_join(util_bl, by = "ident") %>% mutate(`Other` = recode_factor(`Other`, "0" = "No \"Other\" credit at baseline", "1" = "\"Other\" credit at baseline")) %>% count(`Utility service`, `Other`) %>% spread(`Other`, n) %>% rename(`Has electricity or water` = `Utility service`) %>% select(1,3) %>% left_join(tb_util_el, by = "Has electricity or water") %>% filter(!is.na(`Has electricity or water`)) tot_othbl <- sum(util_tb2$`\"Other\" credit at baseline`) tot_othel <- sum(util_tb2$`\"Other\" credit at endline`) util_tb2 <- util_tb2 %>% mutate(`% baseline` = `\"Other\" credit at baseline` / tot_othbl, `% baseline` = round(`% baseline` * 100, 1), `% endline` = `\"Other\" credit at endline` / tot_othel, `% endline` = round(`% endline` * 100, 1)) %>% select(`Has electricity or water`, `\"Other\" credit at baseline`, `%` = `% baseline`, `\"Other\" credit at endline`, `%` = `% endline`) util_tb2 %>% kable(caption = "Number of observations for which ‘other’ credit was recoded as ‘utility’ credit, whether they had access to utility services", format = "latex", booktabs = T, align = "c", longtable = T) %>% kable_styling(full_width = T, latex_options = "HOLD_position") %>% column_spec (3, width = "1cm") %>% column_spec (5, width = "1cm") %>% footnote(general = c("Source: Our analysis using CDDP microdata retrieved from baseline and endline surveys."), general_title = "", threeparttable = T) util_prop_bl <- sum(util_bl$`Utility service` == "Yes", na.rm = T) / sum(!is.na(util_bl$`Utility service`)) util_prop_bl <- round(util_prop_bl * 100, 1) util_prop_el <- sum(util_el$`Utility service` == "Yes", na.rm = T) / sum(!is.na(util_el$`Utility service`)) util_prop_el <- round(util_prop_el * 100, 1) ``` The vast majority of surveyed households were connected to water and electricity, with `r util_prop_bl`% having access to one of these services at baseline and `r util_prop_el`% at endline (these two rates are simple averages without weighting). However, it does not seem appropriate to have recoded all declared “other” credit sources as “utility credit”. It appears, for instance, implausible that households without water and electricity (first row in Table 9) could have received a "utility credit". In the questionnaire, the ‘*Other, specify:*’ option was followed by a field where the respondent was supposed to give the name of this unspecified source. We present in Appendix 1 the occurrences encountered in this complementary variable and their corresponding frequencies. At baseline for instance, a specification corresponding to a utility company was provided in 29% of the cases, but in the others, the specifications corresponded to other types of sources (local stores, consumer lending, real estate purchase, etc.) or were missing. This indicates that, both at baseline and endline, credits registered as ‘*other*’ should not have been systematically reclassified as ‘*utility credit*’. ```{r recode_othr, message=F, warning=F, echo =F} cr_bl <- cr_bl %>% mutate(`Wave` = "Baseline", `Group` = recode_factor(treatment, "0" = "Control", "1" = "Treatment")) cr_el <- cr_el %>% mutate(`Wave` = "Endline", `Group` = recode_factor(treatment, "0" = "Control", "1" = "Treatment")) # Obtain a list of all "please specify", registerd for "other" othr_bl <- filter(cr_bl, `Source` == "Other") othr_el <- filter(cr_el, `Source` == "Other") othr <- bind_rows(othr_bl, othr_el) othr_i <- unique(othr$src_oth) # sort(othr_i) # print list to manually reclassify # Those are clearly for water or electricity connexion utility <- c("arsilaf electricite", "barnchement electricite", "branchemenet electricite", "branchemenr electricite", "branchement d ectricite", "branchement d electricite","branchement electricite", "branchement electrique", "ectricil", "elect one","elec one", "elect one", "electrcite", "electricel", "electricit", "electricit one", "electricite", "electricite one", "eletrul", "elictricite", "energie", "energie solaire", "office nationale electricite", "o n e", "one", "one electricite", "onep", "safac credit", "tema sol", "temasol", "temsol", "tenasol", "tenesol") # Those are clearly for banks bank <- c("ecdam", "ecdom", "eddom", "eedam", "eqdom", "eqdon", "ikdem", "ikdom", "ikdon", "wafsalof") # From shops shop <- c("boucher", "epicerie", "souk") # Those are the others other <- c("afni","ascam", "credit", "e2oom", "ecd091", "en scolaire","hebouss", "kayadat", "macon", "maison de vente", "proprietaire ferme", "remboursement pour la retraite", "societe tene shems", "societe tenne shems", "trysol") # Function to recode with these variables recode_othr <- function(x) { x %>% mutate(source_type3 = ifelse(source_type2 != "Other", as.character(source_type2), ifelse(.$src_oth %in% utility, "Utility", ifelse(.$src_oth %in% bank, "Other formal", ifelse(.$src_oth %in% shop, "Informal", "Other")))), source_type3 = recode_factor(source_type3, "Al Amana" = "Al Amana", "Other formal" = "Other formal", "Informal" = "Informal", "Utility" = "Utility", "Other" = "Other")) } # We then recode unkown sources as branching for consistency purpose cr_bl <- recode_othr(cr_bl) cr_el <- recode_othr(cr_el) # Prepare a summary othr <- recode_othr(othr) util_tb3 <- othr %>% group_by(`Wave`) %>% count(`Recoded source` = source_type3) %>% spread(`Wave`, n) %>% mutate(`Collected as` = "Other", `Recoded by CDDP as` = "Utility", `Must instead be recoded as` = recode_factor(`Recoded source`, "Informal" = "Informal", "Other" = "Other", "Other formal" = "Other formal", "Utility" = "Utility")) %>% arrange(`Must instead be recoded as`) %>% mutate(`Respondent specified` = c(paste(shop, collapse = ", "), paste(c("No specification, or:", paste(other, collapse = ", ")), collapse = " "), paste(bank, collapse = ", "), paste(utility, collapse = ", "))) %>% select(`Collected as`, `Recoded by CDDP as`, `Must instead be recoded as`, `Baseline`, `Endline`, `Respondent specified`) ``` ```{r message=F, warning=F, echo=F} # Select credits with no souce but positive amount nd_bl <- filter(cr_bl, is.na(`Source`), amount > 0) # Check in original data to double check that this is indeed missing source data bl %>% filter(ident %in% nd_bl$ident) %>% select(hh_num = ident, matches("^i[[:digit:]]_")) -> chkna_bl nd_el <- filter(cr_el, is.na(`Source`), amount > 0) ``` In addition, `r length(nd_bl)` credits at baseline and `r length(nd_el)` credits at endline were registered with the amount, guarantee and other fields, but no source. Due to these missing values in the 'source' variable, these credits were not taken into account in the computations made by CDDP. For our replication, to avoid omitting them from descriptive statistics on access to credit, we replace these empty values with ‘*Other*’ in the ‘*source*’ field. ```{r message=F, warning=F, echo=F} # Recode NA sources with positive credit values by unknown source # at baseline recode_miss <- function(x) { x %>% mutate(source_type3 = ifelse(!is.na(`Source`), as.character(source_type3), ifelse(is.na(`Source`) & amount > 0, "Other", source_type3)), source_type3 = recode_factor(source_type3, "Al Amana" = "Al Amana", "Other formal" = "Other formal", "Informal" = "Informal", "Utility" = "Utility", "Other" = "Other")) } # We then recode unkown sources as branching for consistency purpose # And re-generate credit count and dummy from this new basis cr_bl <- recode_miss(cr_bl) nb_cred_bl <- count_cred(cr_bl, source_type3) borrowed_bl <- borrowed(nb_cred_bl) cr_el <- recode_miss(cr_el) nb_cred_el <- count_cred(cr_el, source_type3) borrowed_el <- borrowed(nb_cred_el) borrowed_bl_t <- borrowed_bl %>% left_join(select(bl, ident, treatment, paire, demi_paire), by = "ident") %>% rename(borrowed_alamana_blok = `Al Amana`, borrowed_oformal_blok = `Other formal`, borrowed_informal_blok = `Informal`, borrowed_utility_blok = `Utility`, borrowed_other_blok = `Other`) %>% mutate(borrowed_formal_blok = ifelse(borrowed_alamana_blok > 0 | borrowed_oformal_blok > 0, 1, 0), borrowed_total_blok = ifelse(borrowed_formal_blok > 0 | borrowed_informal_blok > 0 | borrowed_utility_blok > 0 | borrowed_other_blok > 0, 1, 0)) %>% ungroup() vars_cred <- c("borrowed_alamana_blok", "borrowed_oformal_blok", "borrowed_informal_blok", "borrowed_utility_blok", "borrowed_other_blok") bal_cred <- reg_balance(treatment = treatment, dep_vars = vars_cred, controls = "factor(paire)", cluster = "demi_paire", data = borrowed_bl_t) sample <- borrowed_el %>% rename(borrowed_alamana_elok = `Al Amana`, borrowed_oformal_elok = `Other formal`, borrowed_informal_elok = `Informal`, borrowed_utility_elok = `Utility`, borrowed_other_elok = `Other`) %>% mutate(borrowed_formal_elok = ifelse(borrowed_alamana_elok > 0 | borrowed_oformal_elok > 0, 1, 0), borrowed_total_elok = ifelse(borrowed_formal_elok > 0 | borrowed_informal_elok > 0 | borrowed_utility_elok > 0 | borrowed_other_elok > 0, 1, 0)) %>% ungroup() %>% right_join(sample, by = "ident") %>% left_join(select(borrowed_bl_t, ident, starts_with("borr")), by = "ident") imp_cred1 <- reg(sample, dep_vars = c("borrowed_branching_el"), fullreg = FALSE, var_out = "treatment_el", separator = " ") imp_cred2 <- reg(sample, dep_vars = c("borrowed_alamana_elok", "borrowed_oformal_elok", "borrowed_informal_elok", "borrowed_utility_elok", "borrowed_other_elok", "borrowed_formal_elok", "borrowed_total_elok"), fullreg = FALSE, var_out = "treatment_el", separator = " ", rest_form = "~ treatment_el + members_resid_bl + nadults_resid_bl + head_age_bl + act_livestock_bl + act_business_bl + borrowed_total_blok + members_resid_bl_d + nadults_resid_bl_d + head_age_bl_d + act_livestock_bl_d + act_business_bl_d + borrowed_total_bl_d + ccm_resp_activ + other_resp_activ + ccm_resp_activ_d + other_resp_activ_d + factor(paire_el)") ``` This approximation regarding credit from utility companies is noteworthy since they appear as the most important credit source in the surveyed villages, and it is also the type of credit source whose penetration varies the most between baseline and endline. After this correction, the results of the balance tests computed in CDDP Table 1 are modified. ```{r, warning = FALSE, echo = FALSE, message = FALSE} bal_cred <- bal_cred %>% mutate(`Dependent variable` = recode(`Dependent variable`, "borrowed_alamana_blok" = "Loan from Al Amana", "borrowed_oformal_blok" = "Loan from other formal institution", "borrowed_informal_blok" = "Informal loan", "borrowed_utility_blok" = "Electricity or water connection loan", "borrowed_other_blok" = "Other source")) colnames(bal_cred) <- c(" ", "Obs. ", "Obs.", "Mean", "SD", "Coeff.", "p-value") bal_cred %>% mutate(`Obs. ` = as.numeric(`Obs. `), `Obs.` = as.numeric(`Obs.`)) %>% # mutate_all(linebreak) %>% kable(format = "latex", booktabs = T, escape = F, format.args = list(big.mark = ","), align = "l", caption = "Summary statistics: rectified balance at baseline on credit variables") %>% kable_styling(latex_options = "HOLD_position") %>% add_header_above(c(" " = 2, "Control group" = 3, "Treatment - Control" = 2)) %>% # column_spec (1, width = "5cm") %>% footnote(general = c("Source: Our reproduction of CDDP Table 2 with R using the same specifications and correcting loan reclassification.", "Coefficients and p-values from an OLS regression of the variable on a treated village dummy, controlling for strata dummies (paired villages). Standard errors are clustered at the village level.", "*** Significant at the 1 percent level", "** Significant at the 5 percent level", "* Significant at the 10 percent level"), general_title = "", threeparttable = T) ``` As we see in Table 10, after preventing unjustified reclassification from other credit to utility credit, access to the latter appears as significantly unbalanced at baseline. Moreover, this rectification also alters the computing of the average treatment effect on access to utility credit at endline. This was estimated as `r imp_cred1[[1]]` in CDDP Table 2, which is small and insignificant. Conversely, when preventing unjustified reclassification, it becomes `r imp_cred2[[4]]`, which is larger and significant. It is unclear whether this significant increase in access to utility credit in treatment villages is an unexpected impact of increased AAA credit or contamination by a co-intervention. In any case, further analysis would be required to disentangle the impact of microcredit and the impact of utility credit in this context. The existence of such imbalance at baseline and effects at endline is a threat to the internal validity of this RCT. This is an indication of an possible alteration of the experiment integrity and, if it is the case, part of the measured results would be attributable to utility credit instead of microcredit. Another RCT conducted during the same period in Morocco found significant impacts of utility credit on household well-being [@devoto_happiness_2012]. These results also raise questions regarding the external validity of the experiment: would the results apply to a context where there are no important efforts by water and utility companies to expand their services? ### 5.1.5 Credit access and identification of the treatment In their published article, CDDP are very straightforward in the way they describe the difference in access to credit between treatment and control villages: > "*Thirteen percent of the households in treatment villages took a loan, and none in control villages did.*" [@crepon_estimating_2015: abstract] > "*The study has three features that make it a good complement to existing papers. First, it takes place in an area where there is absolutely no other microcredit penetration, before or after the introduction of the product, and for the duration of the study.*" [@crepon_estimating_2015: 124] > "*The experimental design was generally well respected, and we observe essentially no entry of Al Amana (or any other MFI, as it turns out) in the control group. Villagers did not travel to other branches to get loans either.*" [@crepon_estimating_2015: 130] We computed credit prevalence in the treatment and control group at baseline and endline. As a substantial number of households (1,433) were added at endline without having been surveyed at baseline, we present the same analysis on the different subsets: - One with the 5,551 households surveyed at endline and the 4,465 households surveyed at baseline (cross sections), and - One with only the 4,118 households surveyed both at baseline and endline (panel). Figure 1 focuses on panel households and Table 11 presents credit access for both panel and cross section households. ```{r analyse_cr_access, message=F, warning=F, echo=F} # Just checking number of recoded from other to other formal who already had a formal loan borrowed2_el <- el %>% extract_cred() %>% count_cred(source_type2) %>% borrowed() # Produce the graphs borrowed_el <- borrowed_el %>% left_join(select(el, ident, `Group` = treatment), by = "ident") %>% mutate(`Wave` = "Endline", `Group` = recode_factor(`Group`, "0" = "Control", "1" = "Treatment")) borrowed_bl <- borrowed_bl %>% left_join(select(bl, ident, `Group` = treatment), by = "ident") %>% mutate(`Wave` = "Baseline", `Group` = recode_factor(`Group`, "0" = "Control", "1" = "Treatment")) borrowed_all <- borrowed_el %>% bind_rows(borrowed_bl) # A function to analyse share of hh having access per source access_src <- function (x, panel = FALSE) { #select or not panel obs # Keep only panel observations if panel = TRUE if (panel) { x <- x %>% filter(ident %in% bl$ident ) %>% #discard added at EL filter(ident %in% el$ident) #discard dropouts } # Prepare summary x %>% # to avoid counting duplicates Al Amana + other formal mutate(`Al Amana only` = ifelse(`Al Amana` == 1 & `Other formal` == 0, 1, 0), `Other formal only` = ifelse(`Al Amana` == 0 & `Other formal` == 1, 1, 0), `Both` = ifelse(`Al Amana` == 1 & `Other formal` == 1, 1, 0)) %>% select(-`Al Amana`, -`Other formal`) %>% # rest normal gather(key = `Source`, value = has_cred, -ident, -`Group`, -`Wave`) %>% group_by(`Group`, `Source`, `Wave`) %>% summarise(mean_type = round(mean(has_cred, na.rm = TRUE)*100, 2)) %>% ungroup() %>% mutate(type = ifelse(`Source` == "Al Amana only" | `Source` == "Other formal only" | `Source` == "Both", "Formal", `Source`)) %>% mutate(src_gp = interaction(type, `Wave`)) %>% mutate(type = recode_factor(type, "Formal" = "Formal", "Informal" = "Informal", "Utility" = "Utility", "Other" = "Other")) %>% mutate(src_gp = recode_factor(src_gp, "Formal.Baseline" = "Formal.Baseline", "Formal.Endline" = "Formal.Endline", "Informal.Baseline" = "Informal.Baseline", "Informal.Endline" = "Informal.Endline", "Utility.Baseline" = "Utility.Baseline", "Utility.Endline" = "Utility.Endline", "Other.Baseline" = "Other.Baseline", "Other.Endline" = "Other.Endline")) %>% mutate(src_aaa = recode_factor(`Source`, "Al Amana only" = "AAA and no other formal", "Both" = "AAA and other formal", .default = "Other formal")) } cred_panel <- borrowed_all %>% access_src(panel = TRUE) %>% mutate(`Subsample` = "Panel") cred_all <- borrowed_all %>% access_src(panel = FALSE) %>% mutate(`Subsample` = "Cross-section") # for initial graph before review cred_consolid_orig <- cred_panel %>% bind_rows(cred_all) cred_consolid <- cred_consolid_orig %>% filter(`Subsample` == "Panel") cred_consolid %>% ggplot(aes(x = src_gp, y = `mean_type`, fill = `Wave`)) + geom_vline(xintercept = c(2.5,4.5,6.5), lwd=1, colour="white") + geom_bar(stat = "identity", position = position_dodge()) + geom_bar(aes (x = src_gp, y = `mean_type`, color = src_aaa), stat = "identity", position = position_stack(), lwd = 0.5) + scale_color_manual(values = c("purple", "blue", "grey")) + guides(colour = guide_legend(title = "Contour specify source of formal credit", title.position="top", ncol = 1), fill = guide_legend(title = "Survey", title.position="top", ncol = 1)) + labs(y = "%", x = "Type of credit source", title = "Figure 1: Changes in access to credit sources for panel households", subtitle = "Households surveyed both at baseline and endline (N = 4,118)") + theme_gray() + theme(axis.text.x = element_text(hjust = 0), axis.ticks.x = element_blank(), panel.grid.major.x = element_blank(), legend.position = "bottom") + scale_x_discrete(labels = c("Formal", "", "Informal","", "Utility", "", "Other", "")) + scale_y_continuous(limits = c(0, 17.5)) + facet_grid(. ~ `Group`) ``` Source: Our analysis using CDDP microdata retrieved from baseline and endline surveys. The difference in Table 11 between the repeated cross-sections and the panel households highlights sample errors, which we will analyse in more detail in Section 6 of this replication. At this stage, Table 11 shows that the attrition households and the households added at endline are very different in terms of borrowing levels to the households that were interviewed both at baseline and endline. This tends to rule out cross-section analysis and calls for a panel analysis instead. If we focus on growth in access to credit for panel households, as presented in Figure 1, we observe three striking phenomena that undermine the identification strategy used by CDDP. ```{r cr_access_tb, message=F, warning=F, echo=F} cred_tb <- cred_consolid_orig %>% mutate(group_wave = paste(`Group`, `Wave`, sep = ".")) %>% select(`Subsample`, `Type` = type, `Source`, group_wave, mean_type) %>% spread(key = group_wave, value = `mean_type`) sumform <- cred_tb %>% group_by(`Subsample`, `Type`) %>% summarise_if(is.numeric, sum) %>% filter(`Type` == "Formal") %>% mutate(`Source` = "Total formal sources") cred_tb <- cred_tb %>% bind_rows(sumform) %>% mutate(`Source` = recode(`Source`, "Al Amana only" = "AAA and no other formal", "Both" = "AAA and other formal", "Other formal" = "Other formal and no AAA", "Informal" = "Any informal source", "Utility" = "Water or electicity company", "Other" = "None of the above or not specified")) %>% arrange(`Subsample`, `Type`, `Source`) cred_tb %>% kable(format = "latex", booktabs = T, linesep = '', col.names = c("", "","Credit source", "Baseline", "Endline", "Baseline", "Endline"), caption = "Changes in access to credit sources") %>% collapse_rows(1:3, row_group_label_position = "stack") %>% add_header_above(c(" " = 3, "Control" = 2, "Treatment" = 2)) %>% kable_styling(full_width = T, latex_options = "HOLD_position") %>% column_spec (3, width = "6cm") %>% column_spec (4, width = "1.4cm") %>% column_spec (5, width = "1.4cm") %>% column_spec (6, width = "1.4cm") %>% column_spec (7, width = "1.4cm") %>% footnote(general = "Source: Our analysis using CDDP microdata retrieved from baseline and endline surveys.", general_title = "") ``` First, access to formal credit did not notably increase in the treatment group (from 11.31% at baseline to 11.41% at endline). What we observe instead is a substitution of other formal credit sources by AAA. Second, access to formal credit significantly decreased in the control group (from 7.77% at baseline to 4.70% at endline). This might be due to the microcredit crisis that hit Morocco in 2008 [@chen_growth_2010; @rozas_ending_2014; @despallier_crises_2015]. It could also be explained by an agreement reached at the beginning of the RCT with the leading financial institutions that they would not intervene in the study areas. It might also be caused by AAA (which headed the influential national MFI association at the time) calling on its fellow financial institutions to minimise the contamination of the experiment during the RCT. Third, according to the survey data, utility credit is by far the most prevalent credit source in both treatment and control groups. The stability of access to utility credit over time is based on the extrapolation made by CDDP that all “other” sources of credit were “utility” credit. We show that this could not be true in a significant proportion of cases where there are patent contradictions with available information (Table 5). If we reject the automatic reclassification of “other” credits as “utility” credits when no such specification was given by respondents, then variations in “utility” credit are substantial between the baseline and endline. These observations challenge the very meaning of the experimentation put forward by CDDP. What has been evaluated: is it the impact of the replacement of other formal sources with AAA in the treatment group? Is it credit rationing in the control group? Or is it the variation in utility credit? ## 5.2 Outcome measures and controls ### 5.2.1 Incomplete or inconsistent data We discuss here just two examples of the many survey data inconsistencies we found. ```{r meas_err_out, warning=F, message=F, echo=F} # Counting missing assets asset_na_bl <- bl %>% extract_ast_agr() %>% filter(owns == 1 & is.na(number)) %>% group_by(`Asset`) %>% count() asset_na_el <- el %>% extract_ast_agr() %>% filter(owns == 1 & is.na(number)) %>% group_by(`Asset`) %>% count() asset_na_blt <- sum(asset_na_bl$n, na.rm = TRUE) asset_na_elt <- sum(asset_na_el$n, na.rm = TRUE) # Finding missing activities--------------------------------------------------- # modify function recoding activities to sum instead of dummy code_activity2 <- function(x) { x %>% mutate(code = code / 100, act_livestock = ifelse(sector == 2 | (is.na(sector) & !is.na(nom) & nom != "" & code >= 2 & code < 3), 1, 0), act_business = ifelse((!is.na(nom) & nom != "" & sector >= 3 & sector <= 6) | (is.na(sector) & !is.na(nom) & nom != "" & code >= 3 & code < 7), 1, 0)) %>% group_by(ident) %>% summarise(act_livestock = sum(act_livestock, na.rm = TRUE), act_business = sum(act_business, na.rm = TRUE)) } bus_d_el <- el %>% extract_activities() %>% code_activity2() bus_g_el <- el %>% extract_bac() bus_g_el$empty <- apply(bus_g_el[,3:17], 1, function(x) all(is.na(x))) bus_g_el <- bus_g_el %>% filter(empty == FALSE) %>% group_by(ident) %>% count() bus_el <- bus_d_el %>% left_join(bus_g_el, by = "ident") %>% mutate(n = ifelse(is.na(n), 0, n), dif = act_business - n, status = ifelse(dif > 0, "d_no_g", ifelse(dif < 0, "g_no_d", ifelse(act_business > 0, "d_and_g", "")))) %>% group_by(status) %>% count() d_no_g <- bus_el$nn[bus_el$status == "d_no_g"] d_and_g <- bus_el$nn[bus_el$status == "d_and_g"] g_no_d <- bus_el$nn[bus_el$status == "g_no_d"] d_tot <- d_and_g + d_no_g d_no_g_per <- round(d_no_g / d_tot * 100, 1) g_tot <- g_no_d + d_and_g g_no_d_per <- round(g_no_d / g_tot * 100, 1) ``` In `r c_th(asset_na_blt)` cases at baseline and `r asset_na_elt` cases at endline, households declared having agricultural assets of some kind, but the number of items is missing. These assets were therefore not taken into account in the total. This goes for all types of agricultural assets from tractors, reapers, cars and trucks to shovels, axes and sickles. The same problem concerns livestock assets and business assets. Two sections of the questionnaire focused on the assessment of business (non-farm) activities. Section D on “Household activities” records (only) self-employment activities, and G gathers all information, including financial, on these activities outside of agriculture (questionnaire section E) and stockbreeding (questionnaire section F). We find that of the `r d_tot` households with business activities registered in D at endline, `r d_no_g` (`r d_no_g_per`%) have no business activity or only part of their business activities documented in G. On the other hand, of the `r g_tot` with business activities documented in G, `r g_no_d` (`r g_no_d_per`%) have no business activities or only part of their business activities registered in D. These inconsistencies cannot be corrected with the available information. However, they call into question the quality of the underlying data of this RCT, and hence its internal validity. ### 5.2.2 '*Tractors*' and '*reapers*' removed from asset appraisal at endline ```{r warning=F, message=F, echo=F} # summarie values ast_bl_items <- extract_ast_agr(bl) ast_bl <- ast_bl_items %>% group_by(`Asset`) %>% summarise(`Total number of items owned by households at baseline` = sum(ifelse(owns < 3, number, 0), na.rm = TRUE), `Number of times a value was reported at baseline` = sum(new_cost > 0, na.rm = TRUE), `Median value at baseline` = median(as.numeric(new_cost), na.rm = TRUE)) ast_el_items <- extract_ast_agr(el) ast_el <- ast_el_items %>% group_by(`Asset`) %>% summarise(`Total number of items owned by households at endline` = sum(ifelse(owns < 3, number, 0), na.rm = TRUE), `Number of times a value was reported at endline` = sum(new_cost > 0, na.rm = TRUE), `Median value at endline` = median(as.numeric(new_cost), na.rm = TRUE)) ast_comp <- ast_bl %>% full_join(ast_el, by = "Asset") %>% mutate(`% variation of median value` = round((`Median value at endline` - `Median value at baseline`) / `Median value at baseline` * 100)) ast_ap_blel <- ast_bl_items %>% bind_rows(ast_el_items) %>% mutate(`Asset` = recode_factor(`Asset`, "Other 1" = "Other", "Other 2" = "Other", "Other 3" = "Other", "Other 4" = "Other", "Other 5" = "Other")) %>% group_by(`Asset`) %>% summarise(median_value = median(new_cost, na.rm = TRUE)) ast_ap_blel <- ast_ap_blel %>% bind_rows( tibble(`Asset` = c("Other 1", "Other 2", "Other 3", "Other 4", "Other 5"), median_value = c(rep(ast_ap_blel$median_value[ ast_ap_blel$`Asset` == "Other"],5)))) # start section to recompute. comp_agro_ast <- appraise_agro_ast(ast_el_items, ast_ap = ast_ap_blel) %>% left_join(appraise_agro_ast(ast_el_items, ast_ap = ""), by = "ident", suffix = c("_full_pr", "_full")) %>% left_join(appraise_agro_ast(ast_el_items, ast_ap = "", exclude = c("Tractor", "Reaper")), by = "ident") %>% left_join(select(el, ident), by = "ident") %>% mutate(ident = as.character(ident)) mean_ast_wotr <- round(mean(comp_agro_ast$asset_agri, na.rm = TRUE)) mean_ast_wtr <- round(mean(comp_agro_ast$asset_agri_full, na.rm = TRUE)) mean_ast_wtr_pr <- round(mean(comp_agro_ast$asset_agri_full_pr, na.rm = TRUE)) test <- sample %>% left_join(comp_agro_ast, by = "ident") %>% mutate(assets_total_wtr = assets_total_el - asset_agri + asset_agri_full, assets_total_wtr_pr = assets_total_el - asset_agri + asset_agri_full_pr) # This one with the same specifications as CDDP imp_ast_asis <- reg(test, dep_vars = c("assets_total_wtr", "assets_total_wtr_pr", selfact), fullreg = FALSE, var_out = "treatment_el", separator = " ", rest_form = "~ treatment_el + members_resid_bl + nadults_resid_bl + head_age_bl + act_livestock_bl + act_business_bl + borrowed_total_bl + members_resid_bl_d + nadults_resid_bl_d + head_age_bl_d + act_livestock_bl_d + act_business_bl_d + borrowed_total_bl_d + ccm_resp_activ + other_resp_activ + ccm_resp_activ_d + other_resp_activ_d + factor(paire_el)") # This one with correctd borrowed_total imp_ast_totb <- reg(test, dep_vars = c("assets_total_wtr", "assets_total_wtr_pr", selfact), fullreg = FALSE, var_out = "treatment_el", separator = " ", rest_form = "~ treatment_el + members_resid_bl + nadults_resid_bl + head_age_bl + act_livestock_bl + act_business_bl + borrowed_total_blok + members_resid_bl_d + nadults_resid_bl_d + head_age_bl_d + act_livestock_bl_d + act_business_bl_d + borrowed_total_bl_d + ccm_resp_activ + other_resp_activ + ccm_resp_activ_d + other_resp_activ_d + factor(paire_el)") # End section to recompute ``` At baseline, CDDP included all types of assets to calculate the total value of the assets owned by all households. However, an examination of the code used to compute endline data (see Appendix A.2.4) shows that two types of assets have been removed from the sum of asset values calculated for each household: tractors and reapers. The code between endline and baseline preparation do files is overall the same, suggesting that it was copy-pasted. This specific change was therefore made intentionally, but is not mentioned in the published article. It was probably motivated by the fact that the appraisal method used by CDDP produces inaccurate prices, which are particularly erratic for those two assets (see Section 5.2.3). This is however inadequate, because this RCT aims at evaluating the impact on assets, among other outcomes, and tractors and reapers are the most valuable assets that households possess. Including tractors and reapers in the asset appraisal at endline increases average asset value in the sample from `r mean_ast_wotr` to `r mean_ast_wtr`. It also modifies the impact estimation on total assets at endline. This was `r imp_ast_asis[[3]]` in CDDP Table 2, which is substantial and significant. It becomes `r imp_ast_asis[[1]]`, which is larger but insignificant, when we include tractors and reapers in total assets, while keeping the same control variables as CDDP. However, it becomes `r imp_ast_totb[[1]]`, which is larger and significant, when we included the total access to credit as corrected in Section 5.1.2 and Section 5.1.3. This estimation is further modified when we correct the price calculation used for asset valuation, as explained in the following section. ### 5.2.3 Assets, sales and consumption appraised with inconsistent prices The survey suffers from a classic problem with price imputation every time assets, sales, consumption of own production and in-kind savings have to be evaluated [@deaton_analysis_1997: 28-29, 35-39]. A price has to be imputed for each item to account for its value. Yet in most cases, no transaction price is available for that particular item, either because there was no transaction (assets purchased more than a year ago, consumption of own production or savings) or because the transaction price was not registered (new assets and sales). In these cases, the median price of all observed transactions by other households for this item was imputed. The problem is that for some items, the number of transactions for which a price is available is very small, exposing the median to being skewed by outliers or implausible prices reported by the households. Table 12 presents some illustrations of median prices imputed to agricultural assets. ```{r warning=F, message=F, echo=F} ast_comp %>% kable(caption = "Median prices imputed to agricultural assets at baseline and endline", format = "latex", booktabs = T, longtable = T, format.args = list(big.mark = ",")) %>% kable_styling(full_width = T, latex_options = "HOLD_position") %>% column_spec (2, width = "2.1cm") %>% column_spec (3, width = "1.8cm") %>% column_spec (5, width = "2.1cm") %>% column_spec (6, width = "1.8cm") %>% footnote(general = c("Source: Our analysis using CDDP microdata retrieved from baseline and endline surveys."), general_title = "", threeparttable = T) ``` The median reaper value increased by 3,309% between the baseline and endline. The example of agricultural assets presented in Table 8 shows that imputing a median price where only a small number of transactions have been made in the last year gives rise to erratic assets valuations. With scarcely recorded transactions, it is clearly preferable to compute a median price that takes into account transactions observed both at baseline and endline. This hinders the capture of genuine price variations (inflation), but seems like a reasonable trade-off considering the absurd price variations observed above. According to this approach, tractor should have been appraised at MAD 60,000 at endline, which is the median value of the 25 transactions registered at both baseline and endline. Reaper values should be appraised at MAD 10,200 at endline, which is the median value of the 6 transactions registered at both baseline and endline, etc. This correction further modifies the estimation of the experiment impact introduced in Section 5.2.2. The result on total assets at baseline becomes `r imp_ast_asis[[2]]`, which is slightly marginal, when we keep the same control variables as CDDP. It becomes `r imp_ast_totb[[2]]`, which is larger and significant, when we replace the total access to credit by the rectified value corrected in Section 5.1.2 and Section 5.1.3. Moreover, a recurring problem can be seen in Table 8 with all items owned, sold or bought computed by CDDP. The “other” category is always valued at the same median price, despite its covering highly heterogeneous items. This problem with the undefined “other” category is found with the business, livestock and agricultural assets, and also with vegetable, cereals and tree sales. For instance, a tiller, a handheld sprayer and a pruning shear are valued at the same median price as soon as they come under the same “other” category. ### 5.2.4 Other measurement and coding errors on outcome and control variables A series of other errors have been identified. A disputable amortisation procedure led to divide the value of some agricultural investments by 10. A Stata coding error added units of livestock assets that do not exist. Several confusions were made between prices before, during or after harvest when appraising agriculture sales and consumption. Control variables referring to household composition are altered in some observations: no members, several heads, missing ages, etc. These errors affect a limited number of observations, or affect observations with a limited magnitude. They only yield a marginal incidence on the estimated results, so we present them in Appendix 3. ## 5.3 Results with partial corrections ```{r regress_correct, message=F, warning=F, echo=F} # Regress with corrected errors ----------------------------------------------- # Format reproduced results as in Crépon et al. reg_selfact_asis <- reg_selfact_asis %>% tibble() %>% mutate(`Outcome` = names(reg_selfact_asis)) colnames(reg_selfact_asis) <- c("Result as in Crépon et al.", "Outcome variables") consolidated2_bl <- bl %>% prepare(cr_active_only = FALSE, include_cr_oamc = TRUE, exp_agriinv_ok = TRUE, lsk_ast_ok = TRUE, exclude_var_bsale = "", exclude_item_bsale = "", #only 4/6 were taken into acount at EL cereal_sales_ok = TRUE,#sales before harvest appraised at after prices cereal_sav_ok = TRUE, #savings are appraised at price before harvest tree_price_ok = TRUE, #sales during and after harvest at before price veg_price_ok = TRUE, #all "others" are appraised at the sime price mb_resid_ok = TRUE #Household without residing members are 0 instead of NA ) %>% flag_trimobs(trim_vars = trim_vars_el) consolidated2_el <- el %>% prepare(cr_active_only = FALSE, include_cr_oamc = TRUE, exp_agriinv_ok = TRUE, lsk_ast_ok = TRUE, exclude_var_bsale = "", exclude_item_bsale = "", #only 4/6 were taken into acount at EL cereal_sales_ok = TRUE,#sales before harvest appraised at after prices cereal_sav_ok = TRUE, #savings are appraised at price before harvest tree_price_ok = TRUE, #sales during and after harvest at before price veg_price_ok = TRUE, #all "others" are appraised at the sime price mb_resid_ok = TRUE #Household without residing members are 0 instead of NA ) %>% flag_trimobs(trim_vars = trim_vars_el) # merge baseline and endline consolidated2 <- consolidated2_bl %>% full_join(consolidated2_el, by = "ident", suffix = c("_bl", "_el")) %>% set_missing_controlvars() # Correcting measurement errors with trim 0.5 sample_m_05 <- consolidated2 %>% subsample() %>% filter(group == "Panel" | group == "Added") %>% filter(trimobs_el != 1 & samplemodel == 1) # corrected measurement and trim at 0.5 reg_sa_m_05 <- reg(sample_m_05, dep_vars = selfact, fullreg = FALSE, var_out = "treatment_el") reg_sa_m_05 <- reg_sa_m_05 %>% tibble() %>% mutate(`Outcome` = names(reg_sa_m_05)) colnames(reg_sa_m_05) <- c("Some measurement error corrections and trim at 0.5\\%", "Outcome variables") ``` ```{r comp_correct, message=F, warning=F, echo=F} comp <- sample %>% left_join(sample_m_05, by = "ident") %>% mutate( d_assets_total_el = ifelse((assets_total_el.x - assets_total_el.y) == 0, 0, 1), d_output_total_el = ifelse((output_total_el.x - output_total_el.y) == 0, 0, 1), d_expense_total_el = ifelse((expense_total_el.x - expense_total_el.y) == 0, 0, 1), d_inv_total_el = ifelse((inv_total_el.x - inv_total_el.y) == 0, 0, 1), d_profit_total_el = ifelse((profit_total_el.x - profit_total_el.y) == 0, 0, 1), d_consumption_el = ifelse((consumption_el.x - consumption_el.y) == 0, 0, 1), d_members_resid_bl = ifelse((members_resid_bl.x - members_resid_bl.y) == 0, 0, 1), d_nadults_resid_bl = ifelse((members_resid_bl.x - members_resid_bl.y) == 0, 0, 1), d_head_age_bl = ifelse((head_age_bl.x - head_age_bl.y) == 0, 0, 1), d_act_livestock_bl = ifelse((act_livestock_bl.x - act_livestock_bl.y) == 0, 0, 1), d_act_business_bl = ifelse((act_business_bl.x - act_business_bl.y) == 0, 0, 1), d_borrowed_total_bl = ifelse(( borrowed_total_bl.x - borrowed_total_bl.y) == 0, 0, 1)) %>% select(ident, starts_with("d_")) %>% mutate(nb_dif = rowSums(.[2:13], na.rm = TRUE), dif = ifelse(nb_dif > 0 ,1, 0)) tot_dif <- sum(comp$dif) tot_sample <- nrow(sample) dif_per = round((tot_dif / tot_sample * 100), 2) ``` We now recompute the regression presented by CDDP (Table 3), correcting the coding and measurement errors that can be corrected. We correct the measurement errors that can be corrected: account of borrowing at baseline including credit from other MFIs (see Section 5.1.2); borrowing at baseline factoring in all outstanding loans in the past 12 months, instead of just outstanding loans (see Section 5.1.3); appraisal of agricultural assets at baseline including tractors and reapers (see Section 5.1.4); livestock assets excluding non-existent units (see Appendix 3.2); business earnings including all business sales (see Appendix 3.3); prices before, during and after harvest suitably assigned to corresponding sales or consumption (see Section 5.2.4); and investment in agricultural assets not amortised by an arbitrary procedure (see Appendix 4.1). All in all, these corrections affect `r c_th(tot_dif)` of the `r c_th(tot_sample)` observations (`r dif_per`%) used by CDDP (Table 3) for their ATE estimation on self-employment activities. ```{r regress_correct_tb2, message=F, warning=F, echo=F} # LaTeX output reg_sa_m_05_tb <- reg_sa_m_05 %>% select("Outcome variables", "Some measurement error corrections and trim at 0.5\\%") %>% column_to_rownames(var = "Outcome variables") %>% t() %>% as.data.frame() %>% as_tibble() %>% rownames_to_column(var = "Source") %>% filter(`Source` != "Outcome variables") colnames(tb_asis) <- c("Source", colnames(tb_asis)[-1]) reg_sa_m_05_tb <- tb_asis %>% bind_rows(reg_sa_m_05_tb) %>% mutate(`Source` = c("For memory: initial CDDP results", "Some error corrections and trim at 0.5\\%")) reg_sa_m_05_tb %>% mutate_all(as.character) %>% mutate_all(linebreak) %>% kable(format = "latex", booktabs = T, escape = F, # problem with that: no footnote possible valign = "bottom", col.names = linebreak(c(" ","Assets", "Sales and home\nconsumption", "Expenses", "Of which:\nInvestment","Profit")), caption = "Replicated impact estimates correcting some coding and measurement errors") %>% kable_styling (latex_options = "HOLD_position") %>% # column_spec (1, width = "6.5cm") %>% # column_spec (3, width = "2.2cm") %>% # column_spec (5, width = "1.5cm") %>% footnote(general = c("Source: Our replication of CDDP Table 3 with R, using the same data but correcting the coding and measurement errors listed in Section 5: omission of credits from other MFIs in total access to credit; omission of credits that matured before the survey in the variable; omission of agricultural assets in the total of assets owned by households; erratic prices used to appraise agricultural assets; livestock assets excluding non-existent units; business earnings omitted some business sales; confusions between prices before, during and after harvest to appraise agricultural sales and consumption; inconsistent amortisation rules for agricultural investments. Same specifications as CDDP Table 3: Sample includes 4,934 households classified as high probability-to-borrow and surveyed at endline, after trimming 0.5 percent of observations. Coefficients and standard errors (in parentheses) from an OLS regression of the variable on a treated village dummy, controlling for strata dummies (paired villages), number of household members, number of adults, head age, does animal husbandry, does other non-agricultural activities, had an outstanding loan over the past 12 months, HH spouse responded to the survey, and other HH member (excluding the HH head) responded to the survey and variables specified below. Standard errors are clustered at the village level.", "*** Significant at the 1 percent level", "** Significant at the 5 percent level", "* Significant at the 10 percent level"), general_title = "", threeparttable = T) ``` Table 13 shows that the standalone correction of some coding errors reduces and cancels out the magnitude and significance of the estimated impacts, as shown for instance the inclusion of credits from other MFIs and all credits outstanding in the 12 previous months. But the correction of other errors considerably reinforces the estimated impacts. Taken together, these rectifiable errors appear relatively well balanced between treatment and control groups and their correction does not, in itself, disqualify the conclusions of the first part of the published article. We notice at this stage that estimated impacts on assets and expenses are smaller and less significant, and that estimated impacts on outcomes and profits are larger and more significant. One should bear in mind that what we have here is only a partial correction, since measurement errors remain: there are still missing and absurd values (see sections 5.1.4, 5.2.1 and Appendix 3); consumption of own production and in-kind savings are still valued at erratic median prices wherever there were not enough registered transactions to obtain reliable estimates (Section 5.2.3), etc. Besides, the measurement errors observed on credit variables do raise major concerns about the reliability of the externality tests and the local average treatment effects, which are the second part of the CDDP paper, not reproduced here. # 6. Sampling errors CDDP describe their sampling procedure as follows. From a pilot survey including 1,300 households in seven pairs of villages, 24 variables were identified as “good predictors” for a household to borrow from AAA. A logit model was built to assess borrowing propensity based on these 24 variables. One village per pair was randomly selected from 81 pairs of similar villages to receive microcredit services from AAA. Prior to the opening of an AAA branch in the village, a short preparatory survey was administered to a sample of 100 households in each village, or to the entire village where the population was less than 100 households. The 24 variables previously mentioned were included in the short questionnaire, and they were used to compute a borrowing propensity score for each of the 15,145 households surveyed in this preparatory phase. In each village, all the households surveyed during the short preparatory survey that ended up in the top borrowing propensity quartile were included in the sample. Five other households that were surveyed during the short preparatory survey but that did not end up in the top quartile were also randomly selected. A total of 4,465 households were interviewed at baseline, of which 92% were successfully re-interviewed at endline. The propensity score to borrow was then re-estimated before the endline for all households interviewed during the short preparatory survey, based on the take-up observed by the AAA information system in the 81 treatment villages. According to this new score, 1,433 households that had not been selected to be interviewed at baseline were considered as having a very high propensity to borrow and were added to the endline sample. In the following section, we call the latter “households added at endline,” as opposed to “panel households” interviewed at both baseline and endline, and “attrition households” those that were only interviewed at baseline. ## 6.1 Household differences between preparatory and baseline surveys (and endline for those added at endline) We first seek to assess whether the information collected about the households at baseline is consistent with the information collected on those same households by the preparatory survey. We focus on household size, which should not have changed substantially in a short period. We flag the households whose number of members varied by more than 30% and by more than two people (to avoid a false positive with small households) between the preparatory survey and the baseline survey. We also examine three variables used to compute the borrowing propensity score that determined household inclusion in the sample: the household owns land (‘*yes*’ or ‘*no*’), the household has olive or argan trees (‘*yes*’ or ‘*no*’), and one or more household members receive a pension (‘*yes*’ or ‘*no*’). These three variables are chosen from the 24 included in the propensity score, because they were collected in an identical way in the preparatory and baseline survey questionnaires. Variations on the same households in a short period of time should therefore be limited. In addition to the households surveyed at baseline, we also run the same analysis for households added at endline. ```{r check_ms_bl_el, message=F, warning=F, echo=F} msid <- ms %>% select(ident, ms_hhmb = me_m1, ms_admb = me_m2, ms_pension = me_m5, ms_radio = me_m8s, ms_land = me_m9a, ms_tree = me_m9d) %>% mutate(ident = as.character(ident)) # A function to extract data from baseline or endline and join with # data from msi and compute discrepancies extract_chk_ms <- function(x, y, msid) { x %>% select(ident, land = c5, radio = c1_15, olive = e25_1, arganK = e25_5, arganL = e25_6, starts_with("h13_")) %>% mutate(arganK = ifelse(is.na(arganK), 0, arganK), arganL = ifelse(is.na(arganL), 0, arganL), olive = ifelse(is.na(olive), 0, olive), tree = ifelse(olive + arganK + arganL > 0, 1, 0), pension = rowSums(select(., starts_with("h13_")),na.rm = TRUE), pension = ifelse(pension > 0, 1, 0), land = recode(land, "2" = 0, "1" = 1)) %>% select(ident, land, radio, tree, pension) %>% left_join(select(y, ident, members_resid, nadults_resid), by = "ident") %>% left_join(msid, by = "ident") %>% mutate(dif_mb = abs(ms_hhmb - members_resid), `Significant difference in number of household members` = ifelse(dif_mb > 0.3 & dif_mb > 2, "Yes", "No"), dif_land = abs(ms_land - land), `Difference in land ownership` = ifelse(dif_land != 0, "Yes", "No"), dif_radio = abs(ms_radio - radio), `Difference in radio ownership` = ifelse(dif_radio != 0, "Yes", "No"), ms_tree2 = ifelse(ms_tree > 0, 1, 0), dif_tree = abs(ms_tree2 - tree), `Difference in olive or argan tree ownership` = ifelse(dif_tree != 0, "Yes", "No"), dif_pension = abs(ms_pension - pension), `Difference in pension reception` = ifelse(dif_pension != 0, "Yes", "No")) } summarise_chk_ms <- function(x) { x %>% group_by(`Significant difference in number of household members`, `Difference in land ownership`, `Difference in olive or argan tree ownership`, `Difference in pension reception`) %>% count() } # Use this function to compare preparatory survey with household selected # at baseline and households added at endlins match_bl_ms <- bl %>% extract_chk_ms(y = consolidated_bl, msid = msid) %>% summarise_chk_ms() %>% rename(`Selected at baseline` = n) match_el_ms <- el %>% filter(!(ident %in% bl$ident)) %>% extract_chk_ms(y = consolidated_el, msid = msid) %>% summarise_chk_ms() %>% rename(`Added at endline` = n) chk <- match_bl_ms %>% ungroup() %>% full_join(match_el_ms) nb_dif_bl <- sum(chk[chk[,1] == "Yes", 5], na.rm = TRUE) chk_tb <- chk %>% mutate(d = ifelse(`Difference in olive or argan tree ownership` == "Yes", "Yes", ifelse(`Difference in pension reception` == "Yes", "Yes", ifelse(`Difference in land ownership` == "Yes", "Yes", "No")))) %>% rename(`One or more of the 3 selected propensity criteria` = d) %>% group_by(`Significant difference in number of household members`, `One or more of the 3 selected propensity criteria`) %>% summarise(`Selected at baseline` = sum(`Selected at baseline`, na.rm = TRUE), `Added at endline` = sum(`Added at endline`, na.rm = TRUE)) sum_bl <- sum(chk_tb[,3]) sum_add <- sum(chk_tb[,4]) add_chk_tb <- tibble(" ", "Total", sum_bl, sum_add) colnames(add_chk_tb) <- colnames(chk_tb) chk_tb <- chk_tb %>% bind_rows(add_chk_tb) chk_tb <- chk_tb %>% mutate(`%` = round(`Selected at baseline`*100 / sum_bl, 1), `% ` = round(`Added at endline`*100 / sum_add, 1)) %>% ungroup() %>% select(c(1, 2, 3, 5, 4, 6)) colnames(chk_tb)[1] <- "Significant difference in number of household members[note]" colnames(chk_tb)[2] <- "One or more of the 3 selected propensity criteria[note]" chk_tb_print <- chk_tb %>% kable(caption = "Differences in household characteristics between preparatory survey and baseline", format = "latex", booktabs = T, longtable = T, format.args = list(big.mark = ",")) %>% kable_styling(full_width = T, latex_options = "HOLD_position") %>% column_spec (1, width = "4cm") %>% column_spec (2, width = "4cm") %>% column_spec (4, width = "1cm") %>% column_spec (6, width = "1cm") %>% footnote(general = c("Source: Our analysis using CDDP microdata retrieved from baseline and endline surveys."), number = c("Number of members varied by more than 30% and by more than two people; ", "The household owns land, the household has olive or argan trees, and one or more household members receive a pension."), general_title = "", threeparttable = T) %>% add_footnote(c("Number of members varied by more than 30% and by more than two people; ", "The household owns land, the household owns olive or argan trees, and one or more household members receive a pension."), notation = "number") ``` We observe in Table 14 that in 985 cases (22.06%), the number of household members is compatible between the preparatory survey and baseline survey, but the selected propensity score criteria are inconsistent. In 431 additional cases (9.65%), the selected propensity score criteria are consistent between the preparatory and baseline surveys, but the number of household members changes significantly. In 104 other cases (2.33%), both the number of household members and the selected propensity score criteria are inconsistent. In total, we observe a mismatch on these key variables between the preparatory and baseline surveys for 1,520 households of the 4,465 households sampled at baseline (34.04%). For households added at endline, substantial changes in household composition can happen considering the time lapse (delay between preparatory and baseline survey plus two years), but not to such an extent. In total, a mismatch is observed on these key variables between the preparatory and endline surveys for 724 households of the 1,433 households added at endline (50.52%). We do not try to correct these observed inconsistencies, as it would imply removing a large number of observation from the sample, hampering its statistical power. However, we notice here a major concern regarding the way households have been selected for their inclusion into the sample. ```{r, message=F, warning=F, echo=F} chk_tb_print ``` ## 6.2 Inconsistencies in household composition between baseline and endline For panel households, the same households should have been interviewed at both baseline and endline. A consistent definition of household composition is also needed to make reliable comparisons, as household composition determines all living standards parameters such as income, consumption, poverty and food security [@deaton_analysis_1997: 204--268]. The literature on the informal economy in developing countries also establishes that household composition is the defining criterion to be able to assess all parameters relating to self-employment activities [@cling_informal_2014]. ```{r avg_hh_size_bl_el, message=F, warning=F, echo=F} mb_bl <- consolidated_bl %>% filter(members_resid > 0) %>% summarise(`Wave` = "Baseline", `Mean` = mean(members_resid)) mb_bl1 <- round(mb_bl$`Mean`, 2) mb_el <- consolidated_el %>% filter(members_resid > 0) %>% summarise(`Wave` = "Endline", `Mean` = mean(members_resid)) mb_el1 <- round(mb_el$`Mean`, 2) # bind_rows(mb_bl, mb_el) %>% # kable(digits = 2, caption = "Average number of residing household members") ``` At baseline and endline, the respondent was asked to list and describe the key characteristics of all household members. We use the information to analyse whether the composition of each household is consistent between baseline and endline surveys. The average household size was `r mb_bl1` at baseline and `r mb_el1` at endline. This clearly points to a problem, as the number of members per household is not consistent between baseline and endline. For comparable figures, national population censuses establish that rural household size in Morocco was 6.59 in 1994, 6.03 in 2004 and 5.35 in 2014 [@direction_de_la_statistique_recensement2_2005: 14; @direction_de_la_statistique_recensement_2015: 3]. We created an algorithm to compare household composition between baseline and endline. The algorithm checks for each household member at baseline whether there is a corresponding household member at endline of the same gender at a compatible age. The endline survey was conducted two years after the baseline survey, so we consider for each household member that a compatible age at endline would be the person’s age at baseline, plus 1 to 3 years. To check the sensitivity of our matching analysis, we also broaden the range of compatible age from 0 to 5 five years’ difference between endline age and baseline age. Benefit of the doubt is accorded in the case of missing information, i.e. when age is not documented. We therefore consider the presence of a household member of the same gender, but with no registered age, as a possible match. All possible combinations between all members at baseline and all members at endline are checked and the configuration with the highest number of matches is retained for each household. We then compute a score that classifies each household according to the proportion of matches in its composition between baseline and endline: - Identical: all household members match between baseline and endline; - Slightly different: one-tenth or less of household members do not match between baseline and endline; - Different: one-tenth to one-quarter of household members do not match between baseline and endline; - Very different: half to one-quarter of household members do not match between baseline and endline; - Mostly inconsistent: more than half of the members do not match between baseline and endline; - No match: none of the members at endline matches the members at baseline; - Too many members/check manually: the algorithm checks all possible permutations of household same-gender members between endline and baseline. It therefore becomes computationally overwhelming if there are more than ten same-gender members at baseline and/or endline. This only occurs in 16 cases, which we discard from the analysis. ```{r algo_hh_chk, message=F, warning=F, eval=T, echo=F} # This algorithm takes a lot of time to run, so it runs only the first time # the next times, if the csv file exists, this chunk will not be evaluated run_match <- function (t = 10, # max number of hh mb from same sex to analyse dif_min = 1, # minimum age, passed to mactch_ind dif_max = 3) {# maximum age # A function to have the same number of elements, completing with NA eq <- function(x, y) { t <- length(x) - length(y) if (t < 0) { x <- c(x, rep(-99, times = -t)) } return(x) } # define logical test to check if age at endline # corresponds to baseline + range match_ind <- function(x, y, var_age_min = dif_min, var_age_max = dif_max) { # x = ag_m_el # for testing purposes # y = ag_m_bl # for testing purposes sum((x >= (y+var_age_min) & x <= (y+var_age_max)) | (is.na(x) | is.na(y)), na.rm = TRUE) } # Replace all missing codes by NAs bli$a7 <- ifelse(bli$a7 == -99 | bli$a7 == -98, NA, bli$a7) # apply this test for every household and stores variables hh_bl <- unique(bli$ident) # list all household numbers # fields to create: counts for matching and non matching by gender end hh flds <- c("ident", "nb_m_bl", "nb_m_el", "nb_m_ck_bl", "nb_ym_el", "nb_f_bl", "nb_f_el", "nb_f_ck_bl", "nb_yf_el", "hh_comp_bl", "hh_comp_el") # Create empty df with one column for every field hh_chk <- data.frame(matrix(0, ncol = length(flds), nrow = length(hh_bl))) # Assign field names to each column colnames(hh_chk) <- flds # t = 10 # set escape threshold # for(i in 1:60) { # for testing purposes for(i in 1:length(hh_bl)) { # selects successively each household # for each hh filter separately male and female at bl and el # i = 16 # for testing purpose # print(i) # to monitor progress ag_m_bl <- filter(bli, ident %in% hh_bl[i], a4 == 1)$a7 ag_m_el <- filter(eli, ident %in% hh_bl[i], a4 == 1)$a7 ag_f_bl <- filter(bli, ident %in% hh_bl[i], a4 == 2)$a7 ag_f_el <- filter(eli, ident %in% hh_bl[i], a4 == 2)$a7 # concatenates values hh_chk$age_f_mb_bl[i] <- paste(ag_f_bl, collapse = ", ") hh_chk$age_f_mb_el[i] <- paste(ag_f_el, collapse = ", ") hh_chk$age_m_mb_bl[i] <- paste(ag_m_bl, collapse = ", ") hh_chk$age_m_mb_el[i] <- paste(ag_m_el, collapse = ", ") hh_chk$ident[i] <- hh_bl[i] # field for male hh_chk$nb_m_bl[i] <- length(ag_m_bl) hh_chk$nb_m_el[i] <- length(ag_m_el) # check for many men in el have an age compatible with men in bl hh_chk$nb_m_ck_bl[i] <- if (length(ag_m_el) == 0 || length(ag_m_bl) == 0) { hh_chk$nb_m_ck_bl[i] <- 0 } else if (length(ag_m_el) == 1 && length(ag_m_bl) == 1) { hh_chk$nb_m_ck_bl[i] <- match_ind(ag_m_el, ag_m_bl) } else if (length(ag_m_el) > t || length(ag_m_bl) > t) { hh_chk$nb_m_ck_bl[i] <- ifelse(length(ag_m_el) < length(ag_m_bl), length(ag_m_bl) - length(ag_m_el), 0) } else { ag_m_el <- eq(ag_m_el, ag_m_bl) # add empty element if less hhm el vs. bl ag_m_bl <- eq(ag_m_bl, ag_m_el) # vice versa hh_chk$nb_m_ck_bl[i] <- max(sapply(permn(ag_m_el), match_ind, ag_m_bl)) } hh_chk$nb_ym_el[i] <- sum(ag_m_el <= dif_max, na.rm = TRUE) # fiedls for female hh_chk$nb_f_bl[i] <- length(ag_f_bl) hh_chk$nb_f_el[i] <- length(ag_f_el) # check for many men in el have an age compatible with men in bl # requires the same number of elements, so we equalize length hh_chk$nb_f_ck_bl[i] <- if (length(ag_f_el) == 0 || length(ag_f_bl) == 0) { hh_chk$nb_f_ck_bl[i] <- 0 } else if (length(ag_f_el) == 1 && length(ag_f_bl) == 1) { hh_chk$nb_f_ck_bl[i] <- match_ind(ag_f_el, ag_f_bl) } else if (length(ag_f_el) > t || length(ag_f_bl) > t) { hh_chk$nb_f_ck_bl[i] <- ifelse(length(ag_f_el) < length(ag_f_bl), length(ag_f_bl) - length(ag_f_el), 0) } else { ag_f_el <- eq(ag_f_el, ag_f_bl) # add empty element if less hhm el vs. bl ag_f_bl <- eq(ag_f_bl, ag_f_el) # vice versa hh_chk$nb_f_ck_bl[i] <- max(sapply(permn(ag_f_el), match_ind, ag_f_bl)) } hh_chk$nb_yf_el[i] <- sum(ag_f_el <= dif_max, na.rm = TRUE) } hh_chk$mb_bl <- hh_chk$nb_m_bl + hh_chk$nb_f_bl hh_chk$mb_ck <- hh_chk$nb_m_ck_bl + hh_chk$nb_f_ck_bl hh_chk$mb_el <- hh_chk$nb_m_el + hh_chk$nb_f_el hh_chk$ymb_el <- hh_chk$nb_ym_el + hh_chk$nb_yf_el hh_chk$mb_nochk <- (hh_chk$mb_bl - hh_chk$mb_ck) + ((hh_chk$mb_el - hh_chk$ymb_el) - hh_chk$mb_ck) hh_chk$chk_index <- ifelse(hh_chk$mb_el == 0, NA, hh_chk$mb_nochk / hh_chk$mb_bl) hh_chk$status <- ifelse(hh_chk$mb_ck == 0, "No match", ifelse(hh_chk$chk_index >= 1, "Mostly inconsistent", ifelse(hh_chk$chk_index >= 0.5, "Very different", ifelse(hh_chk$chk_index >= 0.2, "Different", ifelse(hh_chk$chk_index > 0, "Slightly different", "Identical"))))) hh_chk$status <- ifelse(hh_chk$nb_m_bl > t | hh_chk$nb_m_el > t | hh_chk$nb_f_bl > t | hh_chk$nb_f_el > t, "Too many members: check manually",hh_chk$status) # The following created an error, reclassifying no matches into dropouts # table(hh_chk$status) # write.csv2(hh_chk, "hh_chk_test.csv") # saving a copy return(hh_chk) } bli <- extract_individuals(filter(bl, ident %in% el$ident)) eli <- extract_individuals(filter(el, ident %in% bl$ident)) if (!file.exists("hh_chk13.Rdata")) { hh_chk13 <- run_match(t = 10, # max number of hh mb from same sex to analyse dif_min = 1, # minimum age, passed to mactch_ind dif_max = 3) # maximum age save(hh_chk13, file = "hh_chk13.Rdata") write.csv(hh_chk13, file = "hh_chk.csv") } if (!file.exists("hh_chk05.Rdata")) { hh_chk05 <- run_match(t = 10, # max number of hh mb from same sex to analyse dif_min = 0, # minimum age, passed to mactch_ind dif_max = 5) # maximum age save(hh_chk05, file = "hh_chk05.Rdata") } ``` ```{r smry_hh_chk, message=F, warning=F, echo=F} load("hh_chk13.Rdata") # hh_chk <- read_delim("hh_chk - copie.csv", ";", escape_double = FALSE, trim_ws = TRUE) # specify order to present results ord <- c("Identical", "Slightly different", "Different", "Very different", "Mostly inconsistent", "No match", "Too many members: check manually") hh_chk13$status <- factor(hh_chk13$status, levels = ord) chk <- hh_chk13$status %>% table() %>% data.frame() # kable(chk, caption = "Number of household according to matching level between baseline and endline") load("hh_chk05.Rdata") # Select good panels keep_hh <- hh_chk13 %>% select(ident, status) %>% mutate(keep_hh = recode(status, "Identical" = 1, "Slightly different" = 1, "Different" = 1, "Very different" = 1, "Too many members: check manually" = 1, .default = 0)) n_1to3 <- hh_chk13 %>% group_by(status) %>% count() n_0to5 <- hh_chk05 %>% group_by(status) %>% count() n_match <- n_1to3 %>% left_join(n_0to5, by = "status") %>% ungroup() %>% mutate(status = recode_factor(status, "Identical" = "Identical", "Slightly different" = "Slightly different", "Different" = "Different", "Very different" = "Very different", "Mostly inconsistent" = "Mostly incompatible", "No match" = "No match", "Too many members: check manually" = "Too many members: check manually")) %>% select(`Status` = status, `Threshold 1 to 3 years` = n.x, `Threshold 0 to 5 years` = n.y) %>% arrange(`Status`) tot_match1 <- sum(n_match[,2]) tot_match2 <- sum(n_match[,3]) add_n_match <- tibble("Total", tot_match1, tot_match2) colnames(add_n_match) <- colnames(n_match) n_match <- n_match %>% bind_rows(add_n_match) %>% mutate(`%` = round(`Threshold 1 to 3 years`*100 / tot_match1, 1), `% ` = round(`Threshold 0 to 5 years`*100 / tot_match2, 1)) %>% select(1,2,4,3,5) kable(n_match, format = "latex", booktabs = T, linesep = '', caption = "Number of households according to the proportion of members whose gender and age match between baseline and endline", format.args = list(big.mark = ",")) %>% kable_styling(latex_options = "HOLD_position") %>% column_spec (1, width = "4cm") %>% column_spec (2, width = "4cm") %>% column_spec (4, width = "1cm") %>% footnote(general = "Source: Our analysis using CDDP microdata retrieved from baseline and endline surveys.", general_title = "") ``` Table 15 shows that the composition of 834 households (655 + 179, i.e. 20.25% of panel households) is entirely or mostly incompatible between baseline and endline. As illustrated in appendix 4, it is not plausible in cases presenting such a magnitude of discrepancy that the same households could have been re-interviewed. The full list of mismatched households is given in the online appendix to this paper in a `.csv` file. In these cases, it seems plausible that the interviewer failed to reach at endline the household that had been interviewed at baseline and interviewed another household instead. Removing the observations corresponding to these mismatched households translates into slightly different estimates. But this removal has to be combined with the inclusion of the observations initially discarded by CDDP as "low borrowing propensity households", as explained in Section 6.3. We present the incidence of the overall resampling in Section 6.4. ## 6.3 Contradictions in propensity scores used as sampling criteria The cornerstone of this RCT protocol and the corresponding article’s identification strategy is the household propensity to borrow, which was evaluated by scores. Attentive readers of the article will understand that two scores were used to assess the household borrowing propensity. Examination of the do-files reveals that there were actually four scores: - **Score 1**: as we explained at the beginning of Section 6, the households sampled at baseline in each village were selected based on a score predicting their propensity to borrow (Score 1). This score was calculated before the baseline using variables collected on all surveyed households by the preparatory survey. In each village, the top quartile of households was classified a “*high borrowing propensity*” group and sampled. Five households randomly selected from the rest of the village were also included in the sample and classified a “*low borrowing propensity*” group; - **Scores 2 and 3**: at the beginning of the endline survey, given the low take-up observed since the beginning of the RCT, CDDP re-calculated a second score (Score 2) supposed to be more accurate than the previous one. Matching the preparatory survey with current AAA administrative data, the new score was computed to better identify potential borrowers that were not sampled at baseline in order to include them in the endline survey. They then recalculated a third score (Score 3) supposed to be even more accurate – based on the same procedure, but using an updated version of the AAA client register – to select the households for the last phases of the endline survey. Households added based on both scores were classified a "*very high borrowing propensity*" group. - **Final score**: CDDP computed a last propensity score, based on the ex-post information contained in the AAA client register. Section 5.1.1 already points out that this administrative data was substantially inconsistent with the information collected by the survey. All average treatment effects estimated by CDDP (Tables 2 to 7) were calculated for the “high” and “very high” propensity to borrow subsamples and presented as the treatment-on-the-treated (TOT) impact. The analysis of the entire sample (“low”, “high” and “very high” propensity groups) is presented as the intention-to-treat (ITT) impact. The final score is the variable used by CDDP (Table 8-panel C) to segment the sample. The values in this Table 8-panel C are the main argument used to justify the instrumental variable regression (using treatment/control classification as an instrument) conducted by CDDP (Table 9). ### 6.3.1 Scores contradict one another We analyse whether households classified in different borrowing propensity groups do indeed have consistent scores across the subsequent estimations made by CDDP. We do so by charting the distribution of observations for each score, separating out “low propensity”, “high propensity” and “very high propensity” observations each time (Figure 2). ```{r compare_scores, message=F, warning=F, message=F, echo=F} # Violin plots for scores 1 to 3 violin <- consolidated %>% select(ident, score1, score2, score3, newhh, random5, random5_end, borrowed_alamana_el, treatment_el) %>% mutate(`Borrowing propensity group` = ifelse(newhh == 1, ifelse(is.na(random5_end) | random5_end == 0, "Very high", "Low"), ifelse(is.na(random5) | random5 == 0, "High", "Low")), `Borrowing propensity group` = recode_factor(`Borrowing propensity group`, "Low" = "Low", "High" = "High", "Very high" = "Very high", .missing = "Attritors")) # Reshaping to wide violin_long <- violin %>% gather(variable, `Score`, -ident, -`Borrowing propensity group`, -newhh, -random5, -random5_end, -borrowed_alamana_el, -treatment_el) %>% mutate(variable = recode(variable, "score1" = "Score 1\n(baseline sampling)", "score2" = "Score 2\n(endline sampling)", "score3" = "Score 3\n(endline sampling)")) # Plotting plot_1to3 <- ggplot(violin_long, aes(x=`Borrowing propensity group`, y=`Score`)) + geom_boxplot() + ylim(-15, 5) + theme(axis.text.x = element_text(angle=45, hjust = 1), axis.title.x = element_blank()) + facet_wrap(~ variable, nrow = 1) + ylab("Score") ``` ```{r compute_phat, message=F, warning=F, message=F, echo=F} # Complete info: el when el, otherwise bl consolidated <- consolidated %>% mutate(m1 = ifelse(!is.na(m1_el), m1_el, m1_bl), m2 = ifelse(!is.na(m2_el), m2_el, m2_bl), m3 = ifelse(!is.na(m3_el), m3_el, m3_bl), m4 = ifelse(!is.na(m4_el), m4_el, m4_bl), m5 = ifelse(!is.na(m5_el), m5_el, m5_bl), m6 = ifelse(!is.na(m6_el), m6_el, m6_bl), m7c = ifelse(!is.na(m7c_el), m7c_el, m7c_bl), m8n = ifelse(!is.na(m8n_el), m8n_el, m8n_bl), m8s = ifelse(!is.na(m8s_el), m8s_el, m8s_bl), m9a = ifelse(!is.na(m9a_el), m9a_el, m9a_bl), m9b = ifelse(!is.na(m9b_el), m9b_el, m9b_bl), m9c = ifelse(!is.na(m9c_el), m9c_el, m9c_bl), m9d = ifelse(!is.na(m9d_el), m9d_el, m9d_bl), m9e = ifelse(!is.na(m9e_el), m9e_el, m9e_bl), m9fb = ifelse(!is.na(m9fb_el), m9fb_el, m9fb_bl), m9fc = ifelse(!is.na(m9fc_el), m9fc_el, m9fc_bl), m9g = ifelse(!is.na(m9g_el), m9g_el, m9g_bl), m10a = ifelse(!is.na(m10a_el), m10a_el, m10a_bl), m10b = ifelse(!is.na(m10b_el), m10b_el, m10b_bl), m11 = ifelse(!is.na(m11_el), m11_el, m11_bl), m12 = ifelse(!is.na(m12_el), m12_el, m12_bl), m13 = ifelse(!is.na(m13_el), m13_el, m13_bl), m14 = ifelse(!is.na(m14_el), m14_el, m14_bl), treatment = ifelse(!is.na(treatment_el), treatment_el, treatment_bl), paire = ifelse(!is.na(paire_el), paire_el, paire_bl)) # The Stata cut function used by Crépon et al. seems bugged. In any case, # it is not clearly documented stata_cut5 <- function(x, y, ...) { qv <- quo(...) z <- x %>% select(ident, ...) %>% arrange(!is.na(!! qv), !! qv) %>% mutate(i = as.numeric(rownames(x)), cut_o = ifelse(i <= y[1], 0, ifelse(i <= (y[1]+y[2]), 1, ifelse(i <= (y[1]+y[2]+y[3]), 2, ifelse(i <= (y[1]+y[2]+y[3]+y[4]), 3, 4))))) %>% select(ident, cut_o) x <- x %>% select(ident) %>% left_join(z, by = "ident") return(x$cut_o) } # Set the number of observations by groups as produced by Stata in # Crépon et al. g_m6 <- c(954, 1311, 1259, 989, 1385) g_m9d <- c(0, 0, 3242, 1104, 1552) g_m9g <- c(0, 0, 0, 0, 5898) g_m10a <- c(0, 0, 2968, 857, 2073) g_m10b <- c(0, 0, 3103, 1271, 1524) g_m13 <- c(0, 1631, 595, 1606, 2066) # apply the cuts identical to Crépon et al. consolidated <- consolidated %>% mutate(m6_c = stata_cut5(., g_m6, m6), m9d_c = stata_cut5(., g_m9d, m9d), m9g_c = stata_cut5(., g_m9g, m9g), m10a_c = stata_cut5(., g_m10a, m10a), m10b_c = stata_cut5(., g_m10b, m10b), m13_c = stata_cut5(., g_m13, m13)) # Regress and predict score as in Crépon et al. m_phat <- glm(client ~ m1 + m2 + m3 + m4 + m5 + m6_c + m7c + m8n + m8s + m9a + m9b + m9c + m9d_c + m9e + m9fb + m9fc + m9g_c + m10a_c + m10b_c + m11 + m12 + m13_c + m14 + factor(paire), family = "binomial", data = filter(consolidated, treatment == 1, trimobs_el != 1)) consolidated$phat <- predict.glm(m_phat, consolidated, type = "response") scores <- consolidated %>% select(ident, score1, score2, score3, newhh, random5_end, phat) ``` ```{r phat, message=F, warning=F, message=F, echo=F} # Produce violin plots violin2 <- scores %>% mutate(ident = as.character(ident)) %>% select(ident, score1, score2, score3, newhh, random5_end, phat) %>% left_join(select(bl, ident, random5), by = "ident") %>% mutate(`Borrowing propensity group` = ifelse(newhh == 1, ifelse(is.na(random5_end) | random5_end == 0, "Very high", "Low"), ifelse(is.na(random5) | random5 == 0, "High", "Low")), `Borrowing propensity group` = recode_factor(`Borrowing propensity group`, "Low" = "Low", "High" = "High", "Very high" = "Very high"), `Score` = phat, variable = "Final score\n(calculated after endline)") # Plotting plot_final <- ggplot(violin2, aes(x=`Borrowing propensity group`, y=`Score`)) + geom_boxplot() + theme(axis.text.x = element_text(angle=45, hjust = 1), axis.title.y=element_blank(), axis.title.x=element_blank()) + facet_wrap(~ variable) test <- grid.arrange(plot_1to3, plot_final, layout_matrix = matrix(c(1, 1, 1, 1, 2, 2), ncol = 6), top = "Figure 2: Contradiction between borrowing propensity scores", bottom = "Borrowing propensity group") ``` Source: Our analysis using CDDP microdata retrieved from endline survey. If the scores were reliable, Figure 2 would present a difference between the distributions: the “high propensity” group would be well above the “low propensity” group, not only for score 1 (on which the classification was based), but also for scores 2 and 3 and the final score. This is not the case. For instance, the “low propensity” group and “high propensity” group have very similar score 2 and 3 distributions. Moreover, the "very high propensity" group displays a score 1 distribution that is similar to the "low propensity" group. The low association between scores is puzzling, as they are supposed to reflect, at least in part, the same phenomenon. It can be deciphered by observing the scoring factors that were attributed each variable to compute scores 1, 2 and 3 and the final score, as presented in Appendix 5. ```{r scores, message=F, warning=F, message=F, echo=F} score_phat <- scores %>% select(ident, phat) %>% mutate(ident = as.character(ident)) violin <- violin %>% left_join(score_phat, by = "ident") el <- el %>% left_join(score_phat, by = "ident") # We regress scores on propensity inputs for the 3 scores s1 <- lm(score1 ~ d24d + com_cCC + pact2 + pterre + e4part + lachat_agriD + enew2 + lne241C + ntach_bov + retr + a_rad + a_tapfCC + d_telC + d_habC + credf + lnremb + groupCC + prend_crCC, data = el) s1 <- s1 %>% tidy() %>% mutate(signif = sapply(p.value, function(x) make_stars3(x)), est_s = paste(round(estimate, 3), signif, sep = "")) %>% select(`Variable` = term, `Score 1` = est_s) s2 <- lm(score2 ~ com_cCC + pterre + e4part + lachat_agriD + enew2 + ntach_bov + lachat_agriD + retr + a_rad + a_tapfCC + d_telC + d_habC + credf + lnremb + groupCC + prend_crCC + m1 + m2 + m3 + m4 + m6 + m9b + m9c + m9d + m9fb + m9fc + m10b + m13, data = mutate(el, pterre = m9a, lachat_agriD = m9e * -1)) s2 <- s2 %>% tidy() %>% mutate(signif = sapply(p.value, function(x) make_stars3(x)), est_s = paste(round(estimate, 3), signif, sep = "")) %>% select(`Variable` = term, `Score 2` = est_s) s3 <- lm(score3 ~ com_cCC + ntach_bov + retr + a_rad + a_tapfCC + credf + groupCC + prend_crCC + pterre + lachat_agriD + e4part + m1 + m2 + m3 + m4 + m6 + m9b + m9d + m9fb + m9fc + m10a + m10b + m13 + factor(demi_paire), data = mutate(el, pterre = m9a, lachat_agriD = m9e * -1, e4part = m9c)) s3 <- s3 %>% tidy() %>% mutate(signif = sapply(p.value, function(x) make_stars3(x)), est_s = paste(round(estimate, 3), signif, sep = "")) %>% select(`Variable` = term, `Score 3` = est_s) s4 <- lm(phat ~ com_cCC + pterre + e4part + lachat_agriD + ntach_bov + retr + a_rad + a_tapfCC + d_telC + d_habC + credf + groupCC + prend_crCC + m4 + m3 + m13 + m9d + m9fb + m9fc + m1 + m2 + m6 + m9b + factor(demi_paire), data = mutate(el, pterre = m9a, com_cCC = m7c, e4part = m9c, lachat_agriD = m9e * -1, ntach_bov = m9g, retr = m5, a_rad = m8s, a_tapfCC = m8n, d_telC = m10a, d_habC = m10b, credf = m11, groupCC = m12, prend_crCC = m14)) s4 <- s4 %>% tidy() %>% mutate(signif = sapply(p.value, function(x) make_stars3(x)), est_s = paste(round(estimate, 3), signif, sep = "")) %>% select(`Variable` = term, `Final score` = est_s) # Extract variable names from database vars_score <- c("com_cCC", "pterre", "e4part", "lachat_agriD", "ntach_bov", "retr", "a_rad", "a_tapfCC", "d_telC", "d_habC", "credf", "groupCC", "prend_crCC", "pact2", "m4", "d24d", "m3", "lnremb", "m13", "lne241C", "m9d", "enew2", "m9fb", "m9fc", "m1", "m2", "m6", "m9b") proplabels <- el %>% select(one_of(vars_score)) %>% map_chr(~attributes(.)$label) %>% str_remove("^M[0-9]+([a-z]+)?\\.") score_coefs <- tibble(`Variable` = vars_score, label = proplabels) tokeep <- "\\.[0-9]+\\*\\*\\*\\*" # Then agregate and display scores_coefs <- score_coefs %>% left_join(s1, by = "Variable") %>% left_join(filter(s2, grepl(tokeep, `Score 2`)), by = "Variable") %>% left_join(filter(s3, grepl(tokeep, `Score 3`)), by = "Variable") %>% left_join(filter(s4, grepl(tokeep, `Final score`)), by = "Variable") %>% filter(!(str_detect(`Variable`, "factor"))) scores_coefs <- scores_coefs %>% mutate(`Variable` = label) %>% mutate(`Variable` = str_replace(`Variable`, "ms ", "")) %>% mutate_all(funs(str_replace(., "\\*\\*\\*\\*", ""))) %>% mutate_all(funs(ifelse(is.na(.), " ", .))) %>% select(-label) scores_coef_print <- scores_coefs %>% kable(caption = "Contradictory coefficients attibuted to propensity determinants for each score", format = "latex", booktabs = T, longtable = T) %>% kable_styling(full_width = T, latex_options = "HOLD_position") %>% column_spec (1, width = "8cm") %>% # column_spec (7, width = "2cm") %>% # column_spec (5, width = "2.3cm") %>% # column_spec (6, width = "2.3cm") %>% footnote(general = c("Source: Our analysis using CDDP microdata retrieved from endline survey.", "The coefficients reported for score 1, score 2 and score 3 are the result of the three corresponding linear regressions on the CDDP entire endline dataset. The scores computed by CDDP were included in the dataset and we use each one of these variables as the dependent variable for the three subsequent regressions. We use the variables indicated in the first column as independent variables and obtain p-values equal to 0 for each variable and a R-squared equal to 1 for each regression. These coefficients therefore correspond exactly to the factors that were applied by CDDP to the corresponding variables to compute the borrowing propensity scores used to sample households at baseline (score 1) and add new households at endline (score 2 and score 3). The final score was not included in the CDDP dataset. We computed it with a logit regression using the same specification as CDDP in their code (AN: 1537-57). The coefficients reported above are the result of a linear regression using the final score as dependent variable and the variables indicated in the first column as independent variables and controlling for strata dummies (paired villages). The coefficients reported for the final score have a p-value < 0.001 % but not equal to 0 and the R-squared is 0.90."), general_title = "", threeparttable = T) ``` Observation of Table 22 indicates that the coefficients attributed to each scoring variable drastically change from one score to the next, denoting a lack of estimation robustness. Some of them become non-significantly different to 0 and vice versa. Moreover, some coefficients change signs for opposite values, from positive to negative and vice versa. For instance, owning land was attributed a negative factor for propensity scores 1 and 3, but a positive factor for score 2 and the final score. Having a fibre mat corresponded to a positive coefficient for scores 1 and 2, but negative for score 3 and the final score. Doing more than three self-employment activities was associated with a significant positive coefficient for score 1, negative for score 2, and was not retained as a scoring variable for score 3 and the final score. And so on and so forth. We observe such contradictions for most of the variables used to compute the borrowing propensities, suggesting that these scores suffer from a major lack of robustness. ### 6.3.2 Borrowing propensity scores fail to predict borrowing To be considered as a propensity score for an event, a variable must predict the occurrence of such an event. We analyse whether the borrowing propensity scores are able to predict borrowing. Figure 3, which presents score distribution based on the borrowing status of households in treatment villages, suggests that the power to predict differences in access to credit is more than limited. ```{r psty_cred, warning=F, message=F, echo=F} violin_long_t <- violin_long %>% filter(treatment_el == 1) %>% rename(`Borrowed from Al Amana` = borrowed_alamana_el) %>% mutate(`Borrowed from Al Amana` = recode_factor(`Borrowed from Al Amana`, "0" = "No", "1" = "Yes")) # Plotting plot2_1to3 <- ggplot(violin_long_t, aes(x=`Borrowed from Al Amana`, y=`Score`)) + geom_boxplot() + ylim(-15, 5) + theme(axis.title.x=element_blank()) + facet_wrap(~ variable, nrow = 1)+ ylab("Score") violin_t <- violin %>% filter(treatment_el == 1) %>% rename(`Borrowed from Al Amana` = borrowed_alamana_el) %>% mutate(`Borrowed from Al Amana` = recode_factor(`Borrowed from Al Amana`, "0" = "No", "1" = "Yes"), variable = "Final score\n(calculated after endline)") # Plotting plot2_final <- ggplot(violin_t, aes(x=`Borrowed from Al Amana`, y=phat)) + geom_boxplot() + theme(axis.title.y=element_blank(), axis.title.x=element_blank()) + facet_wrap(~ variable) plot_final <- ggplot(violin2, aes(x=`Borrowing propensity group`, y=`Score`)) + geom_violin(draw_quantiles = 0.5) + theme(axis.text.x = element_text(angle=45, hjust = 1), axis.title.y=element_blank(), axis.title.x=element_blank()) + facet_wrap(~ variable) test <- grid.arrange(plot2_1to3, plot2_final, layout_matrix = matrix(c(1, 1, 1, 1, 2, 2), ncol = 6), top = "Figure 3: Borrowing propensity scores fail to predict who borrows in treatment villages", bottom = "Borrowing propensity group") ``` Source: Our analysis using CDDP microdata retrieved from endline survey. ```{r psty_cred_test, message=F, warning=F, message=F, echo=F} scores_treated_aa <- violin %>% filter(borrowed_alamana_el == 1) scores_treated_nonaa <- violin %>% filter(borrowed_alamana_el == 0) # Compute t-test tts1 <- t.test(scores_treated_aa$score1, scores_treated_nonaa$score1) tts2 <- t.test(scores_treated_aa$score2, scores_treated_nonaa$score2) tts3 <- t.test(scores_treated_aa$score3, scores_treated_nonaa$score3) tts4 <- t.test(scores_treated_aa$phat, scores_treated_nonaa$phat) tt <- select(tidy(tts1), statistic, p.value) %>% rbind(select(tidy(tts2), statistic, p.value)) %>% rbind(select(tidy(tts3), statistic, p.value)) %>% rbind(select(tidy(tts4), statistic, p.value)) # Compute Wilcoxon ws1 <- wilcox.test(scores_treated_aa$score1, scores_treated_nonaa$score1) ws2 <- wilcox.test(scores_treated_aa$score2, scores_treated_nonaa$score2) ws3 <- wilcox.test(scores_treated_aa$score3, scores_treated_nonaa$score3) ws4 <- wilcox.test(scores_treated_aa$phat, scores_treated_nonaa$phat) ws <- select(tidy(ws1), statistic, p.value) %>% rbind(select(tidy(ws2), statistic, p.value)) %>% rbind(select(tidy(ws3), statistic, p.value)) %>% rbind(select(tidy(ws4), statistic, p.value)) sc <- tt %>% cbind(ws) colnames(sc) <- c("T test", "p value T test", "Wilcoxon", "p value Wilcox") sc <- round(sc, 4) sc$`Score` <- c("Score 1", "Score 2", "Score 4", "Final score") sc %>% select(`Score`, `T test`, `p value T test`, `Wilcoxon`, `p value Wilcox`) %>% kable(format = "latex", booktabs = T, linesep = '', caption = "Score 1 is not associated with borrowing", format.args = list(big.mark = ",")) %>% kable_styling(latex_options = "HOLD_position") %>% footnote(general = "Source: Our analysis using CDDP microdata retrieved from endline survey. Association between scores and reported borrowing in variable 'i3'.", general_title = "", threeparttable = TRUE) ``` For score 1, p-values above 0.05 mean that the null hypothesis cannot be rejected. We can conclude that score 1 is not associated with borrowing. We can reject the null hypothesis for the other scores, i.e. there is some association between the score and borrowing. ## 6.4 Results with a consistent panel sample and correcting some coding and measurement errors To tackle the sampling issues listed above, we recompute the impact estimates with resampling. We include the households classified by CDDP as “low borrowing propensity”, because they were selected based on a score that does not reflect their borrowing propensity (see 6.3.2) and their actual borrowing propensity is not different from the households classified as “high borrowing propensity” (see 6.3.1). We also restrict the analysis to households with compatible baseline-endline compositions, which means discarding households classified as “mostly inconsistent” or “no match” (see Section 6.2). Here too, this only partially corrects the sampling errors. For instance, it does not discard households whose characteristics used as sampling criteria differed between the preparatory survey and the baseline. Rectified estimates with resampling and coding errors corrected in 5.3 are in Table 17. ```{r other_correct3, message=F, warning=F, echo=F} keep_hh <- mutate(keep_hh, ident = as.character(ident)) consolidated2_bl_lp <- consolidated2_bl %>% left_join(keep_hh, by = "ident") %>% filter(keep_hh == 1) %>% flag_trimobs(trim_vars = trim_vars_el) consolidated2_el_lp <- consolidated2_el %>% left_join(keep_hh, by = "ident") %>% filter(keep_hh == 1) %>% flag_trimobs(trim_vars = trim_vars_el) # merge baseline and endline consolidated2_lp <- consolidated2_bl_lp %>% full_join(consolidated2_el_lp, by = "ident", suffix = c("_bl", "_el")) %>% set_missing_controlvars() %>% subsample() # Correcting measurement errors with trim 0.5 sample_m_05_lp <- consolidated2_lp %>% filter(trimobs_el != 1) # Compute N for table n_consolidated2_lp <- nrow(consolidated2_lp) n_sample_m_05_lp <- nrow(sample_m_05_lp) nt_sample_m_05_lp <- sum(sample_m_05_lp$treatment_el == 1) nc_sample_m_05_lp <- sum(sample_m_05_lp$treatment_el == 0) # corrected measurement, trim at 0.5, consistent panel and low propensity households reg_sa_m_05_lp <- reg(sample_m_05_lp, dep_vars = selfact, fullreg = FALSE, var_out = "treatment_el") reg_sa_m_05_lp <- reg_sa_m_05_lp %>% tibble() %>% mutate(`Outcome` = names(reg_sa_m_05_lp)) colnames(reg_sa_m_05_lp) <- c("Some measurement error corrections, trim at 0.5\\%, consistent panel and including `low propensity' households", "Outcome variables") # LaTeX output reg_sa_m_05_lp <- reg_sa_m_05_lp %>% select("Outcome variables", "Some measurement error corrections, trim at 0.5\\%, consistent panel and including `low propensity' households") %>% column_to_rownames(var = "Outcome variables") %>% t() %>% as.data.frame() %>% as_tibble() %>% rownames_to_column(var = "Source") %>% filter(`Source` != "Outcome variables") reg_sa_m_05_lp_tb <- tb_asis %>% bind_rows(reg_sa_m_05_lp) %>% mutate(`Source` = c("For memory: initial CDDP results", "Consistent panel and some error corrections")) reg_sa_m_05_lp_tb %>% mutate_all(as.character) %>% mutate_all(linebreak) %>% kable(format = "latex", booktabs = T, escape = F, # problem with that: no footnote possible col.names = linebreak(c(" ","Assets", "Sales and home\nconsumption", "Expenses", "Of which:\nInvestment","Profit")), caption = "Replicated impact estimates correcting some measurement, coding and sampling errors") %>% kable_styling(latex_options = "HOLD_position") %>% column_spec (1, width = "6.5cm") %>% column_spec (3, width = "2.2cm") %>% column_spec (5, width = "1.5cm") %>% footnote(general = c("Source: Our reproduction of CDDP Table 3 with R using the same raw data, resampling for a consistent panel and correcting the coding and measurement errors listed in Section 5: omission of credits from other MFIs in total access to credit; omission of credits that matured before the survey in the variable; omission of agricultural assets in the total of assets owned by households; erratic prices used to appraise agricultural assets; livestock assets excluding non-existent units; business earnings omitted some business sales; confusions between prices before, during and after harvest to appraise agricultural sales and consumption; and inconsistent amortisation rules for agricultural investments. The sample includes 3,268 households interviewed both at baseline and at endline and which member gender and age composition is compatible between baseline and endline. 0.5 percent of observations are trimmed using the method applied by CDDP at baseline for Table 3. Coefficients and standard errors (in parentheses) from an OLS regression of the variable on a treated village dummy, controlling for strata dummies (paired villages), number of household members, number of adults, head age, does animal husbandry, does other non-agricultural activity, had an outstanding loan over the past 12 months, HH spouse responded to the survey, and other HH member (excluding the HH head) responded to the survey and variables specified below. Standard errors are clustered at the village level.", "*** Significant at the 1 percent level", "** Significant at the 5 percent level", "* Significant at the 10 percent level"), general_title = "", threeparttable = TRUE) reg_sa_m_05_lp <- reg(sample_m_05_lp, dep_vars = selfact, fullreg = FALSE, var_out = "treatment_el", rest_form = "~ treatment_el + members_resid_bl + nadults_resid_bl + head_age_bl + act_livestock_bl + act_business_bl + borrowed_total_bl + members_resid_bl_d + nadults_resid_bl_d + head_age_bl_d + output_total_bl + profit_total_bl + act_livestock_bl_d + act_business_bl_d + borrowed_total_bl_d + ccm_resp_activ + other_resp_activ + ccm_resp_activ_d + other_resp_activ_d + factor(paire_el)") ``` With a consistent panel sample, we have `r c_th(n_sample_m_05_lp)` observations. We see that focusing the analysis on this consistent panel yields different results: the impact estimate on sales is smaller and less significant, the impact estimates on expenses is smaller, and the impact estimate on profits is not significant anymore. In table 18 we check for imbalances at baseline for this resampling, as well as the impact at endline for a series of outcomes. ```{r other_correct, message=F, warning=F, echo=F} ############################ consolidated_misc_lp <- check_misc(el, consolidated2_lp) consolidated_misc_lp <- check_misc(bl, consolidated_misc_lp) consolidated_misc_lp$trimobs_bl <- ifelse(is.na(consolidated_misc_lp$trimobs_bl), 0, consolidated_misc_lp$trimobs_bl) vars_analyse_bl <- paste0(vars_analyse, "_bl") balance_misc_lp <- reg_balance(treatment = treatment_el, dep_vars = vars_analyse_bl, controls = "factor(paire_bl)", cluster = "demi_paire_bl", data = filter(consolidated_misc_lp, trimobs_bl != 1)) %>% mutate(`Dependent variable` = str_remove(`Dependent variable`, "_bl$")) impact_misc_lp <- reg(filter(consolidated_misc_lp, trimobs_el != 1), dep_vars = vars_analyse_el, fullreg = FALSE, separator = " ", var_out = "treatment_el")%>% as.data.frame() %>% rownames_to_column(var = "code") %>% mutate(code = str_remove(code, "_el$"), n = nrow(filter(consolidated_misc_lp, trimobs_el != 1))) %>% select(1, 3, 2) impact_misc_lp2 <- reg(filter(consolidated_misc_lp, trimobs_el != 1), dep_vars = vars_analyse_el, fullreg = FALSE, var_out = "treatment_el", separator = " ", rest_form = "~ treatment_el + members_resid_bl + nadults_resid_bl + head_age_bl + act_livestock_bl + act_business_bl + borrowed_total_bl + members_resid_bl_d + nadults_resid_bl_d + head_age_bl_d + act_livestock_bl_d + act_business_bl_d + borrowed_total_bl_d + ccm_resp_activ + other_resp_activ + ccm_resp_activ_d + other_resp_activ_d + output_total_bl + profit_total_bl + cm_hom_bl + nb_TV_couleur_bl + cm_ne_ds_douar_bl + elec_bl + assain_1_bl + assain_2_bl + eau_2_bl + wom_no_souk_bl + wom_no_bus_bl + factor(paire_el)")%>% as.data.frame() %>% rownames_to_column(var = "code") %>% mutate(code = str_remove(code, "_el$")) var_names <- tibble(c("cm_darija_tot", "Darija", "Household head spoken language", 3), c("cm_berbere_tot", "Berber", "Household head spoken language", 3), c("cm_arabe_class_tot", "Classical Arabic", "Household head spoken language", 3), c("cm_francais_tot", "French", "Household head spoken language", 3), c("cm_hom", "Male head", "Household characteristics", 1), c("cm_fonctionnaire", "Head is a public servant", "Household characteristics", 1), c("cm_ne_ds_douar", "Head born in the same village", "Household characteristics", 1), c("chef_ss_educ", "Head without education", "Household characteristics", 1), c("nb_TV_couleur", "Number of color TVs", "Household assets", 6), c("elec", "Electricity from grid", "Access to basic utilities", 8), c("assain_1", "Sewage network", "Access to basic utilities", 8), c("assain_2", "Septic tank", "Access to basic utilities", 8), c("eau_1", "Private connection to piped water", "Access to basic utilities", 8), c("eau_2", "Shared connection to public tap", "Access to basic utilities", 8), c("owns_land", "Owns land", "Household assets", 6), c("area_land", "Area of owned land", "Household assets", 6), c("migr_tot", "Members left in the last 5 years", "Household characteristics", 1), c("wom_no_souk", "Go to the souk alone", "Respondent considers that women should not", 9), c("wom_no_bus", "Take the bus alone", "Perception of woman condition", 9), c("assets_total","Assets", "Outcomes on self-employment activities", 0), c("output_total","Sales and home consumption", "Outcomes on self-employment activities", 0), c("expense_total","Expenses", "Outcomes on self-employment activities", 0), c("inv_total","Of which: Investment", "Outcomes on self-employment activities", 0), c("profit_total","Profit", "Outcomes on self-employment activities", 0)) colnames(var_names) <- 1:24 var_names <- t(var_names) colnames(var_names) <- c("code", "label", "category", "rank") results_lp <- var_names %>% as.tibble() %>% arrange(rank) %>% left_join(balance_misc_lp, by = c("code" = "Dependent variable")) %>% left_join(impact_misc_lp, by = "code") %>% left_join(impact_misc_lp2, by = "code") results_lp2 <- results_lp %>% select(-code, -category, -rank) colnames(results_lp2) <- c("Variable", "Obs. ", "Obs.", "Mean", "SD", "Coeff.\\textsuperscript{1}", "p-value", "Obs. ","Correcting some errors\\textsuperscript{2}", "Adding controls\\textsuperscript{3}") results_lp2 %>% mutate(`Obs. ` = as.numeric(`Obs. `), `Obs.` = as.numeric(`Obs.`)) %>% # mutate_all(linebreak) %>% kable(format = "latex", booktabs = T, escape = F, format.args = list(big.mark = ","), # align = c(rep("l", 10)), caption = "Balance tests at baseline and impact estimates at endline, correcting some measurement and sampling errors") %>% kable_styling() %>% add_header_above(c(" " = 1, "N" = 1, "Control group" = 3, "Treatment - Control" = 2, " " = 1,"ATE estimates" = 2)) %>% add_header_above(c(" " = 1, "Balance at baseline" = 6, "Impact at endline" = 3)) %>% group_rows("Outcomes on self-employment activities", 1, 5) %>% group_rows("Household characteristics", 6, 10) %>% group_rows("Household head spoken language", 11, 14) %>% group_rows("Household assets", 15, 17) %>% group_rows("Access to basic utilities", 18, 22) %>% group_rows("Respondent considers that women should not:", 23, 24) %>% # column_spec (1, width = "5cm") %>% footnote(general = c("*** Significant at the 1 percent level; ** Significant at the 5 percent level; * Significant at the 10 percent level.", "1. Exact same specifications as in Table 1; 2. Same specifications as in Table 17; 3. Same specifications as in Table 18, adding as controls the baseline values of sales, profits, head was born in the same village, household has a connexion to the electricity grid, to the sewage network, to a septic tank, access to a public tap, respondent considers that women should not go to souk alone and that women should not take the bus alone. Sample includes 3,268 households interviewed both at baseline and endline and which member gender and age composition is compatible between baseline and endline."), general_title = "", threeparttable = T) %>% landscape() ``` Table 18 confirms that even after correcting some measurement and coding errors and focusing on a consistent sample, we still find important imbalances at baseline, on sales and profits, household head gender and origin, access to electricity, water and sanitation, or opinion on women's empowerment. When applying the same corrections of coding, measurement and sampling errors at endline, we find that the impact on assets and profits are not significant, and that the main results are to be found in increasing turnover from self-employment activity. However, we also observe disconcerting estimates on other outcomes. Microcredit would then increase household head education, foster members to leave the household, increase the knowledge of Arabic and French, impede households' access to public sewage and incentivise the use of septic tanks, as well as access to public taps. We also see that household buy more TVs, while a prominent conclusion of CDDP was that it reduced nonessential expenditures. Such outcomes are hardly plausible and we interpret them as an indication of a lack of quality of the data and of alterations in the protocol and the survey sampling. # 7. External validity: what might the results be representative of? If the sampled households do not represent high borrowing propensity rural households, then what do they represent? The inconsistent scoring system explained in 4.3 skewed the representativeness of the baseline sample towards a population subset. Score 1 tended towards the sampling of households owning less land, with fewer cows and more non-agricultural self-employment activities. Yet scores 2 and 3 used to add new households at the endline tended more towards the inclusion of agricultural households. We can compare some of this population’s characteristics with other Moroccan data taken from such sources as major national surveys or censuses of the rural population. For instance, CDDP report a monthly consumption average of MAD 2,272 per household at baseline (data collected from April 2006 to December 2007) compared with the MAD 3,611 found in Morocco’s rural population by the 2007 National Living Standards Survey [@direction_de_la_statistique_enquete_2007, data collected between December 2006 and November 2007]. This means either that the study population was 37% poorer than the average population or that there are inconsistencies between the household expenditure estimation method used by this RCT survey and the national household survey. @pamies_sumner_development_2015 [: 72-74] pointed out, for instance, that the questionnaire designed by CDDP deviated considerably from the living standards measurement survey questionnaire and procedures developed by Moroccan statisticians for domestic surveys. CDDP also report that household heads are men in 93.5% of cases as opposed to the 87.4% average for rural households found in the 2004 population and housing censuses [@direction_de_la_statistique_recensement_2005]. Section 6.2 also saw that the average household size in the RCT sample stood at 5.17 members at baseline and 6.13 members at endline. Moroccan rural households had an average of 6.03 members in 2004 and 5.35 members in 2014, displaying a decreasing pattern contrary to the experiment’s observations. In short, the RCT sample covered households with lower income and different demographic characteristics to the average Moroccan rural population, and with converse household size variation trends. So what are they representative of? # 8. Conclusion This replication was made possible by the fact that the authors and the journal shared the data and codes used to produce the published results. This is commendable and should be further encouraged, as it will enhance the reproducibility and credibility of empirical research, in particular in development economics. The replication of this RCT on microcredit in Morocco identifies a number of shortcomings that challenge the conclusions drawn by CDDP. The trimming procedure used on the data by the authors is debatable and the impact estimations rely heavily on the trimming threshold selected. Trimming at slightly different thresholds returns different or statistically non-significant results. We also find out that the sample was significantly imbalanced at baseline on the main outcomes, as well as several other important variables. We apply the same regressions as in the original paper, but controlling for these imbalances at baseline and find that the impacts on profits do not hold and that the increases in expenses and outputs were underestimated. We also find impacts on variables that are unlikely to be influenced by microcredit. This suggests there are issues in the quality of the underlying data or issues with the integrity of the experiment. We identify numerous sampling errors and measurement errors. The measurement errors are due to inconsistent survey data, faulty variable recoding and a number of coding errors. In particular, the authors collected information from the microcredit institution’s information system and appended it to the survey data. Their demonstration relies essentially on this administrative data, which proves to be largely inconsistent with the borrowing information collected by the surveys. The authors’ explanations for the differences between survey data and administrative data are implausible in most cases. Handling the coding errors and measurement errors that can be addressed using the available data alters the average treatment effect coefficients and significance tests. However, these rectifiable errors are relatively well balanced between treatment and control groups and their correction does not, in itself, disqualify the main conclusions of the first part of the published article. Yet the measurement errors do raise major concerns about the reliability of the second part (externalities and LATE), which is based on inconsistent administrative data. The conclusions of the published article are further called into question when sampling errors are also taken into consideration. Households were sampled based on their answers to a short preparatory survey, but data collected from the same households on the same variables at baseline differs considerably. The borrowing propensity score used as the sampling criterion at baseline fails to predict borrowing and is at odds with the revised borrowing propensity scores used as sampling criteria in a second stage to add new households at endline. The average number of household members grew from 5.17 to 6.13 between the baseline and endline surveys. The gender and age composition of one fifth of the households interviewed at baseline and re-interviewed at endline differs to such an extent that it is not plausible that the same units were re-interviewed in these cases. These sampling errors undermine both the internal and external validity of the RCT. They also cast doubt over what was tested; whether it was increased access to microcredit in the treatment group, credit rationing in the control group or substantial variations in other credit sources. We conclude that this RCT lacks both internal and external validity. Our understanding of these shortcomings is that they are largely due to poor quality survey data. Data quality and sampling integrity are systematically analysed for standard surveys (such as Demographic and Health Surveys and Living Standards Measurement Surveys) and are reported in the survey reports’ appendices. This does not appear to be common practice for most RCT ad-hoc surveys and was not the case with CDDP. It would seem appropriate to align survey methods and practices used for RCTs with the quality standards established for household surveys conducted by national statistical systems [@deaton_analysis_1997; @division_household_2005]. This implies adopting sound unit definitions (household, economic activity, etc.), drawing on nationally tried-and-tested questionnaire examples, working with professional statisticians with experience of quality surveys in the same country (ideally nationals), properly training and closely supervising survey interviewers and data entry clerks, and analysing and reporting measurement and sampling errors. This would also entail taking seriously the question of local context and imperfect RCT implementation process. In their article, CDDP cite 17 references: nine RCTs, four on econometric methodology, three non-RCT empirical studies from India and one economic theory paper. No reference is made to other studies on Morocco, microfinance particularities or challenges encountered with this particular RCT. This is especially surprising in the case in hand, since this RCT was a subject of debate and a number of published papers, including in well-regarded journals, prior to the article by CDDP, all seeking to constructively comment on and contextualise this Moroccan RCT [@bernard_impact_2012; @doligez_evaluer_2013; @morvant-roux_adding_2014; @pamies-sumner_les_2014]. @morvant-roux_adding_2014, in particular, built on an extensive literature review on borrowing in rural Morocco and their own qualitative empirical data to improve our understanding of microcredit take-up patterns in treatment and control villages. Among other criteria, they found strong collinearity at village level in terms of agro-ecological settings, land ownership structures and the socio-political relationship with Moroccan Kingdom institutions. It would be particularly interesting to conduct a reanalysis of CDDP based on compound variables that classify the villages along these criteria. # References
\newpage # Appendix ## Appendix 1 : Reclassification of utility credit In the questionnaire, the ‘*Other, specify:*’ option was followed by a field where the respondent was supposed to give the name of this unspecified source. We present in Table 19 below the occurrences encountered in this complementary variable and their corresponding frequencies. ```{r, message=F, warning=F, echo =F} util_tb3 %>% kable(caption = "Reclassification of \"other\" credits that had all been reclassified as \"Utility\" by CDDP", format = "latex", booktabs = T, align = c("c","c", "c", "c","c", "l"), longtable = T) %>% kable_styling(full_width = T, latex_options = "HOLD_position") %>% column_spec (2, width = "1.8cm") %>% column_spec (2, width = "2.1cm") %>% column_spec (3, width = "1.9cm") %>% column_spec (6, width = "7cm") %>% footnote(general = c("Source: Our analysis using CDDP microdata retrieved from baseline and endline surveys."), general_title = "", threeparttable = T) ``` We see in Table 19 that, at baseline for instance, a specification corresponding to a utility company was provided in 29% of the cases, but in the others, the specifications corresponded to other types of sources (local stores, consumer lending, real estate purchase, etc.) or were missing. This indicates that, both at baseline and endline, credits registered as ‘*other*’ should not have been systematically reclassified as ‘*utility credit*’. \newpage ## Appendix 2: Code excertps of the coding errors explained in Section 3 ### A.2.1 Credit from other MFIs was omitted at baseline The Stata code section the authors used to compute total access to credit and borrowed amount was, at baseline: - For active loans: `egen aloans_total = rowtotal(aloans_alamana aloans_oformal aloans_informal aloans_branching);` (BL:52) - For loans that matured in the last 12 months: `egen ploans_total = rowtotal(ploans_alamana ploans_oformal ploans_informal ploans_branching);` (BL:113) At endline, the same script section became: - For active loans: `egen aloans_total = rowtotal(aloans_alamana aloans_oamc aloans_oformal aloans_informal aloans_branching);` (EL:138) - For loans that matured in the last 12 months: `egen ploans_total = rowtotal(ploans_alamana ploans_oamc ploans_oformal ploans_informal ploans_branching);` (EL:158) A comparison of baseline and endline codes reveals that, at baseline, the variables ‘aloans_oamc’ (i.e. household’s number of outstanding loans from other MFIs) and ‘ploans_oamc’ (i.e. household’s number of loans from other MFIs that matured in the last 12 months) were omitted when creating ‘aloans_total’ and ‘ploans_total’ variables, which were in turn summed into ‘loans_total’ (i.e. the total number of loans taken by each household). This means that the loans from other MFIs were not taken into account when reporting access to credit and assessing the balance between treatment and control groups at baseline. The same mistake was made for analogous ‘aloansamt_total’ and ‘ploansamt_total’ variables, which correspond to the total amount borrowed by each household. ### A.2.2 Only outstanding loans were taken into account at baseline Section 5.1.3 discusses the code used by CDDP to count the number of loans taken out by each household from different source categories: AAA, other MFIs, other formal sources, informal sources and utility companies. First counted were loans outstanding at the time of the survey (‘aloans_[SOURCE]’, where [SOURCE] corresponds to each type of source). Second counted were loans not outstanding at the time of the survey, but outstanding in the past 12 months (‘ploan_[SOURCE]’). Third, the two previous categories (aloans_[SOURCE] and ploans_[SOURCE]) were summed up to obtain the total number of loans outstanding in the past 12 months (loans_[SOURCE]). Yet it is not the total number of loans that was taken into account in the analysis. What was taken into account by CDDP is a dummy version of the loan count. In other words, a new variable (named ‘borrowed_[SOURCE]’) was created for each source category. This variable takes the value ‘*0*’ if the household had no loan from the source category in the last 12 months. It takes the value ‘*1*’ if the household had one or more loan from the source category in the last 12 months. There is, however, an error in the way this variable was computed at baseline. This is the code used to produce the ‘borrowed_[SOURCE]’ variables at baseline (BL: 171-176): ``` *** DUMMY of loans over the period ***; foreach var in alamana oamc oformal informal branching total oformal2{; gen borrowed_`var'=0 if loans_`var'!=.; replace borrowed_`var'=1 if aloans_`var'>=1 & aloans_`var'!=.; }; ``` The reader will notice that what is transformed into 1 or 0 are the variables ‘aloans_[SOURCE]’ (starting with “a”), that is, only the loans that were *outstanding at the time of the survey*. On the other hand, this is the code that was used to produce the ‘borrowed_[SOURCE]’ variables at endline (EL: 216-221): ``` *** DUMMY equal to 1 if had a loan over the period ***; foreach var in alamana oformal informal branching oamc total oformal2{; gen borrowed_`var'=0 if loans_`var'!=.; replace borrowed_`var'=1 if loans_`var'>=1 & loans_`var'!=.; }; ``` The reader will notice that what is transformed into 1 or 0 are the variables ‘loans_[SOURCE]’ (not starting with “a”), that is, the loans that were outstanding at the time of the survey and also the loans that were not outstanding at the time of the survey, but were outstanding in the previous 12 months. In other words, it includes all loans that were *outstanding in the 12 past months*. ### A.2.3 Recoding of "other" credit When recoding the credit variables presented in 3.1.4, CDDP used the following script in both baseline and endline do-files: `gen branching 'j' = (i3_'j' == 16 | i3_'j' == 17);` (BL:43, EL:92) This means that all sources registered as ‘*Other, specify:*’ were reclassified as ‘*Utilities credit*’. ### A.2.4 '*Tractors*' and '*reapers*' removed from asset appraisal at endline The Stata script used by CDDP to compute agricultural assets at baseline includes the following code: - `egen asset_agri=rsum(ag_1-ag_16);` (BL:269) The script segment used for the same measure at endline is written as follows: - `egen asset_agri=rsum(ag_3-ag_16);` (EL:371) The fact that `ag_1` was replaced by `ag_3` means that, at endline, the assets indexed number 1 and 2 in the survey questionnaire (i.e. tractors and reapers) have disappeared from the sum of agricultural asset values calculated for each household. \newpage ## Appendix 3: List of coding errors with minor incidence on impact estimates This appendix presents a series of measurement and coding errors that were only mentioned in the paper. The coding errors could be corrected and did not substantially nor significantly alter the estimated impacts. The measurement errors were limited and magnitude, but provide an additional illustration of the reliability of the data used by CDDP. ### A.3.1 Debatable amortisation rule for asset expenses CDDP computed a series of variables to capture investments in different activity categories: investment in livestock activities, in agricultural activities and in business activities. They also total these investments in activity categories in a variable ‘inv_total’, which is one of the main outcome variables on which impact is estimated. All investments correspond precisely to expenses that are also included in the ‘expense_livestock’, ‘expense_agri’ and ‘expense_business’ variables expense impact estimations. There is, however, one notable exception: at endline (EL:736-740), purchases of agricultural assets for an amount over MAD 10,000 (all corresponding to tractors, reapers, cars and trucks) are divided by 10. For instance, a tractor purchased for MAD 60,000 is counted as MAD 6,000. This is no mention of it in the paper, but this presumably corresponds to amortisations. However, this is inconsistent for four reasons: - No such rule was applied to compute baseline expenses, as reported by CDDP (Table 1); - No amortisation rule was defined for any other investment in any durable assets; - Other investments for amounts over MAD 10,000 in business assets (cars and trucks) were not amortised; - One-tenth of all assets with a value over MAD 10,000 purchased in the last nine years should also have been counted in expenses, but this could not be the case as the recall period for asset purchase was only 12 months. ```{r, echo = FALSE, warning = FALSE, message = FALSE} invagri_asis <- extract_ast_agr(el) %>% appraise_inv_agri() %>% op_expense_agriinv(correct = FALSE) invagri_ok <- extract_ast_agr(el) %>% appraise_inv_agri() %>% op_expense_agriinv(correct = TRUE) %>% left_join(invagri_asis, by = "ident", suffix = c("_ok", "_asis")) invagri_ok %>% summarise(expense_agriinv_asis = mean(expense_agriinv_asis, na.rm = TRUE), expense_agriinv_ok = mean(expense_agriinv_ok, na.rm = TRUE)) test <- sample %>% left_join(invagri_ok, by = "ident") %>% mutate(expense_total_elok = expense_total_el - expense_agriinv_asis + expense_agriinv_ok) imp_expagri <- reg(test, dep_vars = c("expense_total_elok", "expense_total_el"), fullreg = FALSE, var_out = "treatment_el", separator = " ", rest_form = "~ treatment_el + members_resid_bl + nadults_resid_bl + head_age_bl + act_livestock_bl + act_business_bl + borrowed_total_blok + members_resid_bl_d + nadults_resid_bl_d + head_age_bl_d + act_livestock_bl_d + act_business_bl_d + borrowed_total_bl_d + ccm_resp_activ + other_resp_activ + ccm_resp_activ_d + other_resp_activ_d + factor(paire_el)") ``` ### A.3.2 Miscalculation of livestock assets The following segment of code is used to appraise livestock assets, both at baseline (BL:307-314) and endline (EL:412-422): ``` * Value of stock of livestock assets; foreach j of numlist 1(1)3 { ; gen assetlive`j'=0; gen unitprice`j' = f4_`j' / f2_`j' if f4_`j'>0 & f4_`j'!=. & f2_`j'>0 & f2_`j'!=.; sum unitprice`j' if unitprice`j'>0, detail; replace assetlive`j'=r(p50)*f2_`j' if f2_`j'>0 & f2_`j'!=.; }; gen assetlive4 = f4_4 if f4_4>0 & f4_4!=.; gen assetlive5 = f4_5 if f4_5>0 & f4_5!=.; egen asset_livestock = rsum(assetlive1-assetlive5); ``` This script creates three variables in the form of 'unitprice1', 'unitprice2' and 'unitprice3' to compute median prices of each asset type and inserts them between 'assetlive3' and 'assetlive4'. The last line of the script sums all variables according to their location in the Stata dataset, starting with 'assetlive1' and ending with 'assetlive5'. We understand that the authors’ intention was to sum only the variables 'assetlive1', 'assetlive2', 'assetlive3', 'assetlive4' and 'assetlive5', but they unintentionally included 'unitprice1', 'unitprice2' and 'unitprice3' in this total. In other words, they mistakenly added one unit price to each asset type when appraising the value of livestock assets owned by households. ```{r, warning=F, message=F, echo=F} lsk1 <- el %>% extract_lsk_ast() %>% appraise_lsk_ast(correct = FALSE) lsk2 <- el %>% extract_lsk_ast() %>% appraise_lsk_ast(correct = TRUE) %>% left_join(lsk1, by = "ident", suffix = c("_ok", "_asis")) lsk_ok <- lsk2 %>% summarise(asset_livestock_asis = mean(asset_livestock_asis, na.rm = TRUE), asset_livestock_ok = mean(asset_livestock_ok, na.rm = TRUE), nb_change = sum(asset_livestock_asis != asset_livestock_ok, na.rm = TRUE)) ``` ### A.3.3 Subset of business income not taken into account at endline The following script is used to capture service sales at endline: ``` foreach j of numlist 1(1)6 {; foreach i of numlist 1(1)4 {; replace sale_business = sale_business + g35_`j'_`i'*12 if g35_`j'_`i' != . & g35_`j'_`i' >=0; }; }; ``` It loops over service sales (`g35`) activities (`j`) 1 to 6 and over items (`i`) 1 to 4. However, there are as many as six items in the database. Items 5 and 6 are not accounted for. ### A.3.4 Confusion between prices before, during and after harvest For each agricultural product (cereals, fruit tree production and vegetables), median prices before, during (for fruit) and after harvest were computed and imputed for all production for which the transaction price was not registered. However, there are a number of errors at endline: - Prices after harvest were mistakenly imputed to cereal sales before harvest at endline (EL:581); - Prices before harvest were mistakenly imputed to cereal savings, i.e. cereal kept after harvest (EL:837); and - Prices before harvest were mistakenly imputed to tree sales during and after harvest (EL:610-618). The errors listed above concern only the endline preparation do-file. The sections on sales and savings of cereals, fruit tree production and vegetables for baseline preparation are also plagued by errors: some item types are mysteriously not taken into account (e.g. BL:477 for cereals: only half of the cereal types are included) and the evaluation of (frequent cases of) items with missing transaction prices is inconsistent (sometimes not accounted for and sometimes with a price before or after harvest). These errors at baseline have an effect on the balance test between treatment and control villages put forward by CDDP (Table 1). ### A.3.5 Incomplete and inconsistent information on control variables The control variables include variables from the baseline survey: number of household members, number of adults, age of household head, household does animal husbandry (‘*yes*’ or ‘*no*’), household does other non-agricultural activities (‘*yes*’ or ‘*no*’), and household had an outstanding loan in the last 12 months (‘*yes*’ or ‘*no*’). They also include dummies for whether the spouse responded to the survey, and whether another household member (excluding the household head) responded to the survey. Missing values for all variables are converted into 0 and dummy variables are created for each of these variables where a value is missing. Some missing and absurd values are found for the controls, albeit in small numbers. For instance: - 48 households are registered as having more than one head at baseline; - The age of the household head is missing in 28 cases at baseline and 20 household heads are registered as being under 10 years old; - Four households have no members at baseline, and six at endline. These faulty or missing variables could be considered as relatively low considering the sample size. However, this does illustrate that no serious data cleaning was undertaken, even for the most basic variables. In ordinary surveys, and especially in high-quality surveys, such minimal requirements are systematically met. ### A.3.6 Coding errors on control variables Due to the coding error described in 3.1.2, the “had an outstanding loan over the previous 12 months at baseline” variable does not include credit from other MFIs. There are 28 missing values for the household 'head_age' variable, four for the 'members_resid' variable and nine for 'nadults_resid'. In principle, the variables corresponding to whether these values are missing – respectively 'head_age_d', 'members_resid_d' and 'nadults_resid' – should take the value 1. But the code in AN failed to flag them properly. \newpage ## Appendix 4: Illustration of household composition mismatch between baseline and endline Tables 20 and 21 below provide a simple illustration of the first five lines in the original dataset classified as totally or mostly inconsistent by the algorithm presented in Section 5.1. The left side columns present the ages and gender of those household members at baseline and the right side columns present the age and gender at endline of the (in principle) same household. The reader can observe the discrepancies: no plausible narrative could explain such transformation in household composition. Table 20 and Table 21 report the ages of all members at endline and baseline, the total number of members at baseline, the number of mismatches and the matching category in which the household was classified. A couple of lines are sufficient for readers to be able to assess the consistency of the computation we ran. The third case in Table 20 has no members at baseline. This is one of six occurrences in the entire dataset where no information on members was entered at baseline (see Section 5.2.1). ```{r hh_chk_head, message=F, warning=F, message=F, echo=F} incons_tb <- hh_chk13 %>% filter(status == "Mostly inconsistent") %>% select(`Household identifier` = ident, `Age of female members at BL` = age_f_mb_bl, `Age of female members at EL` = age_f_mb_el, `Age of male members at BL` = age_m_mb_bl, `Age of male members at EL` = age_m_mb_el, `Number of members at BL` = mb_bl, `Number of inconsistencies between BL and EL` = mb_nochk, `Matching status` = status) %>% head(n = 5L) incons_tb %>% kable(caption = "First occurrences of household compositions classified as mostly inconsistent", format = "latex", booktabs = T, longtable = T) %>% kable_styling(full_width = T, latex_options = "HOLD_position") %>% column_spec (7, width = "2.2cm") %>% column_spec (8, width = "2.2cm") %>% # column_spec (5, width = "2.3cm") %>% # column_spec (6, width = "2.3cm") %>% footnote(general = c("Source: Our analysis using CDDP microdata retrieved from baseline and endline surveys."), general_title = "", threeparttable = T) incons_tb2 <- hh_chk13 %>% filter(status == "No match") %>% select(`Household identifier` = ident, `Age of female members at BL` = age_f_mb_bl, `Age of female members at EL` = age_f_mb_el, `Age of male members at BL` = age_m_mb_bl, `Age of male members at EL` = age_m_mb_el, `Number of members at BL*` = mb_bl, `Number of inconsitencies between BL and EL` = mb_nochk, `Matching status` = status) %>% head(n = 5L) incons_tb2 %>% kable(caption = "First occurrences of household compositions classified as no match", format = "latex", booktabs = T, longtable = T) %>% kable_styling(full_width = T, latex_options = "HOLD_position") %>% # column_spec (5, width = "2.3cm") %>% # column_spec (6, width = "2.3cm") %>% footnote(general = c("Source: Our analysis using CDDP microdata retrieved from baseline and endline surveys."), general_title = "", threeparttable = T) ``` \newpage ## Appendix 5: Scoring factors that were attributed each variable to compute borrowing propensity scores The scoring factors presented below correspond to the regression coefficients of the borrowing propensity models built for the subsequent scores. CDDP only provide the coefficients for score 1 in their article [@crepon_estimating_2015: Table A1]. However, knowing the different scores for each observation and the variables included in the models, we rerun the regressions for the four scores. As the recomputed scores are a perfect fit with the initial scores at individual level, we are quite confident that the models used by CDDP are similar to ours. The results are presented in Table 22. Levels of significance for the coefficients are reported as p-values are equal or very close to 0 (< 0.001%). ```{r, message=F, warning=F, message=F, echo=F} scores_coef_print ``` Observation of Table 22 indicates that the coefficients attributed to each scoring variable drastically change from one score to the next, denoting a lack of estimation robustness. Some of them become non-significantly different to 0 and vice versa.