Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampling chapter #102

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Sampling chapter #102

wants to merge 7 commits into from

Conversation

AlexandreBlake
Copy link
Contributor

@aspina7 I messed up a bit the creation of the branch. My bad, I do not use git so much collaboratively.
I will commit and push only to it starting now.

There are few things I might need you feedback with:

  • I still have a lot to tweak here and there but and chunks to modify/add. But it should be enough to start getting a feedback so help yourself.

  • For now I generate data in my chunks to illustrate my points rather than load a pre-existing data set. I find it more convenient but it adds code that might not be the main interest of this chapter. I also noticed that in other chapters loading data seems to be the rule. No big deal?

  • I assumed that we were focusing on surveys and put the sample size calculation for analytical studies on the side. Did I assume right?

@netlify
Copy link

netlify bot commented Feb 16, 2022

Deploy Preview for epirhandbook ready!

Name Link
🔨 Latest commit 2b00449
🔍 Latest deploy log https://app.netlify.com/sites/epirhandbook/deploys/627c7fb4b967680008971e82
😎 Deploy Preview https://deploy-preview-102--epirhandbook.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@aspina7
Copy link
Contributor

aspina7 commented Feb 26, 2022

@AlexandreBlake - sorry for the delay, under the exam pile atm.... thanks so much this looks like a great start!
Have added some comments on code below (linked to line as cant seem to comment directly) - and replies below too.

In general - i think i would structure the chapter by having a general intro to the topic as you have, then for each of the methods show first a sample size calculation example and then a sampling example. Happy to jump on a call and discuss if need! Thanks again this is very awesome to see!

  • For now I generate data in my chunks to illustrate my points rather than load a pre-existing data set. I find it more convenient but it adds code that might not be the main interest of this chapter. I also noticed that in other chapters loading data seems to be the rule. No big deal?

I think it is fine for this section as only small datasets really - @nsbatra thoughts? (neale can you also review chapter to make sure you think it fits to rest of handbook style please? Also in terms of length and amount of theory included? Record time spent on r4epis admin_mgt on clockify). My initial feeling is that there is a lot of theory explained, and probably long term we should shift this to the methods manual - but for now leave here as is necessary to understand the page.

  • I assumed that we were focusing on surveys and put the sample size calculation for analytical studies on the side. Did I assume right?

Yeap I think surveys are the more complex so lets start with that - and then the basic analytical studies we can add after. Priority should be on sample size calculation and then link to sampling methods. @pbkeating - can you confirm and review please?

  • line 28 : {metR} looks like an interesting package, is it necessary for the page though, that cant be done with ggplot otherwise (what function using?) - as an aside, can this package be used to do a density map with cases/km2 for the gis page?
  • line 36: maybe a decision tree type graphic would be useful to help ppl decide what section to go to and what to consider?
  • line 50: missed the "m" in sampling
  • line 54: @nsbatra to confirm the appropriate way to link to the survey analysis page
  • line 55: I would simplify wording to read: "involves randomly selecting a sub-population (number of individuals or sampling units) out of a total population (population of an area or a finite number of sampling units) with a probability $p=\frac{n}{N}$." (nb. dont need to have "it" in the front of those)
  • line 56 I would simplify wording to read: "involves selecting a sub-population (number of individuals or sampling units) using a sampling frame. Unlike with the SRS, a constant sampling interval $k=\frac{n}{N}$ is used with the $1^{st}$ sampling unit randomly chosen between 1 and $k$ and every following $k^{th}$ sampling unit selected (e.g. we select every _n_th individual). It is a reasonable approximation of SRS."
  • line 58 I would change the example to be e.g. vaccination coverage by district or by camp block or something
  • line 71 do we need example code here already? maybe we should talk about the sample size calculations first and then go in to sampling appropriate numbers from there.
  • line 99 i think keep the first srs example very basic and dont even mention clustering or stratification - And then in later sections build in the complexity. e.g. just say that we select from a list (sampling frame) of all students attending a school.
  • line 101 I think we need to make it clear by adding a sentence here that normally, sampling frames are from an external source - e.g. an excel sheet with a name, gender and status of all the children in the school. But we will create a fake dataset to demonstrate below.
  • line 115: I would break this in to two steps, first sample to get the numbers (saving as an object), then subset rows with that object.
  • line 138: before showing how variations can impact sample size i think we need just a very basic example of how to calculate the number of children needed given an alpha of bla and a precision of bla... Also the sample size calculation should be moved up to come before the actual sampling (otherwise they dont know how many they need to sample!) - same applies for all sections calculation first then sample.
  • line 294 - here too break up the code in to smaller bits per line, rather than having lots going on between parantheses,- makes it less scary for beginners.
  • line 314: consider taking some of the wording which annick wrote for the r4epis website on sampling, its less technical but think it addresses a lot of what basic users need. https://r4epis.netlify.app/surveys/

@AlexandreBlake
Copy link
Contributor Author

AlexandreBlake commented Feb 27, 2022

@aspina7 No worries...pile stacking up high here too...
I will go over it again later this coming week point by point. Two things though:

line 28 : {metR} looks like an interesting package, is it necessary for the page though, that cant be done with ggplot otherwise (what function using?) - as an aside, can this package be used to do a density map with cases/km2 for the gis page?

I could probably do without it. But it makes a bit of interpolation to keep the figure neat and minimize the number of values I need to calculate to have something that smooth for the heatmap and the contour lines. It also probably keeps the chunk simpler than what would be needed with 100% ggplot2 I think. I would assume I would need to play with geom_tile and contour_lines after bumping the resolution if I went full ggplot2.

Also regarding the GIS sampling part. Should it go here or in the GIS part? Because I assume that this chapter will be before the GIS one ?

@aspina7
Copy link
Contributor

aspina7 commented Feb 28, 2022

Okie dokie no worries stick with {metR} then!
I think let's leave gis sampling here for now and if it makes more sense we can migrate later! Thx!

@nsbatra
Copy link
Contributor

nsbatra commented Mar 2, 2022 via email

Adding some paragraphs
Reshuffling the order of some parts
Adding a tree diagram
Reformating a bit to stick to the format of the other chapters
@AlexandreBlake
Copy link
Contributor Author

@aspina7

line 36: maybe a decision tree type graphic would be useful to help ppl decide what section to go to and what to consider?
line 50: missed the "m" in sampling
line 55: I would simplify wording to read: "involves randomly selecting a sub-population (number of individuals or sampling units) out of a total population (population of an area or a finite number of sampling units) with a probability $p=\frac{n}{N}$." (nb. dont need to have "it" in the front of those)
line 56 I would simplify wording to read: "involves selecting a sub-population (number of individuals or sampling units) using a sampling frame. Unlike with the SRS, a constant sampling interval $k=\frac{n}{N}$ is used with the $1^{st}$ sampling unit randomly chosen between 1 and $k$ and every following $k^{th}$ sampling unit selected (e.g. we select every _n_th individual). It is a reasonable approximation of SRS."
line 58 I would change the example to be e.g. vaccination coverage by district or by camp block or something
line 71 do we need example code here already? maybe we should talk about the sample size calculations first and then go in to sampling appropriate numbers from there.
line 99 i think keep the first srs example very basic and dont even mention clustering or stratification - And then in later sections build in the complexity. e.g. just say that we select from a list (sampling frame) of all students attending a school.
line 101 I think we need to make it clear by adding a sentence here that normally, sampling frames are from an external source - e.g. an excel sheet with a name, gender and status of all the children in the school. But we will create a fake dataset to demonstrate below.
line 115: I would break this in to two steps, first sample to get the numbers (saving as an object), then subset rows with that object.
line 138: before showing how variations can impact sample size i think we need just a very basic example of how to calculate the number of children needed given an alpha of bla and a precision of bla... Also the sample size calculation should be moved up to come before the actual sampling (otherwise they dont know how many they need to sample!) - same applies for all sections calculation first then sample.

  • Done. I simplified the code a bit, adding steps here and there. I slightly reshuffled the order following your suggestions.
  • I still have to go through the GIS sampling part. It is on my to do list for the next session.

@nsbatra

  • I modified it to stick to the current format. I will just save the generated data in a separate file to load. same goes for the decision tree figure. I keep that step for the end though: I do not use Rstudio....I am not a fan (I know I am weird). So until we reach the final step it is not super convenient to use here. But is it problematic to keep the little chunk creating data for villages to select clusters by PPS as is? The point of this chunk is to allow the readers to generate the data several times with different values but always end up with an equal selection probability.
  • I keep my references at the end for now. It does not quite fit with the resource section I have seen in other chapters. Should I just keep it but reformat it differently?

@aspina7
Copy link
Contributor

aspina7 commented Mar 7, 2022

ok perfect - poke me if/when you need me to take another look, thx!

@nsbatra
Copy link
Contributor

nsbatra commented Mar 9, 2022

@AlexandreBlake thanks for your flexibility adapting the data format. For the specific chunk you mentioned with the intention of user re-generating the data that sounds fine (and cool!)

I think references in the resources section is fine, the format of the text under the title is flexible.

@AlexandreBlake
Copy link
Contributor Author

@aspina7 Sorry for the silence, a PhD milestone thing cannibalized my brain and my time. This is behind, so I can push hard on the chapter.

  • I added a chunk on GIS sampling and turned the mock data into a data.frame we load. There is a fine line on the GIS stuff where I guess I should not get into details but still make it functional. I refer the reader to the GIS chapter. But, I know that some specific things such as juggling between format to throw your sampled points into GPS devices for example is not there. Should I expand on that topic (although it feels a bit like a tangent) or are there going to be some additions to the GIS chapter?
  • I will start looking/making a figure for the overview of the chapter and throw it in there soon.

@aspina7
Copy link
Contributor

aspina7 commented Apr 8, 2022

@AlexandreBlake - no worries at all, just crawling out from under a rock myself...
I think adding an export to gpx would be great (doesnt need to get too technical)
This way ppl can import to osmand (and we will have examples of that in the theory manual, not here) - @pbkeating put example code in his post.

poke me when you want me to do a full review pls.

In the meantime a few brief thoughts - would be good to hear from @nsbatra and @pbkeating too:

  • do we need to also demonstrate sample size calculations with the packages mentioned, or others?
    • are typing calculations out too scary for low level users? And will those users be doing sample size calculations anyway?
  • for spatial sampling should we stick to pulling buildings from openstreetmaps, with osmdata?

@pbkeating
Copy link

pbkeating commented Apr 18, 2022 via email

@AlexandreBlake
Copy link
Contributor Author

@pbkeating Hi Pat,

The formulae are there for SRS using categorical/continuous variable for the primary variable of interest, and the design effects/multistage flavors as well. Which one do you have in mind specifically?

@AlexandreBlake
Copy link
Contributor Author

AlexandreBlake commented Apr 18, 2022

@aspina7 @pbkeating

  • I dumped the part sampling points using boundaries of the urban clusters and turned it instead into sampling points from the building feature of OSM data of the area. One thing though: it means that we only show how to sample from the pre-existing points we pull from OSM. We do not show the strategy that requires the painful cleaning phase. I describe it and the associated limitations but that's it. I can still add that chunk back from the previous commit if needed though.
  • The export intol kml and gpx has been added.
  • I will make a last round with fresh eyes in the coming days and ping you @aspina7 when I am done so that you can go through it.

I would assume that creating a figure for the 1st graph illustrating the principle of sampling would be preferable to showing a pre-existing one (with the proper reference)?

Switching the spatial sampling to the case where we pull building gps positions from OSM
@pbkeating
Copy link

pbkeating commented Apr 18, 2022 via email

@aspina7
Copy link
Contributor

aspina7 commented Apr 18, 2022

I think we should keep them!

@pbkeating yeap - agree keep formulas. but should we also show example of how to get to the same numbers using package functions (if available)? This would be the same as in other pages where we have demonstrated how to do the same thing in base and also {dplyr} for example.

We do not show the strategy that requires the painful cleaning phase. I describe it and the associated limitations but that's it. I can still add that chunk back from the previous commit if needed though.

@AlexandreBlake when you poke me to review, can you point me to where the difference is? Ususally worth showing the painful method too .... but I can just add it back in when reviewing if you link me to it.

I would assume that creating a figure for the 1st graph illustrating the principle of sampling would be preferable to showing a pre-existing one (with the proper reference)?

If easy to make a figure go for it - if not we have just made sure the authors of the figure are okay with us reproducing it here and then referencing accordingly. Both fine... e.g. here I took an example from a textbook.

@pbkeating
Copy link

pbkeating commented Apr 18, 2022 via email

@AlexandreBlake
Copy link
Contributor Author

AlexandreBlake commented Apr 18, 2022

@aspina7

@AlexandreBlake when you poke me to review, can you point me to where the difference is? Ususally worth showing the painful method too .... but I can just add it back in when reviewing if you link me to it.

I would agree but I doubt we can show the cleaning procedure on R if we want to keep it simple. That would imply only showing the strategy to draw points from a polygon and keep the descriptions of the pain associated to the cleaning but without actually doing the cleaning. Would that still be fine?

Yes, if there are packages that can do the job, then good to have alternatives. Showing the formula allows for people to get a fuller understanding of the process and takes the black box feeling of using, for example, the weight function in the sitrep templates

@pbkeating I will have a quick look at those packages and add something if their use is not the "headache realm".

@aspina7
Copy link
Contributor

aspina7 commented Apr 18, 2022

Would that still be fine?

ah gotcha sorry - read through your explanation now. You are right, without having polygons for building cleaning would be excruciating (otherwise using intersection and buffer would only be a few lines of code).
But we could demonstrate a very basic example with interactive packages?
e.g. draw a polyon, sample points from there, remove the points that are not over a house (according to basemap tiles), and then sample more points (the n removed)? Using for example {mapedit} based on {leaflet}
cc @AmyMikhail who will be using the code from this page to update epiet case studies

@AlexandreBlake
Copy link
Contributor Author

@aspina7 Will have a stab at it then. Will keep you posted.

Addition of chunks to draw and clean points in GIS sampling using mapedit
Cleaning bits of the text/code
@AlexandreBlake
Copy link
Contributor Author

@aspina7

  • Done with the interactive cleaning part and the gpx/kml export.
  • mapedit is pretty good with interacting with polygons or lines, but not with points (unless I missed it when I played with it and went through their pages). So, the cleaning is a bit cumbersome with people drawing polygons around the points to remove, but it illustrates the "cleaning" procedure.
  • I added sample size calculation using a package for the simple cases (SRS with continuous or categorical primary variable of interest).
  • I added a simple home-made schematic at the beginning of the page.
  • I would need you to have an overall look, I struggle to spot where I left chunks that became a but redundant now.

@aspina7
Copy link
Contributor

aspina7 commented Jun 3, 2022

sorry @AlexandreBlake - super delayed in getting back to you! have been under the books again, chapter looks really good and im using bits of it to update an EPIET case study for next week. Will hopefully be able to do a full review soon... sorry again! and thanks so much for all the work!

@AlexandreBlake
Copy link
Contributor Author

@aspina7 No worries! Busy time here too. I will be hard to reach next week, but we can chat/exchange about it whenever after.

@AlexandreBlake
Copy link
Contributor Author

@aspina7 No pressure, but any thing you think needs more polish?

@aspina7
Copy link
Contributor

aspina7 commented Jul 1, 2022

Sorry @AlexandreBlake - offline in Sicily at the moment. But will do a review and push directly after summer.

@aspina7
Copy link
Contributor

aspina7 commented Feb 12, 2023

@AlexandreBlake am so so sorry! Life took a sharp left-turn and been under a deep rock. Reviewing this now and will make edits directly and hopefully get the book to knit, will push directly to this branch once finish reviewing.

@nsbatra can we get alex an invoice for this please and @AlexandreBlake can you send us your bank details so we can get you paid asap.

really sorry again!

note to myself: i am at performing srs

@AlexandreBlake
Copy link
Contributor Author

@aspina7 No worries, life is stochastic on my side as well, and it looks like you have been busy. I can help with some tweaks on the chapter if needed. But I am on the finishing line to defend (a long line of a couple of months), so I cannot promise super regular help.

@aspina7
Copy link
Contributor

aspina7 commented Feb 14, 2023

is all good - should be able to pull together - good luck with the final phd push!

note to self: mention {osmextracter}

@aspina7
Copy link
Contributor

aspina7 commented Apr 11, 2023

balls - maptools is being retired, will need to shift to mapedit or terra.
Am about 3 quarts of the way through reviewing ... hopefully done before summer!

@AmyMikhail
Copy link

Hi @AlexandreBlake and all,

Not sure if the above discussion regarding whether to include examples of cleaning the data was resolved or not... but in case useful just thought to link to a recent discussion initiated by @pbkeating to add pulling OSM building data to the EPIET case study on spatial sampling.

I proposed some code there which very simply uses st_intersection() to remove points that fall outside a polygon, and I was wondering if the same principle could not be used for building outlines that you can see on OpenStreetMap, since the data you get from that includes the building polygons?

For removing points that are in locations incorrectly identified as a building, or which turn out not to be a building for the function of interest (i.e. residential) in real life, I think this would require a more interactive approach - for instance doing a validation survey with survey software that collects GPS coordinates, then just leaves the more tricky question of how to convert the 'not a building' coordinates to a polygon that could be used for excluding them with st_intersect() again. Epicentre used an algorithm to create polygons from coordinates based on their proximity to each other - but I don't know if this was using existing packages/functions or something they developed from scratch. I'm curious to know if you already had a way to do that @AlexandreBlake ?

@AmyMikhail
Copy link

Also one thing I just noticed @AlexandreBlake :

osm_points for buildings are actually the points that make up the polygon that represents the shape of the building on OpenStreetMap - I didn't check but assume it is one set of coordinates for each corner.

For the edit to the EPIET case study that I mentioned above, I extracted osm_polygons instead and used st_centroid() to get a single pair of coordinates for each building. In the Kario camp example that is used in the case study, this seems to match up to the roofs well (although needless to say sampling from those won't make any adjustments for roof size).

@AlexandreBlake
Copy link
Contributor Author

Hi @AmyMikhail
Thanks for sharing this. The chapter includes a basic interactive example to get a feel of how cumbersome it is. The Epicentre algorithm is a complex variation (now included in a fancy shiny app that pulls plenty of available data/tiles) on the chunk of code shared here.

What we found at the time (definitely not the only option though) was that it was pretty fast to just "clean as you draw points" by visualizing the points with a given buffer on recent images as you draw. Then the cleaning was a simple click: just keep, the point is then saved and you keep drawing until you reach your desired sample size, or drop, and you redraw your point. By just deciding with y/n or Enter/Space on the keyboard the cleaning/drawing ends up being pretty fast even for sample sizes of several hundred points. I t was still a bit fastidious, but it was way faster compared to the iterative process of draw points/check/clean and repeat (assuming that you would also draw some reserve points as a buffer for the unavoidable points falling on buildings that do not qualify for your survey). The current version used in Epicentre might rely on a slightly different set of packages than in the chunk I shared with Patrick, but last time I chatted with Serge it boiled down to the same functionalities overall in the Geosampler (the name of their shiny app).

There used to be two main "schools of thought" in Epicentre a couple of years ago: using a very large polygon to draw points but doing extensive cleaning of the points vs spend a lot of time by excluding as much "empty space" from the polygon (assuming you have reasonably recent imagery to do it and still keeping some margin) but saving a lot of time in cleaning points. Then a third option showed up when there was no time constraint: paying people to put point on every damn roof and use that to draw points directly. I am not sure of the current state of affairs.

The validation surveys you mention have been used in large surveys in Nigeria with a logistics on the field so wild that checking a couple of days before data collection in a given area was possible. But it meant redrawing a lot of points frequently and it was tedious.

The "interactive cleaning" as you draw points used to be the "best" option we used (again it might be different now). It was partially because we could not confidently rely on the building tag in OSM data in most of the settings we did surveys in (too remote or with large changes in a short period of time not captured in OSM).

@aspina7
Copy link
Contributor

aspina7 commented Nov 14, 2023

ach.... lost all my changes because did not push. Will restart... sorry!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants