add Module2_Unit4 content

callysto · Mar 19, 2024 · 5907f1b · 5907f1b
1 parent 4ece5aa
commit 5907f1b
Show file tree

Hide file tree

Showing 2 changed files with 160 additions and 0 deletions.
diff --git a/Module_2/Module2_Unit4.ipynb b/Module_2/Module2_Unit4.ipynb
@@ -0,0 +1,160 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "57f03a22-aedd-4fdd-ae79-af75722a3dd0",
+   "metadata": {},
+   "source": [
+    "![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9a711085-097f-460b-b501-b591f2bbc416",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "# Module 2 Unit 4  - Data Quality\n",
+    "\n",
+    "### Apples to apples\n",
+    "\n",
+    "![apples-to-apples](../_images/Module2-Unit4-image.jpeg)\n",
+    "\n",
+    "Let's imagine we wanted to know how much physical activity students were getting outside of school hours. At the end of the month, we are faced with a pile of reports, but as we begin to go through them, we realize we have some problems.\n",
+    "\n",
+    "Many are hand-written, and are practically illegible. Some contain long-form sentences about what the student did each day, some are just lists of activities with numbers of hours. Records for some days are missing, or appear to have been hastily entered all at one time. Some were recorded by the students themselves, others by a parent or guardian. Spelling mistakes are rampant.\n",
+    "\n",
+    "This data is unstructured, messy, inconsistent, full of errors, and missing values. In this situation, we're not comparing apples to apples, but apples to oranges to kiwis to coconuts. And some of them are rotten.\n",
+    "\n",
+    "Low quality data sets can take a lot of time to clean up enough for us to gain useful insights from them, and in some cases aren't worth the effort. Let's explore some different aspects of data sets that impact their quality and make them practical for data science projects.\n",
+    "\n",
+    "### 📚 Read\n",
+    ">[Reuser's Guide to Open Data Licensing](https://theodi.org/article/reusers-guide-to-open-data-licensing/)\n",
+    "\n",
+    "One way to find data sets is through a simple internet search. Google has created a specialized search tool for this very purpose, the [Google Dataset Search.](https://datasetsearch.research.google.com/)\n",
+    "\n",
+    "Let's take a closer look at some different types of data sources.\n",
+    "\n",
+    "### Private data sources\n",
+    "\n",
+    "Companies that provide online services such as Alphabet (Google), Amazon, Apple, Twitter, and Meta, collect vast amounts of data about users and their online activities.\n",
+    "\n",
+    "These companies use the data to improve their services and develop new products, but also sell access to much of this data to other organizations for advertising, research, and marketing purposes.\n",
+    "\n",
+    "Data from private companies is often not freely available for educational purposes, however there are exceptions. The social media platform Twitter allows anyone to search and download posts made in the last week, and provides a variety of filters to help tailor the results to their needs. \n",
+    "\n",
+    "That said, people pulling this data are only permitted access to a very limited subset, and must use an application programming interface (API), but this free access is noteworthy and used quite extensively by social science researchers, such as the Social Media Lab, a research laboratory at Ryerson University. \n",
+    "\n",
+    "There are a variety of free tools available that can help people interested in using social media for research access and download data from Twitter and other platforms.\n",
+    "\n",
+    "### 📚 Read\n",
+    ">[Social media data in research: a review of the current landscape.](https://ocean.sagepub.com/blog/social-media-data-in-research-a-review-of-the-current-landscape) This short 2019 article by Lily Davies, a Digital Humanities masters student at UCL, summarizes some of the tools used to scrape data from social media platforms.\n",
+    "\n",
+    "\n",
+    "### Government data sources\n",
+    "\n",
+    "Governments are increasingly making an effort to provide public access to data they have collected. \n",
+    "\n",
+    "Data that is freely available to be used, shared, and built on is referred to as **open data.** In many cases, this data is also structured to be machine readable and is accompanied with documentation about the format and metadata regarding how the data was collected and intended to be used.\n",
+    "\n",
+    "The Government of Canada, Statistics Canada, the provincial and territorial governments, and even many municipalities have open data portals where anyone can find data sets created as part of government projects.\n",
+    "\n",
+    "**Explore**\n",
+    ">[Open Government Programs in Canada](https://open.canada.ca/en/maps/open-data-canada#toc1) is an interactive map of the various open data portals around the country.\n",
+    "\n",
+    "**Explore**\n",
+    ">[Major Smart Cities with Open Data](https://rlist.io/l/major-smart-cities-with-open-data-portals) is a list of cities around the world with open data portals. \n",
+    "\n",
+    "### Academic data sources\n",
+    "\n",
+    "Post-secondary institutions generate a lot of valuable research data, and thanks to the UNESCO recommendation on Open Science, are increasingly making their data sets available to the public in formats that allow them to be explored, shared, and expanded upon. These efforts include:\n",
+    "\n",
+    "* [OpenDOAR](http://v2.sherpa.ac.uk/opendoar/), a global directory of open access repositories.\n",
+    "* [Re3Data](https://www.re3data.org/), an online registry of research data repositories.\n",
+    "* [Figshare](https://figshare.com/), a another repository where users can make all of their research outputs available in a citable, shareable and discoverable manner.\n",
+    "* [Dryad](https://datadryad.org/stash), a community-owned and curated research data resource.\n",
+    "\n",
+    "### Non-profit data sources\n",
+    "\n",
+    "Rich data sets are also made available by other sources including non-profit organizations. Some of the non-profit organizations sharing open data sets include:\n",
+    "\n",
+    "* [Gapminder](https://www.gapminder.org/)\n",
+    "* [Billion Prices Project](http://www.thebillionpricesproject.com/)\n",
+    "* [Pew research](https://www.pewresearch.org/download-datasets/)\n",
+    "* [The World Bank](https://data.worldbank.org/)\n",
+    "* [The United Nations](https://data.un.org/)\n",
+    "* [The United Nations Peacekeeping](https://opendata.unesco.org/)\n",
+    "* [UNESCO](https://core.unesco.org/)\n",
+    "* [The World Health Organization Global Health Observatory](https://www.who.int/data/gho)\n",
+    "\n",
+    "Overall, the challenging part is often finding the relevant data source which is what makes data set aggregators like the Google Dataset search so valuable.\n",
+    "\n",
+    "### Generating our own data sets\n",
+    "\n",
+    "As previously mentioned, reusing existing data sets can often be faster and easier than creating our own. However some methods of data collection are reasonable for use in a classroom setting.\n",
+    "\n",
+    "Web scraping involves using automated tools to gather information from webpages and convert it into a format that is convenient for data analysis. \n",
+    "\n",
+    "For example, we could use this method to gather data related to NHL hockey teams and individual player performance records.\n",
+    "\n",
+    "![hockey](../_images/Module2-Unit3-image4.jpeg)\n",
+    "\n",
+    "Scraping live website data can be technically challenging, so we won't be exploring these methods in this course. However, for those teachers and students who are interested in learning how to do this on their own, the CodeAcademy article linked below provides more information.\n",
+    "\n",
+    "### 📚 Read (Optional)\n",
+    ">[Web Scraping MLB Stats with Python and Beautiful Soup](https://news.codecademy.com/web-scraping-python-beautiful-soup-mlb-stats/)\n",
+    "\n",
+    "\n",
+    "\n",
+    "### 🏁 Actvity\n",
+    "\n",
+    "* What’s your favourite hobby? Can you find a data set associated with it (preferably an open data set)?\n",
+    "\n",
+    "*Hint: Try [Google Dataset Search](https://datasetsearch.research.google.com/)*\n",
+    "\n",
+    "* OR based on your location, can you find the nearest [government](https://open.canada.ca/en/maps/open-data-canada#toc1) open data set that’s relevant to you? Within that data repository, can you find a data set that interests you?\n",
+    "\n",
+    "\n",
+    "### Conclusion\n",
+    "\n",
+    "In this unit, we learned about some of the resources teachers and students can use to access data and different types of use licenses.\n",
+    "\n",
+    "Open data portals let us explore real data that is relevant to our lives and is more interesting to explore than outdated or made-up examples.\n",
+    "\n",
+    "However there is so much out there that it can be hard to choose a data set for use in the classroom.\n",
+    "\n",
+    "In the next unit, we'll dive deeper into what makes a data set good for classroom analysis and data science in general."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d55e408c-db2a-42bb-9426-fd5cf38946a4",
+   "metadata": {},
+   "source": [
+    "[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/_images/Module2-Unit4-image.jpeg b/_images/Module2-Unit4-image.jpeg