-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Socrata Data Nodes #306
base: main
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -35,3 +35,4 @@ dependencies: | |
- pointpats=2.3.0 | ||
- pip: | ||
- ipinfo==4.4.3 | ||
- sodapy==2.2.0 |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -686,3 +686,160 @@ def execute(self, exec_context: knext.ExecutionContext): | |
gdf = get_osmnx().geocoder.geocode_to_gdf(self.placename) | ||
gdf = gdf.reset_index(drop=True) | ||
return knext.Table.from_pandas(gdf) | ||
|
||
|
||
############################################ | ||
# Socrata Search | ||
############################################ | ||
@knext.node( | ||
name="Socrata Search", | ||
node_type=knext.NodeType.SOURCE, | ||
icon_path=__NODE_ICON_PATH + "Socrata Search.png", | ||
category=__category, | ||
after="", | ||
) | ||
@knext.output_table( | ||
name="Socrata dataset list", | ||
description="Socrata dataset based on search keywords", | ||
) | ||
class SocrataSearchNode: | ||
"""Retrive the open data category via Socrata API. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Retrieve. Please search for other occurrences and fix them as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. revised |
||
|
||
The Socrata Open Data API (SODA) is a powerful tool designed for programmatically accessing a vast array of open data resources from various organizations around the world, including governments, non-profits,and NGOs.. | ||
This node uses the [SODA Consumer API](https://dev.socrata.com/consumers/getting-started.html) to get the dataset list. | ||
""" | ||
|
||
queryitem = knext.StringParameter( | ||
label="Input searching item", | ||
description="""Enter search keywords or dataset names to find relevant datasets in the Socrata database. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add more comprehensive description about what is possible here e.g. complex queries etc. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. https://dev.socrata.com/docs/filtering |
||
This search is not case-sensitive and can include multiple words separated by spaces. """, | ||
default_value="Massachusetts", | ||
) | ||
|
||
def configure(self, configure_context): | ||
# TODO Create combined schema | ||
return None | ||
|
||
def execute(self, exec_context: knext.ExecutionContext): | ||
from urllib.request import Request, urlopen | ||
import pandas as pd | ||
import json | ||
from pandas import json_normalize | ||
|
||
query_item = self.queryitem | ||
request = Request( | ||
f"http://api.us.socrata.com/api/catalog/v1?q={query_item}&only=datasets&limit=10000" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Needs URL encoding e.g. entering two search strings with a space throws an exception There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. encoded_query_item = quote(query_item) |
||
) | ||
|
||
response = urlopen(request) | ||
response_body = response.read() | ||
|
||
# Load the JSON response into a Python dictionary | ||
data = json.loads(response_body) | ||
|
||
# Extract the "results" key, which contains the dataset information | ||
dataset_info = data["results"] | ||
|
||
# Create a DataFrame from the dataset information, and flatten the nested dictionaries | ||
df = json_normalize(dataset_info) | ||
df = df.drop( | ||
columns=["classification.domain_tags", "classification.domain_metadata"] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Make this code more resilient since it seems the columns are not always there e.g. searching for all_utah_fire_data_long_lat_2018_carto the node throws this exception: Execute failed: "['classification.domain_tags', 'classification.domain_metadata'] not found in axis There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. columns_to_drop = ["classification.domain_tags", "classification.domain_metadata"] |
||
) | ||
|
||
# Find List | ||
list_columns = [ | ||
col for col in df.columns if any(isinstance(item, list) for item in df[col]) | ||
] | ||
|
||
# Drop error list column | ||
for col in list_columns: | ||
try: | ||
df[col] = df[col].apply( | ||
lambda x: ", ".join(x) if isinstance(x, list) else x | ||
) | ||
except Exception as e: | ||
df.drop(columns=[col], inplace=True) | ||
|
||
# Drop columns that cannot be saved in KNIME | ||
drop_columns = [] | ||
for col in df.columns: | ||
try: | ||
# Attempt to convert the column to a KNIME-compatible data type | ||
knime_table = knext.Table.from_pandas(df[[col]]) | ||
except Exception as e: | ||
# If an exception is raised, add the column to the list of columns to drop | ||
drop_columns.append(col) | ||
|
||
# Drop the columns that cannot be saved in KNIME | ||
df.drop(columns=drop_columns, inplace=True) | ||
df.replace("?", pd.NA, inplace=True) | ||
df.replace("", pd.NA, inplace=True) | ||
df.dropna(axis=1, how="all", inplace=True) | ||
df = df.reset_index(drop=True) | ||
return knext.Table.from_pandas(df) | ||
|
||
|
||
############################################ | ||
# Socrata Data Query | ||
############################################ | ||
@knext.node( | ||
name="Socrata Data Query", | ||
node_type=knext.NodeType.SOURCE, | ||
icon_path=__NODE_ICON_PATH + "Socrata Data Query.png", | ||
category=__category, | ||
after="", | ||
) | ||
@knext.output_table( | ||
name="Socrata dataset", | ||
description="Socrata dataset based on search keywords", | ||
) | ||
class SocrataDataNode: | ||
"""Retrive the open data category via Socrata API. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you please rewrite the node description to mention first what kind of data can be retrieved instead of mentioning the technology that is used first which most users won't interest. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Access open datasets from various well-known data resources and organizations effortlessly using the SODA interface.
|
||
|
||
The Socrata Open Data API (SODA) is a powerful tool designed for programmatically accessing a vast array of open data resources from various organizations around the world, including governments, non-profits,and NGOs.. | ||
This node uses the [SODA Consumer API](https://dev.socrata.com/consumers/getting-started.html) to get the dataset from a dataset list generated by Socrata Search Node. | ||
|
||
For instance, this dataset [Incidence Rate Of Breast Cancer](https://opendata.utah.gov/Health/Incidence-Rate-Of-Breast-Cancer-Per-100-000-All-St/q22t-rbk9) has a resource_id of "q22t-rbk9" and a metadata domain of "opendata.utah.gov". | ||
They can be found in the link under API,"https://opendata.utah.gov/resource/q22t-rbk9.json". Both the two items will be used for data retriving. | ||
""" | ||
|
||
metadata_domain = knext.StringParameter( | ||
label="Metadata domain", | ||
description="""The value in the column metadata.domain of a table generated by a Socrata Search node. """, | ||
default_value="", | ||
) | ||
|
||
resource_id = knext.StringParameter( | ||
label="Resource ID", | ||
description="""The value in the column resource.id of a table generated by a Socrata Search node. """, | ||
default_value="", | ||
) | ||
|
||
def configure(self, configure_context): | ||
# TODO Create combined schema | ||
return None | ||
|
||
def execute(self, exec_context: knext.ExecutionContext): | ||
import pandas as pd | ||
import json | ||
import pandas as pd | ||
from sodapy import Socrata | ||
|
||
# Unauthenticated client only works with public data sets. Note 'None' | ||
# in place of application token, and no username or password: | ||
client = Socrata(self.metadata_domain, None) | ||
|
||
# Example authenticated client (needed for non-public datasets): | ||
# client = Socrata(data.cdc.gov, | ||
# MyAppToken, | ||
# username="user@example.com", | ||
# password="AFakePassword") | ||
|
||
# First 2000 results, returned as JSON from API / converted to Python list of | ||
# dictionaries by sodapy. | ||
results = client.get(self.resource_id, limit=100000) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does the node do if the data has more than 100k rows? Can we use paging to loop through the whole result with progress and cancellation support? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is only for query the dataset list, not the data, it might not be necessary. The doc below mentioned that 2.1 version will allow for unlimited, But I haven't find a way to use the API 2.1 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I add paging here |
||
|
||
# Convert to pandas DataFrame | ||
results_df = pd.DataFrame.from_records(results) | ||
|
||
return knext.Table.from_pandas(results_df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please rewrite the node description to mention first what kind of data can be retrieved instead of mentioning the technology that is used first which most users won't interest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Socrata dataset list from a wealth of open data resources from governments, non-profits, and NGOs around the world based on the query term.