Models and external data of 3rd place efficiency solution for https://www.kaggle.com/competitions/pii-detection-removal-from-educational-data competition.
See https://www.kaggle.com/code/devinanzelmo/piidd-efficiency-3rd-process-external-data for links to external data and processing code
See https://www.kaggle.com/code/devinanzelmo/piidd-efficiency-3rd-train for training code that generated models.
See https://www.kaggle.com/code/devinanzelmo/piidd-efficiency-3rd-inference for inference code
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset globally (excluding frigid/polar zones) quantifies the different facets of variability in surface soil (0 – 30 cm) salinity and sodicity for the period between 1980 and 2018. This is realised by developing 4-D predictive models of Electrical Conductivity of saturated soil Extract (ECe) and soil Exchangeable Sodium Percentage (ESP) as indicators of soil salinity and sodicity. These machine learning-based models make predictions for ECe and ESP at different times, locations, and depths and by extracting meaningful statistics form those predictions, different facets of variability in the surface soil salinity and sodicity are quantified. The dataset includes 10 maps documenting different aspects of soil salinity and sodicity variations, and auxiliary data required for generation of those maps. Users are referred to the corresponding "READ_ME" file for more information about this dataset.
According to INSPIRE transformed development plan “Field Path No. 129 — Construction Line” of the city of Großbottwar based on an XPlanung dataset in version 5.0.
Data from the "Resistance Against Manipulative AI: key factors and possible actions" article
"Coal Fields of the Conterminous United States" is a digital representation of James Trumbull's "Coal Fields of the United States" (sheet 1, 1960), which is an adaptation of previous maps by Averitt (1942) and Campbell(1908). It is intended to be the first in a series of open file reports that will eventually result in an I-series map that conforms to the U.S. Geological Survey mapping standards. For this edition, coal boundaries were digitized from Trumbull and plotted to represent as closely as possible the original map. In addition, the Gulf Province was updated using generalized boundaries of coal bearing formations digitized from various state geological maps.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
In this dataset, there are soil data analyses with properties such as pH, organic matter (OM), salinity (EC), etc., major elements (N, P, K, Mg) as well as some microelements (Fe, Zn, Mn, Cu, B) with significant impact on plant nutrition.
Agricultural Soil
Panagiotis Tziachris
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IFHEADS01 - Family Units. Published by Central Statistics Office. Available under the license Creative Commons Attribution 4.0 (CC-BY-4.0).Family Units...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Notes: As of June 2020 this dataset has been static for several years. Recent versions of NHD High Res may be more detailed than this dataset for some areas, while this dataset may still be more detailed than NHD High Res in other areas. This dataset is considered authoritative as used by CDFW for particular tracking purposes but may not be current or comprehensive for all streams in the state.
National Hydrography Dataset (NHD) high resolution NHDFlowline features for California were originally dissolved on common GNIS_ID or StreamLevel* attributes and routed from mouth to headwater in meters. The results are measured polyline features representing entire streams. Routes on these streams are measured upstream, i.e., the measure at the mouth of a stream is zero and at the upstream end the measure matches the total length of the stream feature. Using GIS tools, a user of this dataset can retrieve the distance in meters upstream from the mouth at any point along a stream feature.** CA_Streams_v3 Update Notes: This version includes over 200 stream modifications and additions resulting from requests for updating from CDFW staff and others***. New locator fields from the USGS Watershed Boundary Dataset (WBD) have been added for v3 to enhance user's ability to search for or extract subsets of California Streams by hydrologic area. *See the Source Citation section of this metadata for further information on NHD, WBD, NHDFlowline, GNIS_ID and StreamLevel. **See the Data Quality section of this metadata for further explanation of stream feature development. ***Some current NHD data has not yet been included in CA_Streams. The effort to synchronize CA_Streams with NHD is ongoing.
The table Vital sign data 2012 is part of the dataset Baltimore Vital Signs Data ***, available at https://redivis.com/datasets/bp7s-5nnxzmn8t. It contains 56 rows across 215 variables.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
The General Service Administration's data.json harvest source. This file contains the metadata for the GSA's public data listing shown on data.gov as defined by the Project Open Data
The dataset represents a compilation of user interaction data generated by users who participated in the project's pilot activities in Patras, Greece. Data was generated by users in the SMARTBUY app and includes information about users, stores, product categories, professions, and events.
The dataset comprises the following data: - users: user account data for the Patras pilot users - occupation: all possible occupations that the pilot users could choose from - stores: stores which participated in the Patras pilot - sel_products_cat: products uploaded to the SMARTBUY platform by retailers - events: geo-stamped and time-stamped descriptions of a user interaction event (for instance, "user_id 67 rated product_id 722 with rating 4 at location x1 at datetime y1", or "user_id 91 denoted product_id 78 as favorite at location x2 at datetime y2") - event_types: all possible event types captured by the SMARTBUY platform ('Product searches', 'Product views', 'Featured product', 'Products near you views', 'Product photos browsed', 'Product ratings', 'Clicks on Read More button to read product reviews', 'Clicks on Open map button', 'Clicks on Send this info by email button', 'Products denoted as Favorite')
Privacy-sensitive information such as user names, retailer owner names and store names and keywords searched are anonymized.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This code was tested on Matlab R2015a, on Ubuntu 14.04 and on Mac OS 10.9.5.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
It has never been easier to solve any database related problem using any sequel language and the following gives an opportunity for you guys to understand how I was able to figure out some of the interline relationships between databases using Panoply.io tool.
I was able to insert coronavirus dataset and create a submittable, reusable result. I hope it helps you work in Data Warehouse environment.
The following is list of SQL commands performed on dataset attached below with the final output as stored in Exports Folder QUERY 1 SELECT "Province/State" As "Region", Deaths, Recovered, Confirmed FROM "public"."coronavirus_updated" WHERE Recovered>(Deaths/2) AND Deaths>0 Description: How will we estimate where Coronavirus has infiltrated, but there is effective recovery amongst patients? We can view those places by having Recovery twice more than the Death Toll.
Query 2 SELECT country, sum(confirmed) as "Confirmed Count", sum(Recovered) as "Recovered Count", sum(Deaths) as "Death Toll" FROM "public"."coronavirus_updated" WHERE Recovered>(Deaths/2) AND Confirmed>0 GROUP BY country
Description: Coronavirus Epidemic has infiltrated multiple countries, and the only way to be safe is by knowing the countries which have confirmed Coronavirus Cases. So here is a list of those countries
Query 3 SELECT country as "Countries where Coronavirus has reached" FROM "public"."coronavirus_updated" WHERE confirmed>0 GROUP BY country Description: Coronavirus Epidemic has infiltrated multiple countries, and the only way to be safe is by knowing the countries which have confirmed Coronavirus Cases. So here is a list of those countries.
Query 4 SELECT country, sum(suspected) as "Suspected Cases under potential CoronaVirus outbreak" FROM "public"."coronavirus_updated" WHERE suspected>0 AND deaths=0 AND confirmed=0 GROUP BY country ORDER BY sum(suspected) DESC
Description: Coronavirus is spreading at alarming rate. In order to know which countries are newly getting the virus is important because in these countries if timely measures are taken, it could prevent any causalities. Here is a list of suspected cases with no virus resulted deaths.
Query 5 SELECT country, sum(suspected) as "Coronavirus uncontrolled spread count and human life loss", 100*sum(suspected)/(SELECT sum((suspected)) FROM "public"."coronavirus_updated") as "Global suspected Exposure of Coronavirus in percentage" FROM "public"."coronavirus_updated" WHERE suspected>0 AND deaths=0 GROUP BY country ORDER BY sum(suspected) DESC Description: Coronavirus is getting stronger in particular countries, but how will we measure that? We can measure it by knowing the percentage of suspected patients amongst countries which still doesn’t have any Coronavirus related deaths. The following is a list.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This formatted dataset originates from raw data files from the Institute of Health Metrics and Evaluation Global Burden of Disease (GBD2017). It is population weighted worldwide data on male and female cohorts ages 15-69 years including body mass index (BMI) and cardiovascular disease (CVD) and associated dietary, metabolic and other risk factors. The purpose of creating this formatted database is to explore the univariate and multiple regression correlations of BMI and CVD and other health outcomes with risk factors. Our research hypothesis is that we can successfully apply artificial intelligence to model BMI and CVD risk factors and health outcomes. We derived a BMI multiple regression risk factor formula that satisfied all nine Bradford Hill causality criteria for epidemiology research. We found that animal products and added fats are negatively correlated with CVD early deaths worldwide but positively correlated with CVD early deaths in high quantities. We interpret this as showing that optimal cardiovascular outcomes come with moderate (not low and not high) intakes of animal foods and added fats.
For questions, please email davidkcundiff@gmail.com. Thanks.
This dataset was created by MaXiaokai
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
This series of fifteen CDs was produced by JPL's Science Digital Data Preservation Task (SDDPT) by migrating the original Mariner Ten image EDRs from old, deteriorating
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The complete dataset used in the analysis comprises 36 samples, each described by 11 numeric features and 1 target. The attributes considered were caspase 3/7 activity, Mitotracker red CMXRos area and intensity (3 h and 24 h incubations with both compounds), Mitosox oxidation (3 h incubation with the referred compounds) and oxidation rate, DCFDA fluorescence (3 h and 24 h incubations with either compound) and oxidation rate, and DQ BSA hydrolysis. The target of each instance corresponds to one of the 9 possible classes (4 samples per class): Control, 6.25, 12.5, 25 and 50 µM for 6-OHDA and 0.03, 0.06, 0.125 and 0.25 µM for rotenone. The dataset is balanced, it does not contain any missing values and data was standardized across features. The small number of samples prevented a full and strong statistical analysis of the results. Nevertheless, it allowed the identification of relevant hidden patterns and trends.
Exploratory data analysis, information gain, hierarchical clustering, and supervised predictive modeling were performed using Orange Data Mining version 3.25.1 [41]. Hierarchical clustering was performed using the Euclidean distance metric and weighted linkage. Cluster maps were plotted to relate the features with higher mutual information (in rows) with instances (in columns), with the color of each cell representing the normalized level of a particular feature in a specific instance. The information is grouped both in rows and in columns by a two-way hierarchical clustering method using the Euclidean distances and average linkage. Stratified cross-validation was used to train the supervised decision tree. A set of preliminary empirical experiments were performed to choose the best parameters for each algorithm, and we verified that, within moderate variations, there were no significant changes in the outcome. The following settings were adopted for the decision tree algorithm: minimum number of samples in leaves: 2; minimum number of samples required to split an internal node: 5; stop splitting when majority reaches: 95%; criterion: gain ratio. The performance of the supervised model was assessed using accuracy, precision, recall, F-measure and area under the ROC curve (AUC) metrics.
MY NASA DATA (MND) is a tool that allows anyone to make use of satellite data that was previously unavailable.Through the use of MND’s Live Access Server (LAS) a multitude of charts, plots and graphs can be generated using a wide variety of constraints. This site provides a large number of lesson plans with a wide variety of topics, all with the students in mind. Not only can you use our lesson plans, you can use the LAS to improve the ones that you are currently implementing in your classroom.
This data set contains small-scale base GIS data layers compiled by the National Park Service Servicewide Inventory and Monitoring Program and Water Resources Division for use in a Baseline Water Quality Data Inventory and Analysis Report that was prepared for the park. The report presents the results of surface water quality data retrievals for the park from six of the United States Environmental Protection Agency's (EPA) national databases: (1) Storage and Retrieval (STORET) water quality database management system; (2) River Reach File (RF3) Hydrography; (3) Industrial Facilities Discharges; (4) Drinking Water Supplies; (5) Water Gages; and (6) Water Impoundments. The small-scale GIS data layers were used to prepare the maps included in the report that depict the locations of water quality monitoring stations, industrial discharges, drinking intakes, water gages, and water impoundments. The data layers included in the maps (and this dataset) vary depending on availability, but generally include roads, hydrography, political boundaries, USGS 7.5' minute quadrangle outlines, hydrologic units, trails, and others as appropriate. The scales of each layer vary depending on data source but are generally 1:100,000.
This data set shows 311 service requests in the City of Pittsburgh. This data is collected from the request intake software used by the 311 Response Center in the Department of Innovation & Performance. Requests are collected from phone calls, tweets, emails, a form on the City website, and through the 311 mobile application. For more information, see the 311 Data User Guide. If you are unable to download the 311 Data table due to a 504 Gateway Timeout error, use this link instead: https://tools.wprdc.org/downstream/76fda9d0-69be-4dd5-8108-0de7907fc5a4 NOTE: The data feed for this dataset is broken as of December 21st, 2022. We're working on restoring it.
Models and external data of 3rd place efficiency solution for https://www.kaggle.com/competitions/pii-detection-removal-from-educational-data competition.
See https://www.kaggle.com/code/devinanzelmo/piidd-efficiency-3rd-process-external-data for links to external data and processing code
See https://www.kaggle.com/code/devinanzelmo/piidd-efficiency-3rd-train for training code that generated models.
See https://www.kaggle.com/code/devinanzelmo/piidd-efficiency-3rd-inference for inference code