100+ datasets found

Canada Per Capita Income Single variable data set
kaggle.com
zip
Updated Sep 9, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gurdit Singh (2019). Canada Per Capita Income Single variable data set [Dataset]. https://www.kaggle.com/datasets/gurdit559/canada-per-capita-income-single-variable-data-set
Explore at:
zip(637 bytes)Available download formats
Dataset updated
Sep 9, 2019
Authors
Gurdit Singh
Area covered
Canada
Description
The dataset for predicting income per capita for Canada is taken from the website: data.worldbank.org

I am just using the data to practice Linear regression for single variable as a beginner.

There's a story behind every dataset and here's your opportunity to share yours.

Content

The data set contains 2 columns namely year and per capita income

Acknowledgements

The dataset for predicting income per capita for Canada is taken from the website: data.worldbank.org

Objective

Predict Canadas per capita income for the year 2020 using linear regression (beginner level)(just for practice)
d
2010 County and City-Level Water-Use Data and Associated Explanatory...
catalog.data.gov
data.usgs.gov
+3more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). 2010 County and City-Level Water-Use Data and Associated Explanatory Variables [Dataset]. https://catalog.data.gov/dataset/2010-county-and-city-level-water-use-data-and-associated-explanatory-variables
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description
This data release contains the input-data files and R scripts associated with the analysis presented in [citation of manuscript]. The spatial extent of the data is the contiguous U.S. The input-data files include one comma separated value (csv) file of county-level data, and one csv file of city-level data. The county-level csv (“county_data.csv”) contains data for 3,109 counties. This data includes two measures of water use, descriptive information about each county, three grouping variables (climate region, urban class, and economic dependency), and contains 18 explanatory variables: proportion of population growth from 2000-2010, fraction of withdrawals from surface water, average daily water yield, mean annual maximum temperature from 1970-2010, 2005-2010 maximum temperature departure from the 40-year maximum, mean annual precipitation from 1970-2010, 2005-2010 mean precipitation departure from the 40-year mean, Gini income disparity index, percent of county population with at least some college education, Cook Partisan Voting Index, housing density, median household income, average number of people per household, median age of structures, percent of renters, percent of single family homes, percent apartments, and a numeric version of urban class. The city-level csv (city_data.csv) contains data for 83 cities. This data includes descriptive information for each city, water-use measures, one grouping variable (climate region), and 6 explanatory variables: type of water bill (increasing block rate, decreasing block rate, or uniform), average price of water bill, number of requirement-oriented water conservation policies, number of rebate-oriented water conservation policies, aridity index, and regional price parity. The R scripts construct fixed-effects and Bayesian Hierarchical regression models. The primary difference between these models relates to how they handle possible clustering in the observations that define unique water-use settings. Fixed-effects models address possible clustering in one of two ways. In a "fully pooled" fixed-effects model, any clustering by group is ignored, and a single, fixed estimate of the coefficient for each covariate is developed using all of the observations. Conversely, in an unpooled fixed-effects model, separate coefficient estimates are developed only using the observations in each group. A hierarchical model provides a compromise between these two extremes. Hierarchical models extend single-level regression to data with a nested structure, whereby the model parameters vary at different levels in the model, including a lower level that describes the actual data and an upper level that influences the values taken by parameters in the lower level. The county-level models were compared using the Watanabe-Akaike information criterion (WAIC) which is derived from the log pointwise predictive density of the models and can be shown to approximate out-of-sample predictive performance. All script files are intended to be used with R statistical software (R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org) and Stan probabilistic modeling software (Stan Development Team. 2017. RStan: the R interface to Stan. R package version 2.16.2. http://mc-stan.org).
Real State Website Data
kaggle.com
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M. Mazhar (2023). Real State Website Data [Dataset]. https://www.kaggle.com/datasets/mazhar01/real-state-website-data/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 11, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
M. Mazhar
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
Check: End-to-End Regression Model Pipeline Development with FastAPI: From Data Scraping to Deployment with CI/CD Integration

This CSV dataset provides comprehensive information about house prices. It consists of 9,819 entries and 54 columns, offering a wealth of features for analysis. The dataset includes various numerical and categorical variables, providing insights into factors that influence house prices.

The key columns in the dataset are as follows:

Location1: The location of the houses. Location2 column is identical or shorter version of Location1 Year: The year of construction. Type: The type of the house. Bedrooms: The number of bedrooms in the house. Bathrooms: The number of bathrooms in the house. Size_in_SqYds: The size of the house in square yards. Price: The price of the house. Parking_Spaces: The number of parking spaces available. Floors_in_Building: The number of floors in the building. Elevators: The presence of elevators in the building. Lobby_in_Building: The presence of a lobby in the building.

In addition to these, the dataset contains several other features related to various amenities and facilities available in the houses, such as double-glazed windows, central air conditioning, central heating, waste disposal, furnished status, service elevators, and more.

By performing exploratory data analysis on this dataset using Python and the Pandas library, valuable insights can be gained regarding the relationships between different variables and the impact they have on house prices. Descriptive statistics, data visualization, and feature engineering techniques can be applied to uncover patterns and trends in the housing market.

This dataset serves as a valuable resource for real estate professionals, analysts, and researchers interested in understanding the factors that contribute to house prices and making informed decisions in the real estate market.
P
titanic5 Dataset Dataset
paperswithcode.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
titanic5 Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/titanic5-dataset
Explore at:
Description
titanic5 Dataset Created by David Beltran del Rio March 2016.

Notes This is the final (for now) version of my update to the Titanic data. I think it’s finally ready for publishing if you’d like. What I did was to strip all the passenger and crew data from the Encyclopedia Titanica (ET) web pages (excluding channel crossing passengers), create a unique ID for each passenger and crew member (Name_ID), then (painstakingly and hopefully 100% correctly) match to your earlier titanic3 dataset, in order to compare the two and to get your sibsp and parch variables. Since the ET is updated occasionally the work put into the ID and matching can be reused and refined later. I did eventually hear back from the ET people, they are willing to make the underlying database available in the future, I have not yet taken them up on it.

The two datasets line up nicely, most of the differences in the newer titanic5 dataset are in the age variable, as I had mentioned before - the new set has less missing ages - 51 missing (vs 263) out of 1309.

I am in the process of refining my analysis of the data as well, based on your comments below and your Regression Modeling Strategies example.

titanic3_wID data can be matched to titanic5 using the Name_ID variable. Tab titanic5 Metadata has the variable descriptions and allowable values for Class and Class/Dept.

A note about the ages - instead of using the add 0.5 trick to indicate estimated birth day / date I have a flag that indicates how the “final” age (Age_F) was arrived at. It’s the Age_F_Code variable - the allowable values are in the Titanic5_metadata tab in the attached excel. The reason for this is that I already had some fractional ages for infants where I had age in months instead of years and I wanted to avoid confusion for 6 month old infants, although I don’t think there are any in the data! Also, I was thinking to make fractional ages or age in days for all passengers for whom I have DoB, but I have not yet done so.

Here’s what the tabs are:

Titanic5_all - all (mostly cleaned) Titanic passenger and crew records Titanic5_work - working dataset, crew removed, unnecessary variables removed - this is the one I import into SAS / R to work on Titanic5_metadata - Variable descriptions and allowable values titanic3_wID - Original Titanic3 dataset with Name_ID added for merging to Titanic5 I have a csv, R dataset, and SAS dataset, but the variable names are an older version, so I won’t send those along for now to avoid confusion.

If it helps send my contact info along to your student in case any questions arise. Gmail address probably best, on weekends for sure: davebdr@gmail.com

The tabs in titanic5.xls are

Titanic5_all Titanic5_passenger (the one to be used for analysis) Titanic5_metadata (used during analysis file creation) Titanic3_wID
Film Circulation dataset
zenodo.org
data.niaid.nih.gov
bin, csv, png
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
Explore at:
csv, png, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7887672
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,
HR Dataset.csv
kaggle.com
Updated Mar 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fahad Rehman (2024). HR Dataset.csv [Dataset]. https://www.kaggle.com/datasets/fahadrehman07/hr-comma-sep-csv
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 8, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Fahad Rehman
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
🟡Please Upvote my dataset If you like It.✨

This dataset contains valuable employee information over time that can be analyzed to help optimize key HR functions. Some potential use cases include:

Attrition analysis: Identify factors correlated with attrition like department, role, salary, etc. Segment high-risk employees. Predict future attrition.

Performance management: Analyze the relationship between metrics like ratings, and salary increments. recommend performance improvement programs.

Workforce planning: Forecast staffing needs based on historical hiring/turnover trends. Determine optimal recruitment strategies.

Compensation analysis: Benchmark salaries vs performance, and experience. Identify pay inequities. Inform compensation policies.

Diversity monitoring: Assess diversity metrics like gender ratio over roles, and departments. Identify underrepresented groups.

Succession planning: Identify high-potential candidates and critical roles. Predict internal promotions/replacements in advance.

Given its longitudinal employee data and multiple variables, this dataset provides rich opportunities for exploration, predictive modeling, and actionable insights. With a large sample size, it can uncover subtle patterns. Cleaning, joining with other contextual data sources can yield even deeper insights. This makes it a valuable starting point for many organizational studies and evidence-based decision-making.

.............................................................................................................................................................................................................................................

This dataset contains information about different attributes of employees from a company. It includes 1000 employee records and 12 feature columns.

The columns are:

satisfaction_level: Employee satisfaction score (1-5 scale) last_evaluation: Score on last evaluation (1-5 scale) number_project: Number of projects employee worked on average_monthly_hours: Average hours worked in a month time_spend_company: Number of years spent with the company work_accident: If an employee had a workplace accident (yes/no) left: If an employee has left the company (yes/no) promotion_last_5years: Number of promotions in last 5 years Department: Department of the employee Salary: Annual salary of employee satisfaction_level: Employee satisfaction level (1-5 scale) last_evaluation: Score on last evaluation (1-5 scale)
m
Panel dataset on Brazilian fuel demand
data.mendeley.com
Updated Oct 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergio Prolo (2024). Panel dataset on Brazilian fuel demand [Dataset]. http://doi.org/10.17632/hzpwbp7j22.1
Explore at:
Unique identifier
https://doi.org/10.17632/hzpwbp7j22.1
Dataset updated
Oct 7, 2024
Authors
Sergio Prolo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary : Fuel demand is shown to be influenced by fuel prices, people's income and motorization rates. We explore the effects of electric vehicle's rates in gasoline demand using this panel dataset.

Files : dataset.csv - Panel dimensions are the Brazilian state ( i ) and year ( t ). The other columns are: gasoline sales per capita (ln_Sg_pc), prices of gasoline (ln_Pg) and ethanol (ln_Pe) and their lags, motorization rates of combustion vehicles (ln_Mi_c) and electric vehicles (ln_Mi_e) and GDP per capita (ln_gdp_pc). All variables are all under the natural log function, since we use this to calculate demand elasticities in a regression model.

adjacency.csv - The adjacency matrix used in interaction with electric vehicles' motorization rates to calculate spatial effects. At first, it follows a binary adjacency formula: for each pair of states i and j, the cell (i, j) is 0 if the states are not adjacent and 1 if they are. Then, each row is normalized to have sum equal to one.

regression.do - Series of Stata commands used to estimate the regression models of our study. dataset.csv must be imported to work, see comment section.

dataset_predictions.xlsx - Based on the estimations from Stata, we use this excel file to make average predictions by year and by state. Also, by including years beyond the last panel sample, we also forecast the model into the future and evaluate the effects of different policies that influence gasoline prices (taxation) and EV motorization rates (electrification). This file is primarily used to create images, but can be used to further understand how the forecasting scenarios are set up.

Sources: Fuel prices and sales: ANP (https://www.gov.br/anp/en/access-information/what-is-anp/what-is-anp) State population, GDP and vehicle fleet: IBGE (https://www.ibge.gov.br/en/home-eng.html?lang=en-GB) State EV fleet: Anfavea (https://anfavea.com.br/en/site/anuarios/)
f
Dataset for paper: Body Positivity but not for everyone
sussex.figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kathleen Simon; Megan Hurst (2023). Dataset for paper: Body Positivity but not for everyone [Dataset]. http://doi.org/10.25377/sussex.9885644.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.25377/sussex.9885644.v1
Dataset updated
May 31, 2023
Dataset provided by
University of Sussex
Authors
Kathleen Simon; Megan Hurst
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Data for a Brief Report/Short Communication published in Body Image (2021). Details of the study are included below via the abstract from the manuscript. The dataset includes online experimental data from 167 women who were recruited via social media and institutional participant pools. The experiment was completed in Qualtrics.Women viewed either neutral travel images (control), body positivity posts with an average-sized model (e.g., ~ UK size 14), or body positivity posts with a larger model (e.g., UK size 18+); which images women viewed is show in the ‘condition’ variable in the data.The data includes the age range, height, weight, calculated BMI, and Instagram use of participants. After viewing the images, women responded to the Positive and Negative Affect Schedule (PANAS), a state version of the Body Satisfaction Scale (BSS), and reported their immediate social comparison with the images (SAC items). Women then selected a lunch for themselves from a hypothetical menu; these selections are detailed in the data, as are the total calories calculated from this and the proportion of their picks which were (provided as a percentage, and as a categorical variable [as used in the paper analyses]). Women also reported whether they were on a special diet (e.g., vegan or vegetarian), had food intolerances, when they last ate, and how hungry they were.

Women also completed trait measures of Body Appreciation (BAS-2) and social comparison (PACS-R). Women also were asked to comment on what they thought the experiment was about. Items and computed scales are included within the dataset.This item includes the dataset collected for the manuscript (in SPSS and CSV formats), the variable list for the CSV file (for users working with the CSV datafile; the variable list and details are contained within the .sav file for the SPSS version), and the SPSS syntax for our analyses (.sps). Also included are the information and consent form (collected via Qualtrics) and the questions as completed by participants (both in pdf format).Please note that the survey order in the PDF is not the same as in the datafiles; users should utilise the variable list (either in CSV or SPSS formats) to identify the items in the data.The SPSS syntax can be used to replicate the analyses reported in the Results section of the paper. Annotations within the syntax file guide the user through these.

A copy of SPSS Statistics is needed to open the .sav and .sps files.

Manuscript abstract:

Body Positivity (or ‘BoPo’) social media content may be beneficial for women’s mood and body image, but concerns have been raised that it may reduce motivation for healthy behaviours. This study examines differences in women’s mood, body satisfaction, and hypothetical food choices after viewing BoPo posts (featuring average or larger women) or a neutral travel control. Women (N = 167, 81.8% aged 18-29) were randomly assigned in an online experiment to one of three conditions (BoPo-average, BoPo-larger, or Travel/Control) and viewed three Instagram posts for two minutes, before reporting their mood and body satisfaction, and selecting a meal from a hypothetical menu. Women who viewed the BoPo posts featuring average-size women reported more positive mood than the control group; women who viewed posts featuring larger women did not. There were no effects of condition on negative mood or body satisfaction. Women did not make less healthy food choices than the control in either BoPo condition; women who viewed the BoPo images of larger women showed a stronger association between hunger and calories selected. These findings suggest that concerns over BoPo promoting unhealthy behaviours may be misplaced, but further research is needed regarding women’s responses to different body sizes.
n
Data from: Macaques preferentially attend to intermediately surprising...
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated Apr 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shengyi Wu; Tommy Blanchard; Emily Meschke; Richard Aslin; Ben Hayden; Celeste Kidd (2022). Macaques preferentially attend to intermediately surprising information [Dataset]. http://doi.org/10.6078/D15Q7Q
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6078/D15Q7Q
Dataset updated
Apr 26, 2022
Dataset provided by
Klaviyo
Yale University
University of Minnesota
University of California, Berkeley
Authors
Shengyi Wu; Tommy Blanchard; Emily Meschke; Richard Aslin; Ben Hayden; Celeste Kidd
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Normative learning theories dictate that we should preferentially attend to informative sources, but only up to the point that our limited learning systems can process their content. Humans, including infants, show this predicted strategic deployment of attention. Here we demonstrate that rhesus monkeys, much like humans, attend to events of moderate surprisingness over both more and less surprising events. They do this in the absence of any specific goal or contingent reward, indicating that the behavioral pattern is spontaneous. We suggest this U-shaped attentional preference represents an evolutionarily preserved strategy for guiding intelligent organisms toward material that is maximally useful for learning. Methods How the data were collected: In this project, we collected gaze data of 5 macaques when they watched sequential visual displays designed to elicit probabilistic expectations using the Eyelink Toolbox and were sampled at 1000 Hz by an infrared eye-monitoring camera system. Dataset:

"csv-combined.csv" is an aggregated dataset that includes one pop-up event per row for all original datasets for each trial. Here are descriptions of each column in the dataset:

subj: subject_ID = {"B":104, "C":102,"H":101,"J":103,"K":203} trialtime: start time of current trial in second trial: current trial number (each trial featured one of 80 possible visual-event sequences)(in order) seq current: sequence number (one of 80 sequences) seq_item: current item number in a seq (in order) active_item: pop-up item (active box) pre_active: prior pop-up item (actve box) {-1: "the first active object in the sequence/ no active object before the currently active object in the sequence"} next_active: next pop-up item (active box) {-1: "the last active object in the sequence/ no active object after the currently active object in the sequence"} firstappear: {0: "not first", 1: "first appear in the seq"} looks_blank: csv: total amount of time look at blank space for current event (ms); csv_timestamp: {1: "look blank at timestamp", 0: "not look blank at timestamp"} looks_offscreen: csv: total amount of time look offscreen for current event (ms); csv_timestamp: {1: "look offscreen at timestamp", 0: "not look offscreen at timestamp"} time till target: time spent to first start looking at the target object (ms) {-1: "never look at the target"} looks target: csv: time spent to look at the target object (ms);csv_timestamp: look at the target or not at current timestamp (1 or 0) look1,2,3: time spent look at each object (ms) location 123X, 123Y: location of each box (location of the three boxes for a given sequence were chosen randomly, but remained static throughout the sequence) item123id: pop-up item ID (remained static throughout a sequence) event time: total time spent for the whole event (pop-up and go back) (ms) eyeposX,Y: eye position at current timestamp

"csv-surprisal-prob.csv" is an output file from Monkilock_Data_Processing.ipynb. Surprisal values for each event were calculated and added to the "csv-combined.csv". Here are descriptions of each additional column:

rt: time till target {-1: "never look at the target"}. In data analysis, we included data that have rt > 0. already_there: {NA: "never look at the target object"}. In data analysis, we included events that are not the first event in a sequence, are not repeats of the previous event, and already_there is not NA. looks_away: {TRUE: "the subject was looking away from the currently active object at this time point", FALSE: "the subject was not looking away from the currently active object at this time point"} prob: the probability of the occurrence of object surprisal: unigram surprisal value bisurprisal: transitional surprisal value std_surprisal: standardized unigram surprisal value std_bisurprisal: standardized transitional surprisal value binned_surprisal_means: the means of unigram surprisal values binned to three groups of evenly spaced intervals according to surprisal values. binned_bisurprisal_means: the means of transitional surprisal values binned to three groups of evenly spaced intervals according to surprisal values.

"csv-surprisal-prob_updated.csv" is a ready-for-analysis dataset generated by Analysis_Code_final.Rmd after standardizing controlled variables, changing data types for categorical variables for analysts, etc. "AllSeq.csv" includes event information of all 80 sequences

Empty Values in Datasets:

There is no missing value in the original dataset "csv-combined.csv". Missing values (marked as NA in datasets) happen in columns "prev_active", "next_active", "already_there", "bisurprisal", "std_bisurprisal", "sq_std_bisurprisal" in "csv-surprisal-prob.csv" and "csv-surprisal-prob_updated.csv". NAs in columns "prev_active" and "next_active" mean that the first or the last active object in the sequence/no active object before or after the currently active object in the sequence. When we analyzed the variable "already_there", we eliminated data that their "prev_active" variable is NA. NAs in column "already there" mean that the subject never looks at the target object in the current event. When we analyzed the variable "already there", we eliminated data that their "already_there" variable is NA. Missing values happen in columns "bisurprisal", "std_bisurprisal", "sq_std_bisurprisal" when it is the first event in the sequence and the transitional probability of the event cannot be computed because there's no event happening before in this sequence. When we fitted models for transitional statistics, we eliminated data that their "bisurprisal", "std_bisurprisal", and "sq_std_bisurprisal" are NAs.

Codes:

In "Monkilock_Data_Processing.ipynb", we processed raw fixation data of 5 macaques and explored the relationship between their fixation patterns and the "surprisal" of events in each sequence. We computed the following variables which are necessary for further analysis, modeling, and visualizations in this notebook (see above for details): active_item, pre_active, next_active, firstappear ,looks_blank, looks_offscreen, time till target, looks target, look1,2,3, prob, surprisal, bisurprisal, std_surprisal, std_bisurprisal, binned_surprisal_means, binned_bisurprisal_means. "Analysis_Code_final.Rmd" is the main scripts that we further processed the data, built models, and created visualizations for data. We evaluated the statistical significance of variables using mixed effect linear and logistic regressions with random intercepts. The raw regression models include standardized linear and quadratic surprisal terms as predictors. The controlled regression models include covariate factors, such as whether an object is a repeat, the distance between the current and previous pop up object, trial number. A generalized additive model (GAM) was used to visualize the relationship between the surprisal estimate from the computational model and the behavioral data. "helper-lib.R" includes helper functions used in Analysis_Code_final.Rmd
Z
Global Dataset of Cyber Incidents V.1.2
data.niaid.nih.gov
zenodo.org
Updated May 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
European Repository of Cyber Incidents (EuRepoC) (2024). Global Dataset of Cyber Incidents V.1.2 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7848940
Explore at:
Dataset updated
May 3, 2024
Dataset authored and provided by
European Repository of Cyber Incidents (EuRepoC)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains data on 2889 cyber incidents between 01.01.2000 and 02.05.2024 using 60 variables, including the start date, names and categories of receivers along with names and categories of initiators. The database was compiled as part of the European Repository of Cyber Incidents (EuRepoC) project.

EuRepoC gathers, codes, and analyses publicly available information from over 200 sources and 600 Twitter accounts daily to report on dynamic trends in the global, and particularly the European, cyber threat environment.For more information on the scope and data collection methodology see: https://eurepoc.eu/methodologyCodebook available hereInformation about each file:

Global Database (csv or xlsx):This file includes all variables coded for each incident, organised such that one row corresponds to one incident - our main unit of investigation. Where multiple codes are present for a single variable for a single incident, these are separated with semi-colons within the same cell.

Receiver Dataset (csv):In this file, the data of affected entities and individuals (receivers) is restructured to facilitate analysis. Each cell contains only a single code, with the data "unpacked" across multiple rows. Thus, a single incident can span several rows, identifiable through the unique identifier assigned to each incident (incident_id).

Attribution Dataset (csv):This file follows a similar approach to the receiver dataset. The attribution data is "unpacked" over several rows, allowing each cell to contain only one code. Here too, a single incident may occupy several rows, with the unique identifier enabling easy tracking of each incident (incident_id). In addition, some attributions may also have multiple possible codes for one variable, these are also "unpacked" over several rows, with the attribution_id enabling to track each attribution.eurepoc_global_database_1.2 (json):This file contains the whole database in JSON format.

Data from: CY-Bench: A comprehensive benchmark dataset for subnational crop...

zenodo.org
explore.openaire.eu

zip

Updated Sep 25, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Dilli Paudel; Dilli Paudel; Hilmy Baja; Hilmy Baja; Ron van Bree; Michiel Kallenberg; Michiel Kallenberg; Stella Ofori-Ampofo; Aike Potze; Pratishtha Poudel; Pratishtha Poudel; Abdelrahman Saleh; Weston Anderson; Weston Anderson; Malte von Bloh; Andres Castellano; Oumnia Ennaji; Raed Hamed; Rahel Laudien; Donghoon Lee; Inti Luna; Dainius Masiliūnas; Dainius Masiliūnas; Michele Meroni; Janet Mumo Mutuku; Siyabusa Mkuhlani; Jonathan Richetti; Alex C. Ruane; Ritvik Sahajpal; Guanyuan Shuai; Vasileios Sitokonstantinou; Rogerio de Souza Noia Junior; Amit Kumar Srivastava; Robert Strong; Lily-belle Sweet; Lily-belle Sweet; Petar Vojnović; Allard de Wit; Allard de Wit; Maximilian Zachow; Ioannis N. Athanasiadis; Ron van Bree; Stella Ofori-Ampofo; Aike Potze; Abdelrahman Saleh; Malte von Bloh; Andres Castellano; Oumnia Ennaji; Raed Hamed; Rahel Laudien; Donghoon Lee; Inti Luna; Michele Meroni; Janet Mumo Mutuku; Siyabusa Mkuhlani; Jonathan Richetti; Alex C. Ruane; Ritvik Sahajpal; Guanyuan Shuai; Vasileios Sitokonstantinou; Rogerio de Souza Noia Junior; Amit Kumar Srivastava; Robert Strong; Petar Vojnović; Maximilian Zachow; Ioannis N. Athanasiadis (2024). CY-Bench: A comprehensive benchmark dataset for subnational crop yield forecasting [Dataset]. http://doi.org/10.5281/zenodo.13798797

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13798797

Dataset updated

Sep 25, 2024

Dataset provided by

AgML (https://www.agml.org/)

Authors

License

https://joinup.ec.europa.eu/page/eupl-text-11-12https://joinup.ec.europa.eu/page/eupl-text-11-12

Description

CY-Bench: A comprehensive benchmark dataset for sub-national crop yield forecasting

Overview

CY-Bench is a dataset and benchmark for subnational crop yield forecasting, with coverage of major crop growing countries of the world for maize and wheat. By subnational, we mean the administrative level where yield statistics are published. When statistics are available for multiple levels, we pick the highest resolution. The dataset combines sub-national yield statistics with relevant predictors, such as growing-season weather indicators, remote sensing indicators, evapotranspiration, soil moisture indicators, and static soil properties. CY-Bench has been designed and curated by agricultural experts, climate scientists, and machine learning researchers from the AgML Community, with the aim of facilitating model intercomparison across the diverse agricultural systems around the globe in conditions as close as possible to real-world operationalization. Ultimately, by lowering the barrier to entry for ML researchers in this crucial application area, CY-Bench will facilitate the development of improved crop forecasting tools that can be used to support decision-makers in food security planning worldwide.

* Crops : Wheat & Maize
* Spatial Coverage : Wheat (29 countries), Maize (38).
See CY-Bench paper appendix for the list of countries.
* Temporal Coverage : Varies. See country-specific data

Data

Data format

The benchmark data is organized as a collection of CSV files, with each file representing a specific category of variable for a particular country. Each CSV file is named according to the category and the country it pertains to, facilitating easy identification and retrieval. The data within each CSV file is structured in tabular format, where rows represent observations and columns represent different predictors related to a category of variable.

Data content

All data files are provided as .csv.

Data	Description	Variables (units)	Temporal Resolution	Data Source (Reference)
crop_calendar	Start and end of growing season	sos (day of the year), eos (day of the year)	Static	World Cereal (Franch et al, 2022)
fpar	fraction of absorbed photosynthetically active radiation	fpar (%)	Dekadal (3 times a month; 1-10, 11-20, 21-31)	European Commission's Joint Research Centre (EC-JRC, 2024)
ndvi	normalized difference vegetation index	-	approximately weekly	MOD09CMG (Vermote, 2015)
meteo	temperature, precipitation (prec), radiation, potential evapotranspiration (et0), climatic water balance (= prec - et0)	tmin (C), tmax (C), tavg (C), prec (mm0, et0 (mm), cwb (mm), rad (J m-2 day-1)	daily	AgERA5 (Boogaard et al, 2022), FAO-AQUASTAT for et0 (FAO-AQUASTAT, 2024)
soil_moisture	surface soil moisture, rootzone soil moisture	ssm (kg m-2), rsm (kg m-2)	daily	GLDAS (Rodell et al, 2004)
soil	available water capacity, bulk density, drainage class	awc (c m-1), bulk_density (kg dm-3), drainage class (category)	static	WISE Soil database (Batjes, 2016)
yield	end-of-season yield	yield (t ha-1)	yearly	Various country or region specific sources (see crop_statistics_... in https://github.com/BigDataWUR/AgML-CY-Bench/tree/main/data_preparation)

Folder structure

The CY-Bench dataset has been structure at first level by crop type and subsequently by country. For each country, the folder name follows the ISO 3166-1 alpha-2 two-character code. A separate .csv is available for each predictor data and crop calendar as shown below. The csv files are named to reflect the corresponding country and crop type e.g. **variable_croptype_country.csv**.
```
CY-Bench
│
└─── maize
│ │
│ └─── AO
│ │ -- crop_calendar_maize_AO.csv
│ │ -- fpar_maize_AO.csv
│ │ -- meteo_maize_AO.csv
│ │ -- ndvi_maize_AO.csv
│ │ -- soil_maize_AO.csv
│ │ -- soil_moisture_maize_AO.csv
│ │ -- yield_maize_AO.csv
│ │
│ └─── AR
│ -- crop_calendar_maize_AR.csv
│ -- fpar_maize_AR.csv
│ -- ...
│
└─── wheat
│ │
│ └─── AR
│ │ -- crop_calendar_wheat_AR.csv
│ │ -- fpar_wheat_AR.csv
│ │ ...
```

Example : CSV data content for maize in country X

```
X
└─── crop_calendar_maize_X.csv
│ -- crop_name (name of the crop)
│ -- adm_id (unique identifier for a subnational unit)
│ -- sos (start of crop season)
│ -- eos (end of crop season)
│
└─── fpar_maize_X.csv
│ -- crop_name
│ -- adm_id
│ -- date (in the format YYYYMMdd)
│ -- fpar
│
└─── meteo_maize_X.csv
│ -- crop_name
│ -- adm_id
│ -- date (in the format YYYYMMdd)

│ -- tmin (minimum temperature)
│ -- tmax (maximum temperature)
│ -- prec (precipitation)
│ -- rad (radiation)
│ -- tavg (average temperature)
│ -- et0 (evapotranspiration)
│ -- cwb (crop water balance)
│
└─── ndvi_maize_X.csv
│ -- crop_name
│ -- adm_id
│ -- date (in the format YYYYMMdd)
│ -- ndvi
│
└─── soil_maize_X.csv
│ -- crop_name
│ -- adm_id
│ -- awc (available water capacity)
│ -- bulk_density
│ -- drainage_class
│
└─── soil_moisture_maize_X.csv
│ -- crop_name
│ -- adm_id
│ -- date (in the format YYYYMMdd)
│ -- ssm (surface soil moisture)
│ -- rsm ()
│
└─── yield_maize_X.csv
│ -- crop_name
│ -- country_code
│ -- adm_id
│ -- harvest_year
│ -- yield
│ -- harvest_area
│ -- production

Data access

The full dataset can be downloaded directly from Zenodo or using the ```zenodo_get``` library

License and citation

We kindly ask all users of CY-Bench to properly respect licensing and citation conditions of the datasets included.

Performance vs. Predicted Performance
kaggle.com
Updated Dec 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Calathea21 (2022). Performance vs. Predicted Performance [Dataset]. http://doi.org/10.34740/kaggle/dsv/4752670
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/4752670
Dataset updated
Dec 21, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Calathea21
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset contains information about high school students and their actual and predicted performance on an exam. Most of the information, including some general information about high school students and their grade for an exam, was based on an already existing dataset, while the predicted exam performance was based on a human experiment. In this experiment, participants were shown short descriptions of the students (based on the information in the original data) and had to rank and grade according to their expected performance. Prior to this task some participants were exposed to some "Stereotype Activation", suggesting that boys perform less well in school than girls.

Description of *original_data.csv*

Based on this dataset (which is also available on kaggle), we extracted a number of student profiles that participants had to make grade predictions for. For more information about this dataset we refer to the corresponding kaggle page: https://www.kaggle.com/datasets/uciml/student-alcohol-consumption

Note that we performed some preprocessing on the original data:

The original data consisted of two parts: the information about students following a Maths course and the information about students following a Portuguese course. Since in both datasets the same type of information was recorded, we merged both datasets and added a column "subject", to show which course each student belongs to

We excluded all data where G3 = 0 (i.e. the grade for the last exam = 0)

From original_data.csv we randomly sampled 856 students that participants in our study had to make grade predictions for.

Description of *CompleteDataAndBiases.csv*

index - this column corresponds to the indeces in the file "original_data.csv". Through these indices, it is possible to add columns from the original data to the dataset with the grade prediction

ParticipantID - the ID of the participant who made the performance predictions for the corresponding student. Predictions needed to be made for 856 students, and each participant made 8 predictions total. Thus there are 107 different participant IDs

name - to make the prediction task more engaging for participants, each of the 8 student profiles, that participants had to grade & rank was randomly matched to one of four boy/girl's names (depending on the sex of the student)

sex - the sex of each student, either female (F) or male (M). For benchmarking fair ML algorithms, this can be used as the sensitive attribute. We assume that in the fair version of the decision variable ("Pass"), no sex discrimination occurs. The biased versions of the variable ("Predicted Pass") are mostly discriminatory towards male students.

studytime - this variable is taken from the original dataset and denotes how long a student studied for their exam. In the original data this variable consisted of four levels (less than 2 hours vs. 2-5 hours vs. 5-10 hours vs. more than 10 hours). We binned the latter two levels together and encoded this column numerically from 1-3.

freetime - Originally, this variable ranged from 1 (very low) to 5 (very high). We binned this variable into three categories, where level 1 and 2 are binned, as well as level 4 and 5.

romantic - Binary variable, denoting whether the student is in a romantic relationship or not.

Walc - This variable shows how much alcohol each student consumes in the weekend. Originally it ranged from 1 to 5 (5 corresponding to the highest alcohol consumption), but we binned the last two levels together.

goout - This variable shows how often a student goes out in a week. Originally it ranged from 1 to 5 (5 corresponding to going out very often), but we binned the last two levels together.

Parents_edu - This variable was not present in the original dataset. Instead, the original dataset consisted of two variables "mum_edu" and "dad_edu". We obtained "Parents_edu" by taking the higher one of both. The variable consist of 4 levels, whereas 4 = highest level of education.

absences - This variable shows the number of absences per student. Originally it ranged from 0 - 93, but because large number of absences were infrequent we binned all absences of >=7 into one level.

reason - The reason for why a student chose to go to the school in question. The levels are close to home, school's reputation, school's curricular and other

G3 - The actual grade each student received for the final exam of the course, ranging from 0-20.

Pass - A binary variable showing whether G3 is a passing grade (i.e. >=10) or not.

Predicted Grade - The grade the student was predicted to receive in our experiment

Predicted Rank - In our ex...
m
Chapter 12: Data Preparation for Fraud Analytics: Project: Human Recourses...
data.mendeley.com
Updated Nov 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ABDELRAHIM AQQAD (2023). Chapter 12: Data Preparation for Fraud Analytics: Project: Human Recourses Analysis - Human_Resources.csv [Dataset]. http://doi.org/10.17632/smypp8574h.1
Explore at:
Unique identifier
https://doi.org/10.17632/smypp8574h.1
Dataset updated
Nov 1, 2023
Authors
ABDELRAHIM AQQAD
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Project: Human Recourses Analysis - Human_Resources.csv

Description:

The dataset, named "Human_Resources.csv", is a comprehensive collection of employee records from a fictional company. Each row represents an individual employee, and the columns represent various features associated with that employee.

The dataset is rich, highlighting features like 'Age', 'MonthlyIncome', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department', 'EducationField', 'JobSatisfaction', and many more. The main focus is the 'Attrition' variable, which indicates whether an employee left the company or not.

Employee data were sourced from various departments, encompassing a diverse array of job roles and levels. Each employee's record provides an in-depth look into their background, job specifics, and satisfaction levels.

The dataset further includes specific indicators and parameters that were considered during employee performance assessments, offering a granular look into the complexities of each employee's experience.

For privacy reasons, certain personal details and specific identifiers have been anonymized or fictionalized. Instead of names or direct identifiers, each entry is associated with a unique 'EmployeeNumber', ensuring data privacy while retaining data integrity.

The employee records were subjected to rigorous examination, encompassing both manual assessments and automated checks. The end result of this examination, specifically whether an employee left the company or not, is clearly indicated for each record.
A
‘School Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Feb 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘School Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-school-dataset-3c70/2a80983f/?iid=004-128&v=presentation
Explore at:
Dataset updated
Feb 13, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘School Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/smeilisa07/number of school teacher student class on 13 February 2022.

--- Dataset description provided by original source is as follows ---

Context

This is my first analyst data. This dataset i got from open data Jakarta website (http://data.jakarta.go.id/), so mostly the dataset is in Indonesian. But i have try describe it that you can find it on VARIABLE DESCRIPTION.txt file.

Content

The title of this dataset is jumlah-sekolah-guru-murid-dan-ruang-kelas-menurut-jenis-sekolah-2011-2016, with type is CSV, so you can easily access it. If you not understand, the title means the number of school, teacher, student, and classroom according to the type of school 2011 - 2016. I think, if you just read from the title, you can imagine the contents. So this dataset have 50 observations and 8 variables, taken from 2011 until 2016.

In general, this dataset is about the quality of education in Jakarta, which each year some of school level always decreasing and some is increase, but not significant.

Acknowledgements

This dataset comes from Indonesian education authorities, which is already established in the CSV file by Open Data Jakarta.

Inspiration

Althought this data given from Open Data Jakarta publicly, i want always continue to improve my Data Scientist skill, especially in R programming, because i think R programming is easy to learn and really help me to be always curious about Data Scientist. So, this dataset that I am still struggle with below problem, and i need solution.

Question :

How can i cleaning this dataset ? I have try cleaning this dataset, but i still not sure. You can check on
my_hypothesis.txt file, when i try cleaning and visualize this dataset.

How can i specify the model for machine learning ? What recommended steps i should take ?

How should i cluster my dataset, if i want the label is not number but tingkat_sekolah for every tahun and
jenis_sekolah ? You can check on my_hypothesis.txt file.

--- Original source retains full ownership of the source dataset ---
m
Dataset to run examples in SmartPLS 3 (teaching and learning)
data.mendeley.com
narcis.nl
Updated Mar 7, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Diógenes de Bido (2019). Dataset to run examples in SmartPLS 3 (teaching and learning) [Dataset]. http://doi.org/10.17632/4tkph3mxp9.2
Explore at:
Unique identifier
https://doi.org/10.17632/4tkph3mxp9.2
Dataset updated
Mar 7, 2019
Authors
Diógenes de Bido
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This zip file contains: - 3 .zip files = projects to be imported into SmartPLS 3

DLOQ-A model with 7 dimensions DLOQ-A model with second-order latent variable ECSI model (Tenenhaus et al., 2005) to exemplify direct, indirect and total effects, as well as importance-performance map and moderation with continuous variables. ECSI Model (Sanches, 2013) to exemplify MGA (multi-group analysis)

5 files (csv, txt) with data to run 7 examples in SmartPLS 3

Note: - DLOQ-A = new dataset (ours) - ECSI-Tenenhaus et al. [model for mediation and moderation] = available at: http://www.smartpls.com > Resources > SmartPLS Project Examples - ECSI-Sanches [dataset for MGA] = available in the software R > library(plspm) > data(satisfaction)
m
Data from: Virtual Reality Immersion Dataset
data.mendeley.com
Updated Feb 22, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matias Selzer (2022). Virtual Reality Immersion Dataset [Dataset]. http://doi.org/10.17632/kj79vpcsc5.2
Explore at:
Unique identifier
https://doi.org/10.17632/kj79vpcsc5.2
Dataset updated
Feb 22, 2022
Authors
Matias Selzer
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Description
The present dataset consists of 401 samples obtained from a scientific user study regarding the relationship between the hardware and software characteristics of the Virtual Reality system and the level of immersion perceived by the user.

This dataset consists of a CSV file containing 23 columns related to the independent variables of the VR system, and one last column corresponding to the immersion level related to those variables values.

The independent variables are:

Trial Variables: – Duration Time: from 120 to 1200 seconds (2 to 20 minutes).

Visual Configuration: – Screen Resolution (Width and Height): from 0.1 to 1.0 multiplied by the device max resolution (2160x1200 for the Oculus Rift CV1). – Field-of-View (FOV): from 0.3 (30%) to 1 (100%) of the device max FOV. – Frame Rate (FPS): from 8 to 60 FPS. – Stereopsis: Enabled (1) or Disabled (0). – Antialiasing (MSAA): Enabled (1) or Disabled (0). – Textures: Enabled (1) or Disabled (0). – Illumination: Ambient Light with No Shading (0), or Point Lights with Realistic Shading (1). – Saturation: from -1.0 (no saturation at all) to 1.0 (extremely saturated image). – Brightness: from -0.8 to 0.8. Higher or lower values create completely dark or white scenes. – Contrast: from -0.8 to 0.8. – Sharpness: from 0.0 to 1.0. – Shadows: Shadow Strength from 0.0 to 1.0. – Reflections: (Specular Coefficient of Materials) Enabled (1) or Disabled (0). – 3D Models Detail: Low-Poly Models (0) or High-Poly Models (1). – Depth-of-Field: Enabled (1) or Disabled (0). – Particles: Enabled (1) or Disabled (0).

Audio Configuration: – Sound System: No Sound (0), Speakers (1), or Headphones (2). – Ambient Sound: Enabled (1) or Disabled (0). – Reverberation: Enabled (1) or Disabled (0). – 3D Spatial Sound: Enabled (1) or Disabled (0).

Locomotion Configuration: – Locomotion Mode: Teleportation (1), Joystick Movement (2), or Walking-in-Place (3) (WIP).

The dependent variable is: – Total Immersion: from 0 (no immerion) to 100 (complete immersion).
r
CALY-SWE: Discrete choice experiment and time trade-off data for a...
researchdata.se
data.europa.eu
Updated Sep 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaspar Walter Meili; Lars Lindholm (2024). CALY-SWE: Discrete choice experiment and time trade-off data for a representative Swedish value set [Dataset]. http://doi.org/10.5878/asxy-3p37
Explore at:
Unique identifier
https://doi.org/10.5878/asxy-3p37
Dataset updated
Sep 24, 2024
Dataset provided by
Umeå University
Authors
Kaspar Walter Meili; Lars Lindholm
Time period covered
Jan 8, 2022 - Apr 18, 2022
Area covered
Sweden
Description
The data consist of two parts: Time trade-off (TTO) data with one row per TTO question (5 questions), and discrete choice experiment (DCE) data with one row per question (6 questions). The purpose of the data is the calculation of a Swedish value set for the capability-adjusted life years (CALY-SWE) instrument. To protect the privacy of the study participants and to comply with GDPR, access to the data is given upon request.

The data is provided in 4 .csv files with the names:

tto.csv (252 kB)

dce.csv (282 kB)

weights_final_model.csv (30 kB)

coefs_final_model.csv (1 kB)

The first two files (tto.csv, dce.csv) contain the time trade-off (TTO) answers and discrete choice experiment (DCE) answers of participants. The latter two files (weight_final_model.csv, coefs_final_model.csv) contain the generated value set of CALY-SWE weights, and the pertaining coefficients of the main effects additive model.

Background:

CALY-SWE is a capability-based instrument for studying Quality of Life (QoL). It consists of 6 attributes (health, social relations, financial situation & housing, occupation, security, political & civil rights) and provides the option to gives for attribute answers on 3 levels (Agree, Agree partially, Do not agree). A configuration or state is one of the 3^6 = 729 possible situations that the instrument describes. Here, a config is denoted in the form of xxxxxx, one x for each attribute in order above. X is a digit corresponding to the level of the respective attribute, with 3 being the highest (Agree), and 1 being the lowest (Do not agree). For example, 222222 encodes a configuration with all attributes on level 2 (Partially agree). The purpose of this dataset is to support the publication of the CALY-SWE value set and to enable reproduction of the calculations (due to privacy concerns we abstain from publishing individual level characteristics). A value set consists of values on the 0 to 1 scale for all 729, each of represents a quality weighting where 1 is the highest capability-related QoL, and 0 the lowest capability-related QoL.

The data contains answers to two types of questions: TTO and DCE.

In TTO questions, participants iteratively chose a number of years between 1 to 10. A choice of 10 years is equivalent to living 10 years with full capability (state configuration 333333) in the capability state that the TTO question describes. The answer on the 0 to 1 scale is then calculated as x/10. In the DCE questions, participants were given two states and they chose a state that they found to be better. We used a hybrid model with a linear regression and a logit model component, where the coefficients were linked through a multiplicative factor, to obtain the weights (weights_final_model.csv). Each weight is calculated as constant + the coefficients for the respective configuration. Coefficients for level 3 encode the difference to level 2, and coefficients for level 2 the difference to the constant. For example, for the weight for 123112 is calculated as constant + socrel2 + finhou2 + finhou3 + polciv2 (No coefficients for health, occupation, and security involved as they are on level 1 that is captured in the constant/intercept).

To assess the quality of TTO answers, we calculated a score per participant that takes into account inconsistencies in answering the TTO question. We then excluded 20% of participants with the worst score to improve the TTO data quality and signal strength for the model (this is indicated by the 'included' variable in the TTO dataset). Details of the entire survey are described in the preprint “CALY-SWE value set: An integrated approach for a valuation study based on an online-administered TTO and DCE survey” by Meili et al. (2023). Please check this document for updated versions.

Ids have been randomized with preserved linkage between the DCE and TTO dataset.

Data files and variables:

Below is a description of the variables in each CSV file. - tto.csv:

config: 6 numbers representing the attribute levels. position: The number of the asked TTO question. tto_block: The design block of the TTO question. answer: The equivalence value indicated by the participant, ranging from 0.1 to 1 in steps of 0.1. included: If the answer was included in the data for the model to generate the value set. id: Randomized id of the participant.

dce.csv:

config1: Configuration of the first state in the question. config2: Configuration of the second state in the question. position: The number of the asked TTO question. answer: Whether state 1 or 2 was preferred. id: Randomized id of the participant.

weights_final_model.csv

config: 6 numbers representing the attribute levels. weight: The weight calculated with the final model. ciu: The upper 95% credible interval. cil: The lower 95% credible interval.

coefs_final_model.csv:

name: Name of the coefficient, composed of an abbreviation for the attribute and a level number (abbreviations in the same order as above: health, socrel, finhou, occu, secu, polciv). value: Continuous, weight on the 0 to 1 scale. ciu: The upper 95% credible interval. cil: The lower 95% credible interval.
h
open-close-drawer-csv
huggingface.co
Updated Apr 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chen (2025). open-close-drawer-csv [Dataset]. https://huggingface.co/datasets/caiyan123/open-close-drawer-csv
Explore at:
Dataset updated
Apr 18, 2025
Authors
Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Open-Close Drawer CSV Dataset

This dataset is a preprocessed CSV version of two robotic manipulation tasks — open drawer and close drawer — derived from the PhysicalAI-Robotics-Manipulation-Kitchen dataset by NVIDIA. A new variable label_index has been added to indicate the class label (0 for open, 1 for close).

File Structure

merged_open_close_drawer.csvContains episode-wise data of robot states and actions.

Columns Overview

Column Description… See the full description on the dataset page: https://huggingface.co/datasets/caiyan123/open-close-drawer-csv.
f
Cleaned NHANES 1988-2018
figshare.com
txt
Updated Feb 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet (2025). Cleaned NHANES 1988-2018 [Dataset]. http://doi.org/10.6084/m9.figshare.21743372.v9
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21743372.v9
Dataset updated
Feb 18, 2025
Dataset provided by
figshare
Authors
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.
A
‘dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Dec 2, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2016). ‘dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-dataset-9639/ef622eef/?iid=014-116&v=presentation
Explore at:
Dataset updated
Dec 2, 2016
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/rksensational/dataset on 14 February 2022.

--- Dataset description provided by original source is as follows ---

Overview

The data has been split into two groups:

training set (train.csv) test set (test.csv)

Content

What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Variable Notes

pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children traveled only with a nanny, therefore parch=0 for them.

--- Original source retains full ownership of the source dataset ---

Facebook

Twitter

Click to copy link

Link copied

Cite

Gurdit Singh (2019). Canada Per Capita Income Single variable data set [Dataset]. https://www.kaggle.com/datasets/gurdit559/canada-per-capita-income-single-variable-data-set

Canada Per Capita Income Single variable data set

Predict the canada per capita income using linear regression for beginners

Explore at:

zip(637 bytes)Available download formats

Dataset updated

Sep 9, 2019

Authors

Gurdit Singh

Area covered

Canada

Description

The dataset for predicting income per capita for Canada is taken from the website: data.worldbank.org

I am just using the data to practice Linear regression for single variable as a beginner.

There's a story behind every dataset and here's your opportunity to share yours.

Content

The data set contains 2 columns namely year and per capita income

Acknowledgements

The dataset for predicting income per capita for Canada is taken from the website: data.worldbank.org

Objective

Predict Canadas per capita income for the year 2020 using linear regression (beginner level)(just for practice)

Clear search

Close search

Google apps

Main menu

Canada Per Capita Income Single variable data set

The dataset for predicting income per capita for Canada is taken from the website: data.worldbank.org

Content

Acknowledgements

Objective

2010 County and City-Level Water-Use Data and Associated Explanatory...

Real State Website Data

titanic5 Dataset Dataset

Film Circulation dataset

HR Dataset.csv

🟡Please Upvote my dataset If you like It.✨

This dataset contains valuable employee information over time that can be analyzed to help optimize key HR functions. Some potential use cases include:

The columns are:

Panel dataset on Brazilian fuel demand

Dataset for paper: Body Positivity but not for everyone

Data from: Macaques preferentially attend to intermediately surprising...

Global Dataset of Cyber Incidents V.1.2

Data from: CY-Bench: A comprehensive benchmark dataset for subnational crop...

CY-Bench: A comprehensive benchmark dataset for sub-national crop yield forecasting

Overview

Data

Data format

Data content

Folder structure

Example : CSV data content for maize in country X

Data access

License and citation

Performance vs. Predicted Performance

Description of *original_data.csv*

Description of *CompleteDataAndBiases.csv*

Chapter 12: Data Preparation for Fraud Analytics: Project: Human Recourses...

‘School Dataset’ analyzed by Analyst-2

Context

Content

Acknowledgements

Inspiration

Dataset to run examples in SmartPLS 3 (teaching and learning)

Data from: Virtual Reality Immersion Dataset

CALY-SWE: Discrete choice experiment and time trade-off data for a...

open-close-drawer-csv

Cleaned NHANES 1988-2018

‘dataset’ analyzed by Analyst-2

Overview

Content

Acknowledgements

Variable Notes

Canada Per Capita Income Single variable data set

Predict the canada per capita income using linear regression for beginners

The dataset for predicting income per capita for Canada is taken from the website: data.worldbank.org

Content

Acknowledgements

Objective

**Description of *original_data.csv***

**Description of *CompleteDataAndBiases.csv***