74 datasets found

Spotify daily top 200 songs with genres 2017-2021
kaggle.com
zip
Updated Aug 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Natarov (2021). Spotify daily top 200 songs with genres 2017-2021 [Dataset]. https://www.kaggle.com/ivannatarov/spotify-daily-top-200-songs-with-genres-20172021
Explore at:
zip(4253635 bytes)Available download formats
Dataset updated
Aug 24, 2021
Authors
Ivan Natarov
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
👍 If this dataset was useful to you, leave your vote at the top of the page 👍

The dataset provides information on the daily top 200 tracks listened to by users of the Spotify digital platform around the world.

I put together this dataset because I really love music (I listen to it for several hours a day) and have not found a similar dataset with track genres on kaggle.

The dataset can be useful for beginners in the field of working with data. It contains missing values, arrays in columns, and so on, which can be great practice when conducting an EDA phase.

Soon, my example will appear here as possible, based on the specified dataset, go on a musical journey around the world and understand how the musical tastes of humanity have changed around the world)))

In addition, I will be very happy to see the work of the community on this dataset.

Also, in case of interest in data by country, I am ready to place it upon request.

You can contact me through: telegram @natarov_ivan
Overwatch 2 statistics
kaggle.com
Updated Jun 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mykhailo Kachan (2023). Overwatch 2 statistics [Dataset]. https://www.kaggle.com/datasets/mykhailokachan/overwatch-2-statistics
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 27, 2023
Dataset provided by
Kaggle
Authors
Mykhailo Kachan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is built on data from Overbuff with the help of python and selenium. Development environment - Jupyter Notebook.

The tables contain the data for competitive seasons 1-4 and for quick play for each hero and rank along with the standard statistics (common to each hero as well as information belonging to a specific hero).

Note: data for some columns are missing on Overbuff site (there is '—' instead of a specific value), so they were dropped: Scoped Crits for Ashe and Widowmaker, Rip Tire Kills for Junkrat, Minefield Kills for Wrecking Ball. 'Self Healing' column for Bastion was dropped too as Bastion doesn't have this property anymore in OW2. Also, there are no values for "Javelin Spin Kills / 10min" for Orisa in season 1 (the column was dropped). Overall, all missing values were cleaned.

Attention: Overbuff doesn't contain info about OW 1 competitive seasons (when you change a skill tier, the data isn't changed). If you know a site where it's possible to get this data, please, leave a comment. Thank you!

The code on GitHub .

All procedure is done in 5 stages:

Stage 1:

Data is retrieved directly from HTML elements on the page with the selenium tool on python.

Stage 2:

After scraping, data was cleansed: 1) Deleted comma separator on thousands (e.g. 1,009 => 1009). 2) Translated time representation (e.g. '01:23') to seconds (1*60 + 23 => 83). 3) Lúcio has become Lucio, Torbjörn - Torbjorn.

Stage 3:

Data were arranged into a table and saved to CSV.

Stage 4:

Columns which are supposed to have only numeric values are checked. All non-numeric values are dropped. This stage helps to find missing values which contain '—' instead and delete them.

Stage 5:

Additional missing values are searched for and dealt with. It's either column rename that happens (as the program cannot infer the correct column name for missing values) or a column drop. This stage ensures all wrong data are truly fixed.

The procedure to fetch the data takes 7 minutes on average.

This project and code were born from this GitHub code.
A
‘Young People Survey’ analyzed by Analyst-2
analyst-2.ai
Updated Nov 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Young People Survey’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-young-people-survey-04b9/01af2b48/?iid=033-554&v=presentation
Explore at:
Dataset updated
Nov 12, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Young People Survey’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/miroslavsabo/young-people-survey on 30 September 2021.

--- Dataset description provided by original source is as follows ---

Introduction

In 2013, students of the Statistics class at "https://fses.uniba.sk/en/">FSEV UK were asked to invite their friends to participate in this survey.

The data file (responses.csv) consists of 1010 rows and 150 columns (139 integer and 11 categorical).

For convenience, the original variable names were shortened in the data file. See the columns.csv file if you want to match the data with the original names.

The data contain missing values.

The survey was presented to participants in both electronic and written form.

The original questionnaire was in Slovak language and was later translated into English.

All participants were of Slovakian nationality, aged between 15-30.

The variables can be split into the following groups:

Music preferences (19 items)

Movie preferences (12 items)

Hobbies & interests (32 items)

Phobias (10 items)

Health habits (3 items)

Personality traits, views on life, & opinions (57 items)

Spending habits (7 items)

Demographics (10 items)

Research questions

Many different techniques can be used to answer many questions, e.g.

Clustering: Given the music preferences, do people make up any clusters of similar behavior?

Hypothesis testing: Do women fear certain phenomena significantly more than men? Do the left handed people have different interests than right handed?

Predictive modeling: Can we predict spending habits of a person from his/her interests and movie or music preferences?

Dimension reduction: Can we describe a large number of human interests by a smaller number of latent concepts?

Correlation analysis: Are there any connections between music and movie preferences?

Visualization: How to effectively visualize a lot of variables in order to gain some meaningful insights from the data?

(Multivariate) Outlier detection: Small number of participants often cheats and randomly answers the questions. Can you identify them? Hint: [Local outlier factor][1] may help.

Missing values analysis: Are there any patterns in missing responses? What is the optimal way of imputing the values in surveys?

Recommendations: If some of user's interests are known, can we predict the other? Or, if we know what a person listen, can we predict which kind of movies he/she might like?

Past research

(in slovak) Sleziak, P. - Sabo, M.: Gender differences in the prevalence of specific phobias. Forum Statisticum Slovacum. 2014, Vol. 10, No. 6. [Differences (gender + whether people lived in village/town) in the prevalence of phobias.]

Sabo, Miroslav. Multivariate Statistical Methods with Applications. Diss. Slovak University of Technology in Bratislava, 2014. [Clustering of variables (music preferences, movie preferences, phobias) + Clustering of people w.r.t. their interests.]

Questionnaire

MUSIC PREFERENCES

I enjoy listening to music.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

I prefer.: Slow paced music 1-2-3-4-5 Fast paced music (integer)

Dance, Disco, Funk: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Folk music: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Country: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Classical: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Musicals: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Pop: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Rock: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Metal, Hard rock: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Punk: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Hip hop, Rap: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Reggae, Ska: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Swing, Jazz: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Rock n Roll: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Alternative music: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Latin: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Techno, Trance: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Opera: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

MOVIE PREFERENCES

I really enjoy watching movies.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

Horror movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Thriller movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Comedies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Romantic movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Sci-fi movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

War movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Tales: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Cartoons: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Documentaries: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Western movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Action movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

HOBBIES & INTERESTS

History: Not interested 1-2-3-4-5 Very interested (integer)

Psychology: Not interested 1-2-3-4-5 Very interested (integer)

Politics: Not interested 1-2-3-4-5 Very interested (integer)

Mathematics: Not interested 1-2-3-4-5 Very interested (integer)

Physics: Not interested 1-2-3-4-5 Very interested (integer)

Internet: Not interested 1-2-3-4-5 Very interested (integer)

PC Software, Hardware: Not interested 1-2-3-4-5 Very interested (integer)

Economy, Management: Not interested 1-2-3-4-5 Very interested (integer)

Biology: Not interested 1-2-3-4-5 Very interested (integer)

Chemistry: Not interested 1-2-3-4-5 Very interested (integer)

Poetry reading: Not interested 1-2-3-4-5 Very interested (integer)

Geography: Not interested 1-2-3-4-5 Very interested (integer)

Foreign languages: Not interested 1-2-3-4-5 Very interested (integer)

Medicine: Not interested 1-2-3-4-5 Very interested (integer)

Law: Not interested 1-2-3-4-5 Very interested (integer)

Cars: Not interested 1-2-3-4-5 Very interested (integer)

Art: Not interested 1-2-3-4-5 Very interested (integer)

Religion: Not interested 1-2-3-4-5 Very interested (integer)

Outdoor activities: Not interested 1-2-3-4-5 Very interested (integer)

Dancing: Not interested 1-2-3-4-5 Very interested (integer)

Playing musical instruments: Not interested 1-2-3-4-5 Very interested (integer)

Poetry writing: Not interested 1-2-3-4-5 Very interested (integer)

Sport and leisure activities: Not interested 1-2-3-4-5 Very interested (integer)

Sport at competitive level: Not interested 1-2-3-4-5 Very interested (integer)

Gardening: Not interested 1-2-3-4-5 Very interested (integer)

Celebrity lifestyle: Not interested 1-2-3-4-5 Very interested (integer)

Shopping: Not interested 1-2-3-4-5 Very interested (integer)

Science and technology: Not interested 1-2-3-4-5 Very interested (integer)

Theatre: Not interested 1-2-3-4-5 Very interested (integer)

Socializing: Not interested 1-2-3-4-5 Very interested (integer)

Adrenaline sports: Not interested 1-2-3-4-5 Very interested (integer)

Pets: Not interested 1-2-3-4-5 Very interested (integer)

PHOBIAS

Flying: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

Thunder, lightning: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

Darkness: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

Heights: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

Spiders: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

Snakes: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

Rats, mice: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

Ageing: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

Dangerous dogs: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

Public speaking: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

HEALTH HABITS

Smoking habits: Never smoked - Tried smoking - Former smoker - Current smoker (categorical)

Drinking: Never - Social drinker - Drink a lot (categorical)

I live a very healthy lifestyle.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

PERSONALITY TRAITS, VIEWS ON LIFE & OPINIONS

I take notice of what goes on around me.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

I try to do tasks as soon as possible and not leave them until last minute.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

I always make a list so I don't forget anything.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

I often study or work even in my spare time.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

I look at things from all different angles before I go ahead.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

I believe that bad people will suffer one day and good people will be rewarded.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

I am reliable at work and always complete all tasks given to me.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

I always keep my promises.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

**I can fall for someone very quickly and then
g
MOCK Qualtrics dataset
rubenarslan.github.io
cran.r-universe.dev
+1more
Updated Aug 1, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruben Arslan (2018). MOCK Qualtrics dataset [Dataset]. http://doi.org/10.5281/zenodo.1326520
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.1326520
Dataset updated
Aug 1, 2018
Dataset provided by
MPI Human Development, Berlin
Authors
Ruben Arslan
Time period covered
2018
Area covered
Nowhere
Variables measured
Q7, Q10, ResponseSet
Description
a MOCK dataset used to show how to import Qualtrics metadata into the codebook R package

Table of variables

This table contains variable names, labels, and number of missing values. See the complete codebook for more.

name label n_missing
ResponseSet NA 0
Q7 NA 0
Q10 NA 0

Note

This dataset was automatically described using the codebook R package (version 0.9.5).
Heidelberg Tributary Loading Program (HTLP) Dataset
zenodo.org
explore.openaire.eu
+1more
bin, png
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NCWQR; NCWQR (2024). Heidelberg Tributary Loading Program (HTLP) Dataset [Dataset]. http://doi.org/10.5281/zenodo.6606950
Explore at:
bin, pngAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6606950
Dataset updated
Jul 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
NCWQR; NCWQR
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is updated more frequently and can be visualized on NCWQR's data portal.

If you have any questions, please contact Dr. Laura Johnson or Dr. Nathan Manning.

The National Center for Water Quality Research (NCWQR) is a research laboratory at Heidelberg University in Tiffin, Ohio, USA. Our primary research program is the Heidelberg Tributary Loading Program (HTLP), where we currently monitor water quality at 22 river locations throughout Ohio and Michigan, effectively covering ~half of the land area of Ohio. The goal of the program is to accurately measure the total amounts (loads) of pollutants exported from watersheds by rivers and streams. Thus these data are used to assess different sources (nonpoint vs point), forms, and timing of pollutant export from watersheds. The HTLP officially began with high-frequency monitoring for sediment and nutrients from the Sandusky and Maumee rivers in 1974, and has continually expanded since then.

Each station where samples are collected for water quality is paired with a US Geological Survey gage for quantifying discharge (http://waterdata.usgs.gov/usa/nwis/rt). Our stations cover a wide range of watershed areas upstream of the sampling point from 11.0 km2 for the unnamed tributary to Lost Creek to 19,215 km2 for the Muskingum River. These rivers also drain a variety of land uses, though a majority of the stations drain over 50% row-crop agriculture.

At most sampling stations, submersible pumps located on the stream bottom continuously pump water into sampling wells inside heated buildings where automatic samplers collect discrete samples (4 unrefrigerated samples/d at 6-h intervals, 1974–1987; 3 refrigerated samples/d at 8-h intervals, 1988-current). At weekly intervals the samples are returned to the NCWQR laboratories for analysis. When samples either have high turbidity from suspended solids or are collected during high flow conditions, all samples for each day are analyzed. As stream flows and/or turbidity decreases, analysis frequency shifts to one sample per day. At the River Raisin and Muskingum River, a cooperator collects a grab sample from a bridge at or near the USGS station approximately daily and all samples are analyzed. Each sample bottle contains sufficient volume to support analyses of total phosphorus (TP), dissolved reactive phosphorus (DRP), suspended solids (SS), total Kjeldahl nitrogen (TKN), ammonium-N (NH4), nitrate-N and nitrite-N (NO2+3), chloride, fluoride, and sulfate. Nitrate and nitrite are commonly added together when presented; henceforth we refer to the sum as nitrate.

Upon return to the laboratory, all water samples are analyzed within 72h for the nutrients listed below using standard EPA methods. For dissolved nutrients, samples are filtered through a 0.45 um membrane filter prior to analysis. We currently use a Seal AutoAnalyzer 3 for DRP, silica, NH4, TP, and TKN colorimetry, and a DIONEX Ion Chromatograph with AG18 and AS18 columns for anions. Prior to 2014, we used a Seal TRAACs for all colorimetry.

2017 Ohio EPA Project Study Plan and Quality Assurance Plan

Project Study Plan

Quality Assurance Plan

Data quality control and data screening

The data provided in the River Data files have all been screened by NCWQR staff. The purpose of the screening is to remove outliers that staff deem likely to reflect sampling or analytical errors rather than outliers that reflect the real variability in stream chemistry. Often, in the screening process, the causes of the outlier values can be determined and appropriate corrective actions taken. These may involve correction of sample concentrations or deletion of those data points.

This micro-site contains data for approximately 126,000 water samples collected beginning in 1974. We cannot guarantee that each data point is free from sampling bias/error, analytical errors, or transcription errors. However, since its beginnings, the NCWQR has operated a substantial internal quality control program and has participated in numerous external quality control reviews and sample exchange programs. These programs have consistently demonstrated that data produced by the NCWQR is of high quality.

A note on detection limits and zero and negative concentrations

It is routine practice in analytical chemistry to determine method detection limits and/or limits of quantitation, below which analytical results are considered less reliable or unreliable. This is something that we also do as part of our standard procedures. Many laboratories, especially those associated with agencies such as the U.S. EPA, do not report individual values that are less than the detection limit, even if the analytical equipment returns such values. This is in part because as individual measurements they may not be considered valid under litigation.

The measured concentration consists of the true but unknown concentration plus random instrument error, which is usually small compared to the range of expected environmental values. In a sample for which the true concentration is very small, perhaps even essentially zero, it is possible to obtain an analytical result of 0 or even a small negative concentration. Results of this sort are often “censored” and replaced with the statement “
Censoring these low values creates a number of problems for data analysis. How do you take an average? If you leave out these numbers, you get a biased result because you did not toss out any other (higher) values. Even if you replace negative concentrations with 0, a bias ensues, because you’ve chopped off some portion of the lower end of the distribution of random instrument error.

For these reasons, we do not censor our data. Values of -9 and -1 are used as missing value codes, but all other negative and zero concentrations are actual, valid results. Negative concentrations make no physical sense, but they make analytical and statistical sense. Users should be aware of this, and if necessary make their own decisions about how to use these values. Particularly if log transformations are to be used, some decision on the part of the user will be required.

Analyte Detection Limits

https://ncwqr.files.wordpress.com/2021/12/mdl-june-2019-epa-methods.jpg?w=1024

For more information, please visit https://ncwqr.org/
a
Feminicidio
rstudio-pubs-static.s3.amazonaws.com
rpubs.com
Updated Nov 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Feminicidio [Dataset]. https://rstudio-pubs-static.s3.amazonaws.com/1110407_3c3714fba37745d0be19ab117bfb232f.html
Explore at:
Dataset updated
Nov 7, 2023
Variables measured
ano, mes, risp, rmbh, data_fato, qtde_vitimas, municipio_cod, municipio_fato, tentado_consumado
Description
The dataset has N=354 rows and 9 columns. 354 rows have no missing values on any column.

Table of variables

This table contains variable names, labels, and number of missing values. See the complete codebook for more.

name label n_missing
municipio_cod NA 0
municipio_fato NA 0
data_fato NA 0
mes NA 0
ano NA 0
risp NA 0
rmbh NA 0
tentado_consumado NA 0
qtde_vitimas NA 0

Note

This dataset was automatically described using the codebook R package (version 0.9.2).
AllGestureWiimoteY UCR Archive Dataset
data.niaid.nih.gov
Updated May 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of Southampton (2024). AllGestureWiimoteY UCR Archive Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11185106
Explore at:
Dataset updated
May 14, 2024
Dataset provided by
University of Californiahttp://universityofcalifornia.edu/
University of Southampton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is part of the UCR Archive maintained by University of Southampton researchers. Please cite a relevant or the latest full archive release if you use the datasets. See http://www.timeseriesclassification.com/.

The original data include 10 subjects, each perform 10 gestures 10 times. The gesture acquisition device is a Nintendo Wiimote remote controller with built-in three-axis accelerometer. Each subject performs a set of gestures multiple times. Classes are based on gestures (see class labels below). Note that data are shuffled and randomly sampled so that instances across datasets are not sychronized by dimension or subject. Time series are of different lengths. There are no missing values.

The gestures are (class label. original label - English translation):

poteg: pick-up

shake: shake

desno: one move to the right

levo: one move to the left

gor: one move to up

dol: one move to down

kroglevo: one left circle

krogdesn: one right circle

suneknot: one move toward the screen

sunekven: one move away from the screen

This data is acceleration in y-axis dimension.

Donator: J. Guna
m
Cross Regional Eucalyptus Growth and Environmental Data
data.mendeley.com
Updated Oct 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Erasmus (2024). Cross Regional Eucalyptus Growth and Environmental Data [Dataset]. http://doi.org/10.17632/2m9rcy3dr9.3
Explore at:
Unique identifier
https://doi.org/10.17632/2m9rcy3dr9.3
Dataset updated
Oct 7, 2024
Authors
Christopher Erasmus
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is provided in a single .xlsx file named "eucalyptus_growth_environment_data_V2.xlsx" and consists of fifteen sheets:

Codebook: This sheet details the index, values, and descriptions for each field within the dataset, providing a comprehensive guide to understanding the data structure.

ALL NODES: Contains measurements from all devices, totalling 102,916 data points. This sheet aggregates the data across all nodes.

GWD1 to GWD10: These subset sheets include measurements from individual nodes, labelled according to the abbreviation “Generic Wireless Dendrometer” followed by device IDs 1 through 10. Each sheet corresponds to a specific node, representing measurements from ten trees (or nodes).

Metadata: Provides detailed metadata for each node, including species, initial diameter, location, measurement frequency, battery specifications, and irrigation status. This information is essential for identifying and differentiating the nodes and their specific attributes.

Missing Data Intervals: Details gaps in the data stream, including start and end dates and times when data was not uploaded. It includes information on the total duration of each missing interval and the number of missing data points.

Missing Intervals Distribution: Offers a summary of missing data intervals and their distribution, providing insight into data gaps and reasons for missing data.

All nodes utilize LoRaWAN for data transmission. Please note that intermittent data gaps may occur due to connectivity issues between the gateway and the nodes, as well as maintenance activities or experimental procedures.

Software considerations: The provided R code named “Simple_Dendro_Imputation_and_Analysis.R” is a comprehensive analysis workflow that processes and analyses Eucalyptus growth and environmental data from the "eucalyptus_growth_environment_data_V2.xlsx" dataset. The script begins by loading necessary libraries, setting the working directory, and reading the data from the specified Excel sheet. It then combines date and time information into a unified DateTime format and performs data type conversions for relevant columns. The analysis focuses on a specified device, allowing for the selection of neighbouring devices for imputation of missing data. A loop checks for gaps in the time series and fills in missing intervals based on a defined threshold, followed by a function that imputes missing values using the average from nearby devices. Outliers are identified and managed through linear interpolation. The code further calculates vapor pressure metrics and applies temperature corrections to the dendrometer data. Finally, it saves the cleaned and processed data into a new Excel file while conducting dendrometer analysis using the dendRoAnalyst package, which includes visualizations and calculations of daily growth metrics and correlations with environmental factors such as vapour pressure deficit (VPD).

name	label	n_missing
ResponseSet	NA	0
Q7	NA	0
Q10	NA	0

name	label	n_missing
municipio_cod	NA	0
municipio_fato	NA	0
data_fato	NA	0
mes	NA	0
ano	NA	0
risp	NA	0
rmbh	NA	0
tentado_consumado	NA	0
qtde_vitimas	NA	0

MOCK Big Five Inventory dataset (German metadata demo)

rubenarslan.github.io
cran.r-project.org
+4more

Updated Jun 1, 2016

Facebook

Twitter

Click to copy link

Link copied

Cite

Ruben Arslan (2016). MOCK Big Five Inventory dataset (German metadata demo) [Dataset]. http://doi.org/10.5281/zenodo.1326520

Explore at:

Unique identifier

https://doi.org/10.5281/zenodo.1326520

Dataset updated

Jun 1, 2016

Dataset provided by

MPI Human Development, Berlin

Authors

Ruben Arslan

Time period covered

2016

Area covered

Goettingen, Germany

Variables measured

age, ended, created, expired, session, modified, BFIK_open, BFIK_agree, BFIK_consc, BFIK_extra, and 20 more

Description

a small mock Big Five Inventory dataset

Table of variables

This table contains variable names, labels, and number of missing values. See the complete codebook for more.

name	label	n_missing
session	NA	0
created	user first opened survey	0
modified	user last edited survey	0
ended	user finished survey	0
expired	NA	28
BFIK_open_2	Ich bin tiefsinnig, denke gerne über Sachen nach.	0
BFIK_agree_4R	Ich kann mich schroff und abweisend anderen gegenüber verhalten.	0
BFIK_extra_2	Ich bin begeisterungsfähig und kann andere leicht mitreißen.	0
BFIK_agree_1R	Ich neige dazu, andere zu kritisieren.	0
BFIK_open_1	Ich bin vielseitig interessiert.	0
BFIK_neuro_2R	Ich bin entspannt, lasse mich durch Stress nicht aus der Ruhe bringen.	0
BFIK_consc_3	Ich bin tüchtig und arbeite flott.	0
BFIK_consc_4	Ich mache Pläne und führe sie auch durch.	0
BFIK_consc_2R	Ich bin bequem, neige zur Faulheit.	0
BFIK_agree_3R	Ich kann mich kalt und distanziert verhalten.	0
BFIK_extra_3R	Ich bin eher der "stille Typ", wortkarg.	0
BFIK_neuro_3	Ich mache mir viele Sorgen.	0
BFIK_neuro_4	Ich werde leicht nervös und unsicher.	0
BFIK_agree_2	Ich schenke anderen leicht Vertrauen, glaube an das Gute im Menschen.	0
BFIK_consc_1	Ich erledige Aufgaben gründlich.	0
BFIK_open_4	Ich schätze künstlerische und ästhetische Eindrücke.	0
BFIK_extra_4	Ich gehe aus mir heraus, bin gesellig.	0
BFIK_extra_1R	Ich bin eher zurückhaltend, reserviert.	0
BFIK_open_3	Ich habe eine aktive Vorstellungskraft, bin phantasievoll.	0
BFIK_agree	4 BFIK_agree items aggregated by aggregation_function	0
BFIK_open	4 BFIK_open items aggregated by aggregation_function	0
BFIK_consc	4 BFIK_consc items aggregated by aggregation_function	0
BFIK_extra	4 BFIK_extra items aggregated by aggregation_function	0
BFIK_neuro	3 BFIK_neuro items aggregated by aggregation_function	0
age	Alter	0

Note

This dataset was automatically described using the codebook R package (version 0.9.5).

P
KuaiRec Dataset
paperswithcode.com
opendatalab.com
Updated Sep 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chongming Gao; Shijun Li; Wenqiang Lei; Jiawei Chen; Biao Li; Peng Jiang; Xiangnan He; Jiaxin Mao; Tat-Seng Chua (2023). KuaiRec Dataset [Dataset]. https://paperswithcode.com/dataset/kuairec
Explore at:
Dataset updated
Sep 26, 2023
Authors
Chongming Gao; Shijun Li; Wenqiang Lei; Jiawei Chen; Biao Li; Peng Jiang; Xiangnan He; Jiaxin Mao; Tat-Seng Chua
Description
KuaiRec is a real-world dataset collected from the recommendation logs of the video-sharing mobile app Kuaishou. For now, it is the first dataset that contains a fully observed user-item interaction matrix. For the term “fully observed”, we mean there are almost no missing values in the user-item matrix, i.e., each user has viewed each video and then left feedback.
A
‘WHO national life expectancy ’ analyzed by Analyst-2
analyst-2.ai
Updated Oct 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘WHO national life expectancy ’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-who-national-life-expectancy-c4c7/d31e495e/?iid=008-942&v=presentation
Explore at:
Dataset updated
Oct 30, 2020
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘WHO national life expectancy ’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/mmattson/who-national-life-expectancy on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context

I am developing my data science skills in areas outside of my previous work. An interesting problem for me was to identify which factors influence life expectancy on a national level. There is an existing Kaggle data set that explored this, but that information was corrupted. Part of the problem solving process is to step back periodically and ask "does this make sense?" Without reasonable data, it is harder to notice mistakes in my analysis code (as opposed to unusual behavior due to the data itself). I wanted to make a similar data set, but with reliable information.

This is my first time exploring life expectancy, so I had to guess which features might be of interest when making the data set. Some were included for comparison with the other Kaggle data set. A number of potentially interesting features (like air pollution) were left off due to limited year or country coverage. Since the data was collected from more than one server, some features are present more than once, to explore the differences.

Content

A goal of the World Health Organization (WHO) is to ensure that a billion more people are protected from health emergencies, and provided better health and well-being. They provide public data collected from many sources to identify and monitor factors that are important to reach this goal. This set was primarily made using GHO (Global Health Observatory) and UNESCO (United Nations Educational Scientific and Culture Organization) information. The set covers the years 2000-2016 for 183 countries, in a single CSV file. Missing data is left in place, for the user to decide how to deal with it.

Three notebooks are provided for my cursory analysis, a comparison with the other Kaggle set, and a template for creating this data set.

Inspiration

There is a lot to explore, if the user is interested. The GHO server alone has over 2000 "indicators". - How are the GHO and UNESCO life expectancies calculated, and what is causing the difference? That could also be asked for Gross National Income (GNI) and mortality features. - How does the life expectancy after age 60 compare to the life expectancy at birth? Is the relationship with the features in this data set different for those two targets? - What other indicators on the servers might be interesting to use? Some of the GHO indicators are different studies with different coverage. Can they be combined to make a more useful and robust data feature? - Unraveling the correlations between the features would take significant work.

--- Original source retains full ownership of the source dataset ---
c
California Overlapping Cities and Counties and Identifiers
gis.data.ca.gov
data.ca.gov
+1more
Updated Sep 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California Department of Technology (2024). California Overlapping Cities and Counties and Identifiers [Dataset]. https://gis.data.ca.gov/datasets/california-overlapping-cities-and-counties-and-identifiers/about
Explore at:
Dataset updated
Sep 16, 2024
Dataset authored and provided by
California Department of Technology
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered

Description
WARNING: This is a pre-release dataset and its fields names and data structures are subject to change. It should be considered pre-release until the end of 2024. Expected changes:Metadata is missing or incomplete for some layers at this time and will be continuously improved.We expect to update this layer roughly in line with CDTFA at some point, but will increase the update cadence over time as we are able to automate the final pieces of the process.This dataset is continuously updated as the source data from CDTFA is updated, as often as many times a month. If you require unchanging point-in-time data, export a copy for your own use rather than using the service directly in your applications.PurposeCounty and incorporated place (city) boundaries along with third party identifiers used to join in external data. Boundaries are from the authoritative source the California Department of Tax and Fee Administration (CDTFA), altered to show the counties as one polygon. This layer displays the city polygons on top of the County polygons so the area isn"t interrupted. The GEOID attribute information is added from the US Census. GEOID is based on merged State and County FIPS codes for the Counties. Abbreviations for Counties and Cities were added from Caltrans Division of Local Assistance (DLA) data. Place Type was populated with information extracted from the Census. Names and IDs from the US Board on Geographic Names (BGN), the authoritative source of place names as published in the Geographic Name Information System (GNIS), are attached as well. Finally, coastal buffers are removed, leaving the land-based portions of jurisdictions. This feature layer is for public use.Related LayersThis dataset is part of a grouping of many datasets:Cities: Only the city boundaries and attributes, without any unincorporated areasWith Coastal BuffersWithout Coastal BuffersCounties: Full county boundaries and attributes, including all cities within as a single polygonWith Coastal BuffersWithout Coastal BuffersCities and Full Counties: A merge of the other two layers, so polygons overlap within city boundaries. Some customers require this behavior, so we provide it as a separate service.With Coastal BuffersWithout Coastal Buffers (this dataset)Place AbbreviationsUnincorporated Areas (Coming Soon)Census Designated Places (Coming Soon)Cartographic CoastlinePolygonLine source (Coming Soon)Working with Coastal BuffersThe dataset you are currently viewing includes the coastal buffers for cities and counties that have them in the authoritative source data from CDTFA. In the versions where they are included, they remain as a second polygon on cities or counties that have them, with all the same identifiers, and a value in the COASTAL field indicating if it"s an ocean or a bay buffer. If you wish to have a single polygon per jurisdiction that includes the coastal buffers, you can run a Dissolve on the version that has the coastal buffers on all the fields except COASTAL, Area_SqMi, Shape_Area, and Shape_Length to get a version with the correct identifiers.Point of ContactCalifornia Department of Technology, Office of Digital Services, odsdataservices@state.ca.govField and Abbreviation DefinitionsCOPRI: county number followed by the 3-digit city primary number used in the Board of Equalization"s 6-digit tax rate area numbering systemPlace Name: CDTFA incorporated (city) or county nameCounty: CDTFA county name. For counties, this will be the name of the polygon itself. For cities, it is the name of the county the city polygon is within.Legal Place Name: Board on Geographic Names authorized nomenclature for area names published in the Geographic Name Information SystemGNIS_ID: The numeric identifier from the Board on Geographic Names that can be used to join these boundaries to other datasets utilizing this identifier.GEOID: numeric geographic identifiers from the US Census Bureau Place Type: Board on Geographic Names authorized nomenclature for boundary type published in the Geographic Name Information SystemPlace Abbr: CalTrans Division of Local Assistance abbreviations of incorporated area namesCNTY Abbr: CalTrans Division of Local Assistance abbreviations of county namesArea_SqMi: The area of the administrative unit (city or county) in square miles, calculated in EPSG 3310 California Teale Albers.COASTAL: Indicates if the polygon is a coastal buffer. Null for land polygons. Additional values include "ocean" and "bay".GlobalID: While all of the layers we provide in this dataset include a GlobalID field with unique values, we do not recommend you make any use of it. The GlobalID field exists to support offline sync, but is not persistent, so data keyed to it will be orphaned at our next update. Use one of the other persistent identifiers, such as GNIS_ID or GEOID instead.AccuracyCDTFA"s source data notes the following about accuracy:City boundary changes and county boundary line adjustments filed with the Board of Equalization per Government Code 54900. This GIS layer contains the boundaries of the unincorporated county and incorporated cities within the state of California. The initial dataset was created in March of 2015 and was based on the State Board of Equalization tax rate area boundaries. As of April 1, 2024, the maintenance of this dataset is provided by the California Department of Tax and Fee Administration for the purpose of determining sales and use tax rates. The boundaries are continuously being revised to align with aerial imagery when areas of conflict are discovered between the original boundary provided by the California State Board of Equalization and the boundary made publicly available by local, state, and federal government. Some differences may occur between actual recorded boundaries and the boundaries used for sales and use tax purposes. The boundaries in this map are representations of taxing jurisdictions for the purpose of determining sales and use tax rates and should not be used to determine precise city or county boundary line locations. COUNTY = county name; CITY = city name or unincorporated territory; COPRI = county number followed by the 3-digit city primary number used in the California State Board of Equalization"s 6-digit tax rate area numbering system (for the purpose of this map, unincorporated areas are assigned 000 to indicate that the area is not within a city).Boundary ProcessingThese data make a structural change from the source data. While the full boundaries provided by CDTFA include coastal buffers of varying sizes, many users need boundaries to end at the shoreline of the ocean or a bay. As a result, after examining existing city and county boundary layers, these datasets provide a coastline cut generally along the ocean facing coastline. For county boundaries in northern California, the cut runs near the Golden Gate Bridge, while for cities, we cut along the bay shoreline and into the edge of the Delta at the boundaries of Solano, Contra Costa, and Sacramento counties.In the services linked above, the versions that include the coastal buffers contain them as a second (or third) polygon for the city or county, with the value in the COASTAL field set to whether it"s a bay or ocean polygon. These can be processed back into a single polygon by dissolving on all the fields you wish to keep, since the attributes, other than the COASTAL field and geometry attributes (like areas) remain the same between the polygons for this purpose.SliversIn cases where a city or county"s boundary ends near a coastline, our coastline data may cross back and forth many times while roughly paralleling the jurisdiction"s boundary, resulting in many polygon slivers. We post-process the data to remove these slivers using a city/county boundary priority algorithm. That is, when the data run parallel to each other, we discard the coastline cut and keep the CDTFA-provided boundary, even if it extends into the ocean a small amount. This processing supports consistent boundaries for Fort Bragg, Point Arena, San Francisco, Pacifica, Half Moon Bay, and Capitola, in addition to others. More information on this algorithm will be provided soon.Coastline CaveatsSome cities have buffers extending into water bodies that we do not cut at the shoreline. These include South Lake Tahoe and Folsom, which extend into neighboring lakes, and San Diego and surrounding cities that extend into San Diego Bay, which our shoreline encloses. If you have feedback on the exclusion of these items, or others, from the shoreline cuts, please reach out using the contact information above.Offline UseThis service is fully enabled for sync and export using Esri Field Maps or other similar tools. Importantly, the GlobalID field exists only to support that use case and should not be used for any other purpose (see note in field descriptions).Updates and Date of ProcessingConcurrent with CDTFA updates, approximately every two weeks, Last Processed: 12/17/2024 by Nick Santos using code path at https://github.com/CDT-ODS-DevSecOps/cdt-ods-gis-city-county/ at commit 0bf269d24464c14c9cf4f7dea876aa562984db63. It incorporates updates from CDTFA as of 12/12/2024. Future updates will include improvements to metadata and update frequency.
H
Ci Technology DataSet
dataverse.harvard.edu
Updated Feb 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harte Hanks (2024). Ci Technology DataSet [Dataset]. http://doi.org/10.7910/DVN/WIYLEH
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/WIYLEH
Dataset updated
Feb 26, 2024
Dataset provided by
Harvard Dataverse
Authors
Harte Hanks
License
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.3/customlicense?persistentId=doi:10.7910/DVN/WIYLEHhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.3/customlicense?persistentId=doi:10.7910/DVN/WIYLEH
Description
Originally published by Harte-Hanks, the CiTDS dataset is now produced by Aberdeen Group, a subsidiary of Spiceworks Ziff Davis (SWZD). It is also referred to as CiTDB (Computer Intelligence Technology Database). CiTDS provides data on digital investments of businesses across the globe. It includes two types of technology datasets: (i) hardware expenditures and (ii) product installs. Hardware expenditure data is constructed through a combination of surveys and modeling. A survey is administered to a number of companies and the data from surveys is used to develop a prediction model of expenditures as a function of firm characteristics. CiTDS uses this model to predict the expenditures of non-surveyed firms and reports them in the dataset. In contrast, CiTDS does not do any imputation for product install data, which comes entirely from web scraping and surveys. A confidence score between 1-3 is assigned to indicate how much the source of information can be trusted. A 3 corresponds to 90-100 percent install likelihood, 2 corresponds to 75-90 percent install likelihood and 1 corresponds to 65-75 percent install likelihood. CiTDS reports technology adoption at the site level with a unique DUNS identifier. One of these sites is identified as an “enterprise,” corresponding to the firm that owns the sites. Therefore, it is possible to analyze technology adoption both at the site (establishment) and enterprise (firm) levels. CiTDS sources the site population from Dun and Bradstreet every year and drops sites that are not relevant to their clients. Due to this sample selection, there is quite a bit of variation in the number of sites from year to year, where on average, 10-15 percent of sites enter and exit every year in the US data. This number is higher in the EU data. We observe similar turnover year-to-year in the products included in the dataset. Some products have become absolute, and some new products are added every year. There are two versions of the data: (i) version 3, which covers 2016-2020, and (ii) version 4, which covers 2020-2021. The quality of version 4 is significantly better regarding the information included about the technology products. In version 3, product categories have missing values, and they are abbreviated in a way that are sometimes difficult to interpret. Version 4 does not have any major issues. Since both versions of the data are available in 2020, CiTDS provides a crosswalk between the versions. This makes it possible to use information about products in Version 4 for the products in Version 3, with the caveats that there will be no crosswalk for the products that exist in 2016-2019 but not in 2020. Finally, special attention should be paid to data from 2016, where the coverage is significantly different from 2017. From 2017 onwards, coverage is more consistent. Years of Coverage: APac: 2019 - 2021 Canada: 2015 - 2021 EMEA: 2019 - 2021 Europe: 2015 - 2018 Latin America: 2015, 2019- 2021 United States: 2015 - 2021
A
‘Breast Cancer Diagnostic Dataset (BCD)’ analyzed by Analyst-2
analyst-2.ai
Updated Feb 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Breast Cancer Diagnostic Dataset (BCD)’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-breast-cancer-diagnostic-dataset-bcd-63e2/50e77951/?iid=012-852&v=presentation
Explore at:
Dataset updated
Feb 14, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Breast Cancer Diagnostic Dataset (BCD)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/devraikwar/breast-cancer-diagnostic on 14 February 2022.

--- Dataset description provided by original source is as follows ---

Context

The resources for this dataset can be found at https://www.openml.org/d/13 and https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Content

This data set includes 201 instances of one class and 85 instances of another class. The instances are described by 9 attributes, some of which are linear and some are nominal.

Number of Instances: 286

Number of Attributes: 9 + the class attribute

Attribute Information:

Class: no-recurrence-events, recurrence-events age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99. menopause: lt40, ge40, premeno. tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59. inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39. node-caps: yes, no. deg-malig: 1, 2, 3. breast: left, right. breast-quad: left-up, left-low, right-up, right-low, central. irradiat: yes, no.

Missing Attribute Values: (denoted by “?”) Attribute #: Number of instances with missing values: 6. 8 9. 1.

Class Distribution:

no-recurrence-events: 201 instances recurrence-events: 85 instances

Acknowledgements

Original data https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Inspiration

With the attributes described above, can you predict if a patient has recurrence event ?

--- Original source retains full ownership of the source dataset ---
H
Replication Data for: Cleveland Heart Disease
dataverse.harvard.edu
search.dataone.org
Updated Apr 6, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Bartley (2016). Replication Data for: Cleveland Heart Disease [Dataset]. http://doi.org/10.7910/DVN/QWXVNT
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/QWXVNT
Dataset updated
Apr 6, 2016
Dataset provided by
Harvard Dataverse
Authors
Christopher Bartley
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Cleveland
Description
Original Data from: https://archive.ics.uci.edu/ml/datasets/Heart+Disease Changes made: - four rows with missing values were removed, leaving 299 records - Chest Pain Type, Restecg, Thal variables were converted to indicator variables - class attribute binarised to -1 (no disease) / +1 disease (original values 1,2,3) Attributes: Col 0: CLASS: -1: no disease +1: disease Col 1: Age (cts) Col 2: Sex (0/1) Col 3: indicator (0/1) for typ angina Col 4: indicator for atyp angina Col 5: indicator for non-ang pain Col 6: resting blood pressure (cts) Col 7: Serum cholest (cts) Col 8: fasting blood sugar >120mg/dl (0/1) Col 9: indicator for electrocardio value 1 Col 10: indicator for electrocardio value 2 Col 11: Max heart rate (cts) Col 12: exercised induced angina (0/1) Col 13: ST depression induced by exercise (cts) Col 14: indicator for slope of peak exercise up Col 15: indicator for slope of peak exercise down Col 16: no major vessels colored by fluro (ctsish: 0,1,2,3) Col 17: Thal reversible defect indicator Col 18: Thal fixed defect indicator Col 19: Class 0-4, where 0 is disease not present, 1-4 is present
Employee Sample Data
kaggle.com
Updated May 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
William Lucas (2023). Employee Sample Data [Dataset]. https://www.kaggle.com/datasets/williamlucas0/employee-sample-data/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 29, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
William Lucas
Description
An unclean employee dataset can contain various types of errors, inconsistencies, and missing values that affect the accuracy and reliability of the data. Some common issues in unclean datasets include duplicate records, incomplete data, incorrect data types, spelling mistakes, inconsistent formatting, and outliers.

For example, there might be multiple entries for the same employee with slightly different spellings of their name or job title. Additionally, some rows may have missing data for certain columns such as bonus or exit date, which can make it difficult to analyze trends or make accurate predictions. Inconsistent formatting of data, such as using different date formats or capitalization conventions, can also cause confusion and errors when processing the data.

Furthermore, there may be outliers in the data, such as employees with extremely high or low salaries or ages, which can distort statistical analyses and lead to inaccurate conclusions.

Overall, an unclean employee dataset can pose significant challenges for data analysis and decision-making, highlighting the importance of cleaning and preparing data before analyzing it
c
AIDS Data Repository
catalog.civicdataecosystem.org
Updated May 1, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). AIDS Data Repository [Dataset]. https://catalog.civicdataecosystem.org/dataset/aids-data-repository
Explore at:
Dataset updated
May 1, 2019
Description
Managing data is hard. So many of our partner institutions are under-resourced when it comes to preparing, archiving, sharing and interpreting HIV-related datasets. Crucial datasets often sit on the laptops of local staff in Excel sheets and Word documents, or in large locked-down data warehouses where only a few have the understanding to access it. But data is useless if is not accessible by trusted parties for analysis. UNAIDS has identified the following challenges faced by our local partners: Administrative burden of data management Equipment failure Staff turnover Duplication of requests for data Secure sharing of data Keeping data up-to-date A new software project has been established to tackle these challenges and streamline the data management process... The AIDS Data Repository aims to improve the quality, accessibility and consistency of HIV data and HIV estimates by providing a centralised platform with tools to help countries manage and share their HIV data. The project includes the following features: Schema-based dataset management will help local staff with the process of preparing, validating and archiving key datasets according to the requirements from UNAIDS. Schemas that are designed or approved by UNAIDS determine the design of web forms and validation tools that guide users through the process of uploading essential data. Secure and licensed dataset sharing will give partners confidence that their data should only be used by the parties they trust for the purposes they have agreed. Data access management tools will help organisations understand who has access to use their datasets. Access can be requested, reviewed and granted through the site, but also revoked. This can be done for individual users or for entire organisations. Cloud based archiving and backup of all datasets means that data will not go missing when equipment fails or staff leave. All datasets can be tagged and searched according to their metadata and will be reliably accessible forever. DHIS2 interoperability will enable administrators to share DHIS2 data with all the features and tools provided by the AIDS data repository. Datasets comprising elements automatically pulled from a DHIS2 instance can be added to the site. Periodic pulling of data will ensure that these datasets do not fall out of date. Web-based tools will help administrators configure and monitor the DHIS2 configuration that will likely change over time. Spectrum/Naomi interoperability will streamline the process of preparing and running the Spectrum and HIVE statistical models that are supported by UNAIDS. Web forms and validation tools guide users through the process of preparing the source data sets. These source data sets can then be automatically pulled into the Spectrum and Naomi statistical modelling software tools, which will return the results to the AIDS Data Repository once finished.
Z
Data from: Toulouse Campus Surveillance Dataset: scenarios, soundtracks,...
data.niaid.nih.gov
zenodo.org
Updated Mar 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chambon, Sylvie (2020). Toulouse Campus Surveillance Dataset: scenarios, soundtracks, synchronized videos with overlapping and disjoint views [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1219420
Explore at:
Dataset updated
Mar 5, 2020
Dataset provided by
Sèdes, Florence
Guyot, Patrice
Charvillat, Vincent
Sénac, Christine
Crouzil, Alain
Malon, Thierry
Roman-Jimenez, Geoffrey
Péninon, André
Pinquier, Julien
Chambon, Sylvie
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Toulouse
Description
The Toulouse Campus surveillance Dataset, named ToCaDa, contains two sets of 25 temporally synchronized videos corresponding to two scripted scenarios.

With the help of about 50 persons (actors and camera holders), these videos were shot on July 17th 2017 at 9:50 a.m. and 11:04 a.m. respectively.

Among the cameras: • 9 were located inside the main building and shot from the windows at different floors. All these cameras are focusing the car park and the path leading to the main entrance of the building with large overlapping fields of view. • 8 were located in front of the building and filmed it with large overlapping fields of view. • 8 cameras were arranged further, scattered around the university campus. Each of their views is disjoint from all the others.

About 20 actors were asked to follow two realistic scenarios by performing scripted actions, like driving a car, walking, entering or leaving a building, or holding an item in hand while being filmed.

In addition to ordinary actions, some suspicious behaviors are present.

Irregularities:

Due to the wide variety of devices used during the shooting of the two scenarios, issues were encountered on some cameras, leading to videos where a few seconds are lacking. To ensure temporal synchronization between videos, black frames were added on the missing intervals of time. We list these particular videos and their lacking times below:

F1C3: the first 66 seconds are missing. F1C5: the first 2 seconds are missing. F1C8: the first 3 seconds are missing. F1C13: the first 10 seconds are missing. F1C15: the first second is missing. F1C19: the first second is missing. F2C1: the video is accelerated and only lasts a few seconds. We thus did not provide it. F2C6: lacks from 4:01 to 4:12 and from 4:25 to 4:28. F2C16: lack from 5:15 to 5:26.

Some videos were recorded with mobile devices whose pixel resolution was lower than 1920 x 1080:

F1C3 and F2C3: pixel resolution is 1280 x 720. F1C4 and F2C4: pixel resolution is 640 x 480. F1C15 and F2C15: pixel resolution is 1280 x 720. F1C20 and F2C20: pixel resolution is 1440 x 1080.

More detailed information about the position of the cameras can be found on the following link: http://ubee.enseeiht.fr/dokuwiki/doku.php?id=public:tocada

Citation T. Malon, G. Roman-Jimenez, P. Guyot, S. Chambon, V. Charvillat, A. Crouzil, A. Péninou, J. Pinquier, F. Sèdes and C. Sénac, Toulouse campus surveillance dataset: scenarios, soundtracks, synchronized videos with overlapping and disjoint views, ACM Multimedia Systems Conference, 2018.
Datasets for practical model selection for prospective virtual screening
zenodo.org
application/gzip
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shengchao Liu; Shengchao Liu; Moayad Alnammi; Moayad Alnammi; Spencer S. Ericksen; Andrew F. Voter; Andrew F. Voter; Gene E. Ananiev; James L. Keck; James L. Keck; F. Michael Hoffmann; F. Michael Hoffmann; Scott A. Wildman; Scott A. Wildman; Anthony Gitter; Anthony Gitter; Spencer S. Ericksen; Gene E. Ananiev (2020). Datasets for practical model selection for prospective virtual screening [Dataset]. http://doi.org/10.5281/zenodo.1257463
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1257463
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Shengchao Liu; Shengchao Liu; Moayad Alnammi; Moayad Alnammi; Spencer S. Ericksen; Andrew F. Voter; Andrew F. Voter; Gene E. Ananiev; James L. Keck; James L. Keck; F. Michael Hoffmann; F. Michael Hoffmann; Scott A. Wildman; Scott A. Wildman; Anthony Gitter; Anthony Gitter; Spencer S. Ericksen; Gene E. Ananiev
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains datasets for the manuscript "Practical model selection for prospective virtual screening":

pria_rmi_cv.tar.gz: A compressed directory containing chemical screening data for the PriA-SSB AS, PriA-SSB FP, and RMI-FANCM FP binary datasets. The files also contain the associated continuous % inhibition values and chemical features represented as SMILES and ECFP4 fingerprints. The dataset has been split into five folds for cross validation.

pria_rmi_pcba_cv.tar.gz: A compressed directory containing chemical screening data for the PriA-SSB AS, PriA-SSB FP, and RMI-FANCM FP binary datasets as well as public PubChem BioAssay datasets. The files also contain the PriA-SSB and RMI-FANCM continuous % inhibition values and chemical features represented as SMILES and ECFP4 fingerprints. The dataset has been split into five folds for cross validation. Missing values are left blank.

pria_prospective.csv.gz: A compressed file containing chemical screening data for the binary dataset PriA-SSB prospective. The file also contains the continuous % inhibition values and chemical features represented as SMILES and ECFP4 fingerprints.

If you use this data in a publication, please cite:

Shengchao Liu⁺, Moayad Alnammi⁺, Spencer S. Ericksen, Andrew F. Voter, James L. Keck, F. Michael Hoffmann, Scott A. Wildman, Anthony Gitter. Practical model selection for prospective virtual screening. bioRxiv 2018. doi:10.1101/337956

PubChem data were provided by the PubChem database. Follow the PubChem citation guidelines if you use the PubChem data.
r
datos1
rpubs.com
rstudio-pubs-static.s3.amazonaws.com
Updated Aug 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). datos1 [Dataset]. https://rpubs.com/Ronalmonk/prueba
Explore at:
Dataset updated
Aug 23, 2022
Variables measured
fecha_unidad, caudal_aranjuez_q_m3_hr, caudal_arboleda_q_m3_hr, volumen_totalizador_aranjuez_m3, volumen_totalizador_arboleda_m3
Description
The dataset has N=15384 rows and 5 columns. 15343 rows have no missing values on any column.

Table of variables

This table contains variable names, labels, and number of missing values. See the complete codebook for more.

name label n_missing
fecha_unidad NA 0
volumen_totalizador_arboleda_m3 NA 38
caudal_arboleda_q_m3_hr NA 39
volumen_totalizador_aranjuez_m3 NA 38
caudal_aranjuez_q_m3_hr NA 41

Note

This dataset was automatically described using the codebook R package (version 0.9.2).

name	label	n_missing
fecha_unidad	NA	0
volumen_totalizador_arboleda_m3	NA	38
caudal_arboleda_q_m3_hr	NA	39
volumen_totalizador_aranjuez_m3	NA	38
caudal_aranjuez_q_m3_hr	NA	41

Facebook

Twitter

Click to copy link

Link copied

Cite

Ivan Natarov (2021). Spotify daily top 200 songs with genres 2017-2021 [Dataset]. https://www.kaggle.com/ivannatarov/spotify-daily-top-200-songs-with-genres-20172021

Spotify daily top 200 songs with genres 2017-2021

Data on the daily top 200 tracks according to spotify data 2017-2021

Explore at:

zip(4253635 bytes)Available download formats

Dataset updated

Aug 24, 2021

Authors

Ivan Natarov

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

👍 If this dataset was useful to you, leave your vote at the top of the page 👍

The dataset provides information on the daily top 200 tracks listened to by users of the Spotify digital platform around the world.

I put together this dataset because I really love music (I listen to it for several hours a day) and have not found a similar dataset with track genres on kaggle.

The dataset can be useful for beginners in the field of working with data. It contains missing values, arrays in columns, and so on, which can be great practice when conducting an EDA phase.

Soon, my example will appear here as possible, based on the specified dataset, go on a musical journey around the world and understand how the musical tastes of humanity have changed around the world)))

In addition, I will be very happy to see the work of the community on this dataset.

Also, in case of interest in data by country, I am ready to place it upon request.

You can contact me through: telegram @natarov_ivan

Clear search

Close search

Google apps

Main menu

Spotify daily top 200 songs with genres 2017-2021

Overwatch 2 statistics

Stage 1:

Stage 2:

Stage 3:

Stage 4:

Stage 5:

‘Young People Survey’ analyzed by Analyst-2

Introduction

Research questions

Past research

Questionnaire

MUSIC PREFERENCES

MOVIE PREFERENCES

HOBBIES & INTERESTS

PHOBIAS

HEALTH HABITS

PERSONALITY TRAITS, VIEWS ON LIFE & OPINIONS

MOCK Qualtrics dataset

Table of variables

Note

Heidelberg Tributary Loading Program (HTLP) Dataset

Feminicidio

Table of variables

Note

AllGestureWiimoteY UCR Archive Dataset

Cross Regional Eucalyptus Growth and Environmental Data

MOCK Big Five Inventory dataset (German metadata demo)

Table of variables

Note

KuaiRec Dataset

‘WHO national life expectancy ’ analyzed by Analyst-2

Context

Content

Inspiration

California Overlapping Cities and Counties and Identifiers

Ci Technology DataSet

‘Breast Cancer Diagnostic Dataset (BCD)’ analyzed by Analyst-2

Context

Content

Acknowledgements

Inspiration

Replication Data for: Cleveland Heart Disease

Employee Sample Data

AIDS Data Repository

Data from: Toulouse Campus Surveillance Dataset: scenarios, soundtracks,...

Datasets for practical model selection for prospective virtual screening

datos1

Table of variables

Note

Spotify daily top 200 songs with genres 2017-2021

Data on the daily top 200 tracks according to spotify data 2017-2021